SignalP 6.0 预测来自古细菌、革兰氏阳性细菌、革兰氏阴性细菌和真核生物的蛋白质中存在的信号肽predicts signal peptides and the location of their cleavage sites in proteins from Archaea, Gram-positive Bacteria,及其切割位点的位置。Gram-negative Bacteria and Eukarya.在细菌和古细菌中,SignalP 6.0 可以区分五种类型的信号肽:In Bacteria and Archaea, SignalP 6.0 can discriminate between five types of signal peptides:
Sec/SPI:由 Sec 转座转运,并由信号肽酶 I (Lep) 切割的“标准”分泌信号肽;"Standard" secretory signal peptides transported by Sec translocon and cleaved by Signal Peptidase I (Lep).
Sec/SPII:由 Sec 转座子运输,并由信号肽酶 II (Lsp) 切割的脂蛋白信号肽;lipoprotein signal peptides transported by the Sec translocon and cleaved by Signal Peptidase II (Lsp).
Tat/SPI:由 Tat 转座子转运,并由信号肽酶 I (Lep) 切割的 Tat 信号肽;Tat signal peptides transported by the Tat translocon and cleaved by Signal Peptidase I (Lep).
Tat/SPII:由 Tat 转位子转运,并由信号肽酶 II (Lsp) 切割的 Tat 脂蛋白信号肽;Tat lipoprotein signal peptides transported by Tat translocon & cleaved by Signal Peptidase II (Lsp).
Sec/SPIII:由 Sec 转位子运输,并由信号肽酶 III (PilD/PibD) 切割的菌毛蛋白和菌毛蛋白样信号肽。Pilin & pilin-like signal peptides transported by Sec translocon & cleaved by Signal Peptidase III (PilD/PibD).
此外,SignalP 6.0 预测信号肽的区域。Additionally, SignalP 6.0 predicts the regions of signal peptides.根据类型,预测 n、h 和 c 区域以及其他显着特征的位置。Depending on the type, the positions of n-, h- and c-regions as well as of other distinctive features are predicted.
下载
访问SignalP V6.0网站,找到“Download”,填写相关信息,获取下载链接,下载得到“signalp-6.0.fast.tar.gz”。有两个模式可以选择——“slow_sequential”和“fast"。前者runs the full model sequentially, taking the same amount of RAM as fast but being 6 times slower;后者uses a smaller model that approximates the performance of the full model, requiring a fraction of the resources and being significantly faste。本教程下载的是fast模式。
A command takes the following form
signalp6 --fastafile /path/to/input.fasta --organism other --output_dir path/to/be/saved --format txt --mode fast
fastafile 输入文件为FASTA格式的蛋白序列文件Specifies the fasta file with the sequences to be predicted.。
organism is either other or Eukarya. Specifying Eukarya triggers post-processing of the SP predictions to prevent spurious results (only predicts type Sec/SPI).
format can take the values txt, png, eps, all. It defines what output files are created for individual sequences. txtproduces a tabular .gff file with the per-position predictions for each sequence. png, eps, all additionally produce probability plots in the requested format. For larger prediction jobs, plotting will slow down the processing speed significantly.
mode is either fast, slow or slow-sequential. Default is fast, which uses a smaller model that approximates the performance of the full model, requiring a fraction of the resources and being significantly faster. slow runs the full model in parallel, which requires more than 14GB of RAM to be available. slow-sequential runs the full model sequentially, taking the same amount of RAM as fast but being 6 times slower. If the specified model is not installed, SignalP will abort with an error.
Length: 蛋白序列的长度。The length of the protein sequence.
Number of predicted TMHs:预测到的跨膜螺旋的数量。The number of predicted transmembrane helices.
Exp number of AAs in TMHs:跨膜螺旋中氨基酸的预期数量。The expected number of amino acids intransmembrane helices. 如果此数字大于 18,则很可能是跨膜蛋白(或具有信号肽)。If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
Exp number, first 60 AAs:在蛋白的前60个氨基酸中跨膜螺旋中氨基酸的预期数量。The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein.如果该数字超过几个,你应该被警告在 N 端预测的跨膜螺旋可能是一个信号肽。If it more than a few, you are warned that a predicted transmembrane helix in the N-term could be a signal peptide.
Total prob of N-in:N端在膜的细胞质一侧的总概率。The total probability that the N-term is on the cytoplasmic side of the membrane.
POSSIBLE N-term signal sequence:当“Exp number, first 60 AAs”大于 10 时产生的警告。A warning that is produced when "Exp number, first 60 AAs" is larger than 10.
蛋白F01_bin.1_00110共计436个氨基酸,有5个跨膜螺旋结构。
蛋白F01_bin.1_00142共计557个氨基酸,所有序列均在膜外,即该序列编码的是分泌蛋白。
Short output format
"len=": 蛋白序列的长度。The length of the protein sequence.
"ExpAA=":跨膜螺旋中氨基酸的预期数量。The expected number of amino acids intransmembrane helices.如果此数字大于 18,则很可能是跨膜蛋白(或具有信号肽)。If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
"First60=":在蛋白的前60个氨基酸中跨膜螺旋中氨基酸的预期数量。The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein.如果该数字超过几个,你应该被警告在 N 端预测的跨膜螺旋可能是一个信号肽。If it more than a few, you are warned that a predicted transmembrane helix in the N-term could be a signal peptide.
"PredHel=":预测到的跨膜螺旋的数量。The number of predicted transmembrane helices by N-best.
"Topology=":N-best 预测的拓扑结构。The topology predicted by N-best.拓扑是由跨膜螺旋的位置给出的,如果螺旋在内部,则由“i”分隔,如果螺旋在外部,则由“o”分隔。'i7-29o44-66i87-109o'意味着它从膜内开始,在位置7到29有一个预测的TMH,30-43在膜外,然后是位置44-66的TMH。
结果汇总
通过网页版预测我们仅得到了一个列表文件(Short output format),该文件需要自己复制网页内容粘贴到新文件中,我将其命名为*_TMHMM_SHORT.txt,并将其存放在*_signalp目录中,该目录是由run_SignalP.pl生成的。下面我将会统计各个基因组中信号肽蛋白的总数量、分泌蛋白数量和跨膜蛋白数量到文件Statistics.txt中,并分别提取每个基因组的分泌蛋白序列到*_signalp/*.secretory.faa文件中,提取跨膜蛋白序列到*_signalp/*.membrane.faa文件中。该过程将通过tmhmm_parser.pl完成。
#!/usr/bin/perl use strict; use warnings; # Author: Liu Hualin # Date: Oct 15, 2021 open OUT, ">Statistics.txt" || die; print OUT "Strain name\tSignal peptide numbers\tSecretory protein numbers\tMembrane protein numbers\n"; my @sig = glob("*_signalp"); foreachmy $sig (@sig) { $sig=~/(.+)_signalp/; my $str = $1; my $tmhmm = $sig . "/$str" . "_TMHMM_SHORT.txt"; my $fasta = $sig . "/$str" . ".sigseq"; my $secretory = $str . ".secretory.faa"; my $membrane = $str . ".membrane.faa"; open SEC, ">$secretory" || die; open MEM, ">$membrane" || die; my $out = 0; my $on = 0; my %hash = idseq($fasta); open IN, $tmhmm || die; while (<IN>) { chomp; $_=~s/[\r\n]+//g; # print $_ . "\n"; my @lines = split /\t/; if ($lines[5] eq "Topology=o") { $out++; print SEC ">$lines[0]\n$hash{$lines[0]}\n"; }else { $on++; print MEM ">$lines[0]\n$hash{$lines[0]}\n"; } } close IN; close SEC; close MEM; system("mv $secretory $membrane $sig"); my $total = $out + $on; print OUT "$str\t$total\t$out\t$on\n"; } close OUT; subidseq { my ($fasta) = @_; my %hash; local $/ = ">"; open IN, $fasta || die; <IN>; while (<IN>) { chomp; my ($header, $seq) = split (/\n/, $_, 2); $header =~ /(\S+)/; my $id = $1; $hash{$id} = $seq; } close IN; return (%hash); } 运行方法:将tmhmm_parser.pl放在*_signalp的上一级目录下,*_signalp目录中必须包含*_TMHMM_SHORT.txt文件和*.sigseq文件。在终端运行如下代码:
perl tmhmm_parser.pl
脚本获取
本文脚本见GitHub。 敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!