📄 pepstats.txt
字号:
pepstats Function Protein statisticsDescription pepstats outputs a report of simple protein sequence information including: * Molecular weight * Number of residues * Average residue weight * Charge * Isoelectric point * For each type of amino acid: number, molar percent, DayhoffStat * For each physico-chemical class of amino acid: number, molar percent * Probability of protein expression in E. coli inclusion bodies * Molar extinction coefficient (A280) * Extinction coefficient at 1 mg/ml (A280) DayhoffStat is the amino acid's molar percentage divided by the Dayhoff statistic. The Dayhoff statistic is the amino acid's relative occurence per 1000 aa normalised to 100 by rls@ebi.ac.uk (original work from 1993) The probability of expression in inclusion bodies is sometimes referred to as a type of solubility measure. If, however, a recombinant protein is expressed in Escherichia coli, it can be expressed as soluble in the cytosol or insoluble in inclusion bodies. If the Harrison model predicts a given protein to be probably expressed in includion bodies, this doesn't mean that it is not possible to get it soluble in the cytosol. One example: Thermatoga maritima cell divison protein FtsA with a C-terminal His-Tag has a 58% Harrison probability of being expressed in inclusion bodies. However, there was plenty of soluble protein in the E. coli cytosol (F. van den Ent and J. Lowe, EMBO J. 19, 5300-5307 2000). If the protein is expressed in inclusion bodies or not is not only dependent on the sequence, but also on many other factors, such as E. coli strain, incubation temperature, type of expression vector, strength of promoter and medium.Usage Here is a sample session with pepstats% pepstats Protein statisticsInput protein sequence(s): tsw:laci_ecoliPepstats program output file [laci_ecoli.pepstats]: Go to the input files for this example Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers: [-sequence] seqall Protein sequence(s) filename and optional format, or reference (input USA) [-outfile] outfile [*.pepstats] Pepstats program output file Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -aadata datafile [Eamino.dat] Molecular weight data for amino acids -[no]termini boolean [Y] Include charge at N and C terminus Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format pepstats reads a normal protein sequence USA. Input files for usage example 'tsw:laci_ecoli' is a sequence entry in the example protein database 'tsw' Database entry: tsw:laci_ecoliID LACI_ECOLI STANDARD; PRT; 360 AA.AC P03023; P71309; Q47338; O09196;DT 21-JUL-1986 (Rel. 01, Created)DT 01-NOV-1997 (Rel. 35, Last sequence update)DT 15-DEC-1998 (Rel. 37, Last annotation update)DE LACTOSE OPERON REPRESSOR.GN LACI.OS Escherichia coli.OC Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;OC Escherichia.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 78246991.RA FARABAUGH P.J.;RT "Sequence of the lacI gene.";RL Nature 274:765-769(1978).RN [2]RP SEQUENCE FROM N.A.RC STRAIN=K12 / MG1655;RX MEDLINE; 97426617.RA BLATTNER F.R., PLUNKETT G. III, BLOCH C.A., PERNA N.T., BURLAND V.,RA RILEY M., COLLADO-VIDES J., GLASNER F.D., RODE C.K., MAYHEW G.F.,RA GREGOR J., DAVIS N.W., KIRKPATRICK H.A., GOEDEN M.A., ROSE D.J.,RA MAU B., SHAO Y.;RT "The complete genome sequence of Escherichia coli K-12.";RL Science 277:1453-1474(1997).RN [3]RP SEQUENCE FROM N.A.RC STRAIN=K12 / MG1655;RA DUNCAN M., ALLEN E., ARAUJO R., APARICIO A.M., CHUNG E., DAVIS K.,RA FEDERSPIEL N., HYMAN R., KALMAN S., KOMP C., KURDI O., LEW H.,RA LIN D., NAMATH A., OEFNER P., ROBERTS D., SCHRAMM S., DAVIS R.W.;RL Submitted (NOV-1996) to the EMBL/GenBank/DDBJ databases.RN [4]RP SEQUENCE FROM N.A.RA CHEN J., MATTHEWS K.K.S.M.;RL Submitted (MAY-1991) to the EMBL/GenBank/DDBJ databases.RN [5]RP SEQUENCE FROM N.A.RA MARSH S.;RL Submitted (JAN-1997) to the EMBL/GenBank/DDBJ databases.RN [6]RP SEQUENCE OF 1-147; 159-230 AND 233-360.RX MEDLINE; 76091932.RA BEYREUTHER K., ADLER K., FANNING E., MURRAY C., KLEMM A., GEISLER N.;RT "Amino-acid sequence of lac repressor from Escherichia coli.RT Isolation, sequence analysis and sequence assembly of trypticRT peptides and cyanogen-bromide fragments.";RL Eur. J. Biochem. 59:491-509(1975).RN [7] [Part of this file has been deleted for brevity]CC between the Swiss Institute of Bioinformatics and the EMBL outstation -CC the European Bioinformatics Institute. There are no restrictions on itsCC use by non-profit institutions as long as its content is in no wayCC modified and this statement is not removed. Usage by and for commercialCC entities requires a license agreement (See http://www.isb-sib.ch/announce/CC or send an email to license@isb-sib.ch).CC --------------------------------------------------------------------------DR EMBL; V00294; CAA23569.1; -.DR EMBL; J01636; AAA24052.1; -.DR EMBL; AE000141; AAC73448.1; -.DR EMBL; U73857; AAB18069.1; ALT_INIT.DR EMBL; X58469; CAA41383.1; -.DR EMBL; U86347; AAB47270.1; ALT_INIT.DR EMBL; U72488; AAB36549.1; -.DR EMBL; U78872; AAB37348.1; -.DR EMBL; U78873; AAB37351.1; -.DR EMBL; U78874; AAB37354.1; -.DR PIR; A03558; RPECL.DR PIR; S02540; S02540.DR PDB; 1LCC; 31-JAN-94.DR PDB; 1LCD; 31-JAN-94.DR PDB; 1LTP; 31-OCT-93.DR PDB; 1TLF; 31-JUL-95.DR PDB; 1LBG; 11-JUL-96.DR PDB; 1LBH; 11-JUL-96.DR PDB; 1LBI; 11-JUL-96.DR PDB; 1LQC; 12-FEB-97.DR ECO2DBASE; H039.0; 6TH EDITION.DR ECOGENE; EG10525; LACI.DR PFAM; PF00356; lacI; 1.DR PFAM; PF00532; Peripla_BP_like; 1.DR PROSITE; PS00356; HTH_LACI_FAMILY; 1.KW Transcription regulation; DNA-binding; Repressor; 3D-structure.FT DNA_BIND 6 25 H-T-H MOTIF.FT MUTAGEN 17 17 Y->H: BROADENING OF SPECIFICITY.FT MUTAGEN 22 22 R->N: RECOGNIZE AN OPERATOR VARIANT.FT VARIANT 282 282 Y -> D (IN T41 MUTANT).FT CONFLICT 286 286 S -> L (IN AAA24052, REF. 2, 4 AND 5).FT HELIX 6 13FT TURN 14 14FT HELIX 17 24FT HELIX 32 44FT TURN 49 50SQ SEQUENCE 360 AA; 38564 MW; 4CA5A1D6 CRC32; MKPVTLYDVA EYAGVSYQTV SRVVNQASHV SAKTREKVEA AMAELNYIPN RVAQQLAGKQ SLLIGVATSS LALHAPSQIV AAIKSRADQL GASVVVSMVE RSGVEACKAA VHNLLAQRVS GLIINYPLDD QDAIAVEAAC TNVPALFLDV SDQTPINSII FSHEDGTRLG VEHLVALGHQ QIALLAGPLS SVSARLRLAG WHKYLTRNQI QPIAEREGDW SAMSGFQQTM QMLNEGIVPT AMLVANDQMA LGAMRAITES GLRVGADISV VGYDDTEDSS CYIPPSTTIK QDFRLLGQTS VDRLLQLSQG QAVKGNQLLP VSLVKRKTTL APNTQTASPR ALADSLMQLA RQVSRLESGQ//Output file format Output files for usage example File: laci_ecoli.pepstatsPEPSTATS of LACI_ECOLI from 1 to 360Molecular weight = 38563.97 Residues = 360Average Residue Weight = 107.122 Charge = 1.5Isoelectric Point = 6.8820A280 Molar Extinction Coefficient = 21620A280 Extinction Coefficient 1mg/ml = 0.56Improbability of expression in inclusion bodies = 0.670Residue Number Mole% DayhoffStatA = Ala 44 12.222 1.421B = Asx 0 0.000 0.000C = Cys 3 0.833 0.287D = Asp 17 4.722 0.859E = Glu 15 4.167 0.694F = Phe 4 1.111 0.309G = Gly 22 6.111 0.728H = His 7 1.944 0.972I = Ile 18 5.000 1.111J = --- 0 0.000 0.000K = Lys 11 3.056 0.463L = Leu 40 11.111 1.502M = Met 10 2.778 1.634N = Asn 12 3.333 0.775O = --- 0 0.000 0.000P = Pro 14 3.889 0.748Q = Gln 28 7.778 1.994R = Arg 19 5.278 1.077S = Ser 33 9.167 1.310T = Thr 19 5.278 0.865U = --- 0 0.000 0.000V = Val 34 9.444 1.431W = Trp 2 0.556 0.427X = Xaa 0 0.000 0.000Y = Tyr 8 2.222 0.654Z = Glx 0 0.000 0.000Property Residues Number Mole%Tiny (A+C+G+S+T) 121 33.611Small (A+B+C+D+G+N+P+S+T+V) 198 55.000Aliphatic (A+I+L+V) 136 37.778Aromatic (F+H+W+Y) 21 5.833Non-polar (A+C+F+G+I+L+M+P+V+W+Y) 199 55.278Polar (D+E+H+K+N+Q+R+S+T+Z) 161 44.722Charged (B+D+E+H+K+R+Z) 69 19.167Basic (H+K+R) 37 10.278Acidic (B+D+E+Z) 32 8.889Data files The Dayhoff statistic is read from the EMBOSS data file 'Edayhoff.freq'. You can inspect and modify this file by copying it into your current directory with the command: 'embossdata -fetch'. EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA. To see the available EMBOSS data files, run:% embossdata -showall To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:% embossdata -fetch -file Exxx.dat Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata". The directories are searched in the following order: * . (your current directory) * .embossdata (under your current directory) * ~/ (your home directory) * ~/.embossdataNotes None.References 1. Roger G. Harrison "Expression of soluble heterologous proteins via fusion with NusA protein" in inNovations 11, June 2000, p 4 - 7.Warnings None.Diagnostic Error Messages None.Exit status It always exits with a status of 0.Known bugs None.See also Program name Description backtranambig Back translate a protein sequence to ambiguous codons backtranseq Back translate a protein sequence charge Protein charge plot checktrans Reports STOP codons and ORF statistics of a protein compseq Count composition of dimer/trimer/etc words in a sequence emowse Protein identification by mass spectrometry freak Residue/base frequency table or plot iep Calculates the isoelectric point of a protein mwcontam Shows molwts that match across a set of files mwfilter Filter noisy molwts from mass spec output octanol Displays protein hydropathy pepinfo Plots simple amino acid properties in parallel pepwindow Displays protein hydropathy pepwindowall Displays protein hydropathy of a set of sequencesAuthor(s) Alan Bleasby (ajb
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -