SWISS-PROT RELEASE 14.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 14.0 of SWISS-PROT contains 15409 sequence entries, comprising 4914264
amino acids abstracted from 15054 references. This represents an increase of
13% over release 13.0. The recent growth of the database is summarized below:
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 09/88 7724 2 224 465
9.0 12/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 06/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
Almost 1600 sequences have been added since release 13, the sequence data of 191
existing entries has been updated and the annotations of 2650 entries have been
revised. In particular we have used reviews articles to update the annotations
of the following groups or families of proteins:
2-S seed storage proteins
3-hydroxyacyl-CoA dehydrogenases
Acyl-CoA dehydrogenases
Acylphosphatases
Albumin / AFP / VDBP family
Aldo/keto reductases
Alkaline phosphatases
Amino acid permeases
Arthropod hemocyanins / insect LSPs
Bacterial activator proteins, gntR family
Calcitonin / CGRP / IAPP family
Cecropin family
Copper type II, ascorbate-dependent monooxygenases
Cytochromes b/b6
DNA ligases
Enoyl-CoA hydratases
Glutamine synthetases
Growth factor and cytokines receptors
Isocitrate lyases
Lamins
Leguminous lectins
Nerve growth factor family
Nitrogenase component 1 subunits
Phosphoglycerate mutases
Phosphoribosyl pyrophosphate synthetases
Phosphoribosylglycinamide synthetases
Phytochromes
Platelet-derived growth factor (PDGF) family
Proliferating cell nuclear antigen (PCNA)
Protamine P1
Rieske iron-sulfur proteins
Rotaviruses proteins
Sulfatases
Thiolases
Tryptophan synthetases
Ubiquitin carboxyl-terminal hydrolases
Ubiquitin-conjugating enzymes
Ureases
Urokinases and tissue plasminogen activators
Zinc carboxypeptidases
2 DATA SOURCES
Release 14.0 has been updated using protein sequence data from release 23.0 of
the PIR (Protein Identification Resource) protein data bank, as well as
translation of nucleotide sequence data from release 22.0 of the EMBL Nucleotide
Sequence Database.
As an indication to the source of the sequence data in the SWISS-PROT data bank
we list here the statistics concerning the DR (Databank Reference) pointer
lines:
Entries with pointer(s) to only PIR entri(es): 3079
Entries with pointer(s) to only EMBL entri(es): 6896
Entries with pointer(s) to both EMBL and PIR entri(es): 4557
Entries with no pointers lines (entered in house): 877
3 CHANGES AT THIS RELEASE
3.1 OG Line Format
The OG (OrGanelle) line format has been extended to take into account protein
sequences from genes originating from the cyanelle of bacteria such as
Cyanophora paradoxa. The valid syntax is:
OG CYANELLE.
3.2 DR Line Format
The DR line format has been extended to accept cross-references to PROSITE, the
data bank of sites and pattern in proteins which is now being distributed with
SWISS-PROT. The primary identifier is the PROSITE accession number, and the
secondary identifier is the PROSITE entry name. Examples:
DR PROSITE; PS00088; SOD_MN.
DR PROSITE; PS00021; KRINGLE.
3.3 New CC Line Topics
As of release 14 we have added two new topics for the comments (CC) linetype:
ENZYME REGULATION, and TISSUE SPECIFICITY. Example of their usage:
CC -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED
CC BY ADENYLATION. THE FULLY ADENYLATED ENZYME COMPLEX IS
CC INACTIVE.
CC -!- TISSUE SPECIFICITY: KIDNEY; SUBMAXILLARY GLAND; URINE.
3.4 Documentation Files
The file ECNUMBER.DOC, which contained an index of SWISS-PROT enzyme entries
classified by EC number, is no longer distributed. This information can be
found in the new database ENZYME which is now distributed with SWISS-PROT.
The JOURLIST.DOC file now includes the ISSN numbers for all journals cited in
SWISS-PROT and PROSITE.
For the sake of consistency with the newly introduced PROSITE and ENZYME
databases, the name of the SWISS-PROT User Manual file has been changed from
USRMAN.DOC to SWISSPRT.USR. All three databases now have .USR as the file
extension for their User Manual files.
4 NEW PROSITE DATABASE
PROSITE is a compilation of sites and patterns found in protein sequences. This
database consists of two files: the first file contains the patterns as well as
the results of the scan of SWISS-PROT for these patterns, the second file
contains the documentation that fully describes each pattern. A sample pattern
entry and its corresponding documentation entry are shown below.
The use of protein sequence patterns (or motifs) to determine the function(s) of
proteins is becoming very rapidly one of the essential tools of sequence
analysis. PROSITE, as a stand-alone database is pertinent for such purposes.
But we also believe that PROSITE is an important addition to SWISS-PROT as it
allows the flexible classification of proteins into families.
PROSITE is distributed with SWISS-PROT; for a complete description of the
content and format of this database you should refer to the User Manual (file
PROSITE.USR).
------------------------ Start of Sample PROSITE Entry -------------------------
ID CARBOXYPEPTIDASE_SER; PATTERN.
AC PS00131;
DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).
DE Serine carboxypeptidases, serine active site.
PA G-E-S-Y-A-G.
NR /RELEASE=14,15409;
NR /TOTAL=7(7); /POSITIVE=7(7); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR /FALSE_NEG=0(0);
CC /TAXO-RANGE=??E??; /MAX-REPEAT=1;
CC /SITE=3,active_site;
DR P10619, PRTP$HUMAN, T; P09620, KEX1$YEAST, T; P00729, CBPY$YEAST, T;
DR P07519, CBP1$HORVU, T; P08818, CBP2$HORVU, T; P08819, CBP2$WHEAT, T;
DR P11515, CBPG$WHEAT, T;
DO PDOC00122;
//
{PDOC00122}
{PS00131; CARBOXYPEPTIDASE_SER}
{BEGIN}
************************************************
* Serine carboxypeptidases, serine active site *
************************************************
All known carboxypeptidases are either metallo carboxypeptidases or serine
carboxypeptidases (EC 3.4.16.-). The catalytic activity of the serine carboxy-
peptidases, like that of the serine proteases of the trypsin family, is
provided by a charge relay system involving an aspartic acid residue hydrogen-
bonded to an histidine, which itself is hydrogen-bonded to a serine. Proteins
known or proposed to be serine carboxypeptidases are:
- Barley serine carboxypeptidases I and II [1,2].
- Wheat serine carboxypeptidase II [3].
- A probable wheat serine carboxypeptidase induced by gibberellin [4].
- Yeast carboxypeptidase Y [5], a vacuolar protease involved in the degra-
dation of small peptides.
- Yeast KEX1 protease [6], which is involved in killer toxin and alpha-
factor precursor processing.
- Human 'protective protein' [7], a lysosomal protein which appears to be
essential for both the activity of beta-galactosidase and neuraminidase.
The sequence in the vicinity of the active site serine residue is perfectly
conserved in all these serine carboxypeptidases.
-Consensus pattern: G-E-S-Y-A-G
[S is the active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: October 1989 / Text revised.
[ 1] Sorensen S.B., Breddam K., Svendsen I.
Carlsberg Res. Commun. 51:475-485(1986).
[ 2] Sorensen S.B., Svendsen I., Breddam K.
Carlsberg Res. Commun. 52:285-295(1987).
[ 3] Breddam K., Sorensen S.B., Svendsen I.
Carlsberg Res. Commun. 52:297-311(1987).
[ 4] Baulcombe D.C., Barker R.F., Jarvis M.G.
J. Biol. Chem. 262:13726-13735(1987).
[ 5] Valls L.A., Hunter C.P., Rothman J.H., Stevens T.H.
Cell 48:887-897(1987).
[ 6] Dmochowska A., Dignard D., Henning D., Thomas D.Y., Bussey H.
Cell 50:573-584(1987).
[ 7] Galjart N.J., Gillemans N., Harris A., van de Horst G.T.J.,
Verheijen F.W., Galjaard H., D'Azzo A.
Cell 54:755-764(1988).
{END}
------------------------- End of Sample PROSITE Entry --------------------------
5 NEW ENZYME DATABASE
A new secondary database called ENZYME has been established. It contains the
following data for each type of characterized enzyme for which an EC number has
been assigned:
o EC number
o Recommended name
o Alternative names (if any)
o Catalytic activity
o Cofactors (if any)
o Pointers to the SWISS-PROT entry(ies) that correspond to the enzyme
(if any)
We believe that the ENZYME database will be useful to anybody working with
enzymes and will allow programs to be developed that can help with the creation
of new metabolic pathways.
A sample entry is shown below:
ID 1.14.17.3
DE PEPTIDYLGLYCINE MONOOXYGENASE.
AN PEPTIDYL ALPHA-AMIDATING ENZYME.
CA PEPTIDYLGLYCINE + ASCORBATE + O(2) = PEPTIDYL(2-HYDROXYGLYCINE) +
CA DEHYDROASCORBATE + H(2)O.
CC THE PRODUCT IS UNSTABLE AND DISMUTATES TO GLYOXYLATE AND THE
CC CORRESPONDING DESGLYCINE PEPTIDE AMIDE.
CF COPPER.
DR P10731, AMD$BOVIN ; P14925, AMD$RAT ; P08478, AMD1$XENLA;
DR P12890, AMD2$XENLA;
//
The impact of this new database on SWISS-PROT is the following:
o The ECNUMBER.DOC file is now obsolete and is now longer generated.
o Instead of having CC (comments) lines with the topics:
CC -!- CATALYTIC ACTIVITY: description_of_catalytic_activity.
CC -!- COFACTOR: description_of_cofactor.
the enzyme entries in SWISS-PROT will, in future releases, have two new
linetypes:
CA Description_of_catalytic_activity
CF Description_of_cofactor
These lines will be automatically generated at each release of
SWISS-PROT from the information stored in the ENZYME database. The
introduction of these new linetypes is planned for release 16 of
SWISS-PROT.
ENZYME is distributed with SWISS-PROT; for a complete description of the content
and format of this database you should refer to the User Manual (file
ENZYME.USR).
6 DISTRIBUTION MEDIA
Data is available on magnetic tape, TK50 cassette and CD-ROM. This section of
the release notes applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file organisation used
on CD.
6.1 Tape Formats
The distribution tapes are 9-track industry standard magnetic tapes. Each file
consists of fixed-length 80 byte records, padded with trailing blanks as
appropriate (except for VMS Backup format tapes which have variable length
records). Tape format details (density, blocksize, label type, character set)
are attached to each tape.
In many formats, a release requires more than one tape volume. In order to
support sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.
VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS14.BCK.
6.2 Documentation
The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable. As with all other tape files they have a fixed
record length of 80 bytes. The page length of 63 lines per page was chosen so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.
Page throws are indicated by lines with the six character string <PAGE> in
positions 1-6, and nothing else. If you wish to print any of these files we
suggest you copy them down onto disk, use your local editor to replace every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.
6.3 Release 14 Files
The distribution tape(s) contain the files shown below, in the order listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.
File Number File Name Description #Records
----------- ------------ ----------------------------- --------
1 CONTENTS.DOC Tape Contents (this table) 63
2 SWISSPRT.USR User Manual 1829
3 RELNOTES.DOC Release Notes (this document) 1063
4 SPECIES.NDX Species Index 8779
5 KEYWORD.NDX Keyword Index 16830
6 AUTHOR.NDX Author Index 74268
7 SHORTDIR.NDX Short Directory Index 31363
8 ACNUMBER.NDX Accession Number Index 15535
9 CITATION.NDX Citation Index 30411
10 SPIDCODE.NDX Species ID Code Index 2842
11 EMBLTOSP.DOC EMBL/SWISS-PROT Xreferences 13567
12 ORGCODES.DOC Organism Code List 3199
13 PDBTOSP.DOC PDB/SWISS-PROT Xreferences 575
14 JOURLIST.DOC Journal Abbreviation List 1343
15 DATASUB.TXT Data Submission Form 323
16 SEQ.DAT SWISS-PROT Sequence Entries 490281
17 PROSITE.USR PROSITE Database User Manual 915
18 PROSITE.LIS PROSITE Entry List 424
19 PROSITE.DOC PROSITE Entry Documentation 12313
20 PROSITE.DAT PROSITE Database Entries 6401
21 ENZYME.USR ENZYME Database User Manual 427
22 ENZYME.DAT ENZYME Database 12008
7 INDEX FILE FORMATS
The index key of each index file (keywords, authors, citations, etc.) is sorted
alphabetically; the names of all entries containing the index key are listed
alphabetically after the key. Each entry name is accompanied by its primary
accession number.
Except for the short directory, accession number and species id code indices,
all index files have the same layout: each value of the index key begins on a
new line in column 1, and the associated entry names begin on the next line.
Lines containing entry names are in fixed-format, layed out as follows:
Columns Description
------- ---------------------------
14-23 entry name (left-justified)
29-34 primary accession number
36-45 entry name (left-justified)
51-56 primary accession number
58-67 entry name (left-justified)
73-78 primary accession number
Up to three entry names fit on each such line; if a given index key has more
than three entries associated with it, additional lines are used (with exactly
the same layout). This index file format is identical to that of the EMBL
Nucleotide Sequence database.
7.1 Species Index
This file lists all species which appear in the database. It is sorted
alphabetically on (english) common name. The latin genus and species will be
listed, if present in the database entries. Mitochondrion and chloroplast
sequences appear under separate index keys, immediately after the related
nuclear sequences. An excerpt from the species index file is given below (the
ruler is presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM)
PLAS$SOLCR P00297
CHIMPANZEE (PAN SATYRUS)
HBA3$PANSA P01935
CHIMPANZEE (PAN TROGLODYTES)
CD4$PANTR P16004 HA1A$PANTR P13748 HA1B$PANTR P13749
HA1C$PANTR P16209 HA1D$PANTR P16210 HA1E$PANTR P16215
HA1M$PANTR P13750 HA1N$PANTR P13751 HBAZ$PANTR P06347
MBP$PANTR P06906 MYG$PANTR P02145 NUO4$PANTR P03906
NUO5$PANTR P03916
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.2 Keyword Index
This file lists all keywords which appear in the database (on the KW lines). It
is sorted alphabetically on keyword. An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
CXA1$CONGE P01519 CXA1$CONMA P01521 CXA1$CONST P15471
CXA2$CONGE P01520
ACIDIC PROTEIN
143E$BOVIN P11576 B23$RAT P13084 B231$HUMAN P08693
B232$HUMAN P06748 BAT$HALHA P13260 CALQ$CANFA P12637
CALQ$RABIT P07221 CENB$HUMAN P07199 CMGA$HUMAN P10645
CMGA$RAT P10354 GRPE$ECOLI P09372 MK16$YEAST P10962
NFL$BOVIN P02548 NO38$CHICK P16039 NU38$XENLA P07222
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.3 Author Index
This file lists all author names which appear in citations (on the RA lines) in
the database. It is sorted alphabetically on name. Names are presented as they
appear in the database entries (i.e. as cited in publications) - we do not
attempt to handle multiple surname spellings, or different initials, for the
same author. An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
PT3G$SALTY P02908
AARDEN L.A.
IL6$HUMAN P05231
AARONSON R.P.
HEMA$INCCA P03465
AARONSON S.
PGDS$HUMAN P16234
AARONSON S.A.
3ORF$EIAV1 P11305 DBL$HUMAN P10911 ENV$EIAV1 P11306
ENV$SMSAV P03384 GAG$AVISN P03342 GAG$MSVMO P03334
GAG$SMSAV P03330 KMOS$MSVMO P00538 PDGB$HUMAN P01127
POL$EIAV1 P11204 POL$MMTVB P03365 POL$SMSAV P03359
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.4 Short Directory Index
This file contains summary information about all entries in the database,
including a brief description of the entry, its sequence length, molecule type
and data management class. The file is sorted alphabetically on entry name.
The lines are fixed-format, layed out as follows:
Columns Field Name Description
------- --------------- -------------------------------------------
01-10 entry name left-justified
14-14 data class s = standard
u = unannotated
p = preliminary
r = unreviewed
16-18 molecule type PRT (protein)
20-25 sequence length right-justified
27-80 description left-justified
If an entry's description occupies more than 54 characters (cols 27-80), it will
be continued onto one or more continuation lines. Continuation lines contain
description text (left-justified) in cols 27-80; cols 01-26 are blank. An
excerpt from the short description index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA p PRT 924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA
10KA$MYCTU s PRT 100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) -
MYCOBACTERIUM TUBERCULOSIS
10KS$HUMAN s PRT 91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO
SAPIENS)
110K$PLAKN s PRT 296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM
KNOWLESI
11KD$ADE02 s PRT 79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
(PROTEIN X) - ADENOVIRUS TYPE 2
11SB$CUCMA s PRT 480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN
(CUCURBITA MAXIMA)
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.5 Accession Number Index
This file lists all accession numbers in the database. It is sorted
alphabetically on accession number. Each accession number is followed by the
name and primary accession number of every entry in which it occurs.
The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key. In the other
index files, the index key appears on a line by itself. This index key,
however, is short enough to fit on the same line as the entries, and we have
done this to save space.
Accession numbers which have been deleted from the database also appear in this
index, containing the word DELETED (left-justified) in the entry name field.
An excerpt from the accession number index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00835 ATPE$MAIZE P00835
P00836 ATPE$HORVU P00836
P00837 ATPG$ECOLI P00837
P00838 ATPG$ECOLI P00837
P00839 ATPL$BOVIN P00839 ATPM$BOVIN P07926
P00840 ATP9$MAIZE P00840
P00841 ATP9$YEAST P00841
P00842 ATPL$NEUCR P00842
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.6 Citation Index
This file lists all journal citations which appear in the database. It is
sorted alphabetically on citation. An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
CASK$BOVIN P02668
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
KV3F$HUMAN P01624
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
HV41$MOUSE P01811
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
HBB$ATEGE P02034
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
HBB$RABIT P02057
ANN. N.Y. ACAD. SCI. 356:1-13(1980)
CALM$METSE P02596
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.7 Species ID Code Index
This file lists all species id codes which appear in the database. It is sorted
alphabetically on species id code. Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the species id code. The
sequence entry code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry name (after the
"$"). For example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.
Each species id code starts on a new line, occupying columns 1-5, and the
associated sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ... 70, 75. If there are more than 14 sequence entry
codes for a given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.
An excerpt from the species id code index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACEME PER RBS1 RBS2 RBS3 RBS4 RBS5
ACENE CYC
ACHKL CALM
ACHLY API
ACIBA KKA
ACICA BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPC
TRPD TRPG
ACIFE HGDA HGDB YHGD
ACIGL ASPQ
ACIGU PRT1
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
8 WE NEED YOUR HELP !
We welcome any feedback from our users. We would especially appreciate
information about any sequences belonging to your field of expertise which are
missing from the database. We also would like to be notified about annotations
which should be updated (e.g. if the function of a protein has been clarified,
or if new post-translational information has become available).
APPENDIX A
SOME STATISTICS
A.1 AMINO ACID COMPOSITION
Composition in percent for the complete database:
Ala (A) 7.69 Gln (Q) 4.11 Leu (L) 9.10 Ser (S) 7.06
Arg (R) 5.21 Glu (E) 6.30 Lys (K) 5.86 Thr (T) 5.84
Asn (N) 4.43 Gly (G) 7.17 Met (M) 2.30 Trp (W) 1.32
Asp (D) 5.24 His (H) 2.26 Phe (F) 3.95 Tyr (Y) 3.20
Cys (C) 1.83 Ile (I) 5.39 Pro (P) 5.12 Val (V) 6.48
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.04
Classification of the amino acids by their frequency:
Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 DISTRIBUTION OF SEQUENCES BY SPECIES
Total number of species represented in this release of the database: 2192.
Species represented 1x: 964
2x: 413
3x: 224
4x: 127
5x: 96
6x: 76
7x: 40
8x: 29
9x: 47
10x: 24
11- 20x: 75
21-100x: 60
>100x: 17
Table of the most common species:
Number Frequency Species
1 1336 Human
2 1150 Escherichia coli
3 763 Mouse
4 652 Rat
5 501 Baker's yeast (Saccharomyces cerevisiae)
6 387 Bovine
7 263 Fruit fly (Drosophila melanogaster)
8 253 Chicken
9 210 Rabbit
10 176 Pig
11 169 Bacillus subtilis
12 146 African clawed frog (Xenopus laevis)
13 134 Salmonella typhimurium
14 132 Bacteriophage T4
15 118 Maize
16 104 Rice
104 Tobacco
18 85 Wheat
19 84 Liverwort (Marchantia polymorpha)
20 77 Pea
21 75 Slime mold (Dictyostelium discoideum)
75 Spinach
75 Staphylococcus aureus
24 74 Soybean
25 73 Vaccinia virus
A.3 DISTRIBUTION OF SEQUENCES BY LENGTH
From To Number From To Number
1- 50 1012 1001-1100 131
51- 100 1810 1101-1200 84
101- 150 2648 1201-1300 68
151- 200 1546 1301-1400 46
201- 250 1233 1401-1500 35
251- 300 1093 1501-1600 17
301- 350 929 1601-1700 20
351- 400 939 1701-1800 16
401- 450 699 1801-1900 13
451- 500 786 1901-2000 18
501- 550 601 2001-2100 7
551- 600 381 2101-2200 19
601- 650 279 2201-2300 19
651- 700 199 2301-2400 10
701- 750 196 2401-2500 9
751- 800 125 >2500 29
801- 850 115
851- 900 134
901- 950 71
951-1000 72
Currently the five largest sequences are:
RYNR$RABIT 5037 a.a.
APB$HUMAN 4563 a.a.
APOA$HUMAN 4548 a.a.
DMD$HUMAN 3685 a.a.
DMD$CHICK 3660 a.a.