SWISS-PROT RELEASE 18.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 18.0 of SWISS-PROT contains 20772 sequence entries, comprising
6'792'034 amino acids abstracted from 20580 references. This represents
an increase of 4% over release 17. The recent growth of the data bank is
summarized below:
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
1.2 Source of data
Release 18.0 has been updated using protein sequence data from release
27.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 26.0 of the
EMBL Nucleotide Sequence Data Library.
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Databank
Reference) pointer lines:
Entries with pointer(s) to only PIR entri(es): 3863
Entries with pointer(s) to only EMBL entri(es): 3435
Entries with pointer(s) to both EMBL and PIR entri(es): 13001
Entries with no pointers lines: 473
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 17
2.1 Sequences and annotations
About 780 sequences have been added since release 17, the sequence data
of 139 existing entries has been updated and the annotations of 3010
entries have been revised. In particular we have used reviews articles
to update the annotations of the following groups or families of
proteins:
- Bacterial luciferase subunits
- Beta-amylases
- Carboxylesterases type-B
- Class-I aminoacyl-tRNA synthetases
- Clusterins
- Fumarate reductases / succinate dehydrogenases
- G-protein coupled receptors
- GTPase-activating proteins
- IMP dehydrogenase / GMP reductase
- Insulin-like growth factor binding proteins
- Leucine/isoleucine-binding proteins
- Lipocalins
- Manganese-dependent dipeptidases
- Nickel-dependent hydrogenases
- P-II proteins (glnB)
- Pectinesterases
- Phenylalanine and histidine ammonia-lyases
- Phosphoenolpyruvate carboxykinase (ATP and GTP)
- Polygalacturonases
- Prokaryotic molybdopterin oxidoreductases
- Receptors tyrosine kinase class IV (FGF receptors)
- Ribosomal proteins
- Signal recognition particle 54 Kd protein
- Somatostatins
- Thymosin beta-4 family
- Tyrosinases
- Wnt-1 family
2.2 Change in the OS line
As previously announced we have inverted the order of the information in
the OS line. We switched from 'English common name (Latin name)` to
'Latin name (English common name)`. Example:
OS HUMAN (HOMO SAPIENS).
as been changed to:
OS HOMO SAPIENS (HUMAN).
2.3 Cross-references to the Escherichia coli gene-protein database
We have added cross-references to the Escherichia coli gene-protein
database (2D gels spots) (for a description see: VanBogelen R.A., Hutton
M.E., and Neidhardt F.C., in Electrophoresis (1990), 12:1131-1166).
These cross-references are present in the DR lines. The data bank
identifier is EC-2D-GEL, the primary identifier is the 2D gel spot
alphanumeric designation, and the secondary identifier is the latest
edition of the data bank that we have used to derive the cross-
reference. Example of a DR line for the Escherichia coli gene-protein
database:
DR EC-2D-GEL; G052.0; 3RD EDITION.
2.4 Status of cross-references to PIR
We have continued adding cross-references to entries in the unannotated
sections of PIR (known as PIR2 and PIR3); currently we have cross-
references to 13118 sequence entries in PIR2/3 out of a total of 19051
entries in those sections in release 27 of PIR.
2.5 Documentation changes
- The EC2DTOSP.TXT document is an index of Escherichia coli Gene-
protein database entries referenced in SWISS-PROT (see section 2.3).
- The SPEINDEX.TXT document is a species index.
- The JOURLIST.TXT document now indicates, when it exist, the 6
characters CODEN designation of the journals cited in SWISS-PROT and
in PROSITE. Example of an entry in the JOURLIST.TXT file:
Abbrev: EMBO J.
Title : EMBO Journal
ISSN : 0261-4189
Coden : EMJ0DG
- The SPECODES.TXT document is no longer distributed. The information
contained in this document was duplicating that found in the species
index.
2.6 Absence of the line-types: CA and CF
We announced in the last two release notes that, starting with release
18, the enzyme entries in SWISS-PROT would have two new line-types:
CA Description_of_catalytic_activity.
CF Description_of_cofactor.
We finally decided not to implement this change as it would create line-
types specific to a subset of entries (enzymes); it would open the door
to the creation of too many types of lines. We believe that the use of
topics in the comment line is a better approach for the storage of such
information.
3. FORTHCOMING CHANGES
3.1 New line-types: RP and RC
We plan to implement the following change in release 19; the current RN
line will be replaced by three line types: a modified RN (Reference
Number) line type containing just the reference number, a new RP
(Reference Position) line type containing the extent of the work carried
out by the authors of the reference, and a new RC (Reference Comment)
line type containing comments relevant to the reference (strain, tissue,
etc.). Three examples of the usage of these new lines are given below.
RN [1]
RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-23.
RC STRAIN=K12;
RN [1]
RP SEQUENCE OF 24-56 AND 67-89.
RC STRAIN=BALB/C; TISSUE=BRAIN;
RN [2]
RP X-RAY CRYSTALLOGRAPHY 1.8 ANGSTROMS.
- Each reference block will continue to have exactly one RN line.
- There will always be a single RP line which will be in free text
format.
- As many RC lines as are needed to display the comments will appear;
if a reference has no comment then the RC line will not appear.
- A precise syntax will be used to display the information that appear
on the RC line.
The syntax of the Rc line is:
RC TOKEN=Text; TOKEN=Text;....
Where the following token are already defined: MEDLINE, PLASMID,
SPECIES, STRAIN, and TISSUE. Additional tokens will probably be added to
this list.
3.2 MEDLINE unique identifiers
Starting with release 19 each journal reference listed in SWISS-PROT
which exists in the MEDLINE bibliographic data bank will include the
"Unique Identifier" (UI) of that reference in MEDLINE. This information
will be stored in the new RC line using the "MEDLINE" token. Example:
RC MEDLINE=90205618;
It is planned that, in a few months, MEDLINE will add cross-references
to SWISS-PROT.
4. ENZYME AND PROSITE
4.1 The ENZYME data bank
Release 5.0 of the ENZYME data bank is distributed along with release 18
of SWISS-PROT. ENZYME release 5.0 contains information relative to 3072
enzymes. The data bank is complete and up to date. Until new enzyme
nomenclature data is published we only plan to update the SWISS-PROT
pointers at each release of the protein sequence data bank, correct
eventual errors, and complete the information concerning synonyms and
cofactors using the literature.
4.2 The PROSITE data bank
Release 7.0 of the PROSITE data bank is distributed along with release
18 of SWISS-PROT. Release 7.0 contains 441 documentation chapters that
describes 508 different patterns. Since the last major release of
PROSITE (release 6.0 of November 1990), 69 new chapters have been added
and 163 chapters have been updated.
5. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, as for example if the function
of a protein has been clarified or if new post-translational information
has become available.
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.65 Gln (Q) 4.09 Leu (L) 9.11 Ser (S) 7.08
Arg (R) 5.23 Glu (E) 6.28 Lys (K) 5.85 Thr (T) 5.85
Asn (N) 4.44 Gly (G) 7.12 Met (M) 2.32 Trp (W) 1.30
Asp (D) 5.24 His (H) 2.27 Phe (F) 3.95 Tyr (Y) 3.21
Cys (C) 1.83 Ile (I) 5.44 Pro (P) 5.09 Val (V) 6.49
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.03
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 2864
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 1266
2x: 520
3x: 284
4x: 172
5x: 121
6x: 94
7x: 73
8x: 37
9x: 58
10x: 29
11- 20x: 101
21-100x: 86
>100x: 23
A.2.2 Table of the most represented species
Number Frequency Species
1 1715 Human
2 1454 Escherichia coli
3 1001 Mouse
4 933 Rat
5 675 Baker's yeast (Saccharomyces cerevisiae)
6 486 Bovine
7 388 Fruit fly (Drosophila melanogaster)
8 354 Chicken
9 282 Bacillus subtilis
10 255 Rabbit
11 251 Vaccinia virus (strain Copenhagen)
12 246 African clawed frog (Xenopus laevis)
13 230 Pig
14 193 Human cytomegalovirus (strain AD169)
15 191 Salmonella typhimurium
16 160 Bacteriophage T4
17 152 Maize
18 128 Rice
19 113 Vaccinia virus (strain WR)
20 112 Tobacco
Pea
22 106 Wheat
23 102 Staphylococcus aureus
24 96 Sheep
25 93 Barley
Slime mold (Dictyostelium discoideum)
27 84 Agrobacterium tumefaciens
Liverwort (Marchantia polymorpha)
Spinach
30 83 Soybean
31 80 Fission yeast (Schizosaccharomyces pombe)
32 79 Pseudomonas aeruginosa
Klebsiella pneumoniae
34 78 Dog
35 77 Neurospora crassa
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 1390 1001-1100 191
51- 100 2369 1101-1200 122
101- 150 3358 1201-1300 94
151- 200 2034 1301-1400 57
201- 250 1666 1401-1500 46
251- 300 1490 1501-1600 24
301- 350 1311 1601-1700 23
351- 400 1281 1701-1800 23
401- 450 979 1801-1900 27
451- 500 1057 1901-2000 22
501- 550 811 2001-2100 9
551- 600 555 2101-2200 23
601- 650 390 2201-2300 28
651- 700 288 2301-2400 11
701- 750 270 2401-2500 13
751- 800 214 >2500 50
801- 850 162
851- 900 172
901- 950 112
951-1000 100
Currently the ten largest sequences are:
RYNR$RABIT 5037 a.a.
RYNR$HUMAN 5032 a.a.
APB$HUMAN 4563 a.a.
APOA$HUMAN 4548 a.a.
POLG$BVDV 3988 a.a.
POLG$HCVA 3898 a.a.
POLG$HCVB 3898 a.a.
TRX$DROME 3759 a.a.
ACVA$PENCH 3746 a.a.
DMD$HUMAN 3685 a.a.
APPENDIX B: ON-LINE EXPERTS
B.1 List of on-line experts for PROSITE and SWISS-PROT
Field of expertise Name Email address
--------------------------- ------------------- --------------------------
Alcohol dehydrogenases Bengt P. bengt@medfys.ki.se
Aldehyde dehydrogenases Bengt P. bengt@medfys.ki.se
Alpha-2-macroglobulins Van Leuven F. fred@blekul13.bitnet
Apolipoproteins Boguski M.S. boguski@ncbi.nlm.nih.gov
Arrestins Kolakowski L.F. Jr. lfk@athena.mit.edu
Bacteriophage P4 Halling C. chh9@midway.uchicago.edu
Beta-lactamases Brannigan J. jab5@vaxa.york.ac.uk
Chitinases Henrissat B. cermav@frgren81.bitnet
CTF/NF-I Mermod N. nmermod@clsuni51.bitnet
Cytochromes P450 Holsztynska E.J. ela@netcom.uucp
netcom!ela@apple.com
EF-hand calcium-binding Cox J.A. cox@cgeuge52.bitnet
Kretsinger R.H. rhk5i@virginia.bitnet
Eryf1-type zinc-fingers Boguski M.S. boguski@ncbi.nlm.nih.gov
Glucanases Henrissat B. cermav@frgren81.bitnet
Beguin P. phycel@pasteur.bitnet
G-protein coupled receptors Chollet A. chollet@clients.switch.ch
Attwood T.K. bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins Boguski M.S. boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17 Landsman D. landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases Kolakowski L.F. Jr. lfk@athena.mit.edu
Integrases Roy P.H. 2020000@lavalvx1.bitnet
Phytochromes Partis M.D. partis@gcri.afrc.ac.uk
Protein kinases Hanks S. hanks@vuctrvax.bitnet
Restriction-modification Bickle T. bickle@urz.unibas.ch
Roberts R.J. roberts@cshl.org
Ring-cleavage dioxygenases Harayama S. harayama@cgecmu51.bitnet
Subtilisin family proteases Brannigan J. jab5@vaxa.york.ac.uk
Thiol proteases Turks B. turk@ijs.ac.mail.yu
Thiol proteases inhibitors Turks B. turk@ijs.ac.mail.yu
TPR repeats Boguski M.S. boguski@ncbi.nlm.nih.gov
Transit peptides von Heijne G. gunnar@cbts.sunet.se
Type-II membrane antigens Levy S. levy@cellbio.stanford.edu
Xylose isomerase Jenkins J. jenkins@frira.afrc.ac.uk
B.2 Requirements to fulfill to become an on-line expert
An expert should be a scientist working with specific famili(es) of
proteins (or specific domains) and which would:
a) Review the protein sequences in SWISS-PROT and the patterns/matrices
in PROSITE relevant to their field of research.
b) Agree to be contacted by people that have obtained new sequence(s)
which seem to belong to "their" familie(s) of proteins.
c) Have access to electronic mail and be willing to use it to send and
receive data.
If you are willing to be part of this scheme please contact Amos Bairoch
at one of the following electronic mail addresses:
bairoch@cgecmu51.bitnet
bairoch@cmu.unige.ch