SWISS-PROT RELEASE 25.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 25.0 of SWISS-PROT contains 29955 sequence entries, comprising
10'214'020 amino acids abstracted from 29176 references. This represents
an increase of 7% over release 24. The recent growth of the data bank is
summarized below.
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
1.2 Source of data
Release 25.0 has been updated using protein sequence data from release
35.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 34.0 of the
EMBL Nucleotide Sequence Database.
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Database cross-
references) pointer lines:
Entries with pointer(s) to only PIR entri(es): 4420
Entries with pointer(s) to only EMBL entri(es): 4235
Entries with pointer(s) to both EMBL and PIR entri(es): 20694
Entries with no pointers lines: 606
<PAGE>
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 24
2.1 Sequences and annotations
About 1820 sequences have been added since release 24, the sequence data
of 191 existing entries has been updated and the annotations of 2900
entries have been revised. In particular we have used reviews articles
to update the annotations of the following groups or families of
proteins:
- 2'-5'-oligoadenylate synthetases
- Adaptins
- ADP-glucose pyrophosphorylases
- Alanine dehydrogenases and pyridine nucleotide transhydrogenases
- Alpha-isopropylmalate and homocitrate synthases
- Ankrepeats proteins
- Bacterial regulatory proteins, arsR family
- Cobalt-binding eukaryotic proteins
- Cytochrome c and c1 heme lyases
- Elongation factor 1 beta/beta'/delta and gamma chains
- Eukaryotic initiation factor 4E
- Erythropoietins
- Glycosyl hydrolases family 8
- Glucoamylases
- Heat shock 20 Kd family
- Interleukins-4
- MARCKS family
- Oleosins
- Prokaryotic transcription elongation factors (greA/B)
- Serine proteases, ompT family
- Tyrosine protein kinases
- Uroporphyrin-III C-methyltransferases
- XPGC family proteins
2.2 SWISS-PROT and 2D gel databases
Two-dimensional (2D) gel techniques have made enormous progress in the
last few years. One of the consequences of this evolution has been the
development of databases that contain master gels from a variety of
mammalian tissues or from bacterial sources. These databases are going
to play an increasingly important role in the analysis of genomes and of
molecular diseases. 2D gel databases generally contain one or more
master images of the gels that correspond to the tissue or organism
studied; spots on these images are attributed an identification code and
a variable percentage of these spots are linked to known proteins. The
identification of a protein on a 2D gel is generally carried out using
antibodies or by microsequencing. Microsequencing of 2D gel spots also
produces partial sequences and physico-chemical data for a number of yet
uncharacterized proteins.
<PAGE>
SWISS-PROT has committed itself to work in close collaboration with a
number of groups developing 2D gel databases. Since last year cross-
references are already available to the gene-protein database of
Escherichia coli K-12 (now called ECO2DBASE) [1] and symmetrically that
database now contains cross-references to SWISS-PROT. As a second step
we have expanded our links to 2D gel databases by integrating data from
the following sources:
- The Human 2D gel protein database of the Faculty of Medicine of the
University of Geneva (known as SWISS-2DPAGE). SWISS-2DPAGE currently
contains data concerning plasma [2] and liver [3] proteins, but will
soon include additional tissues.
- The Human keratinocyte 2D gel protein database from the universities
of Aarhus and Ghent [4] (known as AARHUS/GHENT-2DPAGE).
For both of the above databases we provide:
a) Cross-references (using the DR line) to the identificators for the
spots corresponding to known or unknown microsequenced proteins.
b) We have created new entries for microsequences that correspond to
novel, yet unidentified, proteins.
c) In some cases we have entered the extent of the microsequences for
already known proteins. This was done for proteins which are not yet
well characterized. The availability of such microsequences allows,
for example, to confirm the position of a signal sequence cleavage
site or to confirm the correctness of a translated genomic sequence.
In the near future the collaboration with the group of D. Hochstrasser
which produces the SWISS-2DPAGE database will be expanded in the
following directions:
a) The MELANIE software package [5] which is a complete system for the
analysis of 2D gels and which is developed by the group of
Hochstrasser will allow its users to navigate back and forth between
SWISS-2DPAGE and SWISS-PROT. For more information on Melanie, please
contact Dr. Ron Appel:
Email: appel@cih.hcuge.ch
Tel: +41-22-372 62 64
Fax: +41-22-372 61 98
b) A file server will be set up that will allow anyone with a network
connection to obtain annotated graphic files containing the region of
the gels that correspond to a selected SWISS-PROT entry linked to
SWISS-2DPAGE.
[1] VanBogelen R.A., Sankar P., Clark R.L., Bogan J.A., Neidhardt F.C.
Electrophoresis 13:1014-1054(1992).
<PAGE>
[2] Hughes G.J., Frutiger S., Paquet N., Ravier F., Pasquali C.,
Sanchez J.-C., James R., Tissot J.-D., Bjellqvist B., Hochstrasser
D.F.
Electrophoresis 13:707-714(1992).
[3] Hochstrasser D.F., Frutiger S., Paquet N., Bairoch A., Ravier F.,
Pasquali C., Sanchez J.-C., Tissot J.-D., Bjellqvist B., Vargas R.,
Appel R.D., Hughes G.J.
Electrophoresis 13:992-1001(1992).
[4] Celis J.E., Rasmussen H.H., Madsen P., Leffers H., Honore B.,
Dejgaard K., Gesser B., Olsen E., Gromov P., Hoffmann H.J., Nielsen
M., Celis A., Basse B., Lauridsen J.B., Ratz G.P., Nielsen H.,
Andersen A.H., Walbum E., Kjaergaard I., Puype M., Van Damme J.,
Vandekerckhove J.
Electrophoresis 13:893-959(1992).
[5] Appel R., Hochstrasser D.F., Funk M., Vargas J.R., Pellegrini C.,
Muller A.F., Scherrer J.-R.
Electrophoresis 12:722-735(1991).
2.3 Changes in the DR line
a) The cross-references to the SWISS-2DPAGE and AARHUS/GHENT-2DPAGE 2D
gel databases (see the description in the above section).
Data bank identifier: SWISS-2DPAGE
Primary identifier : The protein spot alphanumeric designation.
Note: SWISS-2DPAGE uses SWISS-PROT primary
accession numbers as the alphanumeric designation
of spots that are linked to SWISS-PROT entries.
Secondary identifier: The species of origin (currently only `HUMAN' is
used).
Example : DR SWISS-2DPAGE; P10599; HUMAN.
Data bank identifier: AARHUS/GHENT-2DPAGE
Primary identifier : The protein spot alphanumeric designation.
Secondary identifier: Either `IEF' (for isoelectric focusing) or
`NEPHGE' (for non-equilibrium pH gradient
electrophoresis).
Example : DR AARHUS/GHENT-2DPAGE; 8006; IEF.
b) Instead of using the release number as the secondary identifier for a
cross-reference to the FlyBase database, we now use the gene name
indicated in the UID table of this database. Example:
DR FLYBASE; 00055; RELASE 9209.
is now:
DR FLYBASE; 00055; ADH.
<PAGE>
c) Instead of using "EC-2D-GEL" to designate the 2D gel gene-protein
database of Escherichia coli, we now use "ECO2DBASE" to reflect the
new official name of that database. Example:
DR EC-2D-GEL; I026.0; 4TH EDITION.
is now:
DR ECO2DBASE; I026.0; 5TH EDITION.
2.4 Weekly update of SWISS-PROT
Starting with the last release we started to provide weekly updates of
SWISS-PROT. These updates are available by anonymous FTP. Three files
are updated every week:
new_seq.dat Contains all the new entries since the last full release.
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release.
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since
the last release.
Currently these files are available on the following anonymous ftp
servers:
Organism EMBL ftp server
Address ftp.embl-heidelberg.de (or 192.54.41.33)
Directory /pub/databases/swissprot/new
Organism Basel Biozentrum Biocomputing server (EMBnet SWISS node)
Address bioftp.unibas.ch (or 131.152.8.1)
Directory /archive_data/brand_new/swissprot/updates
Organism ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/swiss-prot/updates
Organism National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov (or 130.14.20.1)
Directory /repository/swiss-prot/updates
!! Important notes !!!
Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
<PAGE>
Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free. Also,
for the same reason, new entries do not contain an OC (Organism
Classification) line.
2.5 Cancelling the announced change in the RA line concerning the
author names format
After reconsideration, the RA line change announced in release 24 has
been cancelled as the expected benefits did not seem to compensate
possible negative effects on existing software. The format of the RA
lines therefore remains unchanged both in the EMBL Nucleotide sequence
database and in SWISS-PROT.
3. ENZYME AND PROSITE
3.1 The ENZYME data bank
Release 12.0 of the ENZYME data bank is distributed with release 25 of
SWISS-PROT. ENZYME release 12.0 contains information relative to 3489
enzymes.
3.2 The PROSITE data bank
Release 10.1 of the PROSITE data bank is distributed with release 25 of
SWISS-PROT. Release 10.1 contains 635 documentation chapters that
describes 803 different patterns. Release 10.1 does not really represent
a new release; the only changes between releases 10.0 and 10.1 are
updating of the pointers to the SWISS-PROT entries whose name have been
modified between releases 24 and 25. The next release of PROSITE (11.0)
will be distributed with release 26 of SWISS-PROT.
4. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available.
<PAGE>
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.68 Gln (Q) 4.03 Leu (L) 9.17 Ser (S) 7.07
Arg (R) 5.26 Glu (E) 6.26 Lys (K) 5.81 Thr (T) 5.84
Asn (N) 4.43 Gly (G) 7.08 Met (M) 2.35 Trp (W) 1.31
Asp (D) 5.26 His (H) 2.26 Phe (F) 3.98 Tyr (Y) 3.22
Cys (C) 1.80 Ile (I) 5.51 Pro (P) 5.05 Val (V) 6.52
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.03
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 3876
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 1755
2x: 633
3x: 376
4x: 240
5x: 169
6x: 133
7x: 81
8x: 74
9x: 79
10x: 38
11- 20x: 144
21- 50x: 92
51-100x: 29
>100x: 37
<PAGE>
A.2.2 Table of the most represented species
Number Frequency Species
1 2297 Human
2 2039 Escherichia coli
3 1367 Mouse
4 1262 Rat
5 1143 Baker's yeast (Saccharomyces cerevisiae)
6 593 Bovine
7 527 Fruit fly (Drosophila melanogaster)
8 463 Chicken
9 431 Bacillus subtilis
10 334 African clawed frog (Xenopus laevis)
11 330 Salmonella typhimurium
12 315 Rabbit
13 287 Pig
14 251 Vaccinia virus (strain Copenhagen)
15 211 Maize
16 193 Human cytomegalovirus (strain AD169)
17 173 Vaccinia virus (strain WR)
18 171 Rice
19 168 Bacteriophage T4
20 164 Arabidopsis thaliana (Mouse-ear cress)
21 154 Tobacco
22 151 Wheat
23 148 Pea
24 145 Pseudomonas aeruginosa
25 139 Caenorhabditis elegans
26 129 Barley
27 127 Fission yeast (Schizosaccharomyces pombe)
28 123 Staphylococcus aureus
29 120 Spinach
30 119 Sheep
31 118 Soybean
32 117 Marchantia polymorpha (Liverwort)
117 Slime mold (Dictyostelium discoideum)
34 106 Dog
106 Neurospora crassa
36 103 Klebsiella pneumoniae
37 102 Pseudomonas putida
<PAGE>
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 1786 1001-1100 287
51- 100 3037 1101-1200 169
101- 150 4386 1201-1300 142
151- 200 2924 1301-1400 92
201- 250 2493 1401-1500 78
251- 300 2209 1501-1600 39
301- 350 2035 1601-1700 39
351- 400 2110 1701-1800 35
401- 450 1546 1801-1900 38
451- 500 1738 1901-2000 29
501- 550 1175 2001-2100 12
551- 600 828 2101-2200 33
601- 650 575 2201-2300 43
651- 700 423 2301-2400 14
701- 750 402 2401-2500 16
751- 800 326 >2500 90
801- 850 240
851- 900 248
901- 950 153
951-1000 165
Currently the ten largest sequences are:
RYNR_RABIT 5037 a.a.
RYNR_HUMAN 5032 a.a.
APB_HUMAN 4563 a.a.
APOA_HUMAN 4548 a.a.
RRPA_CVMJH 4488 a.a.
DYHC_TRIGR 4466 a.a.
GRSB_BACBR 4451 a.a.
PLEC_RAT 4140 a.a.
POLG_BVDV 3988 a.a.
VGF1_IBVB 3951 a.a.
<PAGE>
APPENDIX B: ON-LINE EXPERTS
B.1 List of on-line experts for PROSITE and SWISS-PROT
Field of expertise Name Email address
--------------------------- ------------------ ----------------------------
Alcohol dehydrogenases Joernvall H. hans.jornvall@k1m.ki.se
Persson B. bengt.persson@embl-heidelberg.
de
Aldehyde dehydrogenases Joernvall H. hans.jornvall@k1m.ki.se
Persson B. bengt.persson@embl-heidelberg.
de
Alpha-crystallins/HSP-20 Leunissen J.A.M. jackl@caos.caos.kun.nl
de Jong W. u629000@hnykun11.bitnet
Alpha-2-macroglobulins Van Leuven F. fred@blekul13.bitnet
AA-tRNA synthetases class II Leberman R. leberman@frembl51.bitnet
Apolipoproteins Boguski M.S. boguski@ncbi.nlm.nih.gov
AraC family HTH proteins Ramos J.L. jlramos@cnbvx3.cnb.uam.es
Arrestins Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.
edu
ATP synthase c subunit Recipon H. recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins Rees J. jrees@vax.oxford.ac.uk
Beta-lactamases Brannigan J. jab5@vaxa.york.ac.uk
Beta-transducin family Boguski M.S. boguski@ncbi.nlm.nih.gov
C-type lectin domain Drickamer K. drick@cuhhca.hhmi.columbia.
edu
Chalcone/stilbene synthases Schroeder J. raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60 Georgopoulos C. georgopo@cmu.unige.ch
Chaperonins TCP1 family Willison K.R. willison@icr.ac.uk
Chitinases Henrissat B. bernie@cermav.grenet.fr
Clusterin Peitsch M.C. peitsch@ulbio1.unil.ch
Cold shock domain Landsman D. landsman@ncbi.nlm.nih.gov
CTF/NF-I Mermod N. nmermod@ulys.unil.ch
Gronostajski R. gronosr@ccsmtp.ccf.org
Cytochromes P450 Holsztynska E.J. ela@netcom.uucp
netcom!ela@apple.com
DEAD-box helicases Linder P. linder@urz.unibas.ch
dnaJ family Kelley W. kelley@cmu.unige.ch
EF-hand calcium-binding Cox J.A. cox@sc2a.unige.ch
Kretsinger R.H. rhk5i@virginia.bitnet
Elongation factor 1 Amons R. wmbamons@rulgl.leidenuniv.nl
Enoyl-CoA hydratase Hofmann K.O. khofmann@biomed.biolan.
uni-koeln.de
fruR/lacI family HTH proteins Reizer J. jreizer@ucsd.edu
GATA-type zinc-fingers Boguski M.S. boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul. Boguski M.S. boguski@ncbi.nlm.nih.gov
GltP family of transporters Hofmann K.O. khofmann@biomed.biolan.
uni-koeln.de
<PAGE>
Glucanases Henrissat B. bernie@cermav.grenet.fr
Beguin P. phycel@pasteur.bitnet
Glutamine synthetase Tateno Y. ytateno@genes.nig.ac.jp
G-protein coupled receptors Chollet A. arc3029@ggr.co.uk
Attwood T.K. bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins Boguski M.S. boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17 Landsman D. landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.
edu
Integrases Roy P.H. 2020000@saphir.ulaval.ca
Kringle domain Ikeo K. kikeo@genes.nig.ac.jp
Lipocalins Boguski M.S. boguski@ncbi.nlm.nih.gov
Peitsch M.C. peitsch@ulbio1.unil.ch
lysR family HTH proteins Henikoff S. henikoff@sparky.fhcrc.org
MAC components / perforin Peitsch M.C. peitsch@ulbio1.unil.ch
Malic enzymes Glynias M. mglynias@ncsa.uiuc.edu
MAM domain Bork P. bork@embl-heidelberg.de
MIP family proteins Reizer J. jreizer@ucsd.edu
Myelin proteolipid protein Hofmann K.O. khofmann@biomed.biolan.
uni-koeln.de
Pancreatic trypsin inhibitor Ikeo K. kikeo@genes.nig.ac.jp
PEP requiring enzymes Reizer J. jreizer@ucsd.edu
pfkB carbohydrate kinases Reizer J. jreizer@ucsd.edu
Phytochromes Partis M.D. partis@gcri.afrc.ac.uk
Protein kinases Hanks S. hanks@vuctrvax.bitnet
Hunter T. hunter@salk.bitnet
PTS proteins Reizer J. jreizer@ucsd.edu
Restriction-modification Bickle T. bickle@urz.unibas.ch
enzymes Roberts R.J. roberts@neb.com
Ribosomal protein S3 Hallick R. hallick%biotec@arizona.edu
Ribosomal protein S15 Ellis S.R. srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases Harayama S. harayama@cmu.unige.ch
Signal sequence peptidases von Heijne G. gvh@csb.ki.se
Dalbey R.E. rdalbey@magnus.acs.ohio-state.
edu
Sodium symporters Reizer J. jreizer@ucsd.edu
Subtilases Brannigan J. jab5@vaxa.york.ac.uk
Siezen R.J. nizo@caos.caos.kun.nl
Thiol proteases Turk B. turk@ijs.ac.mail.yu
Thiol proteases inhibitors Turk B. turk@ijs.ac.mail.yu
TNF family Jongeneel C.V. vjongene@isrecmail.unil.ch
TPR repeats Boguski M.S. boguski@ncbi.nlm.nih.gov
Transit peptides von Heijne G. gvh@csb.ki.se
Type-II membrane antigens Levy S. levy@cellbio.stanford.edu
Uracil-DNA glycosylase Aasland R. aasland@bio.uib.no
Vitamin K-depend. Gla domain Price P.A. pprice@ucsd.edu
XPGC protein Clarkson S.G. clarkson@cmu.unige.ch
Xylose isomerase Jenkins J. jenkins@frira.afrc.ac.uk
WAP-type domain Claverie J.-M. jmc@ncbi.nlm.nih.gov
ZP domain Bork P. bork@embl-heidelberg.de
<PAGE>
African swine fever virus Yanez R.J. ryanez@cbm2.uam.es
Bacteriophage P4 Halling C. chh9@midway.uchicago.edu
Drosophila Ashburner M. ma11@phx.cam.ac.uk
Escherichia coli Rudd K. rudd@ncbi.nlm.nih.gov
Salmonella typhimurium Rudd K. rudd@ncbi.nlm.nih.gov
Snakes Stocklin R. stocklin@cmu.unige.ch
Yeast chromosome I Ouellette F. francis@monod.biol.mcgill.ca
B.2 Requirements to fulfill to become an on-line expert
An expert should be a scientist working with specific famili(es) of
proteins (or specific domains) and who would:
a) Review the protein sequences in SWISS-PROT and the patterns/matrices
in PROSITE relevant to their field of research.
b) Agree to be contacted by people that have obtained new sequence(s)
which seem to belong to "their" familie(s) of proteins.
c) Have access to electronic mail and be willing to use it to send and
receive data.
If you are willing to be part of this scheme please contact Amos Bairoch
at one of the following electronic mail addresses:
bairoch@cmu.unige.ch
bairoch@cgecmu51.bitnet
<PAGE>
APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
**********************
*********************** * EPD [Euk. Promot.] *
* EMBL Nucleotide * <----> **********************
* Sequence Data *
****************** * Library * **********************
* FLYBASE * <--> *********************** <----- * ECD [E. coli map] *
* [Drosophila * ^ ^ **********************
* genomic d.b.] * <------+ | |
****************** | | +--------- **********************
| | * TFD [Trans. fact.] *
| | +--------> **********************
| | |
****************** v v v **********************
* REBASE * *********************** * ENZYME [Nomencl.] *
* [Restriction * <--- * SWISS-PROT * <----- **********************
* enzymes] * * Protein Sequence * |
****************** * Data Bank * v
*********************** **********************
****************** ^ ^ | | ^ ^ | * OMIM [Diseases] *
* EcoGene/EcoSeq * | | | | | | +--------> **********************
* [E. coli] * <-----+ | | | | |
****************** | | | | +-----------> **********************
| | | | * ECO2DBASE [2D] *
| | | | **********************
****************** | | | |
* PROSITE * <--------+ | | +---------------> **********************
* [Patterns] * | | * SWISS-2DPAGE [2D] *
****************** | +---------------+ **********************
| v |
| *********************** | **********************
+--------> * PDB [3D structures] * +--> * Aarhus/Ghent [2D] *
*********************** **********************
<PAGE>