SWISS-PROT RELEASE 16.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 16.0 of SWISS-PROT contains 18364 sequence entries, comprising 5'986'949
amino acids abstracted from 17763 references. This represents an increase of 9%
over release 15. The recent growth of the data bank is summarized below:
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
More than 1400 sequences have been added since release 15, the sequence data of
271 existing entries has been updated and the annotations of 3500 entries have
been revised. In particular we have used reviews articles to update the
annotations of the following groups or families of proteins:
- Alpha and beta adrenergic receptors
- Arrestins
- Chromogranins / secretogranins
- CTF/NF-I family
- ClpP proteases
- ets family
- GABA(A) receptors
- Gram-positive cocci surface proteins
- Hexokinases
- Integrins alpha and beta chains
- NMePhe pili proteins
- p53 proteins
- Poly(ADP-ribose) polymerase
- Profilins
- S-Adenosylmethionine synthetases
- Site-specific recombinases
- Synaptobrevins
- Type-II membrane antigens
- UDP-glucoronosyl transferases
- Uteroglobin family
- LBP / BPI / CETP family
2 DATA SOURCES
Release 16.0 has been updated using protein sequence data from release 25.0 of
the PIR (Protein Identification Resource) protein data bank, as well as
translation of nucleotide sequence data from release 24.0 of the EMBL Nucleotide
Sequence Data Library.
As an indication to the source of the sequence data in the SWISS-PROT data bank
we list here the statistics concerning the DR (Databank Reference) pointer
lines:
Entries with pointer(s) to only PIR entri(es): 3335
Entries with pointer(s) to only EMBL entri(es): 5468
Entries with pointer(s) to both EMBL and PIR entri(es): 8908
Entries with no pointers lines: 653
3 CHANGES AT THIS RELEASE
3.1 Cross-References To MIM
We have finished adding cross-references to all human protein sequence entries
which are represented in the latest edition of the MIM (Mendelian Inheritance in
Man) book [1].
There are currently 842 SWISS-PROT entries that have cross-references to one or
more MIM catalog number.
A new document file, called MIMTOSP.DOC, is provided with SWISS-PROT, it is a
sorted list of the MIM catalog entries cross-referenced in SWISS- PROT and the
corresponding protein sequence entry names.
4 FORTHCOMING CHANGES
We plan to implement the following changes in release 18 (these changes were
announced for release 16, but we are postponing their application so as to leave
more time to sequence analysis software developers to update their packages).
4.1 New Linetypes: CA and CF
As we announced in the last release notes, the enzyme entries in SWISS- PROT
will have two new line-types:
CA Description_of_catalytic_activity.
CF Description_of_cofactor.
These lines will be automatically generated at each release of SWISS- PROT from
the information stored in the ENZYME data bank. They will replace the
'CATALYTIC ACTIVITY` and 'COFACTORS` comment lines (CC) topics. For example:
CC -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
OXALOACETATE + L-GLUTAMATE.
--------------------------------------------------------------------------------
(1) McKusick Victor A., Mendelian Inheritance in Man, Catalogs of autosomal
dominant, autosomal recessive, and X-linked phenotypes, Ninth edition,
Johns Hopkins University Press, Baltimore, (1990).
CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.
will be changed to:
CA L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
CF PYRIDOXAL PHOSPHATE.
4.2 OS Line Format
We will invert the order of the information in the OS line. Currently we have
"English common name (Latin name)"; we will switch to "Latin name (English
common name)". For example:
OS HUMAN (HOMO SAPIENS).
will be changed to:
OS HOMO SAPIENS (HUMAN).
5 ENZYME AND PROSITE DATABASES
Release 3.0 of the ENZYME data bank is distributed along with release 16 of
SWISS-PROT. ENZYME release 3.0 contains information relative to 3071 enzyme
entries. The data bank is complete and up to date. Until new enzyme
nomenclature data is published we only plan to update the SWISS- PROT pointers
at each release of the protein sequence data bank, correct eventual errors, and
complete the information concerning synonyms and cofactors using the literature.
Release 6.0 of the PROSITE data bank is distributed along with release 16 of
SWISS-PROT. PROSITE release 6 contains 375 documentation chapters that describe
433 different patterns. Since release 5.1 77 new chapters have been added and
131 have been updated.
6 DISTRIBUTION MEDIA
Data is available on magnetic tape, TK50 cassette and CD-ROM. This section of
the release notes applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file organisation used
on CD.
6.1 Tape Formats
The distribution tapes are 9-track industry standard magnetic tapes. Each file
consists of fixed-length 80 byte records, padded with trailing blanks as
appropriate (except for VMS Backup format tapes which have variable length
records). Tape format details (density, blocksize, label type, character set)
are attached to each tape.
In many formats, a release requires more than one tape volume. In order to
support sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.
VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS15.BCK.
6.2 Documentation
The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable. As with all other tape files they have a fixed
record length of 80 bytes. The page length of 63 lines per page was chosen so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.
Page throws are indicated by lines with the six character string <PAGE> in
positions 1-6, and nothing else. If you wish to print any of these files we
suggest you copy them down onto disk, use your local editor to replace every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.
6.3 Release 16 Files
The distribution tape(s) contain the files shown below, in the order listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.
File Number File Name Description #Records
----------- ------------ ----------------------------- --------
1 CONTENTS.DOC Tape Contents (this table) 64
2 SWISSPRT.USR User Manual 1830
3 RELNOTES.DOC Release Notes (this document) 895
4 SPECIES.NDX Species Index 10297
5 KEYWORD.NDX Keyword Index 20291
6 AUTHOR.NDX Author Index 88407
7 SHORTDIR.NDX Short Directory Index 38209
8 ACNUMBER.NDX Accession Number Index 18543
9 CITATION.NDX Citation Index 36092
10 SPIDCODE.NDX Species ID Code Index 3285
11 EMBLTOSP.DOC EMBL/SWISS-PROT Xreferences 17022
12 ORGCODES.DOC Organism Code List 3518
13 MIMTOSP.DOC MIM/SWISS-PROT Xreferences 1022
14 PDBTOSP.DOC PDB/SWISS-PROT Xreferences 574
15 JOURLIST.DOC Journal Abbreviation List 1470
16 DATASUB.TXT Data Submission Form 315
17 SEQ.DAT SWISS-PROT Sequence Entries 602162
18 PROSITE.USR PROSITE Database User Manual 915
19 PROSITE.LIS PROSITE Entry List 508
20 PROSITE.DOC PROSITE Entry Documentation 15669
21 PROSITE.DAT PROSITE Database Entries 8229
22 ENZYME.USR ENZYME Database User Manual 487
23 ENZYME.DAT ENZYME Database 20871
7 INDEX FILE FORMATS
The index key of each index file (keywords, authors, citations, etc.) is sorted
alphabetically; the names of all entries containing the index key are listed
alphabetically after the key. Each entry name is accompanied by its primary
accession number.
Except for the short directory, accession number and species id code indices,
all index files have the same layout: each value of the index key begins on a
new line in column 1, and the associated entry names begin on the next line.
Lines containing entry names are in fixed-format, layed out as follows:
Columns Description
------- ---------------------------
14-23 entry name (left-justified)
29-34 primary accession number
36-45 entry name (left-justified)
51-56 primary accession number
58-67 entry name (left-justified)
73-78 primary accession number
Up to three entry names fit on each such line; if a given index key has more
than three entries associated with it, additional lines are used (with exactly
the same layout). This index file format is identical to that of the EMBL
Nucleotide Sequence database.
7.1 Species Index
This file lists all species which appear in the database. It is sorted
alphabetically on (english) common name. The latin genus and species will be
listed, if present in the database entries. Mitochondrion and chloroplast
sequences appear under separate index keys, immediately after the related
nuclear sequences. An excerpt from the species index file is given below (the
ruler is presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM)
PLAS$SOLCR P00297
CHIMPANZEE IMMUNODEFICIENCY VIRUS (CIV) (SIV(CPZ))
ENV$SIVCZ P17281 GAG$SIVCZ P17282 NEF$SIVCZ P17664
POL$SIVCZ P17283 REV$SIVCZ P17280 TAT$SIVCZ P17285
VIF$SIVCZ P17284 VPR$SIVCZ P17287 VPU$SIVCZ P17286
CHIMPANZEE (PAN SATYRUS)
HBA3$PANSA P01935
CHIMPANZEE (PAN TROGLODYTES)
CD4$PANTR P16004 HA1A$PANTR P13748 HA1B$PANTR P13749
HA1C$PANTR P16209 HA1D$PANTR P16210 HA1E$PANTR P16215
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.2 Keyword Index
This file lists all keywords which appear in the database (on the KW lines). It
is sorted alphabetically on keyword. An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
CXA1$CONGE P01519 CXA1$CONMA P01521 CXA1$CONST P15471
CXA2$CONGE P01520
ACIDIC PROTEIN
143E$BOVIN P11576 B23$RAT P13084 B231$HUMAN P08693
B232$HUMAN P06748 BAT$HALHA P13260 CALQ$CANFA P12637
CALQ$RABIT P07221 CENB$HUMAN P07199 CMGA$HUMAN P10645
CMGA$RAT P10354 GRPE$ECOLI P09372 MK16$YEAST P10962
NFL$BOVIN P02548 NO38$CHICK P16039 NU38$XENLA P07222
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.3 Author Index
This file lists all author names which appear in citations (on the RA lines) in
the database. It is sorted alphabetically on name. Names are presented as they
appear in the database entries (i.e. as cited in publications) - we do not
attempt to handle multiple surname spellings, or different initials, for the
same author. An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
PT3G$SALTY P02908
AARDEN L.A.
IL6$HUMAN P05231
AARONSON R.P.
HEMA$INCCA P03465
AARONSON S.
PGDS$HUMAN P16234
AARONSON S.A.
3ORF$EIAV1 P11305 DBL$HUMAN P10911 ENV$EIAV1 P11306
ENV$SMSAV P03384 GAG$AVISN P03342 GAG$MSVMO P03334
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.4 Short Directory Index
This file contains summary information about all entries in the database,
including a brief description of the entry, its sequence length, molecule type
and data management class. The file is sorted alphabetically on entry name.
The lines are fixed-format, layed out as follows:
Columns Field Name Description
------- --------------- -------------------------------------------
01-10 entry name left-justified
14-14 data class s = standard
u = unannotated
p = preliminary
r = unreviewed
16-18 molecule type PRT (protein)
20-25 sequence length right-justified
27-80 description left-justified
If an entry's description occupies more than 54 characters (cols 27-80), it will
be continued onto one or more continuation lines. Continuation lines contain
description text (left-justified) in cols 27-80; cols 01-26 are blank. An
excerpt from the short description index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA s PRT 924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA
10KA$MYCTU s PRT 100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) -
MYCOBACTERIUM TUBERCULOSIS
10KS$HUMAN s PRT 91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO
SAPIENS)
10KS$RAT s PRT 18 10 KD SECRETORY PROTEIN (CC10) (FRAGMENT) - RAT
(RATTUS NORVEGICUS)
110K$PLAKN s PRT 296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM
KNOWLESI
11KD$ADE02 s PRT 79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
(PROTEIN X) - ADENOVIRUS TYPE 2
11SB$CUCMA s PRT 480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN
(CUCURBITA MAXIMA)
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.5 Accession Number Index
This file lists all accession numbers in the database. It is sorted
alphabetically on accession number. Each accession number is followed by the
name and primary accession number of every entry in which it occurs.
The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key. In the other
index files, the index key appears on a line by itself. This index key,
however, is short enough to fit on the same line as the entries, and we have
done this to save space.
Accession numbers which have been deleted from the database also appear in this
index, containing the word DELETED (left-justified) in the entry name field.
An excerpt from the accession number index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00836 ATPE$HORVU P00836
P00837 ATPG$ECOLI P00837
P00838 ATPG$ECOLI P00837
P00839 ATPL$BOVIN P00839 ATPM$BOVIN P07926
P00840 ATP9$MAIZE P00840
P00841 ATP9$YEAST P00841
P00842 ATPL$NEUCR P00842
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.6 Citation Index
This file lists all journal citations which appear in the database. It is
sorted alphabetically on citation. An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
CASK$BOVIN P02668
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
KV3F$HUMAN P01624
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
HV41$MOUSE P01811
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
HBB$ATEGE P02034
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
HBB$RABIT P02057
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
7.7 Species ID Code Index
This file lists all species id codes which appear in the database. It is sorted
alphabetically on species id code. Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the species id code. The
sequence entry code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry name (after the
"$"). For example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.
Each species id code starts on a new line, occupying columns 1-5, and the
associated sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ... 70, 75. If there are more than 14 sequence entry
codes for a given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.
An excerpt from the species id code index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):
1 10 20 30 40 50 60 70 80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACHKL CALM
ACHLY API
ACIBA KKA
ACICA BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPB
TRPC TRPD TRPF TRPG
ACIFE HGDA HGDB YHGD
ACIGL ASPQ
ACIGU PRT1
ACISP CYMO
+--------+---------+---------+---------+---------+---------+---------+---------+
1 10 20 30 40 50 60 70 80
8 WE NEED YOUR HELP !
We welcome any feedback from our users. We would especially appreciate
information about any sequences belonging to your field of expertise which are
missing from the database. We also would like to be notified about annotations
which should be updated (e.g. if the function of a protein has been clarified,
or if new post-translational information has become available).
APPENDIX A
SOME STATISTICS
A.1 AMINO ACID COMPOSITION
Composition in percent for the complete database:
Ala (A) 7.67 Gln (Q) 4.09 Leu (L) 9.09 Ser (S) 7.09
Arg (R) 5.25 Glu (E) 6.29 Lys (K) 5.85 Thr (T) 5.86
Asn (N) 4.42 Gly (G) 7.14 Met (M) 2.31 Trp (W) 1.31
Asp (D) 5.23 His (H) 2.27 Phe (F) 3.95 Tyr (Y) 3.20
Cys (C) 1.83 Ile (I) 5.40 Pro (P) 5.11 Val (V) 6.48
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.03
Classification of the amino acids by their frequency:
Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg, Asp, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 DISTRIBUTION OF SEQUENCES BY SPECIES
Total number of species represented in this release of the database: 2492.
Species represented 1x: 1097
2x: 459
3x: 238
4x: 153
5x: 114
6x: 86
7x: 52
8x: 37
9x: 51
10x: 17
11- 20x: 97
21-100x: 72
>100x: 19
Table of the most common species:
Number Frequency Species
1 1550 Human
2 1326 Escherichia coli
3 886 Mouse
4 791 Rat
5 591 Baker's yeast (Saccharomyces cerevisiae)
6 422 Bovine
7 342 Fruit fly (Drosophila melanogaster)
8 311 Chicken
9 229 Rabbit
10 226 Bacillus subtilis
11 220 African clawed frog (Xenopus laevis)
12 205 Pig
13 189 Human cytomegalovirus (strain AD169)
14 168 Salmonella typhimurium
15 154 Bacteriophage T4
16 133 Maize
17 118 Rice
18 108 Tobacco
19 105 Vaccinia virus
20 95 Wheat
21 94 Pea
22 88 Staphylococcus aureus
23 86 Slime mold (Dictyostelium discoideum)
24 84 Liverwort (Marchantia polymorpha)
25 83 Sheep
26 81 Spinach
27 80 Barley
80 Soybean
29 70 Herpes simplex virus type 1 (strain 17)
70 Fission yeast (Schizosaccharomyces pombe)
A.3 DISTRIBUTION OF SEQUENCES BY LENGTH
From To Number From To Number
1- 50 1174 1001-1100 162
51- 100 2099 1101-1200 105
101- 150 3021 1201-1300 88
151- 200 1820 1301-1400 52
201- 250 1468 1401-1500 45
251- 300 1316 1501-1600 20
301- 350 1147 1601-1700 22
351- 400 1131 1701-1800 20
401- 450 867 1801-1900 22
451- 500 959 1901-2000 21
501- 550 718 2001-2100 9
551- 600 475 2101-2200 22
601- 650 336 2201-2300 24
651- 700 258 2301-2400 11
701- 750 243 2401-2500 11
751- 800 183 >2500 37
801- 850 144
851- 900 150
901- 950 95
951-1000 89
Currently the five largest sequences are:
RYNR$RABIT 5037 a.a.
APB$HUMAN 4563 a.a.
APOA$HUMAN 4548 a.a.
DMD$HUMAN 3685 a.a.
DMD$CHICK 3660 a.a.