SWISS-PROT RELEASE 29.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 29.0 of SWISS-PROT contains 38303 sequence entries, comprising
13'464'008 amino acids abstracted from 36638 references. This represents
an increase of 7.7% over release 28. The recent growth of the data bank
is summarized below.
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
29.0 06/94 38303 13 464 008
1.2 Source of data
Release 29.0 has been updated using protein sequence data from release
40.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 38.0 of the
EMBL Nucleotide Sequence Database.
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Database cross-
references) pointer lines:
Entries with pointer(s) to only PIR entri(es): 4682
Entries with pointer(s) to only EMBL entri(es): 5191
Entries with pointer(s) to both EMBL and PIR entri(es): 27691
Entries with no pointers lines: 739
<PAGE>
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 28
2.1 Sequences and annotations
About 2320 sequences have been added since release 28, the sequence data
of 351 existing entries has been updated and the annotations of 7300
entries have been revised.
We are continuing the process to 'clean-up' the various representations
of domains in the feature lines (especially the usage of the feature
keys "DOMAIN", "REPEAT", "DNA_BIND", and "SITE"). We also have undertook
an overall revision of the CC topics "SUBCELLULAR LOCATION", "SUBUNIT
and "CAUTION".
2.2 What's happening with the model organisms
As we announced in the last two releases we have selected a number of
organisms that are the target of genome sequencing and/or mapping
projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates.
- Provide a high level of annotations.
- Cross-references to specialized database(s) that contain, among other
data, some genetic information about the genes that code for these
proteins.
- Provide specific indices or documents.
What was done since the last release or in preparation for the next
release:
- We have added Homo sapiens (human) as the sixth model organism (see
the next section for some additional information).
- In the next release we will add Bacillus subtilis as the seventh
organism. We will link SWISS-PROT to the SubtiList database currently
being designed by Ivan Moszer of the Pasteur Institute in Paris.
- The next release of LISTA will include accession numbers for each
gene entry; we will therefore be able to cross-reference SWISS-PROT
to LISTA.
Here is the current status of the model organisms:
Organism Database Index file Number of
cross-referenced sequences
-------------- -------------------------- -------------- ---------
B.subtilis SubtiList (in preparation) In preparation 563
C.elegans WormPep CELEGANS.TXT 679
D.discoideum DictyDB DICTY.TXT 198
D.melanogaster FlyBase In preparation 600
E.coli EcoGene ECOLI.TXT 2674
H.sapiens MIM MIMTOSP.TXT 2862
S.cerevisiae LISTA (in preparation) YEAST.TXT 1951
<PAGE>
2.3 Human genetic diseases
We have made an important effort in the implementation, in SWISS-PROT,
of data relevant to human genetic diseases. This effort has mainly dealt
with the following enhancements:
a) In sequence entries associated with one more genetic diseases, we
have updated and expanded the annotations characterizing those
diseases. These annotations are stored in the CC line topic
'DISEASE'.
Examples:
CC -!- DISEASE: DEFECTS IN GALNS ARE A CAUSE OF MUCOPOLYSACCHARIDOSIS
CC TYPE IVA (MPS IVA) (ALSO KNOWN AS MORQUIO A SYNDROME) WHICH IS
CC CHARACTERIZED BY SPECIFIC SPONDYLOEPIPHYSEAL DYSPLASIA, SHORT
CC TRUNK DWARFISM, COXA VALGA, ODONTOID HYPOPLASIA, CORNEAL
CC OPACITIES, PRESERVATION OF INTELLIGENCE, AND EXCESSIVE URINARY
CC EXCRETION OF KERATAN SULFATE AND CHONDROITIN-6-SULFATE.
CC -!- DISEASE: DEFECTS IN KRT9 ARE A CAUSE OF EPIDERMOLYTIC
CC PALMOPLANTAR KERATODERMA (EPPK), AN AUTOSOMAL DOMINANT DISEASE
CC CHARACTERIZED BY DIFFUSE THICKENING OF THE EPIDERMIS ON THE
CC ENTIRE SURFACE OF PALMS AND SOLES SHARPLY BORDERED WITH
CC ERYTHEMATOUS MARGINS.
b) We have entered in SWISS-PROT all the mutations linked with genetic
diseases or polymorphisms as long as they are not frameshift or
nonsense mutation. These mutations are described in the feature table
('VARIANT' key) and the relevant references have been added.
Partial example (from entry P07949 / KRET_HUMAN) describing the RET
protein which is linked with the diseases MEN2A, MEN2B, MTC and HSCR:
RN [7]
RP VARIANT MEN2B MET-929.
RM 94159102
RA HOFSTRA R.M.W., LANDSVATER R.M., CECCHERINI I., STULP R.P.,
RA STELWAGEN T., LUO Y., PASINI B., HOEPPENER J.W.M.,
RA VAN AMSTEL H.K.P., ROMEO G., LIPS C.J.M., BUYS C.H.C.M.;
RL NATURE 367:375-376(1994).
RN [8]
RP VARIANTS HSCR PRO-765; GLN-897 AND GLY-972.
RM 94159103
RA ROMEO G., RONCHETTO P., LUO Y., BARONE V., SERI M., CECCHERINI I.,
RA PASINI B., BOCCIARDI R., LERONE M., KAARLAINEN H., MARTUCCIELLO G.;
RL NATURE 367:377-378(1994).
FT VARIANT 765 765 S -> P (IN HSCR).
FT VARIANT 897 897 R -> Q (IN HSCR).
FT VARIANT 929 929 T -> M (IN MEN2B).
FT VARIANT 972 972 R -> G (IN HSCR).
<PAGE>
c) A new CC topic 'POLYMORPHISM' has been implemented. Examples of its
use:
CC -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191
CC HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE WITH
CC ARG-191 WITH A HIGH TURNOVER NUMBER.
CC -!- POLYMORPHISM: OVER 80 VARIANTS OF HUMAN DBP HAVE BEEN
CC IDENTIFIED. THE THREE MOST COMMON ALLELES ARE CALLED GC1F,
CC GC1S, AND GC2. THE SEQUENCE SHOWN IS THAT OF THE GC2 ALLELE.
d) New keywords have been introduced:
- 'DISEASE MUTATION' is used for sequences in which there is at least
one known disease-inducing mutation.
- 'POLYMORPHISM' is used in each entry where "neutral" variants have
been found (at the level of the protein sequence).
- 'CHROMOSOMAL TRANSLOCATION' is used to indicate proteins whose gene
are known to be involved in chromosomal translocations.
- Keywords have been implemented for genetic diseases linked with more
than a single gene/protein. These keywords are:
ALBINISM
ALZHEIMER'S DISEASE
AMYOTROPHIC LATERAL SCLEROSIS
ATHEROSCLEROSIS
AUTOIMMUNE ENCEPHALOMYELITIS
AUTOIMMUNE UVEITIS
BERNARD SOULIER SYNDROME
CHARCOT-MARIE-TOOTH DISEASE
CHRONIC GRANULOMATOUS DISEASE
COCKAYNE'S SYNDROME
DEJERINE-SOTTAS SYNDROME
DIABETES
DOWN'S SYNDROME
DWARFISM
ELLIPTOCYTOSIS
EMPHYSEMA
GAUCHER DISEASE
GLYCOGEN STORAGE DISEASE
GM2-GANGLIOSIDOSIS
GOUT
HEMOPHILIA
HEREDITARY HEMOLYTIC ANEMIA
HYPERLIPOPROTEINEMIA
LEBER'S HEREDITARY OPTIC NEUROPATHY
MAPLE SYRUP URINE DISEASE
METACHROMATIC LEUCODYSTROPHY
MUCOPOLYSACCHARIDOSIS
PHENYLKETONURIA
PSEUDOHERMAPHRODITISM
RETINITIS PIGMENTOSA
SCID
SYSTEMIC LUPUS ERYTHEMATOSUS
THROMBOPHILIA
<PAGE>
VON WILLEBRAND DISEASE
XERODERMA PIGMENTOSUM
e) The GDB list of genes has been used to update the GN (gene name) line
of many SWISS-PROT entries.
2.4 Changes in the DR line
We have added cross-references from SWISS-PROT to two additional
databases:
- The G-protein--coupled receptor database (GCRDb) prepared by Lee
Frank Kolakowski at the Massachusetts General Hospital Renal Unit.
Reference: Kolakowski L.F. Jr.; Receptors Channels In press(1994).
- The Maize Genome Database (MaizeDB) developed by the USDA-ARS Maize
Genome Project as part of the National Agricultural Library's Plant
Genome Research Program.
These cross-references are present in the DR lines:
Data bank identifier: GCRDB
Primary identifier : Unique identifier (accession number) of the
entry
Secondary identifier: None; a dash '-' is stored in that field
Example : DR GCRDB; GCR_0087; -.
Data bank identifier: MAIZEDB
Primary identifier : 'Gene-product' accession ID
Secondary identifier: None; a dash '-' is stored in that field
Example : DR MAIZEDB; 25342; -.
We have removed from SWISS-PROT cross-references to TFD (the
Transcription Factor Database). The main reason for this decision is
that the information stored in the 'polypeptide' table of TFD does not
expand on the data present in the corresponding SWISS-PROT entry.
2.5 Status of the documentation files
SWISS-PROT is distributed with a large number of documentation files.
Some of these files have been available for a long time (the user
manual, release notes, the various indices for authors, citations,
keywords, etc.), but many have been created recently and we are
continuously adding new files. The following table list all the
documents that are currently available or that will be added in the next
few months.
USERMAN .TXT User manual
RELNOTES.TXT Release notes
SHORTDES.TXT Short description of entries in SWISS-PROT
<PAGE>
JOURLIST.TXT List of abbreviations for journals cited
KEYWLIST.TXT List of keywords in use
SPECLIST.TXT List of organism identification codes
EXPERTS .TXT List of on-line experts for PROSITE and SWISS-PROT [1, 3]
ACINDEX .TXT Accession number index
AUTINDEX.TXT Author index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keyword index
SPEINDEX.TXT Species index
7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries
CDLIST .TXT CD nomenclature for surface proteins of human leucocytes
CELEGANS.TXT Index of Caenorhabditis elegans entries and their
corresponding gene designations and WormPep cross-
references
DICTY .TXT Index of Dictyostelium discoideum entries and their
corresponding gene designations and DictyDB cross-
references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries
referenced in SWISS-PROT
ECOLI .TXT Index of Escherichia coli K12 chromosomal entries and
their corresponding EcoGene cross-reference [4]
EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT
GLYCOSYL.TXT Index of glycosyl hydrolases classified by families on
the basis of sequence similarities [2]
HOXLIST .TXT Vertebrate homeobox proteins: nomenclature and index
MIMTOSP .TXT Index of MIM entries referenced in SWISS-PROT
NOMLIST .TXT List of nomenclature related references for proteins [1]
PDBTOSP .TXT Index of Brookhaven PDB entries referenced in SWISS-PROT
PLASTID .TXT List of chloroplast and cyanelle encoded proteins
RESTRIC .TXT List of restriction enzymes and methylases entries [1]
RIBOSOMP.TXT Index of ribosomal proteins classified by families on the
basis of sequence similarities [2]
YEAST .TXT Index of Saccharomyces cerevisiae entries and their
corresponding gene designations
YEAST11 .TXT Yeast Chromosome XI entries [1]
Notes:
[1] New in release 29.
[2] Will be available starting with release 30 in October 1994.
[3] The list of on-line experts used to be an appendix of the release
notes. We now provide it as a separate document.
[4] The format of this file was slightly modified in this release; we
added a field that indicates if the 3D structure of an E.coli
protein is available in PDB.
<PAGE>
2.6 The Expasy World-Wide Web server
2.6.1 Background information
The World-Wide Web (WWW), which originated at CERN, is a powerful global
information system merging networked information retrieval and
hypertext. It gives access, using hypertext links, to the documents and
information contained in all the existing WWW servers around the world,
as well as to the data obtainable through other information retrieval
systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
run on a local computer a client program (a WWW browser), which displays
hypertext documents. The user can then either request a keyword search
or jump to another document by following a hypertext link. WWW has the
outstanding advantage of extending the hypertext model to the whole
world (by allowing hypertext jumps to documents anywhere on the internet
network) and by being device and user-interface independent (browsers
exist for a variety of computers and user-interfaces, including Unix
workstations running XWindows, MacIntoshes and PCs with Microsoft
Windows).
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT, PROSITE, SWISS-2DPAGE and SWISS-3DIMAGE
databases and, through any SWISS-PROT protein sequence entry, to other
databases such as EMBL, PROSITE, REBASE, FlyBase, GCRDb, MaizeDB, OMIM,
PDB and Medline. Using a browser which is able to display images one can
also remotely access 2D gels image data from SWISS-2DPAGE.
A WWW server can be accessed on the internet through its Uniform
Resource Locator (URL), the addressing system defined by the WWW model.
The URL for the ExPASy WWW server is:
http://expasy.hcuge.ch/
or
http://129.195.254.61/
To access a WWW server, you need to run a browser (or client) program on
your local computer. Browsers exist for a variety of machines and may be
obtained by anonymous ftp. ExPASy can be used with any WWW browser, but
we recommend NCSA Mosaic. It is a very flexible and powerful browser
with a graphical user interface; available for Unix boxes using
X11/Motif; for Apple McIntoshes and for Microsoft Windows. You can get
it from the FTP site: ftp.ncsa.uiuc.edu.
To access all the data available from SWISS-2DPAGE, the user's local
computer needs to run an image viewing program. For most browsers on
Unix workstations the default program is xv, a shareware application.
Similar Windows or Apple shareware or public domain applications are
also available.
For more information on the ExPASy WWW server, you can read the
following article:
Appel R.D., Bairoch A., Hochstrasser D.F.
A new generation of information retrieval tools for biologists: the
example of the ExPASy WWW server.
Trends Biochem. Sci. 19:258-260(1994).
<PAGE>
Or you can contact Dr. Ron Appel:
Email: appel@cih.hcuge.ch
Fax: +41-22-372 61 98
2.6.2 Changes to the WWW ExPASy server
There has been quite a number of changes to the server in the last three
months. We want to list specifically the following enhancements:
- A direct entry point to PROSITE has been implemented. It is possible
to search in PROSITE by description (title of the entry), entry name,
accession number, author name and by performing a full text search.
- It is now possible to retrieve either the EMBL or GenBank version of
a cross-referenced nucleotide sequence entry.
- Active cross-references are now provided to GCRDb and MaizeDB (see
section 2.4 above).
- New SWISS-PROT documents such as RESTRIC.TXT or YEAST11.TXT (see
section 2.5 above) are available as hypertext documents.
2.7 Weekly updates of SWISS-PROT
Since release 24, we provide weekly updates of SWISS-PROT.
The weekly updates are available by anonymous FTP. Three files are
updated at each update:
new_seq.dat Contains all the new entries since the last full release.
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release.
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Currently these files are available on the following anonymous ftp
servers:
Organization ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/swiss-prot/updates
Organization National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov (or 130.14.20.1)
Directory /repository/swiss-prot/updates
Organization EMBL ftp server
Address ftp.embl-heidelberg.de (or 192.54.41.33)
Directory /pub/databases/swissprot/new
!! Important notes !!!
Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
<PAGE>
Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free. Also,
for the same reason, new entries do not contain an OC (Organism
Classification) line.
3. ENZYME AND PROSITE
3.1 The ENZYME data bank
Release 16.0 of the ENZYME data bank is distributed with release 29 of
SWISS-PROT. ENZYME release 16.0 contains information relative to 3546
enzymes. For the first time we have integrated information directly sent
to us by the Enzyme nomenclature subcommitee of the NCB-IUBMB.
3.2 The PROSITE data bank
3.2.1 Statistics for release 12
Release 12.0 of the PROSITE data bank is distributed with release 29 of
SWISS-PROT. Release 12 contains 785 documentation chapters that
describes 1029 different patterns, rules and profiles/matrices. Since
the last major release of PROSITE (release 11.0 of October 1993), 71 new
chapters have been added and 338 chapters have been updated.
Out of a total of 38303 entries in SWISS-PROT, 18786 are cross-
referenced in PROSITE (excluding the false positives). This tally for
exactly 49% of the sequences in SWISS-PROT.
The next release of PROSITE (12.1) will be distributed with release 30
of SWISS-PROT.
3.2.2 List of the new entries in release 12
Ly-6 / u-PAR domain signature
Nuclear transition protein 2 signatures
Ribosomal protein L19 signature
Ribosomal protein L20 signature
Ribosomal protein L35 signature
Ribosomal protein L1e signature
Ribosomal protein S2 signatures
Ribosomal protein S7e signature
Ribosomal protein S21e signature
Ribosomal protein S28e signature
DnaA protein signature
NAD-dependent glycerol-3-phosphate dehydrogenase signature
FAD-dependent glycerol-3-phosphate dehydrogenase signatures
Mannitol dehydrogenases signature
Coproporphyrinogen III oxidase signature
Bacterial-type phytoene dehydrogenase signature
Ergosterol biosynthesis ERG4/ERG24 family signatures
Transaldolase active site
<PAGE>
Myristoyl-CoA:protein N-myristoyltransferase signatures
PTS EIIB domains cysteine phosphorylation site signature
Eukaryotic RNA polymerases 15 Kd subunits signature
Protein phosphatase 2A regulatory subunit PR55 signatures
Protein phosphatase 2C signature
Glycosyl hydrolases family 16 signature
Glycosyl hydrolases family 25 active sites signature
Glycosyl hydrolases family 39 putative active site
Ubiquitin carboxyl-terminal hydrolases family 2 signatures
Glycoprotease family signature
Dehydroquinase class I active site
Dehydroquinase class II signature
Imidazoleglycerol-phosphate dehydratase signatures
Cysteine synthase/cystathionine beta-synthase P-phosphate
Glyoxalase I signatures
6-pyruvoyl tetrahydropterin synthase signatures
Phosphomannose isomerase type I signatures
Folylpolyglutamate synthase signatures
Transposases, Mutator family, signature
OHHL biosynthesis luxI family signature
Succinate dehydrogenase cytochrome b subunit signatures
Globins profile
PTR2 family proton/oligopeptide symporters signatures
glpT family of transporters signature
Bacterial formate and nitrite transporters signatures
Fungal hydrophobins signature
G-protein coupled receptors family 3 signatures
Antenna complexes alpha and beta subunits signatures
Photosystem I psaG and psaK proteins signature
ER lumen protein retaining receptor signatures
Neuromedin U signature
Urotensin II signature
Neutrophil bactenecins signatures
Gamma-thionins family signature
Streptomyces subtilisin-type inhibitors signature
Heat shock hsp20 proteins family profile
Bacterial export FHIPEP family signature
Cytochrome c oxidase assembly factor COX10/ctaB/cyoE signature
Cyclin-dependent kinases regulatory subunits signatures
ADP-ribosylation factors family signature
SAR1 family signature
Initiation factor 3 signature
Transcription termination factor nusG signature
BTG1 family signature
G10 protein signatures
Clathrin adaptor complexes medium chain signatures
Clathrin adaptor complexes small chain signature
Extracellular proteins SCP/Tpx-1/Ag5/PR-1/Sc7 signatures
Oxysterol-binding protein family signature
Serum amyloid A proteins signature
Spermadhesins family signatures
Syndecans signature
Translationally controlled tumor protein signatures
<PAGE>
3.2.3 Status of profiles in PROSITE
There are a number of protein families as well as functional or
structural domains that cannot be detected using patterns due to their
extreme sequence divergence. Typical examples of important functional
domains which are weakly conserved are the immunoglobulin domains, the
SH2 and SH3 domains, or the fibronectin type III domain. In such domains
there are only a few sequence positions which are well conserved. Any
attempt of building a consensus pattern for such regions will either
fail to pick up a significant proportion of the protein sequences that
contain such region (false negative) or will pick up too many proteins
that do not contain the region (false positive). The use of technique
based on weight matrices or profiles allows the detection of such
proteins or domains. Philipp Bucher, Kay Hofmann at ISREC in Lausanne
and myself are collaborating to include such methods into PROSITE.
This is the first release of PROSITE to include weight matrices (also
known as profiles). In this release only two profiles entries are
available (for the hsp20 family of small chaperones and for globins). We
plan to add many new profiles for the next major release (release 13) as
well as to replace some of the existing pattern entries by profiles.
None of the many academic or commercial programs which has been
developed to scan PROSITE can currently make use of the profile entries.
We are therefore distributing, with PROSITE, the source code (C
language) of two programs that should help software developers to
implement profile-specific routines in their application(s):
scan4prf Loads a sequence from a file and scans it with all (or one) of
the PROSITE profiles.
srch4prf Loads a profile from a file and scans for that profile in a
SWISS-PROT data base file.
These programs will soon be available in the respective /prosite
directory of the NCBI and Expasy anonymous FTP servers (for the
addresses, see section 2.7).
Important notice for software developers: the integration of profiles
into PROSITE did not "break" the current format. The profiles entries in
the PROFILE.DAT file are tagged with the token "MATRIX" on the "ID"
line; a new line-type "MA" is used in these entries to store all the
weight matrices specific parameters. The full description of the format
of the "MA" line-type is available as a user's manual (file name:
PROFILE.TXT) that is part of the PROSITE distribution files. The format
of the PROSITE.DOC file has not be changed.
3.2.4 Author index file
Starting with this release, we distribute a file that contains an index
of the authors (and on-line experts) referenced in the PROSITE.DOC file.
The name of this file is 'PAUTINDX.TXT'.
<PAGE>
WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available.
<PAGE>
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.60 Gln (Q) 4.03 Leu (L) 9.21 Ser (S) 7.15
Arg (R) 5.23 Glu (E) 6.27 Lys (K) 5.83 Thr (T) 5.82
Asn (N) 4.47 Gly (G) 6.97 Met (M) 2.36 Trp (W) 1.29
Asp (D) 5.28 His (H) 2.25 Phe (F) 4.00 Tyr (Y) 3.21
Cys (C) 1.78 Ile (I) 5.58 Pro (P) 5.02 Val (V) 6.52
Asx (B) 0.005 Glx (Z) 0.005 Xaa (X) 0.02
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 4471
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 2010
2x: 735
3x: 402
4x: 255
5x: 182
6x: 196
7x: 102
8x: 79
9x: 88
10x: 45
11- 20x: 176
21- 50x: 121
51-100x: 37
>100x: 43
<PAGE>
A.2.2 Table of the most represented species
Number Frequency Species
1 2862 Human
2 2674 Escherichia coli
3 1951 Baker's yeast (Saccharomyces cerevisiae)
4 1697 Mouse
5 1565 Rat
6 710 Bovine
7 679 Caenorhabditis elegans
8 633 Fruit fly (Drosophila melanogaster)
9 563 Bacillus subtilis
10 542 Chicken
11 410 African clawed frog (Xenopus laevis)
12 394 Salmonella typhimurium
13 387 Rabbit
14 339 Pig
15 251 Vaccinia virus (strain Copenhagen)
16 239 Maize
17 229 Arabidopsis thaliana (Mouse-ear cress)
18 221 Fission yeast (Schizosaccharomyces pombe)
19 200 Bacteriophage T4
20 198 Slime mold (Dictyostelium discoideum)
21 197 Rice
22 193 Human cytomegalovirus (strain AD169)
23 185 Pseudomonas aeruginosa
24 183 Vaccinia virus (strain WR)
25 180 Tobacco
26 174 Pea
27 168 Wheat
28 158 Barley
29 146 Variola virus
30 142 Dog
31 139 Sheep
32 137 Soybean
33 134 Staphylococcus aureus
34 131 Spinach
35 127 Pseudomonas putida
36 124 Neurospora crassa
37 122 Marchantia polymorpha (Liverwort)
38 121 Rhodobacter capsulatus
39 119 Klebsiella pneumoniae
40 111 Agrobacterium tumefaciens
41 108 Bacillus stearothermophilus
42 104 Tomato
43 101 Rhizobium meliloti
<PAGE>
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 2159 1001-1100 363
51- 100 3748 1101-1200 257
101- 150 5182 1201-1300 196
151- 200 3711 1301-1400 115
201- 250 3240 1401-1500 113
251- 300 2860 1501-1600 54
301- 350 2687 1601-1700 55
351- 400 2771 1701-1800 45
401- 450 2065 1801-1900 51
451- 500 2192 1901-2000 37
501- 550 1560 2001-2100 19
551- 600 1089 2101-2200 48
601- 650 776 2201-2300 53
651- 700 580 2301-2400 19
701- 750 558 2401-2500 25
751- 800 425 >2500 119
801- 850 328
851- 900 353
901- 950 229
951-1000 221
Currently the ten longest sequences are:
HTS1_COCCA 5217 a.a.
FAT_DROME 5147 a.a.
RYNR_RABIT 5037 a.a.
RYNR_HUMAN 5032 a.a.
RYNC_RABIT 4969 a.a.
DYHC_DICDI 4725 a.a.
APB_HUMAN 4563 a.a.
APOA_HUMAN 4548 a.a.
RRPA_CVMJH 4488 a.a.
DYHC_TRIGR 4466 a.a.
<PAGE>
APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
**********************
*********************** * EPD [Euk. Promot.] *
* EMBL Nucleotide * <----> **********************
* Sequence Data *
****************** * Library * **********************
* FLYBASE * <--> *********************** <----- * ECD [E. coli map] *
* [Drosophila * ^ ^ ^ **********************
* genomic d.b.] * <---------+ | | |
****************** | | | +------------ **********************
| | | * TFD [Trans. fact.] *
| | | **********************
****************** | | |
* MaizeDb * <------+ | | | **********************
****************** | | | +--------------> * GCRDb [7TM recep.] *
| | | | **********************
****************** | | | |
* WormPep * | | | | **********************
* [C.elegans] * <----+ | | | | +------> * DictyDB [D.disco.] *
****************** | | | | | | **********************
| | | | | |
****************** | v v v v v **********************
* REBASE * *********************** * ENZYME [Nomencl.] *
* [Restriction * <--- * SWISS-PROT * <----- **********************
* enzymes] * * Protein Sequence * |
****************** * Data Bank * v
*********************** **********************
****************** ^ ^ | | ^ ^ | * OMIM [Diseases] *
* EcoGene/EcoSeq * | | | | | | +--------> **********************
* [E. coli] * <-----+ | | | | |
****************** | | | | +-----------> **********************
| | | | * ECO2DBASE [2D] *
| | | | **********************
****************** | | | |
* PROSITE * <--------+ | | +---------------> **********************
* [Patterns] * | | * SWISS-2DPAGE [2D] *
****************** | +---------------+ **********************
| v |
| *********************** | **********************
+--------> * PDB [3D structures] * +--> * Aarhus/Ghent [2D] *
*********************** **********************
<PAGE>