ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot
 Hosted by ch flag SIB Switzerland Mirror sites: Australia  Brazil  Canada  China  Korea
Search for
                  SWISS-PROT RELEASE 37.0 RELEASE NOTES

!! Important: do not forget to read section 10 of these release notes. It
contains an important announcement relevant to SWISS-PROT and PROSITE !!



                           1.  INTRODUCTION


Release 37.0  of SWISS-PROT  contains 77'977 sequence entries, comprising
28'268'293 amino acids abstracted from 62'513 references. This represents
an increase  of 5.3%  over release  36. The  growth of  the data  bank is
summarized below.

 Release      Date           Number of       Number of amino
                               entries                 acids
    2.0       09/86               3939               900 163
    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389
   35.0       11/97              69113            25 083 768
   36.0       07/98              74019            26 840 295
   37.0       12/98              77977            28 268 293



     2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 36


2.1  Sequences and annotations

3'988 sequences  have been  added since  release 36, the sequence data of
667 existing  entries has  been updated  and the  annotations  of  12'047
entries have been revised.


2.2  What's happening with the model organisms

We have  selected a  number of  organisms that  are the  target of genome
sequencing and/or mapping projects and for which we intend to:

o  Be as  complete as possible.  All sequences  available at a given time
   should  be  immediately  included  in  SWISS-PROT.  This also includes
   sequence corrections and updates;
o  Provide a higher level of annotation;
o  Provide  cross-references  to  specialized  database(s) that  contain,
   among other  data,  some genetic information about the genes that code
   for these proteins;
o  Provide specific indices or documents.

Here is the current status of the model organisms in SWISS-PROT:

 Organism        Database            Index file       Number of
                 cross-referenced                     sequences
 --------------  ----------------    --------------   ---------
 A.thaliana      None yet            In preparation         792
 B.subtilis      SubtiList           SUBTILIS.TXT          2046
 C.albicans      None yet            CALBICAN.TXT           194
 C.elegans       Wormpep             CELEGANS.TXT          1956
 D.discoideum    DictyDB             DICTY.TXT              285
 D.melanogaster  FlyBase             FLY.TXT               1064
 E.coli          EcoGene             ECOLI.TXT             4476
 H.influenzae    HiDB (TIGR)         HAEINFLU.TXT          1701
 H.sapiens       MIM                 MIMTOSP.TXT           5146
 H.pylori        HpDB (TIGR)         HPYLORI.TXT            367
 M.genitalium    MgDB (TIGR)         MGENITAL.TXT           470
 M.musculus      MGD                 MGDTOSP.TXT           3387
 M.jannaschii    MjDB (TIGR)         MJANNASC.TXT          1307
 M.tuberculosis  None yet            None yet               918
 S.cerevisiae    SGD                 YEAST.TXT             4806
 S.typhimurium   StyGene             SALTY.TXT              723
 S.pombe         None yet            POMBE.TXT             1406
 S.solfataricus  None yet            None yet                84

We  plan  to  finish  as  quickly  as  possible  the  annotation  of  the
Escherichia coli,  Haemophilus influenzae,  Methanococcus jannaschii  and
yeast (S.cerevisiae)  sequence entries  which are  not yet part of SWISS-
PROT.


2.3  Switch to the NCBI taxonomy

To contribute  to the standardization of the taxonomies used in molecular
sequence databases  we have changed our taxonomy with release 37. We have
switched  to   the  NCBI   taxonomy,  which   is  already   used  by  the
DDBJ/EMBL/GenBank   nucleotide    sequence   databases.   The   taxonomic
classification maintained  at the  NCBI  is  available  from  the  server
http://www.ncbi.nlm.nih.gov/Taxonomy.

This modification affects the OC (Organism Classification) lines. However
it has  no impact  on the  format of that line-type, only on its content.
For example, the OC lines for Homo sapiens (human) used to be:

OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC   EUTHERIA; PRIMATES.

and is now:

OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA;
OC   PRIMATES; CATARRHINI; HOMINIDAE; HOMO.

The switch  to  the  new  taxonomy  indirectly  brings  along  additional
changes. Most of these changes are subtle, yet they may have an impact on
some users  and some  specific usage of SWISS-PROT. We will describe here
some of these changes.

The NCBI taxonomy is much more detailed than that formerly used by SWISS-
PROT. The  number of  nodes listed in the OC lines is therefore generally
larger. For example, the taxonomic lineage for Pisum sativum (garden pea)
used to be:

OC   EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
OC   FABALES; FABACEAE.

It is now:

OC   EUKARYOTA; VIRIDIPLANTAE; STREPTOPHYTA; EMBRYOPHYTA; TRACHEOPHYTA;
OC   EUPHYLLOPHYTES; SPERMATOPHYTA; MAGNOLIOPHYTA; EUDICOTYLEDONS;
OC   ROSIDAE; FABALES; FABACEAE; PAPILIONOIDEAE; PISUM.

The names  of the  taxonomic kingdoms  at the  root of the NCBI taxonomic
tree differ from the old SWISS-PROT taxonomy in the following manner:

        NCBI        Old SWISS-PROT
        ----------  --------------
        Archaea     Archaebacteria
        Bacteria    Prokaryota
        Eukaryota   Eukaryota
        Viruses     Viridae

This is important for users selecting a subset of the database based on a
particular taxonomic kingdom.

We  also   changed  the   names  of   the  corresponding   files  in  the
special_selection section  of the anonymous FTP server (see section 7.1).
The files:

archaebacteria.seq.xxxxxx
eukaryota     .seq.xxxxxx
prokaryota    .seq.xxxxxx
viridae       .seq.xxxxxx

(where 'xxxxxx' is the date the file was created) are now renamed:

archaea       .seq.xxxxxx
eukaryota     .seq.xxxxxx
bacteria      .seq.xxxxxx
viruses       .seq.xxxxxx

The format  and content  of  the 'speclist.txt' documentation  file  (see
section 4)  has changed.  It no  longer contains the section that used to
list the taxonomic nodes as it would now be too cumbersome to be included
in such a document. The SWISS-PROT taxonomic node code is replaced by the
NCBI  taxonomy  ID  (TaxID).  As  the  NCBI  code  does  not  convey  any
information per  se on  which taxonomic  kingdom a species belongs to, we
have followed each organism code by a letter that indicates the taxonomic
kingdom a species belongs to. It can be one of the following:

'A' for archaea (=archaebacteria);
'B' for bacteria (=prokaryota or eubacteria);
'E' for eukaryota;
'V' for viruses and phages (=viridae).

Example:

DROME E 007227: N=Drosophila melanogaster
                C=Fruit fly

On the ExPASy WWW version (http://www.expasy.ch/cgi-bin/speclist) of this
document, the  NCBI TaxID  is an active link to the NCBI server, querying
the Taxonomic database on the lineage of the selected organism.

While in  the process  of mapping  the old SWISS-PROT taxonomy to that of
NCBI, we  corrected more  than 100  misspelling in species names. We also
updated many  names to  newer and more appropriate designations (but kept
the previous names as synonyms).


2.4  Introduction of the Reference Title (RT) line-type

In release  37 we  have introduced  a new  line type,  the RT  (Reference
Title) line. This optional line is placed between the RA and RL line. The
RT line  gives the title of the paper (or other work) cited as exactly as
possible given  the limitations of the computer character set. The format
of the RT line is:

RT   "TITLE";

An example of the use of RT lines is shown below:

RT   "Sequence analysis of the genome of the unicellular cyanobacterium
RT   Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
RT   region from map positions 64% to 92% of the genome.";

It should be noted that:

o The form  used is  that which  would be  used in a citation rather than
  that displayed  at the  top of the published paper. For instance, where
  journals capitalize major title words this is not preserved;
o The text of a title ends  with either a period '.', a question mark '?'
  or an exclamation mark '!';
o Double  quotation  marks '"' are  not present in the text of the title;
  they are replaced by single quotation marks;
o Titles of articles published in a language other than English have been
  translated into English;
o Greek letters are spelled out (alpha, beta, etc.).

The RT  lines were  introduced in  journal, book and patent references as
well as  in some  other types  of references  (Plant Gene  Register, Worm
Breeder Gazette).  They have  not yet  been systematically introduced for
unpublished submissions. The RT lines were introduced using the following
sources of information:

o  For all  references linked to Medline,  the  titles were automatically
   extracted from the relevant Medline abstracts;
o  The  EMBL DNA  sequence  database  was  then  automatically scanned to
   retrieve additional titles. We then searched for the remaining missing
   titles in a variety of on-line resources:
o  The  LITDB    bibliographic  database    from  the   Protein  Research
   Foundation in Japan;
o  The AGRICOLA bibliographic database from NAL;
o  The Web sites of various journals;
o  The Korean journals abstract database;
o  The PDB 3D-structure database;
o  The MIM database;
o  The Plant Gene Register;
o  The NCBI Entrez protein search tool;
o  The European Patent Office patent database;
o  About 200 titles were typed-in by going to various libraries in Geneva
   to find the relevant papers;
o  Finally some authors, editors or publishers were  contacted  by email.
   We want  to thank  all  those  that  responded and sent us titles that
   would otherwise have been very difficult to find.

Currently out  of more  than 62000 references, we only lack the title for
less than 50 (this corresponds to a coverage of more than 99.9%).

The RT  line has been introduced in mixed-case, instead of the ALL UPPER-
CASE format used elsewhere in SWISS-PROT. As you will see in section 3.1,
we plan to gradually convert all of SWISS-PROT to mixed-case.


2.5  Changes affecting the accession numbers

With the  creation of  the TrEMBL  database (see section 6) and the rapid
increase in  the amount  of sequence data, we are faced with a problem of
availability of  accession numbers.  Currently we use a system based on a
one-letter prefix  followed by 5 digits. This system was also used by the
nucleotide sequence  databases which  had originally  reserved for SWISS-
PROT the  prefix letters 'P' and 'Q'. The nucleotide databases having run
out of space (due mainly to EST's), have been forced to start using a new
format based on a two-letter prefix followed by 6 digits.

We have used up all possible numbers with 'P' and 'Q' and the only letter
prefix which  was not  used by  the nucleotide  database is  'O'.  As  we
believe that  changing the  format of  the accession numbers to that used
now by  the nucleotide  database  would  create  havoc  on  the  numerous
software packages  using SWISS-PROT,  we have decided to keep a system of
accession numbers  based on  a six-character code, but with the following
changes:

o We  have  started  using  'O'.  This  extra  letter  should  allow  the
  continuation of  the present  format (1  prefix letter  + 5 digits) for
  approximately one year.
o When we  will have  finished using  up 'O',  we will introduce a system
  based on the following format:

    1        2       3          4            5            6
    [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

What the  above means is that we will keep a six-character code, but that
in positions  3, 4  and 5  of this  code any  combination of  letters and
numbers can  be present.  This  format  allows  a  total  of  14  million
accession numbers (up from 300'000 with the current system).

We only  allow numbers  in positions  2 and  6  so  that  the  SWISS-PROT
accession numbers  can not  be mistaken  with gene names, acronyms, other
type of accession numbers or any type of words!

Examples: P0A3S2, Q2ASD4, O13YX2, P9B123


2.6  Changes concerning the reference location line (RL)

The (IN)  prefix is  mainly used  for book  citations. We  have  slightly
changed the  format of  these book  citations so  that the  format is now
similar to  that used  by the  EMBL nucleotide sequence database. The new
format is:

RL   (IN) EDITOR_1 I.[, EDITOR2 I., EDITOR_X I.] (EDS.);
RL   BOOK-NAME, PP.[VOL:]FIRST-LAST, PUBLISHER, CITY (YEAR).

So, what was before:

RL   (IN) TRENDS IN QSAR AND MOLECULAR MODELING 92, WERMUTH C.G., ED.,
RL   PP.485-486, ESCOM, LEIDEN, (1993).

is now:

RL   (IN) WERMUTH C.G. (EDS.);
RL   TRENDS IN QSAR AND MOLECULAR MODELLING 92, PP.485-486, ESCOM
RL   SCIENCE PUBLISHERS, LEIDEN (1993).

Since release 36, the (IN) prefix has also been used for citations to the
electronic Plant Gene Register. In release 37 it can additionally be used
for    references     to    the     Worm    Breeders     Gazette     (see
http://elegans.swmed.edu/wli/). Example:

RL   (IN) WORM BREEDER'S GAZETTE 15(3):34(1998).


2.7  Cleaning up of the SIMILARITY comment line (CC) topic

We are continuing a major overhaul of the SIMILARITY topic. We would like
the majority  of the  information stored  in this  topic to  be usable by
computer  programs   (while  being   human-readable).  We  are  therefore
standardizing the  format of  this topic  using two different subformats.
One to describe to which family a protein belongs:

CC   -!-  SIMILARITY: BELONGS TO THE <Name1> FAMILY [OF <Name2>].
CC        [<Name3> SUBFAMILY.]

Examples:

CC   -!-  SIMILARITY: BELONGS TO THE 14-3-3 FAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE
CC        FAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES.
CC   -!-  SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF
CC        OXIDOREDUCTASES.
CC   -!-  SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS.
CC        "DEFORMED" SUBFAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY. KINESIN
CC        SUBFAMILY.

And one to describe which domains are found in a given protein:

CC   -!-  SIMILARITY: CONTAINS n <Name> [DOMAIN|REPEAT][S].

Examples:

CC   -!-  SIMILARITY: CONTAINS 1 FHA DOMAIN.
CC   -!-  SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS.
CC   -!-  SIMILARITY: CONTAINS 2 SH3 DOMAINS.
CC   -!-  SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS.

We have  already updated  many entries  in this and the previous releases
and plan to complete this change for the next release.


2.8  Changes concerning cross-references (DR line)

We have added cross-references from SWISS-PROT to the Pfam protein domain
database  (see   http://www.sanger.ac.uk/Pfam/;  reference:  Bateman  A.,
Birney E., Durbin R., Eddy S.R., Finn R.D. and Sonnhammer E.L.L.; Nucleic
Acids Res.  27:260-262(1999)). These  cross-references are present in the
DR lines.  The specific format for cross-references to the Pfam databases
is almost identical to that used for the PROSITE database:

DR   PFAM; ACCESSION_NUMBER; ENTRY_NAME; STATUS.

Where 'ACCESSION_NUMBER' stands for the accession number of the Pfam HMM-
profile  entry; 'ENTRY_NAME' is  the  name  of  the entry and 'STATUS' is
either  'n' or  'PARTIAL'.  'n' is  the  number of hits of the profile in
that particular protein sequence. The 'PARTIAL' status indicates that the
profile did not detect the sequence because that sequence is not complete
and lacks the region on  which is  the profile  is based.  The difference
between the cross-references to  Pfam and  those to  PROSITE is  that the
PROSITE DR  lines  make  use  of two additional 'STATUS': 'FALSE_NEG' and
'UNKNOWN'.

Examples of Pfam cross-references:

DR   PFAM; PF00017; SH2; 1.
DR   PFAM; PF00008; EGF; 8.
DR   PFAM; PF00595; PDZ; PARTIAL.


In this  release, we  have also  updated all the DR lines pointing to the
HSSP, Mendel and TRANSFAC databases.




                        3. PLANNED CHANGES


3.1  Conversion of SWISS-PROT to mixed-case characters

We are  happy to  announce that we will gradually start the conversion of
SWISS-PROT entries from all 'UPPER CASE' to 'MiXeD CaSe'. The first line-
type that  follows the  new format  is the  newly introduced RT line (see
section 2.4). In release 38 we are planning to convert the following line
types:

                        DT, OS, OG, OC, RL and KW

Further lines  will be  converted in  release 39,  and this  process will
probably be  completed for  January 1,  2000. We  can't enter  the  third
millennium with  a  carry  over  from  the  time  of  punched  tapes  and
teletypes!

Here is  an example  of what a SWISS-PROT entry will look like in release
38:

ID   PETG_CYAPA     STANDARD;      PRT;    37 AA.
AC   P14236;
DT   01-JAN-1990 (Rel. 13, Created)
DT   01-JAN-1990 (Rel. 13, Last sequence update)
DT   01-NOV-1997 (Rel. 35, Last annotation update)
DE   CYTOCHROME B6-F COMPLEX SUBUNIT 5.
GN   PETG.
OS   Cyanophora paradoxa.
OG   Cyanelle.
OC   Eukaryota; Glaucocystophyceae; Cyanophoraceae; Cyanophora.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=LB555 / PRINGSHEIM;
RX   MEDLINE; 90098772.
RA   STIREWALT V.L., BRYANT D.A.;
RT   "Molecular cloning and nucleotide sequence of the petG gene of the
RT   cyanelle genome of Cyanophora paradoxa.";
RL   Nucleic Acids Res. 17:10095-10095(1989).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   STRAIN=LB555 / PRINGSHEIM;
RA   STIREWALT V.L., MICHALOWSKI C.B., LUFFELHARDT W., BOHNERT H.J.,
RA   BRYANT D.A.;
RL   Submitted (JUL-1995) to the EMBL/GenBank/DDBJ databases.
CC   -!- FUNCTION: THE CYTOCHROME B6-F COMPLEX FUNCTIONS IN THE LINEAR
CC       CROSS-MEMBRANE TRANSPORT OF ELECTRONS BETWEEN PHOTOSYSTEM II AND
CC       I, AS WELL AS IN CYCLIC ELECTRON FLOW AROUND PHOTOSYSTEM I.
CC       PETG IS REQUIRED FOR EITHER THE STABILITY OR ASSEMBLY OF THE
CC       CYTOCHROME B6-F COMPLEX.
CC   -!- SUBCELLULAR LOCATION: THYLAKOID MEMBRANE-ASSOCIATED.
CC   -!- SIMILARITY: BELONGS TO THE PETG FAMILY.
DR   EMBL; X16974; G12549; -.
DR   EMBL; U30821; G1016164; -.
DR   PIR; S06916; S06916.
DR   MENDEL; 7879; CYApa;petG;1.
KW   Electron transport; Respiratory chain; Cyanelle;
KW   Thylakoid membrane; Transmembrane.
FT   DOMAIN        1      4       LUMENAL (POTENTIAL).
FT   TRANSMEM      5     25       POTENTIAL.
FT   DOMAIN       26     37       STROMAL (POTENTIAL).
SQ   SEQUENCE   37 AA;  4139 MW;  265A8973 CRC32;
     MVEPLLSGIV LGLIPVTLIG LFVAAYLQYR RGNQFEF
//


3.2  Extension of the accession number system

As already  explained in  detail under  2.5, we will extend the accession
number system  when we  will have  used up  the 'O'  series of  accession
numbers. This can be anticipated for early 1999.


3.3  Introduction of a new CC line-type topic: MISCELLANEOUS

We will introduce in the next release a new 'topic' for the comments (CC)
line-type:  'MISCELLANEOUS'.  This  topic  will  be used for all comments
which  do not  belong to any other already defined topic. What this means
is that,  starting with release 38, all comment lines will be assigned to
a topic. Example:

CC   -!- BINDS TO BACITRACIN.

will become:

CC   -!- MISCELLANEOUS: BINDS TO BACITRACIN.


3.4  Introduction  of   a  unique   identifier  in  the  VARIANT  feature
     description of human sequence entries

We plan  to introduce  in release  38 a unique identifier for all VARIANT
feature keys  in human  sequence entries.  This change  is the first step
toward providing  a unique  identifier to  all SWISS-PROT features. Human
sequence  variants   were  chosen   as  a   prototype  for  this  planned
improvement. It  will be  possible, as  soon as  these identifiers become
available, to  directly link  specific sequence  variants to the relevant
entries in  disease mutation  databases  as  well  as  to  provide  these
databases with a method to implement reciprocal links.

The  unique identifier will be of the form of '/FTId=VAR_nnnnnn' and will
be added as the last part of the description field of a 'VARIANT' feature
keys. Examples:

FT   VARIANT       6      6       E -> V (IN S; SICKLE CELL ANEMIA).
FT   VARIANT      11     11       V -> D (IN WINDSOR; O2 AFFINITY UP;
FT                                UNSTABLE).

will become:

FT   VARIANT       6      6       E -> V (IN S; SICKLE CELL ANEMIA);
FT                                /FTId=VAR_000001.
FT   VARIANT      11     11       V -> D (IN WINDSOR; O2 AFFINITY UP;
FT                                UNSTABLE); /FTId=VAR_000234.


3.5  Small change  in the  format of  RL lines for submissions to the DNA
     databases

Along with  the conversion of the RL to mixed-case (see 3.1) we will also
make a  small change to the format of RL lines for submissions to the DNA
databases. What is now:

RL   SUBMITTED (MMM-YEAR) TO EMBL/GENBANK/DDBJ DATA BANKS.

will be changed to:

RL   Submitted (MMM-YEAR) to the EMBL/GenBank/DDBJ databases.

Such a change is made so as to follow more closely the format used by the
EMBL nucleotide sequence database.




                  4. STATUS OF THE DOCUMENTATION FILES


SWISS-PROT is  distributed with  a large  number of  documentation files.
Some of these files have been available for a long time (the user manual,
release notes,  the various  indices for  authors,  citations,  keywords,
etc.), but many have been created recently and we are continuously adding
new files. The following table lists all the documents that are currently
available.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes for current release (37)
 OLDRLNOT.TXT   Release notes for previous release (36)
 SHORTDES.TXT   Short description of entries in SWISS-PROT
 JOURLIST.TXT   List of abbreviations for journals cited [see 1]
 KEYWLIST.TXT   List of keywords in use [see 2]
 SPECLIST.TXT   List of organism identification codes [see 3]
 TISSLIST.TXT   List of tissues
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Author index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keyword index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 BLOODGRP.TXT   List of blood group antigen proteins
 CALBICAN.TXT   Index   of  Candida  albicans  entries   and  their
                corresponding gene designations
 CDLIST.TXT     CD  nomenclature  for  surface  proteins  of  human
                leucocytes
 CELEGANS.TXT   Index  of Caenorhabditis elegans entries  and their
                corresponding gene Wormpep cross-references
 DICTY.TXT      Index   of  Dictyostelium  discoideum  entries  and
                their  corresponding gene designations  and DictyDb
                cross-references
 EC2DTOSP.TXT   Index  of  Escherichia coli  Gene-protein  database
                entries referenced in SWISS-PROT
 ECOLI.TXT      Index  of Escherichia coli K12  chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index  of   EMBL  Database  entries  referenced  in
                SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index  of  Drosophila  entries and  FlyBase  cross-
                references
 GLYCOSID.TXT   Classification  of glycosyl hydrolase  families and
                index of glycosyl hydrolase entries
 HAEINFLU.TXT   Index  of  Haemophilus  influenzae  RD  chromosomal
                entries
 HOXLIST.TXT    Vertebrate  homeotic Hox proteins: nomenclature and
                index
 HPYLORI.TXT    Index   of   Helicobacter   pylori   strain   26695
                chromosomal entries
 HUMCHR17.TXT   Index of protein  sequence entries encoded on human
                chromosome 17
 HUMCHR18.TXT   Index of protein  sequence entries encoded on human
                chromosome 18
 HUMCHR19.TXT   Index of protein  sequence entries encoded on human
                chromosome 19
 HUMCHR20.TXT   Index of protein  sequence entries encoded on human
                chromosome 20
 HUMCHR21.TXT   Index of protein  sequence entries encoded on human
                chromosome 21
 HUMCHR22.TXT   Index of protein  sequence entries encoded on human
                chromosome 22
 HUMCHRX.TXT    Index of protein  sequence entries encoded on human
                chromosome X
 HUMCHRY.TXT    Index of protein  sequence entries encoded on human
                chromosome Y
 HUMPVAR.TXT    Index of human proteins with sequence variants
 INITFACT.TXT   List and index of translation initiation factors
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 METALLO.TXT    Classification  of  metallothioneins and  index  of
                entries in SWISS-PROT
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT
 MGENITAL.TXT   Index  of Mycoplasma genitalium chromosomal entries
 MJANNASC.TXT   Index of Methanococcus jannaschii entries
 NGR234.TXT     Table  of   putative  genes  in  Rhizobium  plasmid
                pNGR234a
 NOMLIST.TXT    List   of  nomenclature   related  references   for
                proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries
 PDBTOSP.TXT    Index  of X-ray  crystallography Protein Data  Bank
                (PDB) entries referenced in SWISS-PROT
 PEPTIDAS.TXT   Classification  of peptidase families and  index of
                peptidase entries
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index   of  Schizosaccharomyces  pombe  entries  in
                SWISS-PROT    and    their    corresponding    gene
                designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of  ribosomal proteins classified by families
                on the basis of sequence similarities
 SALTY.TXT      Index  of  Salmonella typhimurium  LT2  chromosomal
                entries  and  their  corresponding  StyGene  cross-
                references
 SUBTILIS.TXT   Index of  Bacillus subtilis 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF  (Uncharacterized  Protein Families)  list  and
                index of members
 YEAST.TXT      Index   of  Saccharomyces  cerevisiae  entries  and
                their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries
 YEAST14.TXT    Yeast Chromosome XIV entries

Notes:

[1]  The journal list ('jourlist.txt') has been extensively updated. This
     document now  lists for  each journal  the name  of  its  publisher.
     Journal subtitles,  when they  are available,  have also been added.
     This file  can now  be considered as a mini-database on life science
     journals. It  lists 1073  journals and  contains more  than 800  Web
     links. Example of an entry in the journal list:

     Abbrev: Allergy
     Title : Allergy
             [European Journal of Allergy and Clinical Immunology]
     ISSN  : 0105-4538
     CODEN : LLRGDY
     Publis: Munksgaard
     Note  : Replaces Acta Allergol., starts with vol. 33 in 1978.
     Server: http://www.munksgaard.dk/allergy/

[2]  The keyword  list ('keywlist.txt') has been converted to  mixed-case
     characters.
[3]  The species  list ('speclist.txt') has been extensively  updated due
     to the  switch  to  the NCBI taxonomy (see section 2.3); it also has
     been converted to mixed-case characters.

We have  continued to  include in  some  SWISS-PROT  document  files  the
references of  Web sites  relevant to  the subject  under  consideration.
There are now 40 documents that include such links.




                  5. THE EXPASY WORLD-WIDE WEB SERVER


5.1  Background information

The most  efficient and  user-friendly way  to  browse  interactively  in
SWISS-PROT, PROSITE,  ENZYME, SWISS-2DPAGE  and other databases is to use
the World-Wide  Web (WWW)  molecular biology  server ExPASy.  The  ExPASy
server was  made available  to  the  public  in  September  1993  and  is
reachable at the following address:

                          http://www.expasy.ch/

The ExPASy  WWW server  allows access,  using the user-friendly hypertext
model, to  the SWISS-PROT,  PROSITE, ENZYME,  SWISS-2DPAGE, SWISS-3DIMAGE
and CD40Lbase  databases. And,  through any  SWISS-PROT protein  sequence
entry, to  other databases  such as  EMBL,  Eco2DBASE,  EcoCyc,  FlyBase,
GCRDb, MaizeDB,  Mendel, OMIM,  PDB, HSSP,  Pfam,  ProDom,  REBASE,  SGD,
SubtiList/NRSub, TRANSFAC, YPD and Medline. ExPAsy also offers many tools
for the analysis of protein sequences and 2D gels.


5.2  Swiss-Shop

We provide,  on ExPASy,  a service  called Swiss-Shop.  Swiss-Shop is  an
automated sequence  alerting system  which allows  users  to  obtain,  by
email, new  sequence entries  relevant to  their  field(s)  of  interest.
Various criteria can be combined:

o    By entering  one or  more  words  that  should  be  present  in  the
     description line;
o    By entering one or more species name(s) or taxonomic division(s);
o    By entering one or more keywords;
o    By entering one or more author names;
o    By entering  the accession  number (or  entry  name)  of  a  PROSITE
     pattern or a user-defined sequence pattern;
o    By entering  the accession  number (or  entry name)  of an  existing
     SWISS-PROT entry or by entering a private sequence.

Every week,  the new  sequences entered  in SWISS-PROT  are automatically
compared with  all the criteria that have been defined by the users. If a
sequence corresponds  to the  selection criteria  defined by a user, that
sequence is sent by electronic mail.


5.3  What is new on ExPASy

ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:

o    Read the  document History of changes, improvements and new features
     which is available at the address:
  
                 http://www.expasy.ch/www/history.html

o    Subscribe to  Swiss-Flash, a service that reports news of databases,
     software and  services developments. By subscribing to this service,
     you will automatically get Swiss-Flash bulletins by electronic mail.
     To subscribe use the address:
     
              http://www.expasy.ch/www/swiss-flash.html

Among all  the improvements  and the  new features  introduced during the
last  six   months,  there  are  at  least  three  that  we  believe  are
specifically useful to SWISS-PROT users:

o NiceProt is a tool that provides a user-friendly tabular view of SWISS-
  PROT entries. The 'NiceProt View of SWISS-PROT' is accessible  from the
  top and  bottom of  each  SWISS-PROT entry on ExPASy.  You can use this
  tool  to link  to any  SWISS-PROT by using the following style  of URL:
  http://www.expasy.ch/cgi-bin/niceprot.pl?P01585 (where the last part of
  the URL is a valid primary accession number).

o The  SWISS-PROT/TrEMBL  full  text  search  tool has been improved. The
  databases are now  indexed  using  the Glimpse search engine, wildcards
  can be used in query strings, more fields (line types) are  indexed and
  response times are much shorter than before. See:

             http://www.expasy.ch/cgi-bin/sprot-search-ful

o Users who wish to save and retrieve all SWISS-PROT entries  originating
  from  a species can do this via the SWISS-PROT 'speclist.txt' document.
  By clicking on any of the species codes and specifying a file name, one
  can save  all corresponding  entries to  a file  that can be  retrieved
  from the anonymous ExPASy FTP server.




                  6. TREMBL - A SUPPLEMENT TO SWISS-PROT


The ongoing  genome sequencing  and mapping  projects  have  dramatically
increased the  number of protein sequences to be incorporated into SWISS-
PROT. Since  we do not want to dilute the quality standards of SWISS-PROT
by incorporating  sequences into  the database  without  proper  sequence
analysis and  annotation, we  cannot speed  up the  incorporation of  new
incoming data  indefinitely. But  as we  also want  to make the sequences
available as  fast as  possible, we  have introduced  with  SWISS-PROT  a
computer annotated  supplement. This  supplement consists  of entries  in
SWISS-PROT-like  format  derived  from  the  translation  of  all  coding
sequences (CDS)  in the  EMBL nucleotide  sequence database, except those
already included in SWISS-PROT.

We name  this supplement  TrEMBL  (Translation  from  EMBL).  It  can  be
considered as  a  preliminary  section  of  SWISS-PROT.  This  SWISS-PROT
release is  supplemented by TrEMBL release 8. TrEMBL is split in two main
sections; SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (180'763 in release 8)
which  should  be  incorporated  into  SWISS-PROT.  SWISS-PROT  accession
numbers have been assigned for all SP-TrEMBL entries.

REM-TrEMBL (REMaining  TrEMBL) contains the entries (43'780 in release 8)
that we  do not  want to  include in  SWISS-PROT for a variety of reasons
(synthetic sequences, pseudogenes, translations of incorrect open reading
frames, fragments  with  less  than  eight  amino  acids,  patent-derived
sequences, immunoglobulins and T-cell receptors, etc.)

TrEMBL is  available by  FTP from  the EBI  and  ExPASy  servers  in  the
directory databases/trembl'.  It can  be queried  on WWW  by the  EBI and
ExPASy SRS  servers. It is also searchable on the FASTA, BIC-SW and BLAST
servers of the EBI.




                7.  FTP ANONYMOUS ACCESS TO SWISS-PROT


7.1  Generalities

SWISS-PROT is  available for  download on  the  following  anonymous  FTP
servers:

Organization   Swiss Institute of Bioinformatics (SIB)
Address        ftp.expasy.ch
Directory      /databases/swiss-prot/

Organization   European Bioinformatics Institute (EBI)
Address        ftp.ebi.ac.uk
Directory      /pub/databases/swissprot/

We have  reorganized the  directory on the ExPASy FTP server where SWISS-
PROT is stored. The new organization is shown below.

+--swiss-prot-+
              |
              |--release             The files for the current release of
              |                      SWISS-PROT
              |
              |--release_compressed  The files of the compressed version
              |                      (*.Z) of the current release of SWISS-
              |                      PROT
              |
              |--special_selections  Files storing SWISS-PROT entries either
              |                      from a specific taxonomic subset or
              |                      linked to a specific database
              |
              |--sw_old_releases     The compressed 'tar' (archive) files
              |                      of previous releases of SWISS-PROT
              |
              +--updates             The files of the cumulative weekly
              |                      updates
              |
              +--updates_compressed  The files of the compressed version
                                     (*.Z) of the cumulative weekly updates


The main differences from the previous release are:

o The  SWISS-PROT  release  files  are  now  in  a  subdirectory  (swiss-
  prot/release) instead of the main directory which is now devoid of data
  files.
o A  new   subdirectory  (swiss-prot/sw_old_releases)   was  created.  It
  contains Unix  compressed 'tar' (archive) files of previous releases of
  SWISS-PROT.  Each   release  is   stored  in   a  file  with  the  name
  sprotNN.tar.Z where  NN is a release number. Such a file stores all the
  documentation (*.txt)  files and  the data  file (sprotNN.dat)  of  the
  corresponding SWISS-PROT  release. The  release notes  are renamed from
  release.txt to  release.NN. We  have decided  to provide these files to
  answer two  kinds of  requests. The  main one originates from users who
  want to  compare sequence analysis algorithms by benchmarking them on a
  specific release  of the  database so  as to compare their results with
  those of  a competing  program. The  second type of requests originates
  from legal  departments of biotech companies that often want to be able
  to check  the state  of knowledge  on a  particular sequence at a given
  time frame.


7.2  Weekly updates of SWISS-PROT

Weekly updates  of SWISS-PROT are available by anonymous FTP. Three files
are generated at each update:

new_seq.dat    Contains all the new entries since the last full release;
upd_seq.dat    Contains the  entries for which the sequence data has been
               updated since the last release;
upd_ann.dat    Contains the  entries for  which one  or  more  annotation
               fields have been updated since the last release.

!! Important notes !!

o Although we  try to  follow a  regular schedule,  we do  not promise to
  update these  files every  week. In  most cases  two weeks  may  elapse
  between two updates.
o Instead of  using the  above files,  you can,  every week,  download an
  updated copy  of the SWISS-PROT database. This file is available in the
  directory containing the non-redundant database (see next section).


7.3  Non-redundant database

About a  year ago,  we started  to distribute  on the  ExPASy and EBI FTP
servers, files  that make  up a  non-redundant (see further) and complete
protein sequence database consisting of three components:

1) SWISS-PROT
2) TrEMBL
3) New  entries to  be later  integrated into  TrEMBL (hereafter known as
   TrEMBL_New)

Every week  three files  are completely  rebuilt. These  files are named:
sprot.dat.Z, trembl.dat.Z  and trembl_new.dat.Z. As indicated by their .Z
extension these  are Unix compress format files which, when decompressed,
will produce ASCII files in SWISS-PROT format.

Three others  files are  also available  (sprot.fas.Z,  trembl.fas.Z  and
trembl_new.fas.Z) Which are compressed fasta format sequence files useful
for building  the databases  used by  FASTA,  BLAST  and  other  sequence
similarity search  programs. Please  do not use these files for any other
purpose, as  you will  lose all  annotations by using this very primitive
format.

The files  for the  non-redundant database  are stored  in the  directory
/databases/sp_tr_nrdb on the ExPASy FTP server (ftp.expasy.ch) and in the
directory   /pub/databases/sp_tr_nrdb    on   the    EBI    FTP    server
(ftp.ebi.ac.uk).

Additional notes

o The SWISS-PROT  file continuously  grows as new annotated sequences are
  added.

o The TrEMBL  file decreases  in size  as sequences are moved out of that
  section after  being annotated  and moved into SWISS-PROT. Four times a
  year a  new release of TrEMBL is built at EBI, at this point the TrEMBL
  file increases  in size  as it  then includes  all of the new data (see
  next section) that has accumulated since the last release.

o The TrEMBL_New file starts as a very small file and grows in size until
  a new release of TrEMBL is available.

o SWISS-PROT and  TrEMBL share  the same  system  of  accession  numbers.
  Therefore you  will not  find any  primary accession  number duplicated
  between the  two sections. A TrEMBL entry (and its associated accession
  number(s)) can either move to SWISS-PROT as new entry or be merged with
  an existing  SWISS-PROT  entry.  In  the  latter  case,  the  accession
  number(s) of  that TrEMBL  entry are  added to  that of  the SWISS-PROT
  entry.

o TrEMBL_New does  not  have  real  accession  numbers.  However  it  was
  necessary to  have an AC line so as to be able to use it with different
  software products.  This AC  line contains a temporary identifier which
  consists of  the pID (protein identifier) of the coding sequence in the
  parent nucleotide sequence.

o While these three files allow you to build what we call a non-redundant
  database, it  must  be  noted  that  this  is  not  completely  a  true
  statement. Without  going into  a long explanation we can say that this
  is currently  the best  attempt in  providing a  complete selection  of
  protein sequence  entries while  trying to eliminate redundancies. Also
  SWISS-PROT is  completely (well 99.994% !) non-redundant, TrEMBL is far
  from being  non-redundant and  the addition  of SWISS-PROT  + TrEMBL is
  even less.

o To describe  to your  users the  version of  the non-redundant database
  that you  are providing  them with,  you should  use a statement of the
  form:

     SWISS-PROT release 37 and updates until <current_date>;
     TrEMBL release  8  minus  data  integrated  into  SWISS-PROT  as  of
     <current_date>;
     New preliminary TrEMBL entries created since release 8 of TrEMBL




                         8.  ENZYME and PROSITE


8.1  The ENZYME data bank

Release 24.0  of the  ENZYME data  bank is distributed with release 37 of
SWISS-PROT. ENZYME  release 24.0  contains information  relative to  3704
enzymes. It  differs from  the previous release (23 of July 1998) in that
we have  converted the  CA (Catalytic Activity) and DI (DIsease) lines to
mixed-case characters.  The conversion  of the  ENZYME database  from ALL
UPPER-CASE to mixed-case is therefore completed.

Example, what was before:

ID   1.14.15.4
DE   Steroid 11-beta-monooxygenase.
AN   Steroid 11-beta-hydroxylase.
AN   Steroid 11-beta/18-hydroxylase.
AN   Cytochrome p450 XIB1.
CA   A STEROID + REDUCED ADRENAL FERREDOXIN + O(2) = AN 11-BETA-
CA   HYDROXYSTEROID + OXIDIZED ADRENAL FERREDOXIN + H(2)O.
CF   Heme-thiolate.
CC   -!- Also hydroxylates steroids at the 18-position, and converts
CC       18-hydroxycorticosterone into aldosterone.
DI   ADRENAL HYPERPLASIA IV; MIM:202010.
PR   PROSITE; PDOC00081;
DR   P15150, CPN1_BOVIN;  Q64408, CPN1_CAVPO;  P15538, CPN1_HUMAN;
DR   P97720, CPN1_MESAU;  Q29527, CPN1_PAPHA;  Q29552, CPN1_PIG  ;
DR   Q92104, CPN1_RANCA;  P15393, CPN1_RAT  ;  P51663, CPN1_SHEEP;
DR   P19099, CPN2_HUMAN;  Q64658, CPN2_MESAU;  P15539, CPN2_MOUSE;
DR   P30099, CPN2_RAT  ;  P30100, CPN3_RAT  ;
//

is now:

ID   1.14.15.4
DE   Steroid 11-beta-monooxygenase.
AN   Steroid 11-beta-hydroxylase.
AN   Steroid 11-beta/18-hydroxylase.
AN   Cytochrome p450 XIB1.
CA   A steroid + reduced adrenal ferredoxin + O(2) = an 11-beta-
CA   hydroxysteroid + oxidized adrenal ferredoxin + H(2)O.
CF   Heme-thiolate.
CC   -!- Also hydroxylates steroids at the 18-position, and converts
CC       18-hydroxycorticosterone into aldosterone.
DI   Adrenal hyperplasia IV; MIM:202010.
PR   PROSITE; PDOC00081;
DR   P15150, CPN1_BOVIN;  Q64408, CPN1_CAVPO;  P15538, CPN1_HUMAN;
DR   P97720, CPN1_MESAU;  Q29527, CPN1_PAPHA;  Q29552, CPN1_PIG  ;
DR   Q92104, CPN1_RANCA;  P15393, CPN1_RAT  ;  P51663, CPN1_SHEEP;
DR   P19099, CPN2_HUMAN;  Q64658, CPN2_MESAU;  P15539, CPN2_MOUSE;
DR   P30099, CPN2_RAT  ;  P30100, CPN3_RAT  ;
//

In this  release, we  have also updated and added a significant number of
DI (Disease) lines and added synonyms (AN lines) to a number of entries.

The WWW  version of  ENZYME on  ExPASy now  includes links  to the BRENDA
database of enzymes. See:

  http://www.uni-koeln.de/math-nat-fak/biochemie/ds/dsbren_e.htm


8.2  The PROSITE data bank

Release 15.0  of the  PROSITE data bank is distributed with release 36 of
SWISS-PROT. This  release of  PROSITE contains 1014 documentation entries
that describe 1'352 different patterns, rules and profiles/matrices.



                       9. WE NEED YOUR HELP !


We welcome  feedback from  our users. We would especially appreciate that
you notify  us if  you find  that sequences  belonging to  your field  of
expertise are  missing from  the database.  We  also  would  like  to  be
notified about  annotations to  be updated, if, for example, the function
of a  protein has  been clarified  or  if  new  information  about  post-
translational modifications  has become  available.  To  facilitate  this
feedback we  offer, on  the ExPASY  WWW server,  a form  that allows  the
submission of updates and/or corrections to SWISS-PROT:

             http://www.expasy.ch/sprot/sp_update_form.html

It is also possible, from any entry in SWISS-PROT displayed by the ExPASy
server, to  submit updates  and/or corrections for that particular entry.
Finally, you  can also  send your  comments by  electronic  mail  to  the
address:

                         swiss-prot@expasy.ch

Note that  from January  1999, all  update requests  will be  assigned  a
unique  identifier   of the form  'UR-Xnnnn'  (example:  UR-A0123).  This
identifier will be used internally by the SWISS-PROT staff at SIB and EBI
to track  down the  fate of  requests and  will also  be  used  in  email
exchanges with the persons having submitted a request.



                    10. IMPORTANT ANNOUNCEMENT


It became  obvious in the last years that the tremendous increase in data
flow has created a requirement for resources which cannot be addressed in
full by  public funding.  This is  causing databases  to fall  behind the
research. We  believe that the only solution to the resource shortfall is
to ask  commercial users  to participate  by paying a license fee. No fee
will be charged to academic users, nor will any restriction be imposed on
their use or reuse of the data. Both SWISS-PROT and PROSITE are concerned
by these changes, while this is not the case of ENZYME.

A document  fully describing  what will  be the impact of this change for
SWISS-PROT is  available with  the SWISS-PROT  distribution files  on FTP
(sp_info.txt). You can also access the document as well as other relevant
ones from:

                     http://www.expasy.ch/announce/
                     http://www.ebi.ac.uk/news.html

If you  do not  have the  time to  read this document, the most important
take-home message is that these changes should not have any impact on the
way SWISS-PROT  or PROSITE  are accessed or redistributed. Academic users
will not be affected by these changes. Industrial end-users will also not
directly be  affected as long as their employer pays the license fee. The
same holds  true  for  bioinformatics  companies.  Academic  software  or
database  developers  as  well  as  providers  of  database  distribution
services will  be only minimally affected by these changes. We hope to be
able to  keep the  spirit of SWISS-PROT and PROSITE alive and at the same
time ensure  their long-term  financial survival.  We sincerely  hope and
believe that  in the next two years the only change that will matter will
be the increase in scope and timeliness of the databases.


----------------------------------------------------------------------------
SWISS-PROT is copyright.  It is produced through a collaboration between the
Swiss Institute  of  Bioinformatics   and the EMBL Outstation - the European
Bioinformatics Institute. There are no restrictions on its use by non-profit
institutions as long as its  content is in no way modified. Usage by and for
commercial entities requires a license agreement.  For information about the
licensing  scheme  see: http://www.isb-sib.ch/announce/ or send  an email to
license@isb-sib.ch.
----------------------------------------------------------------------------

   ========================================================================


                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.58   Gln (Q) 3.97   Leu (L) 9.42   Ser (S) 7.12
   Arg (R) 5.16   Glu (E) 6.37   Lys (K) 5.95   Thr (T) 5.67
   Asn (N) 4.45   Gly (G) 6.84   Met (M) 2.37   Trp (W) 1.23
   Asp (D) 5.28   His (H) 2.24   Phe (F) 4.09   Tyr (Y) 3.18
   Cys (C) 1.66   Ile (I) 5.81   Pro (P) 4.90   Val (V) 6.58

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 6307

   The first twenty species represent 36880 sequences: 47.3 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2929
                            2x:  984
                            3x:  503
                            4x:  340
                            5x:  244
                            6x:  216
                            7x:  161
                            8x:  116
                            9x:  107
                           10x:   65
                       11- 20x:  297
                       21- 50x:  185
                       51-100x:   74
                         >100x:   86


   A.2.2  Table of the most represented species

    Number   Frequency     Species
         1        5146     Human
         2        4806     Baker's yeast (Saccharomyces cerevisiae)
         3        4476     Escherichia coli
         4        3387     Mouse
         5        2550     Rat
         6        2046     Bacillus subtilis
         7        1956     Caenorhabditis elegans
         8        1701     Haemophilus influenzae
         9        1406     Fission yeast (Schizosaccharomyces pombe)
        10        1307     Methanococcus jannaschii
        11        1126     Bovine
        12        1064     Fruit fly (Drosophila melanogaster)
        13         918     Mycobacterium tuberculosis
        14         862     Chicken
        15         792     Arabidopsis thaliana (Mouse-ear cress)
        16         723     Salmonella typhimurium
        17         711     African clawed frog (Xenopus laevis)
        18         670     Synechocystis sp. (strain PCC 6803)
        19         651     Pig
        20         582     Rabbit
        21         490     Mycoplasma pneumoniae
        22         470     Mycoplasma genitalium
        23         428     Maize
        24         403     Rhizobium sp. (strain NGR234)
        25         367     Helicobacter pylori
        26         363     Pseudomonas aeruginosa
        27         332     Rice
        28         296     Dog
        29         295     Tobacco
        30         285     Slime mold (Dictyostelium discoideum)
        31         274     Treponema pallidum
        32         272     Bacteriophage T4
        33         268     Sheep
        34         262     Mycobacterium leprae
        35         260     Borrelia burgdorferi
        36         256     Pea
        37         253     Vaccinia virus (strain Copenhagen)
        38         235     Methanobacterium thermoautotrophicum
                   235     Soybean
        40         224     Neurospora crassa
        41         222     Staphylococcus aureus
        42         221     Barley
        43         219     Porphyra purpurea
        44         209     Wheat
        45         203     Tomato
        46         201     Rhodobacter capsulatus
        47         199     Potato
        48         198     Klebsiella pneumoniae
        49         194     Candida albicans
        50         193     Human cytomegalovirus (strain AD169)
        51         192     Bacillus stearothermophilus
        52         189     Archaeoglobus fulgidus
                   189     Pseudomonas putida
        54         186     Vaccinia virus (strain WR)
        55         170     Agrobacterium tumefaciens
        56         169     Spinach
        57         166     Guinea pig
        58         159     Chlamydomonas reinhardtii
        59         158     Rhizobium meliloti
        60         154     Autographa californica nuclear polyhedrosis virus
        61         150     Aspergillus nidulans
                   150     Marchantia polymorpha (Liverwort)
        63         148     Streptomyces coelicolor
                   148     Guillardia theta (Cryptomonas phi)
        65         147     Cyanophora paradoxa
        66         146     Variola virus
        67         144     Golden hamster
        68         143     Horse
        69         140     Lactococcus lactis (subsp. lactis)
        70         139     Odontella sinensis
        71         134     Orgyia pseudotsugata multicapsid polyhedrosis virus
        72         132     Kluyveromyces lactis
        73         127     Trypanosoma brucei brucei
        74         126     Synechococcus sp. (strain PCC 7942)
        75         125     Thermus aquaticus (subsp. thermophilus)
        76         120     Alcaligenes eutrophus
        77         115     Bombyx mori (Silk moth)
                   115     Anabaena sp. (strain PCC 7120)
        79         114     Bradyrhizobium japonicum
        80         109     Yersinia enterocolitica
        81         107     Streptococcus pneumoniae
        82         105     Brachydanio rerio (Zebrafish)
        83         104     Oncorhynchus mykiss (Rainbow trout)
                   104     Brassica napus (Rape)
        85         102     Rhodobacter sphaeroides
        86         101     Cat



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    3186             1001-1100      708
                 51- 100    6584             1101-1200      537
                101- 150    9506             1201-1300      365
                151- 200    7467             1301-1400      246
                201- 250    7006             1401-1500      202
                251- 300    6508             1501-1600      127
                301- 350    6115             1601-1700      115
                351- 400    6164             1701-1800       86
                401- 450    4707             1801-1900       93
                451- 500    4450             1901-2000       62
                501- 550    3351             2001-2100       34
                551- 600    2258             2101-2200       68
                601- 650    1768             2201-2300       70
                651- 700    1292             2301-2400       35
                701- 750    1146             2401-2500       41
                751- 800     941             >2500          222
                801- 850     740
                851- 900     781
                901- 950     536
                951-1000     460



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                              HTS1_COCCA  5217
                              MUC2_HUMAN  5179
                              FAT_DROME   5147
                              RYNR_RABIT  5037
                              RYNR_PIG    5035
                              RYNR_HUMAN  5032
                              RYNC_RABIT  4969
                              LRP_CAEEL   4753
                              DYHC_DICDI  4725
                              PLEC_RAT    4687
                              LRP2_RAT    4660
                              LRP2_HUMAN  4655
                              DYHC_RAT    4644
                              DYHC_DROME  4639
                              DYHC_CAEEL  4568
                              DYHB_CHLRE  4568
                              APB_HUMAN   4563
                              APOA_HUMAN  4548
                              LRP1_HUMAN  4544
                              LRP1_CHICK  4543
                              DYHC_PARTE  4540
                              RRPA_CVMJH  4488
                              DYHG_CHLRE  4485
                              DYHC_ANTCR  4466
                              DYHC_TRIGR  4466
                              GRSB_BACBR  4451
                              PKSK_BACSU  4447
                              PKSL_BACSU  4427
                              PGBM_HUMAN  4393
                              YP73_CAEEL  4385
                              DYHC_NEUCR  4367
                              DYHC_NECHA  4349
                              DYHC_EMENI  4344
                              PKD1_HUMAN  4303
                              DYHC_SCHPO  4196
                              DYHC_YEAST  4092
                              RRPA_CVH22  4085
                              RRPL_DUGBV  4036


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 955


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 351
                       2x: 130
                       3x:  79
                       4x:  43
                       5x:  33
                       6x:  26
                       7x:  15
                       8x:  17
                       9x:  15
                      10x:  12
                  11- 20x:  66
                  21- 50x:  67
                  51-100x:  25
                    >100x:  76


   A.5.2  List of the most cited journals in SWISS-PROT

   Nb    Citations       Journal abbreviation
   --    ---------       ----------------------------------
    1    6476            J. Biol. Chem.
    2    3931            Proc. Natl. Acad. Sci. U.S.A.
    3    3418            Nucleic Acids Res.
    4    2815            J. Bacteriol.
    5    2606            Gene
    6    2119            FEBS Lett.
    7    1994            Eur. J. Biochem.
    8    1843            Biochem. Biophys. Res. Commun.
    9    1811            Biochemistry
   10    1751            EMBO J.
   11    1650            Nature
   12    1484            Biochim. Biophys. Acta
   13    1398            J. Mol. Biol.
   14    1264            Cell
   15    1214            Mol. Cell. Biol.
   16     981            Mol. Gen. Genet.
   17     973            Plant Mol. Biol.
   18     941            Genomics
   19     922            Biochem. J.
   20     833            Science
   21     811            Mol. Microbiol.
   22     778            Virology
   23     702            J. Biochem.
   24     525            J. Virol.
   25     482            Yeast
   26     472            J. Cell Biol.
   27     464            J. Gen. Virol.
   28     452            Plant Physiol.
   29     431            Hum. Mutat.
   30     419            Genes Dev.
   31     402            Hum. Mol. Genet.
   32     355            J. Immunol.
   33     344            Arch. Biochem. Biophys.
   34     339            Infect. Immun.
   35     324            Curr. Genet.
   36     322            Oncogene
   37     309            Mol. Biochem. Parasitol.
   38     295            FEMS Microbiol. Lett.
   39     291            Structure
   40     269            Am. J. Hum. Genet.
   41     264            Biol. Chem. Hoppe-Seyler
   42     261            Nat. Genet.
   43     250            Development
   44     244            J. Clin. Invest.
   45     238            Mol. Endocrinol.
   46     238            Microbiology
   47     225            J. Mol. Evol.
   48     220            J. Gen. Microbiol.
   49     220            Genetics
   50     219            Nat. Struct. Biol.
   51     213            Hoppe-Seyler's Z. Physiol. Chem.
   52     200            DNA Cell Biol.
   53     199            Hum. Genet.
   54     196            Appl. Environ. Microbiol.
   55     186            J. Exp. Med.
   56     183            Blood
   57     182            Dev. Biol.
   58     176            Protein Sci.
   59     175            Neuron
   60     154            Immunogenetics
   61     152            DNA
   62     146            Endocrinology
   63     146            DNA Seq.
   64     136            Plant Cell
   65     122            Cancer Res.
   66     116            Plant J.
   67     116            Hemoglobin
   68     115            Bioorg. Khim.
   69     115            Biochimie
   70     112            Mol. Biol. Evol.
   71     112            J. Neurochem.
   72     109            Virus Res.
   73     109            Agric. Biol. Chem.
   74     107            Comp. Biochem. Physiol.
   75     106            Brain Res. Mol. Brain Res.
   76     101            Mech. Dev.


   ========================================================================


   APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
               DATABASES

   The current  status of  the relationships (cross-references) between
   SWISS-PROT and some biomolecular databases is shown in the following
   schematic:


                         ***********************
                         *  EMBL Nucleotide    *
                         *  Sequence Database  *
                         *       [EBI]         *
                         ***********************
                           ^ ^ ^  ^  ^ ^ ^ ^ ^
******************         | | |  I  | | | | |         **********************
* FlyBase        * <-------+ | |  I  | | | | +-------> * MGD [Mouse]        *
******************         | | |  I  | | | | |         **********************
                           | | |  I  | | | | |
******************         | | |  I  | | | | |         **********************
* SubtiList      * <---------+ |  I  | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis]   *         | | |  I  | | | | |         **********************
******************         | | |  I  | | | | |
                           | | |  I  | | | | |         **********************
******************         | | |  I  | | +-----------> * EcoGene [E.coli]   *
* Mendel [Plant] * <-----+ | | |  I  | | | | |         **********************
******************       | | | |  I  | | | | |
                         | | | |  I  | | | | |         **********************
******************       | | | |  I  +---------------> * SGD [Yeast]        *
* MaizeDb        * <-----------+  I  | | | | |         **********************
* [Zea mays]     *       | | | |  I  | | | | |
******************       | | | |  I  | | | | |         **********************
                         | | | |  I  | +-------------> * DictyDB [D.disco.] *
******************       | | | |  I  | | | | |         **********************
* WormPep        *       | | | |  I  | | | | |
* [C.elegans]    * <---+ | | | |  I  | | | | |         **********************
******************     | | | | |  I  | | | | | +-----> * ENZYME [Nomencl.]  *
                       | | | | |  I  | | | | | |       **********************
******************     | v v v v  v  v v v v v v           v
* REBASE         *     *************************       **********************
* [Restriction   * <-- *   SWISS-PROT          * ----> * OMIM [Human]       *
*  enzymes]      *     *   Protein Sequence    *       **********************
******************     *   Data Bank           *
                       *************************       **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^          * ECO2DBASE     [2D] *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * Maize-2DPAGE  [2D] *
******************        | | | | | | | |              **********************
* TRANSFAC       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************

   =End=of=SWISS-PROT=release=37=notes=====================================

ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot
 Hosted by ch flag SIB Switzerland Mirror sites: Australia  Brazil  Canada  China  Korea