-------------------------------------------------------------------------
SWISS-PROT Protein Knowledgebase
Release Notes
Release 40, October 2001
-------------------------------------------------------------------------
Table of contents
1 Introduction
2 Description of the changes made to SWISS-PROT since release 38
3 Forthcoming changes
4 Status of the documentation files
5 The ExPASy World-Wide Web server
6 TrEMBL - a supplement to SWISS-PROT
7 FTP access to SWISS-PROT and TrEMBL
8 ENZYME and PROSITE
9 We need your help!
A Appendix A
1 Introduction
Release 40.0 of SWISS-PROT contains 101'602 sequence entries, comprising
37'315'215 amino acids abstracted from 91'880 references. This represents
an increase of 18% over release 39. The growth of the data bank is
summarized below.
Release Date Number of Number of
entries amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
2 Description of the changes made to SWISS-PROT since release 38
The name of the database changed from 'SWISS-PROT protein sequence
database' to 'SWISS-PROT knowledgebase' to emphasize the fact that
SWISS-PROT collects, by far, more than just information on protein
sequences and that it is a central linking and linked database which
connects the various findings in the diverse fields of proteomics research.
We apologize that due to technical problems we never posted the release
notes of release 39. Therefore this document describes the changes that
took place not only since release 39 but also those between releases 38 and
39.
2.1 Sequences and annotations
15'184 sequences have been added since release 39, the sequence data of
2'908 existing entries has been updated and the annotations of 44' 684
entries have been revised. With this release SWISS-PROT has passed the
symbolic mark of 100 thousand entries.
2.2 The HPI project
The Human Proteomics Initiative (HPI) has been introduced to put a major
effort on the annotation of all known human sequences according to the
quality standards of SWISS-PROT. This means that, for each known protein, a
wealth of information is provided, which includes the description of its
function, its domain structure, subcellular location, posttranslational
modifications, variants, similarities to other proteins, etc. This not only
implies the annotation of newly detected proteins, but also the integration
of new research data to the existing entries by specialized biologists, who
are in close contact with experts all over the world.
There are currently 7'471 annotated human sequences in SWISS-PROT. These
entries are associated with 19'922 literature references, 18' 974
experimental or predicted PTM's, 1'697 splice variants and 12'061
polymorphisms (most of which are linked with disease states).
Simultaneously, two further efforts were increased: the description of
human diseases associated with deficiency(ies) in the protein and mammalian
orthologs of human proteins are annotated at a level equivalent to that of
the cognate human sequences.
For all aspects of the HPI projects, we would appreciate the help and
collaboration of the scientific community. Information concerning the human
proteome is highly critical to a large section of the life science
community. We therefore appeal to the user community to fully participate
in this initiative by providing all the necessary information to help and
to speed up the comprehensive annotation of the human proteome.
For a detailed description of the HPI project and its current status please
consult:
http://www.expasy.org/sprot/hpi/
2.3 The HAMAP project
The first complete microbial genomic sequence was that of the bacterium
Haemophilus influenzae, which became available in 1995. Since then more
than 50 bacterial and archaeal genomes have been sequenced and many more
sequencing projects of pathogenic as well as nonpathogenic microbes are in
progress. To date, the publicly available microbial genomes collectively
encode more than 100'000 different proteins.
In order to handle the large amount of "raw" data coming from the microbial
genomic sequencing, the High quality Automated Microbial Annotation of
Proteomes (HAMAP) project was initiated. The latter aims to automatically
annotate a significant percentage of proteins which originate from
microbial genome sequencing projects.
To maintain a high level quality of annotation, specific tools are
developed to deal with two completely separate subsets of bacterial and
archaeal proteins: proteins that have no recognizable similarity to any
other microbial or non-microbial proteins ("ORFans") and proteins that are
part of well-defined families or subfamilies. This is done by using a rule
system that describes the level and extent of annotations that can be
assigned by similarity with a prototype manually-annotated entry. The
result is a curated entry whose quality is identical to that produced
manually by an expert annotator.
The programs in development are designed to recognize protein
peculiarities, and only proteins which match the defined criteria will be
processed automatically. Protein sequences which fail to fit into that rule
system will be further analyzed by SWISS-PROT expert annotators.
For a detailed description of the HAMAP project and its current status
please consult:
http://www.expasy.org/sprot/hamap/
2.4 What's happening with the model organisms?
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
* be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates;
* provide a higher level of annotation;
* provide cross-references to specialized database(s) that contain,
among other data, some genetic information about the genes that code
for these proteins;
* provide specific indices or documents.
From our efforts to annotate human sequence entries as complete as possible
arose the HPI project (see 2.2), and the bacterial model organisms became
part of the HAMAP project (see 2.3). Here is the current status of the
model organisms which are not covered by these two projects:
Organism Database Index file Number of
cross-references sequences
------------ ---------------- -------------- ---------
A.thaliana None yet In preparation 1'409
C.albicans None yet CALBICAN.TXT 256
C.elegans Wormpep CELEGANS.TXT 2'184
D.discoideum DictyDB DICTY.TXT 311
D.melanogaster FlyBase FLY.TXT 1'514
M.musculus MGD MGDTOSP.TXT 4'816
S.cerevisiae SGD YEAST.TXT 4'859
S.pombe None yet POMBE.TXT 1'782
2.5 Progress in the conversion of SWISS-PROT to mixed-case
characters
We are gradually converting SWISS-PROT entries from all 'UPPER CASE' to
'MiXeD CaSe'. The line-types that have been converted between release 38
and 40 are: DE (DEscription), most RC (Reference Comment) topics (SPECIES,
TISSUE, PLASMID and TRANSPOSON) and DR (Database cross-Reference). The new
OX line (Organism cross-reference; see section 2.8) and the new CC topics
PHARMACEUTICAL and BIOTECHNOLOGY (described in section 2.11) have been
introduced in mixed case. The CC topic MASS SPECTROMETRY has been converted
to mixed case. As described in section 3.5, the process of converting all
of SWISS-PROT to mixed case continues.
2.6 Extension of the accession number system
With the creation of the TrEMBL database and the rapid increase in the
amount of sequence data, we were faced with a problem of availability of
accession numbers. We used a system based on a one-letter prefix followed
by 5 digits. This system was also used by the nucleotide sequence databases
which had originally reserved for SWISS-PROT the prefix letters 'O', 'P'
and 'Q'. Having run out of space (due mainly to EST's), the nucleotide
sequence databases have been forced to choose a new format, which became a
two-letter prefix followed by 6 digits.
We have now used up all possible numbers with 'O', 'P' and 'Q'. As we
believe that changing the format of the accession numbers to that used now
by the nucleotide database would have created havoc on the numerous
software packages using SWISS-PROT, we decided to keep a system of
accession numbers based on a 6-character code, but with the following
format extension:
1 2 3 4 5 6
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
What the above means is that we kept a 6-character code, but that in
positions 3, 4 and 5 of this code any combination of letters and numbers
can be present. This format allows a total of 14 million accession numbers
(compared with only 300'000 with the former system).
We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession
numbers cannot be mistaken with gene names, acronyms, other type of
accession numbers or any kind of word!
Examples: P0A3S2, Q2ASD4, O13YX2, P9B123.
2.7 Multiple AC lines
Starting from release 39, there can be more than one AC (ACcession) line
per SWISS-PROT entry. Strictly speaking this was not a format change and
the SWISS-PROT user's manual always indicated that there could be more than
one AC line per entry. Until recently, a single line was sufficient and the
majority of entries contained only a single accession number. But, in the
process of providing an optimally non-redundant database, we are merging
information from TrEMBL entries into SWISS-PROT entries. When we merge a
TrEMBL entry to a SWISS-PROT one, we add to the latter the accession
number(s) of the TrEMBL entry. The repetition of such a process sometimes
produces an accession number list which can no longer fit in a single AC
line. Therefore there are now some entries with two, three (as shown below)
or more AC lines.
AC P16070; P22511; Q04858; Q13419; Q13957; Q13958; Q13959; Q13960;
AC Q13961; Q13967; Q13968; Q13980; Q15861; Q16064; Q16065; Q16066;
AC Q16208; Q16522;
2.8 Introduction of the new line type OX: Organism taxonomy
cross-reference
The OX (Organism taXonomy cross-reference) line has been introduced to
indicate the identifier to a specific organism in a taxonomic database. The
number of taxonomic codes is identical to the number of species given in
the OS line. There can be more than one OX line in an entry and its format
is:
OX Taxonomy-database_Qualifier=Taxonomic code[, Taxonomic code...];
There are cross-references to the taxonomic database of NCBI, which is
associated with the qualifier 'TaxID' and a one-to six-digit taxonomic
code.
Examples of its usage:
OX NCBI_TaxID=10116;
OX NCBI_TaxID=9606, 10090, 9913, 9823, 10141, 10029, 10030, 10116, 9986,
OX 9031, 8355, 7227, 7213, 7108, 7130;
2.9 Changes concerning the RC line
We are gradually implementing controlled vocabularies for the different
type of RC tokens. To complement the tissue list (TISSLIST.TXT), we have
now added a plasmid list (PLASMID.TXT) and are in the process of creating a
strain list. Controlled vocabularies are part of the SWISS-PROT
documentation files that are all described in section 4.
2.10 Changes concerning the RX line
The RX line format changed, and it now provides identifiers also to the
bibliographic database PubMed.
The old format was:
RX MEDLINE; unique_identifier.
The new format is:
RX BIBLIOGRAPHIC_DATABASE=IDENTIFIER[; BIBLIOGRAPHIC_DATABASE=IDENTIFIER...];
Example of RX lines:
RX PubMed=9145897;
RX MEDLINE=79012484; PubMed=358200;
2.11 Introduction of two new CC line topics: BIOTECHNOLOGY and
PHARMACEUTICAL
We have introduced two new 'topics' for the comments (CC) line type.
The topic 'BIOTECHNOLOGY' has been introduced to describe the use of a
specific protein in the biotechnological industry. This topic contains the
name(s) of the compani(es) that produce the protein or the genetically
manipulated organism as well as a short description of the biotechnological
function of the protein. The brand name(s), under which a protein is
available, is added, if applicable.
Examples of the usage:
CC -!- BIOTECHNOLOGY: Introduced by genetic manipulation and
CC expressed in improved ripening tomato by Monsanto. ACC is the
CC immediate precursor of the phytohormone ethylene who is
CC involved in the control of ripening. ACC deaminase reduces
CC ethylene biosynthesis and thus extend the shelf life of fruits
CC and vegetables.
CC -!- BIOTECHNOLOGY: Used in the food industry for high temperature
CC liquefaction of starch-containing mashes and in the detergent
CC industry to remove starch. Sold under the name Termamyl by
CC Novozymes.
The topic 'PHARMACEUTICAL' has been introduced to describe the use of a
specific protein as a pharmaceutical drug. The information provided by such
a topic will include the brand name(s) under which a protein is available,
the name(s) of the compani(es) that produce it as well as a short
description of the therapeutic usage of the protein. It should be noted
that any entries containing such a comment field will also be tagged with
the keyword 'Pharmaceutical'.
Examples of the usage:
CC -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC Betaseron (Berlex) and Rebif (Serono). Used in the treatment
CC of multiple sclerosis (MS). Betaseron is a slightly modified
CC form of IFNB1 with two residue substitutions.
CC -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron).
CC Used in patients with renal cell carcinoma or metastatic
CC melanoma.
2.12 Cleaning up of comment line (CC) topics
We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while being human-readable). We are therefore standardizing the
format of the topics.
The two sub-formats of the topic ALTERNATIVE PRODUCTS:
CC -!- ALTERNATIVE PRODUCTS: isoforms; (shown here),
CC , and ; are produced by alternative splicing.
CC [Comment.]
CC -!- ALTERNATIVE PRODUCTS: isoforms; (shown here),
CC and ; are produced by alternative
CC initiation. [Comment.]
Examples:
CC -!- ALTERNATIVE PRODUCTS: At least 5 isoforms; 1 (shown here), 2, 3, 4
CC and 5; are produced by alternative splicing. They differ in their
CC acetylcholine receptor clustering activity.
CC -!- ALTERNATIVE PRODUCTS: 3 isoforms; TRAC-2 (shown here), TRAC-3 and
CC TRAC-4; are produced by alternative initiation.
We are gradually cleaning up the comment line topic SIMILARITY. To describe
the similarity of the protein to a protein family, we use the following
subformat:
CC -!- SIMILARITY: Belongs to the <family_name>[. <sub-family_name>].
Examples:
CC -!- SIMILARITY: Belongs to the 14-3-3 family.
CC -!- SIMILARITY: Belongs to the glucosamine/galactosamine-6-phosphate
CC isomerase family. 6-phosphogluconolactonase subfamily.
To describe conserved domains within a protein sequence, we use the
subformat:
CC -!- SIMILARITY: Contains n <domain_name>.
Examples:
CC -!- SIMILARITY: Contains 10 HEAT repeats.
CC -!- SIMILARITY: Contains 1 FKBP-type PPIase domain.
2.13 Changes concerning cross-references (DR line)
We have added cross-references from SWISS-PROT to the following databases:
2.13.1 GlycoSuiteDB
GlycoSuiteDB, a database of glycan structures available at
http://www.glycosuite.com/ (see Cooper C.A., Harrison M.J., Wilkins M.R.
and Packer N.H.; Nucleic Acids Res. 29:332-335(2001)). The identifiers of
the appropriate DR line are:
Data bank
identifier: GlycoSuiteDB
Primary identifier: GlycoSuiteDB unique identifier for a glycoprotein,
which is identical to the SWISS-PROT primary AC
number of that protein.
Secondary
identifier: None; a dash '-' is stored in that field.
Example: DR GlycoSuiteDB; P05067; -.
2.13.2 SMART
The Simple Modular Architecture Research Tool (SMART), a database of
functional sites available at http://smart.embl-heidelberg.de/ (see Schultz
J., Copley R.R., Doerks T., Ponting C.P. and Bork P.; Nucleic Acids Res.
28:231-234(2000)). The cross-references for this database are composed of
the following items:
Data bank identifier: SMART
Primary identifier: SMART unique identifier for a domain.
Secondary identifier: Abbreviation for the name of a domain or module.
Fourth item: Number of hits of the domain in the entry.
Example: DR SMART; SM00370; LRR; 6.
2.13.3 Leproma
The Mycobacterium leprae genome database Leproma, which is available at
http://genolist.pasteur.fr/Leproma/. The information is available in the DR
line:
Data bank identifier: Leproma
Primary identifier: Leproma unique identifer for an ORF.
Secondary identifier: None; a dash '-' is stored in that field.
Example: DR Leproma; ML0485; -.
2.13.4 MEROPS
MEROPS, the protease database available at http://www.merops.co.uk/ (see
Rawlings N.D. and Barrett A.J.; Nucleic Acids Res. 28:323-325(2000)). The
following information is available in the two qualifiers of the DR line:
Data bank identifier: MEROPS
Primary identifier: The MEROPS unique identifier for a peptidase.
Secondary identifier: None; a dash '-' is stored in that field.
Example: DR MEROPS; M41.001; -.
2.13.5 MypuList
The Mycoplasma pulmonis genome database MypuList, available at
http://genolist.pasteur.fr/MypuList/. The following information is
available in the two identifiers of the DR line:
Data bank identifier: MypuList
Primary identifier: The MypuList unique identifier for an ORF.
Secondary identifier: None; a dash '-' is stored in that field.
Example: DR MypuList; MYPU_4900; -.
2.13.6 ProDom
Cross-references to the ProDom protein domain database used to be provided
as implicit links; links are now also available as explicit links:
Data bank identifier: ProDom
Primary identifier: The ProDom unique identifier for a domain.
Secondary identifier: The ProDom entry name.
Fourth item: Number of hits of the domain in the entry.
Example for an DR ProDom; PD000600; 14-3-3; 1.
explicit link:
2.13.7 ANU-2DPAGE
The Australian National University Two-Dimensional Polyacrylamide Gel
Electrophoresis Database (ANU-2DPAGE) is available at
http://semele.anu.edu.au/2d/2d.html (see Imin N., Kerim T., Weinman J.J.
and Rolfe B.G.; Proteomics 1:1149-1161(2001)). The following information is
available in the DR line:
Data bank
identifier: ANU-2DPAGE
Primary identifier: ANU-2DPAGE unique identifier, which is identical to
the SWISS-PROT primary AC number of that protein.
Secondary
identifier: None; a dash '-' is stored in that field.
Example: DR ANU-2DPAGE; Q9XEA8; -.
2.13.8 COMPLUYEAST-2DPAGE
Two-dimensional polyacrylamide gel electrophoresis database at Universidad
Complutense de Madrid (COMPLUYEAST-2DPAGE) is available at
http://babbage.csc.ucm.es/2d/2d.html. The following informaiton is
available in the DR line:
Data bank
identifier: COMPLUYEAST-2DPAGE
Primary COMPLUYEAST-2DPAGE unique identifier, which is
identifier: identical to the SWISS-PROT primary AC number of that
protein.
Secondary
identifier: None; a dash '-' is stored in that field.
Example: DR COMPLUYEAST-2DPAGE; P43067; -.
2.13.9 PHCI-2DPAGE
The Parasite Host Cell Interaction 2D-PAGE database (PHCI-2DPAGE) is
available at http://www.gram.au.dk/2d/2d.html. The cross-references for
this database are composed of the following items:
Data bank
identifier: PHCI-2DPAGE
Primary identifier: PHCI-2DPAGE unique identifier, which is identical to
the SWISS-PROT primary AC number of that protein.
Secondary
identifier: None; a dash '-' is stored in that field.
Example: DR PHCI-2DPAGE; Q9Z6V3; -.
2.13.10 PMMA-2DPAGE
The Purkyne Military Medical Academy 2D-PAGE database (PMMA-2DPAGE) is
available at http://www.pmma.pmfhk.cz/2d/2d.html. The identifers of the
appropriate DR line are:
Data bank
identifier: PMMA-2DPAGE
Primary identifier: PMMA-2DPAGE unique identifier, which is identical to
the SWISS-PROT primary AC number of that protein.
Secondary
identifier: None; a dash '-' is stored in that field.
Example: DR PMMA-2DPAGE; Q01995; -.
2.13.11 Siena-2DPAGE
The 2D-PAGE database from the Department of Molecular Biology, University
of Siena, Italy, is available at http://www.bio-mol.unisi.it/2d/2d.html.
The components of the corresponding DR line are:
Data bank
identifier: Siena-2DPAGE
Primary identifier: Siena-2DPAGE unique identifier, which is identical to
the SWISS-PROT primary AC number of that protein.
Secondary
identifier: None; a dash '-' is stored in that field.
Example: DR Siena-2DPAGE; P01591; -.
2.14 Introduction of a new FT key: SE_CYS
Selenocysteine is the 21st 'natural' amino acid. It is now known to occur
in several prokaryotic and eukaryotic proteins. Its mRNA codon is UGA,
which usually serves as a stop codon, but with a specific downstream
sequence forming a loop and a specific translational elongation factor. It
is recognized as the site of selenocysteine incorporation into proteins.
The joint nomenclature committee of the IUPAC/IUBMB (see
http://www.chem.qmw.ac.uk/iupac/jcbn/) officially recommended
(http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html) a three-letter
and a one-letter symbol for selenocysteine, namely 'Sec' and 'U'.
Introducing a new one-letter code in the sequence records would have
disrupt most, if not all, sequence analysis software. We therefore decided
to change, in SWISS-PROT, the rules used to annotate the presence of
selenocysteine residues in sequence entries in the manner described below.
Selenocysteines were stored, in the sequence records, using the one-letter
symbol 'C' for cysteine and are indicated in the feature table (FT) by a
line of the type:
FT BINDING x x SELENIUM.
The one-letter code has not been changed (for the reason explained above),
but we introduced a specific feature key (SE_CYS) to indicate the presence
of a selenocysteine at a given sequence position. The above example has
therefore been changed to:
FT SE_CYS x x
We also want to remind users that the keyword ' Selenocysteine' continues
to be used to tag sequence entries that contain at least one such residue.
2.15 Introduction of feature identifiers to the feature keys
CARBOHYD and VARIANT
We have introduced unique and stable feature identifiers (FTId) which allow
to construct links directly from position-specific annotation in the
feature table to specialized protein-related databases. Examples are
databases specialized in certain types of posttranslational modifications
of proteins, or in mutations. The FTId is always the last component in the
feature description.
2.15.1 Feature identifiers in FT VARIANT lines of human sequence
entries
The feature identifiers in the FT VARIANT lines of human sequence entries
allow to refer to a sequence variation and serve as anchors for
specifically directed links. A federated single human mutation database
(HmutDB; http://www2.ebi.ac.uk/mutations/central/proposal.html) has been
proposed, and the complete set of all FT VARIANT lines has been indexed for
SRS at EBI (http://srs.ebi.ac.uk/), under the name SWISSCHANGE. The
database SWISSCHANGE can be queried by SWISS-PROT FTIds.
The format of FT VARIANT lines with feature identifiers is:
FT VARIANT x x Description.
FT /FTId=VAR_number.
Example:
FT VARIANT 3 3 A -> L.
FT /FTId=VAR_000001.
2.15.2 Feature identifiers in FT CARBOHYD lines
The same principle is used to further enhance the links to GlycoSuiteDB, an
annotated database of glycan structures (see section 2.13.1). So in
addition the explicit global link in the DR line, we create unique feature
identifiers for each of the FT CARBOHYD lines, which will allow direct
access to the glycan structure.
The format of FT CARBOHYD lines with feature identifiers is:
FT CARBOHYD x x Description.
FT /FTId=CAR_number.
Example:
FT CARBOHYD 251 251 N-LINKED (GLCNAC...).
FT /FTId=CAR_000070.
2.16 Change in the syntax of the SQ line
The SQ (SeQuence header) line marks the beginning of the sequence data and
gives a quick summary of its content. The format of the SQ line was:
SQ SEQUENCE XXXX AA; XXXXXX MW; XXXXXXXX CRC32;
The last information item in the SQ line was a 32-bit CRC (Cyclic
Redundancy Check) value which is computed from the sequence. As the number
of available sequences is increasing rapidly, there are now a few cases
where two sequences can share the same CRC32 (but none, which also share
the same molecular weight 'MW' or number of amino acids 'AA' ). To address
this issue we replaced the 32-bit CRC value by a 64-bit CRC. The format of
the SQ line changed therefore to:
SQ SEQUENCE XXXX AA; XXXXXX MW; XXXXXXXXXXXXXXXX CRC64;
Example:
SQ SEQUENCE 233 AA; 25630 MW; 146A1B48A1475C86 CRC64;
3 Forthcoming changes
3.1 Version of SP in XML format
A distribution version of SWISS-PROT and TrEMBL in XML format is being
developed. The specifications of this new format will be described when it
will be first implemented in TrEMBL.
3.2 Extension of the entry name format
We endeavor to assign meaningful entry names that facilitate the
identification of the proteins and the species of origin concerning an
entry. Currently the entry name consists of up to ten uppercase
alphanumeric characters. SWISS-PROT uses a general purpose naming
convention that can be symbolized as X_Y, where X is a mnemonic code of at
most 4 alphanumeric characters representing the protein name, the '_' sign
serves as a separator, and the Y is a mnemonic species identification code
of at most 5 alphanumeric characters representing the biological source of
the protein.
We are planning to elongate the mnemonic code for the protein name from up
to 4 characters to up to 5 characters. E.g. the mnemonic code for the
meiotic recombination protein rec10 is currently 'RE10'. After the
introduction of extended entry names it could be modified to the 5-letter
code 'REC10'.
3.3 Multiple RP lines
Starting with release 41, there can be more than one RP (Reference
Position) line per reference in a SWISS-PROT entry. The RP line describes
the extent of the work carried out by the authors of the reference, e.g.
molecule type that has been sequenced, the characterization of the protein,
characterization of PTMs, analysis of the protein structure, detection of
variants, etc.
As the number of experimental results per publication increased over the
years the limitation of using a single RP line per reference became more
and more often insufficient to add all the information while being
consistent in format. So we decided to have multiple RP lines.
Example:
RP SEQUENCE FROM N.A., PARTIAL SEQUENCE, AND CHARACTERIZATION.
could become
RP SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND
RP CHARACTERIZATION.
3.4 Cleaning up of comment line (CC) topics
We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while being human-readable). We are therefore standardizing the
format of the topics.
We are gradually cleaning up the comment line topic PATHWAY. To describe
the biochemical pathway in which the protein is involved, we use the
following format:
CC -!- PATHWAY: biochemical pathway; nth step[. Comment].
Example:
CC -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step.
The comment line topic COFACTOR will be modified gradually to the following
format:
CC -!- COFACTOR: cofactor1[, cofactor2 and cofactor3][. Comment].
Examples:
CC -!- COFACTOR: Magnesium.
CC -!- COFACTOR: Copper, Manganese, and Nickel.
3.5 Continuation of the conversion of SWISS-PROT to mixed-case
characters
We will continue to convert SWISS-PROT entries from all 'UPPER CASE' to
'MiXeD CaSe'. In release 41 we are planning to convert the GN (Gene Name)
line, the RC (Reference Comment) line topic STRAIN, and the CC (Comment)
line topics CATALYTIC ACTIVITY and PATHWAY.
Here is an example of what a SWISS-PROT entry will look like in release 41:
ID GSA_ECOLI STANDARD; PRT; 426 AA.
AC P23893; P78277;
DT 01-NOV-1991 (Rel. 20, Created)
DT 01-NOV-1997 (Rel. 35, Last sequence update)
DT 01-MAR-2002 (Rel. 41, Last annotation update)
DE Glutamate-1-semialdehyde 2,1-aminomutase (EC 5.4.3.8) (GSA)
DE (Glutamate-1-semialdehyde aminotransferase) (GSA-AT).
GN hemL or gsa or popC or B0154.
OS Escherichia coli.
OC Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC Escherichia.
OX NCBI_TaxID=562;
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE=91155920; PubMed=1900346;
RA Grimm B., Bull A., Breu V.;
RT "Structural genes of glutamate 1-semialdehyde aminotransferase for
RT porphyrin synthesis in a cyanobacterium and Escherichia coli.";
RL Mol. Gen. Genet. 225:1-10(1991).
RN [2]
RP SEQUENCE FROM N.A.
RC STRAIN=K12 / W3110;
RX MEDLINE=94261430; PubMed=8202364;
RA Fujita N., Mori H., Yura T., Ishihama A.;
RT "Systematic sequencing of the Escherichia coli genome: analysis of
RT the 2.4-4.1 min (110,917-193,643 bp) region.";
RL Nucleic Acids Res. 22:1637-1639(1994).
RN [3]
RP SEQUENCE FROM N.A.
RC STRAIN=K12 / MG1655;
RX MEDLINE=97426617; PubMed=9278503;
RA Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
RA Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
RA Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
RA Mau B., Shao Y.;
RT "The complete genome sequence of Escherichia coli K-12.";
RL Science 277:1453-1474(1997).
RN [4]
RP SEQUENCE FROM N.A.
RA Schramm S., Duncan M., Allen E., Araujo R., Aparicio A., Chung E.,
RA Davis K., Federspiel N., Hyman R., Kalman S., Komp C., Kurdi O.,
RA Lashkari D., Lew H., Lin D., Namath A., Oefner P., Roberts D.,
RA Davis R.W.;
RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases.
RN [5]
RP CHARACTERIZATION.
RX MEDLINE=91258321; PubMed=2045363;
RA Ilag L.L., Jahn D., Eggertsson G., Soell D.;
RT "The Escherichia coli hemL gene encodes glutamate 1-semialdehyde
RT aminotransferase.";
RL J. Bacteriol. 173:3408-3413(1991).
RN [6]
RP MUTAGENESIS OF LYS-265.
RX MEDLINE=92353044; PubMed=1643048;
RA Ilag L.L., Jahn D.;
RT "Activity and spectroscopic properties of the Escherichia coli
RT glutamate 1-semialdehyde aminotransferase and the putative active
RT site mutant K265R.";
RL Biochemistry 31:7143-7151(1992).
CC -!- CATALYTIC ACTIVITY: (S)-4-amino-5-oxopentanoate =
CC 5-aminolevulinate.
CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.
CC -!- PATHWAY: Porphyrin biosynthesis by the C5 pathway; second step.
CC -!- SUBUNIT: HOMODIMER.
CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC (POTENTIAL).
CC -!- SIMILARITY: BELONGS TO CLASS-III OF PYRIDOXAL-PHOSPHATE-DEPENDENT
CC AMINOTRANSFERASES.
DR EMBL; X53696; CAA37734.1; -.
DR EMBL; D26562; CAB20274.1; -.
DR EMBL; AE000125; AAC73265.1; -.
DR EMBL; U70214; AAB08584.1; -.
DR PIR; S13327; S13327.
DR PIR; S45223; S45223.
DR HSSP; P24630; 2GSA.
DR EcoGene; EG10432; hemL.
DR InterPro; IPR000954; Aminotran_3.
DR Pfam; PF00202; aminotran_3; 1.
DR PROSITE; PS00600; AA_TRANSFER_CLASS_3; 1.
KW Porphyrin biosynthesis; Isomerase; Pyridoxal phosphate;
KW Complete proteome.
FT BINDING 265 265 PYRIDOXAL PHOSPHATE (PROBABLE).
FT MUTAGEN 265 265 K->R: 2% OF WILD-TYPE ACTIVITY.
FT CONFLICT 2 2 S -> R (IN REF. 1 AND 2).
FT CONFLICT 9 9 S -> Q (IN REF. 1 AND 2).
SQ SEQUENCE 426 AA; 45366 MW; BED817E100468CF2 CRC64;
MSKSENLYSA ARELIPGGVN SPVRAFTGVG GTPLFIEKAD GAYLYDVDGK AYIDYVGSWG
PMVLGHNHPA IRNAVIEAAE RGLSFGAPTE MEVKMAQLVT ELVPTMDMVR MVNSGTEATM
SAIRLARGFT GRDKIIKFEG CYHGHADCLL VKAGSGALTL GQPNSPGVPA DFAKYTLTCT
YNDLASVRAA FEQYPQEIAC IIVEPVAGNM NCVPPLPEFL PGLRALCDEF GALLIIDEVM
TGFRVALAGA QDYYGVVPDL TCLGKIIGGG MPVGAFGGRR DVMDALAPTG PVYQAGTLSG
NPIAMAAGFA CLNEVAQPGV HETLDELTTR LAEGLLEAAE EAGIPLVVNH VGGMFGIFFT
DAESVTCYQD VMACDVERFK RFFHMMLDEG VYLAPSAFEA GFMSVAHSME DINNTIDAAR
RVFAKL
//
4 Status of the documentation files
SWISS-PROT is distributed with a large number of documentation files. Some
of these files have been available for a long time (the user manual,
release notes, the various indices for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new
files, and updating and modifying existing files. Please note that the
header in many documentaiton files changed. The following table lists all
the documents that are currently available.
See also section 7.3 for information on how to access updated versions of
all documents in-between major releases.
USERMAN.TXT User manual
RELNOTES.TXT Release notes for the current release (40)
SHORTDES.TXT Short description of entries in SWISS-PROT [see 1]
JOURLIST.TXT List of cited journals
KEYWLIST.TXT List of keywords
PLASMID.TXT List of plasmids [see 2]
SPECLIST.TXT List of organism (species) identification codes
TISSLIST.TXT List of tissues
EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT
DBXREF.TXT List of databases cross-referenced in SWISS-PROT [see 2]
SUBMIT.TXT Submission of sequence data to SWISS-PROT
ACINDEX.TXT Accession number index
AUTINDEX.TXT Authors index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keywords index
SPEINDEX.TXT Species index
DELETEAC.TXT Deleted accession number index
7TMRLIST.TXT List of 7-transmembrane G-linked receptor entries [see 1]
AATRNASY.TXT List of aminoacyl-tRNA synthetases
ALLERGEN.TXT Nomenclature and index of allergen sequences
ANNBIOCH.TXT SWISS-PROT annotation: how is biochemical information
assigned to sequence entries
BLOODGRP.TXT Blood group antigen proteins
CALBICAN.TXT Index of Candida albicans entries and their corresponding
gene designations
CDLIST.TXT CD nomenclature for surface proteins of human leucocytes
Index of Caenorhabditis elegans entries and their
CELEGANS.TXT corresponding gene designations and WormPep
cross-references
Index of Dictyostelium discoideum entries and their
DICTY.TXT corresponding gene designations and DictyDB
cross-references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database
(ECO2DBASE) entries referenced in SWISS-PROT
ECOLI.TXT Index of Escherichia coli strain K12 chromosomal entries
and their corresponding EcoGene cross-references
EMBLTOSP.TXT Index of EMBL Nucleotide Sequence Database entries
referenced in SWISS-PROT
EXTRADOM.TXT Nomenclature of extracellular domains
FLY.TXT Index of Drosophila entries and their corresponding
FlyBase cross-references
GLYCOSID.TXT Classification of glycosyl hydrolase families and index of
glycosyl hydrolase entries in SWISS-PROT
HAEINFLU.TXT Index of Haemophilus influenzae strain Rd chromosomal
entries
HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and index
HPYLORI.TXT Index of Helicobacter pylori strain 26695 chromosomal
entries
HUMCHR01.TXT Index of proteins encoded on human chromosome 1 [see 2]
HUMCHR02.TXT Index of proteins encoded on human chromosome 2 [see 2]
HUMCHR03.TXT Index of proteins encoded on human chromosome 3 [see 2]
HUMCHR04.TXT Index of proteins encoded on human chromosome 4 [see 2]
HUMCHR05.TXT Index of proteins encoded on human chromosome 5 [see 2]
HUMCHR06.TXT Index of proteins encoded on human chromosome 6 [see 2]
HUMCHR07.TXT Index of proteins encoded on human chromosome 7 [see 2]
HUMCHR08.TXT Index of proteins encoded on human chromosome 8 [see 2]
HUMCHR09.TXT Index of proteins encoded on human chromosome 9 [see 2]
HUMCHR10.TXT Index of proteins encoded on human chromosome 10 [see 2]
HUMCHR11.TXT Index of proteins encoded on human chromosome 11 [see 2]
HUMCHR12.TXT Index of proteins encoded on human chromosome 12 [see 2]
HUMCHR13.TXT Index of proteins encoded on human chromosome 13
HUMCHR14.TXT Index of proteins encoded on human chromosome 14 [see 2]
HUMCHR15.TXT Index of proteins encoded on human chromosome 15 [see 2]
HUMCHR16.TXT Index of proteins encoded on human chromosome 16
HUMCHR17.TXT Index of proteins encoded on human chromosome 17
HUMCHR18.TXT Index of proteins encoded on human chromosome 18
HUMCHR19.TXT Index of proteins encoded on human chromosome 19
HUMCHR20.TXT Index of proteins encoded on human chromosome 20
HUMCHR21.TXT Index of proteins encoded on human chromosome 21
HUMCHR22.TXT Index of proteins encoded on human chromosome 22
HUMCHRX.TXT Index of proteins encoded on human chromosome X
HUMCHRY.TXT Index of proteins encoded on human chromosome Y
HUMPVAR.TXT Index of human proteins with sequence variants
INITFACT.TXT List and index of translation initiation factors
INTEIN.TXT Index of intein-containing entries referenced in
SWISS-PROT [see 2]
METALLO.TXT Classification of metallothioneins and index of the
entries in SWISS-PROT
MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT
MGENITAL.TXT Index of Mycoplasma genitalium strain G-37 chromosomal
entries
MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT
MJANNASC.TXT Index of Methanococcus jannaschii entries
NGR234.TXT Table of predicted proteins in Rhizobium plasmid pNGR234a
NOMLIST.TXT List of nomenclature related references for proteins
PCC6803.TXT Index of Synechocystis strain PCC 6803 entries
PDBTOSP.TXT Index of Protein Data Bank (PDB) entries referenced in
SWISS-PROT
PEPTIDAS.TXT Classification of peptidase families and index of
peptidase entries in SWISS-PROT
PLASTID.TXT List of chloroplast and cyanelle encoded proteins
POMBE.TXT Index of Schizosaccharomyces pombe entries and their
corresponding gene designations
RESTRIC.TXT List of restriction enzyme and methylase entries
RIBOSOMP.TXT Index of ribosomal proteins classified by families on the
basis of sequence similarities
RPROWAZE.TXT Index of Rickettsia prowazekii strain Madrid E entries
[see 2]
SALTY.TXT Index of Salmonella typhimurium strain LT2 chromosomal
entries and their corresponding StyGene cross-references
SUBTILIS.TXT Index of Bacillus subtilis strain 168 chromosomal entries
and their corresponding SubtiList cross-references
UPFLIST.TXT UPF (Uncharacterized Protein Families) list and index of
members
YEAST.TXT Index of Saccharomyces cerevisiae entries in SWISS-PROT
and their corresponding gene designations
YEAST1.TXT Yeast Chromosome I entries
YEAST2.TXT Yeast Chromosome II entries
YEAST3.TXT Yeast Chromosome III entries
YEAST5.TXT Yeast Chromosome V entries
YEAST6.TXT Yeast Chromosome VI entries
YEAST7.TXT Yeast Chromosome VII entries
YEAST8.TXT Yeast Chromosome VIII entries
YEAST9.TXT Yeast Chromosome IX entries
YEAST10.TXT Yeast Chromosome X entries
YEAST11.TXT Yeast Chromosome XI entries
YEAST13.TXT Yeast Chromosome XIII entries
YEAST14.TXT Yeast Chromosome XIV entries
Notes:
1 The '7TMRLIST.TXT' and 'SHORTDES.TXT' files have been converted to
mixed-case characters.
2 The 'DBXREF.TXT', 'HUMCHR01.TXT', 'HUMCHR02.TXT', 'HUMCHR03.TXT',
'HUMCHR04.TXT', 'HUMCHR05.TXT', 'HUMCHR06.TXT', 'HUMCHR07.TXT',
'HUMCHR08.TXT', 'HUMCHR09.TXT', 'HUMCHR10.TXT', 'HUMCHR11.TXT',
'HUMCHR12.TXT', 'HUMCHR14.TXT', 'HUMCHR15.TXT', 'INTEIN.TXT',
'PLASMID.TXT', and 'RPROWAZE.TXT' files are new documents introduced
since release 38.
We have continued to include in some SWISS-PROT documentation files the
references of Web sites relevant to the subject under consideration. There
are now 89 documents that include such links.
5 The ExPASy World-Wide Web server
5.1 Background information
The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases is to use the
World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was
made available to the public in September 1993 and is reachable at the
following address:
http://www.expasy.org/
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT/TrEMBL, PROSITE, ENZYME, SWISS-2DPAGE,
SWISS-3DIMAGE and CD40Lbase databases. And, through any SWISS-PROT protein
sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc,
EcoGene, FlyBase, GCRDb, GlycoSuiteDB, MaizeDB, OMIM, PDB, HSSP, Pfam,
ProDom, REBASE, SGD, SubtiList, TRANSFAC, YPD, ZFIN and Medline. ExPASy
also offers many tools for the analysis of protein sequences and 2D gels.
There are currently five mirror sites of ExPASy, i.e. exact copies of the
server. The ExPASy mirror sites are located in:
Australia http://au.expasy.org/
at the Australian Proteome Analysis Facility (APAF), Sydney
Canada http://ca.expasy.org/
at the Canadian Bioinformatics Resource (CBR), Halifax
China http://cn.expasy.org/
at the Center of Bioinformatics, Peking University, Beijing
Korea http://kr.expasy.org/
at the Yonsei Proteome Research Center
Taiwan http://tw.expasy.org/
at the National Health Research Institutes (NHRI), Taipei
Explicit general and continuously updated documentation about the ExPASy
server is available at http://www.expasy.org/doc/expasy.pdf.
5.2 Swiss-Shop
We provide, on ExPASy, a service called Swiss-Shop
(http://www.expasy.org/swiss-shop/). Swiss-Shop is an automated sequence
alerting system which allows users to obtain, by email, new sequence
entries relevant to their field(s) of interest. Every week, the new
sequences entered in SWISS-PROT are automatically compared with all the
criteria that have been defined by the users. If a sequence corresponds to
the selection criteria defined by a user, that sequence is sent by
electronic mail. Various criteria can be combined:
* By entering one or more words that should be present in the
description line;
* By entering one or more species name(s) or taxonomic division(s);
* By entering one or more keywords;
* By entering one or more author names;
* By entering the accession number (or entry name) of a PROSITE pattern
or a user-defined sequence pattern. In this case, all new SWISS-PROT
entries matching this pattern will be reported;
* By entering the accession number (or entry name) of an existing
SWISS-PROT entry or by entering a 'private' sequence. In this case,
all new SWISS-PROT entries similar to that sequence will be reported.
5.3 What is new on ExPASy
ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:
* Read the document 'History of changes, improvements and new features'
which is available at the address: http://www.expasy.org/history.html
* Subscribe to Swiss-Flash, a service that reports news of databases,
software and service developments. By subscribing to this service, you
will automatically get Swiss-Flash bulletins by electronic mail. To
subscribe, use the address: http://www.expasy.org/swiss-flash/
Among all the improvements and the new features introduced since the last
SWISS-PROT release, here are those that we believe are specifically useful
to SWISS-PROT users:
1. A new and improved version of the NiceProt view of SWISS-PROT is
available and offers the following new features: a link to a
printer-friendly view of a SWISS-PROT entry, display of the length of
certain features in the FT lines, and access to a new tool, the
'Feature aligner' which allows to select features for submission to
the ClustalW multiple alignment program.
2. SWISS-PROT release statistics are now available for every update of
the database (http://www.expasy.org/sprot/relnotes/relstat.html).
Among other parameters, statistics about database growth, average
sequence lengths and amino acid composition, taxonomic origin, journal
citations and database cross-references are presented, including some
graphics.
3. A new view is available within the SRS Sequence Retrieval System.
It displays, for each protein corresponding to a user query, gene
name(s) and organism (in addition to the parameters ID, AC,
description and sequence length which are displayed by the default
view "Short description"). This new view is entitled "Long
description" and is available from the menu "Use view ..." in the SRS
query form.
4. The SIB Blast interface (accessible also via "Quick BLAST" or from
the bottom of every SWISS-PROT/TrEMBL entry) now offers the
possibility to restrict the similarity search by using taxonomic
criteria. A "Taxonomic View" of the results can also be obtained via
the BLAST result page. The user can also select a number of matching
sequences and directly submit them to a ClustalW search, or retrieve
and download the corresponding SWISS-PROT/TrEMBL entries. An
alternative view of the results, NiceBlast, is available, which
consists of an html table, detailing complete descriptions of all
matching proteins, including the full protein name, gene name,
sequence length and organism.
5. Explicit cross-references have been implemented between SWISS-PROT
and BLOCKS, GlycoSuiteDB, InterPro, Leproma, MEROPS, MypuList, SMART,
TubercuList, ANU-2DPAGE, PHCI-2DPAGE, PMMA-2DPAGE, COMPLUYEAST-2DPAGE,
and Siena-2DPAGE. Implicit links have been added to the resources DIP,
GeneCensus, GeneLynx, HUGE and NucleaRDB.
6. A new tool has been added to the ExPASy suite of proteomics tools:
FindPept (http://www.expasy.org/tools/findpept.html) can identify
peptides that result from unspecific cleavage of proteins from their
experimental masses, taking into account artefactual chemical
modifications, post-translational modifications (PTM) and protease
autolytic cleavage. This new tool has been closely integrated with the
other proteomics tools on ExPASy, such as PeptIdent and FindMod.
7. The Sulfinator (http://www.expasy.org/tools/sulfinator/) is a newly
developed tool to predict tyrosine sulfation sites for a protein
sequence, using four different Hidden Markov Models (HMM).
8. Sequences of alternatively spliced isoforms of the same protein are
documented in the feature table of that protein sequence record. In
collaboration with the SWISS-PROT group at EBI, a program varsplic.pl
has been written to generate additional records from SWISS-PROT and
TrEMBL, one for each splice isoform of each protein. The resulting
data sets for SWISS- PROT and TrEMBL are available on the ExPASy ftp
server (ftp://ftp.expasy.org/databases/sp_tr_nrdb/), along with a more
detailed description of the project and information on how to obtain a
local copy of the varsplic.pl program.
The additional isoform entries have been added to the
SWISS-PROT/TrEMBL databases underlying the BLAST server at SIB
Switzerland, ScanProsite, and PeptIdent. Gradually, all other tools on
ExPASy will be modified to handle splice isoforms. The NiceProt view
of SWISS-PROT/TrEMBL provides links from the isoform name in the
feature table (example: Q01432) to a page displaying the sequence of
the corresponding isoform.
9. In the framework of the HAMAP project (see section 2.3), several
new features and tools have been implemented on ExPASy:
o The keyword "Complete Proteome" has been introduced to all
SWISS-PROT/TrEMBL entries describing a protein which is thought
to be expressed by an organism whose genome has been completely
sequenced. This keyword is so far only used for microbial
(bacterial and archaeal) proteins. A complete set of proteins
from a microbial genome can therefore be obtained using this
keyword across SWISS-PROT and TrEMBL.
o We provide clean non-redundant SWISS-PROT/TrEMBL data sets for
all completely sequenced microbial genomes. These files are
available on the ExPASy ftp server in SWISS-PROT and Fasta format
(ftp://ftp.expasy.org/databases/complete_proteomes/), and can
also be used for similarity searches on the SIB Blast server
("microbial proteomes").
o A Genomic Proximity Viewer is available for those microbial
genomes where an ORF numbering system exists. For those
organisms, it is possible to click on the ORF name in the
SWISS-PROT/TrEMBL GN lines to obtain a list of proteins encoded
by genes in proximity. The tool is also accessible from the HAMAP
complete proteome pages of those organisms. Example: Borrelia
Burgdorferi,
http://www.expasy.org/cgi-bin/genomeview.pl?bn=BORBU.
10. A year ago we have launched Protein Spotlight
(http://www.expasy.org/spotlight/); a periodical review centered on a
specific protein or group of proteins. It is published on a monthly
basis. You can subscribe to receive each issue, free of charge, in
HTML or PDF format.
6 TrEMBL - a supplement to SWISS-PROT
The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
SWISS-PROT. Since we do not want to dilute the quality standards of
SWISS-PROT by incorporating sequences into the database without proper
sequence analysis and annotation, we cannot speed up the incorporation of
new incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we have introduced with SWISS-PROT a
computer annotated supplement. This supplement consists of entries in
SWISS-PROT-like format derived from the translation of all coding sequences
(CDS) in the EMBL nucleotide sequence database, except those already
included in SWISS-PROT.
This supplement is named TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of SWISS-PROT. This SWISS-PROT release
is supplemented by TrEMBL release 18.
TrEMBL is available by FTP from the EBI and ExPASy servers in the directory
'databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS
servers. It is distributed with its own set of release notes.
7 FTP access to SWISS-PROT and TrEMBL
7.1 Generalities
SWISS-PROT is available for download on the following anonymous FTP
servers:
Organization Swiss Institute of Bioinformatics (SIB)
Address ftp.expasy.org, au.expasy.org/ftp/,
ca.expasy.org/ftp/, cn.expasy.org/ftp/,
kr.expasy.org/ftp/, tw.expasy.org/ftp/
Directory /databases/swiss-prot/
Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk
Directory /pub/databases/swissprot/
7.2 Non-redundant database
We distribute on the ExPASy and EBI FTP servers, files that make up a
non-redundant (see further) and complete protein sequence database
consisting of three components:
1) SWISS-PROT
2) TrEMBL
3) New entries to be later integrated into TrEMBL (hereafter known as
TrEMBL_New)
Every week three files are completely rebuilt. These files are named:
sprot. dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their
'. gz' extension, these are gzip-compressed files which, when decompressed,
will produce ASCII files in SWISS-PROT format.
Three other files are also available (sprot.fas.gz, trembl.fas.gz and
trembl_new.fas.gz) which are compressed 'fasta' format sequence files
useful for building the databases used by FASTA, BLAST and other sequence
similarity search programs. Please do not use these files for any other
purpose, as you will lose all annotations by using this very ' primitive'
format.
The files for the non-redundant database are stored in the directory
'/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in
the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server
(ftp.ebi.ac.uk).
Additional notes:
* The SWISS-PROT file continuously grows as new annotated sequences are
added.
* The TrEMBL file decreases in size as sequences are moved out of that
section after being annotated and moved into SWISS-PROT. Four times a
year a new release of TrEMBL is built at EBI, at this point the TrEMBL
file increases in size as it then includes all of the new data (see
next section) that has accumulated since the last release.
* The TrEMBL_New file starts as a very small file and grows in size
until a new release of TrEMBL is available.
* SWISS-PROT and TrEMBL share the same system of accession numbers.
Therefore you will not find any primary accession number duplicated
between the two sections. A TrEMBL entry (and its associated accession
number(s)) can either move to SWISS-PROT as new entry or be merged
with an existing SWISS-PROT entry. In the latter case, the accession
number(s) of that TrEMBL entry are added to that of the SWISS-PROT
entry.
* TrEMBL_New does not have real accession numbers. However it was
necessary to have an 'AC' line so as to be able to use it with
different software products. This AC line contains a temporary
identifier which consists of the protein_ID (protein sequence
identifier) of the coding sequence in the parent nucleotide sequence.
* TrEMBL_New is quite messy! You will of course find new sequence
entries but you will also encounter sequences that are going to be
used to update existing TrEMBL or SWISS-PROT entries. None of the
"cleaning" steps that are applied to produce a TrEMBL release are run
on TrEMBL_New nor are any of the computer-annotation software tools
that are used to enhance the information content of TrEMBL. TrEMBL_New
is provided only so that users can be sure not to miss any important
new sequences when they run similarity searches.
* While these three files allow you to build what we call a
'non-redundant' database, it must be noted that this is not completely
a true statement. Without going into a long explanation we can say
that this is currently the best attempt in providing a complete
selection of protein sequence entries while trying to eliminate
redundancies. Also SWISS-PROT is completely (well 99.994% !)
non-redundant, TrEMBL is far from being non-redundant and the addition
of SWISS-PROT + TrEMBL is even less.
* To describe to your users the version of the non-redundant database
that you are providing them with, you should use a statement of the
form:
SWISS-PROT release 40.0 of 17-Oct-2001;
TrEMBL release 18.0 of 22-Oct-2001;
TrEMBL_New of 22-Oct-2001.
7.3 Weekly updates of SWISS-PROT documents
Whilst the ExPASy FTP server so far only allowed FTP access to the
SWISS-PROT documents and indexes in their versions at the time of the last
full release, all documents are now updated with every weekly release of
SWISS-PROT. They are available for FTP download from the directory
/databases/swiss-prot/updated_doc/.
7.4 Weekly updates of SWISS-PROT
Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are generated at each update:
new_seq.dat Contains all the new entries since the last full
release;
upd_seq.dat Contains the entries for which the sequence data has
been updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Important notes
* Although we try to follow a regular schedule, we do not promise to
update these files every week. In most cases two weeks may elapse
between two updates.
* Instead of using the above files, you can, every week, download an
updated copy of the SWISS-PROT database. This file is available in the
directory containing the non-redundant database (see section 7.2).
8 ENZYME and PROSITE
8.1 The ENZYME nomenclature database
Release 27.0 of the ENZYME nomenclature database is distributed with
release 40 of SWISS-PROT. ENZYME release 27.0 contains information relative
to 3'870 enzymes. In this release, we have added a significant number of
new entries and we also updated many entries.
8.2 The PROSITE database
Release 17.0 of the PROSITE database will be available in a few weeks.
PROSITE will now come with its own set of release notes.
9 We need your help!
We welcome feedback from our users. We would especially appreciate that you
notify us if you find that sequences belonging to your field of expertise
are missing from the database. We also would like to be notified about
annotations to be updated, if, for example, the function of a protein has
been clarified or if new information about post-translational modifications
has become available. To facilitate this feedback we offer, on the ExPASy
WWW server, a form that allows the submission of updates and/or corrections
to SWISS-PROT:
http://www.expasy.org/sprot/sp_update_form.html
It is also possible, from any entry in SWISS-PROT displayed by the ExPASy
server, to submit updates and/or corrections for that particular entry.
Finally, you can also send your comments by electronic mail to the address:
swiss-prot@expasy.org
Note that all update requests are assigned a unique identifier of the
form UR-Xnnnn (example: UR-A0123). This identifier is used internally by
the SWISS-PROT staff at SIB and EBI to track down the fate of requests
and is also be used in email exchanges with the persons having submitted
a request.
APPENDIX A: Some statistics
A.1 Amino acid composition
A.1.1 Composition in percent for the complete database
Ala (A) 7.61 Gln (Q) 3.93 Leu (L) 9.53 Ser (S) 7.08
Arg (R) 5.19 Glu (E) 6.47 Lys (K) 5.97 Thr (T) 5.58
Asn (N) 4.36 Gly (G) 6.85 Met (M) 2.37 Trp (W) 1.21
Asp (D) 5.25 His (H) 2.24 Phe (F) 4.10 Tyr (Y) 3.16
Cys (C) 1.63 Ile (I) 5.85 Pro (P) 4.89 Val (V) 6.61
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
A.2 Taxonomic origin
Total number of species represented in this release of SWISS-PROT: 7'188
The first twenty species represent 45'181 sequences: 44.5 % of the total
number of entries.
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 3396
2x: 1086
3x: 589
4x: 366
5x: 267
6x: 251
7x: 169
8x: 137
9x: 125
10x: 61
11- 20x: 308
21- 50x: 231
51-100x: 78
>100x: 124
A.2.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 7471 Homo sapiens (Human)
2 4859 Saccharomyces cerevisiae (Baker's yeast)
3 4816 Mus musculus (Mouse)
4 4741 Escherichia coli
5 3091 Rattus norvegicus (Rat)
6 2260 Bacillus subtilis
7 2184 Caenorhabditis elegans
8 1782 Schizosaccharomyces pombe (Fission yeast)
9 1769 Haemophilus influenzae
10 1514 Drosophila melanogaster (Fruit fly)
11 1472 Methanococcus jannaschii
12 1409 Arabidopsis thaliana (Mouse-ear cress)
13 1321 Mycobacterium tuberculosis
14 1295 Bos taurus (Bovine)
15 1004 Gallus gallus (Chicken)
16 883 Synechocystis sp. (strain PCC 6803)
17 872 Escherichia coli O157:H7
18 846 Salmonella typhimurium
19 798 Archaeoglobus fulgidus
20 794 Xenopus laevis (African clawed frog)
21 765 Sus scrofa (Pig)
22 680 Aquifex aeolicus
23 671 Oryctolagus cuniculus (Rabbit)
24 662 Mycoplasma pneumoniae
25 594 Pseudomonas aeruginosa
26 588 Treponema pallidum
27 557 Buchnera aphidicola (subsp. Acyrthosiphon pisum)
28 523 Rickettsia prowazekii
29 522 Helicobacter pylori (Campylobacter pylori)
30 505 Helicobacter pylori J99 (Campylobacter pylori J99)
31 503 Mycobacterium leprae
32 486 Mycoplasma genitalium
33 481 Zea mays (Maize)
34 450 Methanobacterium thermoautotrophicum
35 403 Rhizobium sp. (strain NGR234)
36 395 Borrelia burgdorferi (Lyme disease spirochete)
37 390 Oryza sativa (Rice)
38 387 Chlamydia trachomatis
39 375 Thermotoga maritima
40 374 Streptomyces coelicolor
41 371 Chlamydia pneumoniae (Chlamydophila pneumoniae)
42 368 Canis familiaris (Dog)
43 364 Chlamydia muridarum
44 356 Rhizobium meliloti (Sinorhizobium meliloti)
45 353 Vibrio cholerae
46 333 Nicotiana tabacum (Common tobacco)
47 323 Pasteurella multocida
48 322 Ovis aries (Sheep)
49 320 Pyrococcus horikoshii
50 311 Dictyostelium discoideum (Slime mold)
51 301 Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
52 284 Pyrococcus abyssi
53 276 Pisum sativum (Garden pea)
54 272 Bacteriophage T4
55 260 Staphylococcus aureus
56 256 Candida albicans (Yeast)
57 255 Neurospora crassa
58 254 Vaccinia virus (strain Copenhagen)
59 247 Triticum aestivum (Wheat)
60 247 Bacillus halodurans
61 244 Glycine max (Soybean)
62 243 Hordeum vulgare (Barley)
63 242 Aeropyrum pernix
64 241 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
65 231 Pseudomonas putida
66 227 Lycopersicon esculentum (Tomato)
67 221 Cavia porcellus (Guinea pig)
68 220 Porphyra purpurea
69 219 Solanum tuberosum (Potato)
70 214 Spinacia oleracea (Spinach)
71 214 Klebsiella pneumoniae
72 213 Bacillus stearothermophilus
73 210 Neisseria meningitidis (serogroup B)
74 204 Neisseria meningitidis (serogroup A)
75 193 Human cytomegalovirus (strain AD169)
76 188 Campylobacter jejuni
77 187 Vaccinia virus (strain WR)
78 183 Deinococcus radiodurans
79 180 Agrobacterium tumefaciens
80 179 Sulfolobus solfataricus
81 179 Brachydanio rerio (Zebrafish) (Zebra danio)
82 173 Equus caballus (Horse)
83 171 Mesocricetus auratus (Golden hamster)
84 171 Chlamydomonas reinhardtii
85 170 Thermoplasma acidophilum
86 168 Emericella nidulans (Aspergillus nidulans)
87 158 Halobacterium sp. (strain NRC-1)
88 154 Autographa californica nuclear polyhedrosis virus (AcMNPV)
89 153 Cyanidium caldarium
90 152 Thermus aquaticus (subsp. thermophilus)
91 151 Marchantia polymorpha (Liverwort)
92 151 Cyanophora paradoxa
93 149 Xylella fastidiosa
94 148 Fowlpox virus (FPV)
95 148 Guillardia theta (Cryptomonas phi)
96 147 Synechococcus sp. (strain PCC 7942) (Anacystis nidulans R2)
97 147 Variola virus
98 143 Caulobacter crescentus
99 142 Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
100 142 Kluyveromyces lactis (Yeast)
A.2.3 Taxonomic distribution of the sequences
Kingdom Sequences (% of the database)
Archaea 5032 ( 5%)
Bacteria 34782 ( 34%)
Eukaryota 53357 ( 53%)
Viruses 8431 ( 8%)
A.3 Sequence size
A.3.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 1950 1001-1100 915
51- 100 7099 1101-1200 708
101- 150 10484 1201-1300 471
151- 200 9010 1301-1400 318
201- 250 8978 1401-1500 268
251- 300 8130 1501-1600 172
301- 350 7894 1601-1700 150
351- 400 7945 1701-1800 105
401- 450 5869 1801-1900 116
451- 500 5485 1901-2000 87
501- 550 4190 2001-2100 47
551- 600 2852 2101-2200 87
601- 650 2249 2201-2300 89
651- 700 1651 2301-2400 50
701- 750 1457 2401-2500 48
751- 800 1240 >2500 273
801- 850 985
851- 900 965
901- 950 700
951-1000 593
A.3.2 Longest and shortest sequences
The shortest sequence is GRWM_HUMAN (P24272) : 3 amino acids.
The longest sequence is NEBU_HUMAN (P20929) : 6669 amino acids.
A.4 Journal citations
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of SWISS-PROT: 1'190
A.4.1 Table of the frequency of journal citations
Journals cited 1x: 443
2x: 157
3x: 87
4x: 58
5x: 51
6x: 27
7x: 24
8x: 19
9x: 21
10x: 11
11- 20x: 83
21- 50x: 88
51-100x: 31
>100x: 90
A.4.2 List of the most cited journals in SWISS-PROT
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 8033 Journal of Biological Chemistry
2 4615 Proceedings of the National Academy of Sciences of the U.S.A.
3 3554 Nucleic Acids Research
4 3295 Journal of Bacteriology
5 3144 Gene
6 2492 FEBS Letters
7 2293 Biochemical and Biophysical Research Communications
8 2255 European Journal of Biochemistry
9 2144 Biochemistry
10 1998 The EMBO Journal
11 1894 Nature
12 1833 Biochimica et Biophysica Acta
13 1682 Journal of Molecular Biology
14 1503 Genomics
15 1477 Cell
16 1434 Molecular and Cellular Biology
17 1096 Biochemical Journal
18 1085 Molecular and General Genetics
19 1078 Plant Molecular Biology
20 1024 Science
21 982 Molecular Microbiology
22 814 Virology
23 808 Journal of Biochemistry
24 637 Human Molecular Genetics
25 592 Journal of Cell Biology
26 573 Journal of Virology
27 525 Human Mutation
28 520 Plant Physiology
29 518 Genes and Development
30 510 Yeast
31 505 Nature Genetics
32 494 Oncogene
33 486 Journal of General Virology
34 477 Infection and Immunity
35 461 Journal of Immunology
36 441 The American Journal of Human Genetics
37 424 Structure
38 420 Archives of Biochemistry and Biophysics
39 391 FEMS Microbiology Letters
40 366 Microbiology
41 358 Current Genetics
42 346 Development
43 333 Nature Structural Biology
44 331 Molecular and Biochemical Parasitology
45 320 Human Genetics
46 293 Genetics
47 280 Molecular Endocrinology
48 277 Journal of Clinical Investigation
49 270 Biological Chemistry Hoppe-Seyler
50 267 Applied and Environmental Microbiology
51 265 Blood
52 263 Journal of Molecular Evolution
53 253 Protein Science
54 249 DNA and Cell Biology
55 243 Developmental Biology
56 229 Journal of General Microbiology
57 224 Journal of Experimental Medicine
58 213 Neuron
59 213 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
60 211 Cancer Research
61 210 Immunogenetics
62 208 Mammalian Genome
63 197 Endocrinology
64 182 Mechanisms of Development
65 180 DNA Sequence
66 170 Acta Crystallographica, Section D
67 164 The Plant Cell
68 161 Brain Research. Molecular Brain Research
69 159 Journal of Neurochemistry
70 158 Molecular Biology and Evolution
71 156 DNA
72 155 Molecular Biology of the Cell
73 147 The Plant Journal
74 146 Journal of Cell Science
75 145 Journal of Neuroscience
76 135 Comparative Biochemistry and Physiology
77 133 Bioscience, Biotechnology, and Biochemistry
78 130 Antimicrobial Agents and Chemotherapy
79 125 Biochimie
80 123 Virus Research
81 122 Bioorganicheskaia Khimiia
82 120 Molecular Pharmacology
83 117 Hemoglobin
84 116 The Journal of Clinical Endocrinology and Metabolism
85 113 Agricultural and Biological Chemistry
86 112 Cytogenetics and Cell Genetics
87 112 American Journal of Physiology
88 110 Molecular Plant-Microbe Interactions
89 105 Proteins
90 102 Peptides
91 100 DNA Research
A.5 Statistics for some line types
The following table summarizes the total number of some SWISS-PROT lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 182326 1.79
Journal 152419 89829 1.50
Submitted to EMBL/GenBank/DDBJ 27607 24142 0.27
Unpublished observations 500 496 <0.01
Book citation 438 428 <0.01
Submitted to SWISS-PROT 437 435 <0.01
Plant Gene Register 385 378 <0.01
Submitted to other databases 185 183 <0.01
Thesis 160 159 <0.01
Unpublished results 114 112 <0.01
Patent 79 77 <0.01
Worm Breeder's Gazette 2 2 <0.01
Comments (CC) 309232 3.04
SIMILARITY 91246 81758 0.90
FUNCTION 61984 61049 0.61
SUBCELLULAR LOCATION 42010 42010 0.41
CATALYTIC ACTIVITY 27896 26508 0.27
SUBUNIT 25865 25864 0.25
PATHWAY 11464 11431 0.11
TISSUE SPECIFICITY 10070 10070 0.10
COFACTOR 7811 7811 0.08
MISCELLANEOUS 6942 6352 0.07
PTM 5829 5447 0.06
INDUCTION 2971 2971 0.03
DEVELOPMENTAL STAGE 2811 2811 0.03
ALTERNATIVE PRODUCTS 2755 2754 0.03
DOMAIN 2658 2471 0.03
CAUTION 2169 2099 0.02
DISEASE 1865 1620 0.02
ENZYME REGULATION 1473 1473 0.01
MASS SPECTROMETRY 548 506 0.01
DATABASE 503 465 <0.01
POLYMORPHISM 295 287 <0.01
PHARMACEUTICAL 38 38 <0.01
BIOTECHNOLOGY 29 29 <0.01
Features (FT) 471213 4.64
DOMAIN 76115 22381 0.75
TRANSMEM 64913 14473 0.64
CARBOHYD 40298 9840 0.40
CONFLICT 36638 12924 0.36
DISULFID 34856 9355 0.34
METAL 27931 6801 0.27
CHAIN 20956 16975 0.21
VARIANT 18980 3544 0.19
ACT_SITE 18495 11839 0.18
REPEAT 17543 3013 0.17
SIGNAL 12976 12975 0.13
NP_BIND 12514 8916 0.12
MOD_RES 11665 6503 0.11
NON_TER 10234 7849 0.10
BINDING 7710 6160 0.08
TURN 7330 633 0.07
STRAND 7077 562 0.07
ZN_FING 5911 2061 0.06
INIT_MET 4892 4868 0.05
HELIX 4644 587 0.05
VARSPLIC 4211 2068 0.04
SITE 4151 3019 0.04
PROPEP 3842 3488 0.04
DNA_BIND 3796 3589 0.04
MUTAGEN 2797 963 0.03
LIPID 2684 2174 0.03
TRANSIT 2300 2284 0.02
PEPTIDE 2202 830 0.02
CA_BIND 2106 840 0.02
NON_CONS 732 387 0.01
UNSURE 255 117 <0.01
SIMILAR 242 203 <0.01
SE_CYS 104 64 <0.01
THIOETH 90 31 <0.01
THIOLEST 23 23 <0.01
Cross-references (DR) 718458 7.07
EMBL 179318 95610 1.76
InterPro 128566 81051 1.27
Pfam 101086 77741 0.99
PROSITE 83189 53484 0.82
PIR 47057 35789 0.46
HSSP 33548 33548 0.33
PRINTS 30494 27899 0.30
SMART 30434 22855 0.30
ProDom 16772 16337 0.17
PDB 10380 3124 0.10
TIGR 9378 9343 0.09
MIM 6755 6024 0.07
SGD 4903 4849 0.05
MGD 4408 4397 0.04
EcoGene 4134 4132 0.04
Mendel 3041 2942 0.03
MEROPS 2348 2260 0.02
SubtiList 2234 2233 0.02
WormPep 2071 2034 0.02
FlyBase 1936 1883 0.02
GCRDb 1661 972 0.02
TRANSFAC 1612 1494 0.02
TubercuList 1350 1313 0.01
StyGene 799 798 0.01
SWISS-2DPAGE 746 745 0.01
Leproma 501 497 <0.01
MaizeDB 402 398 <0.01
HIV 370 354 <0.01
REBASE 352 347 <0.01
ECO2DBASE 351 299 <0.01
DictyDb 313 310 <0.01
GlycoSuiteDB 249 249 <0.01
ZFIN 154 154 <0.01
YEPD 129 120 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
PHCI-2DPAGE 128 128 <0.01
Siena-2DPAGE 104 104 <0.01
HSC-2DPAGE 85 85 <0.01
COMPLUYEAST-2DPAGE 50 50 <0.01
CarbBank 41 21 <0.01
Maize-2DPAGE 39 39 <0.01
PMMA-2DPAGE 26 26 <0.01
MypuList 21 21 <0.01
ANU-2DPAGE 13 13 <0.01
A.6 Miscellaneous statistics
Total number of distinct authors cited in SWISS-PROT: 146'936
Total number of entries encoded on a chloroplast : 2'609
Total number of entries encoded on a mitochondrion : 2'262
Total number of entries encoded on a cyanelle : 145
Total number of entries encoded on a plasmid : 2'344
Number of additional sequences encoded on splice variants : 3'505
--End of document--