UniProt Knowledgebase
Swiss-Prot Protein Knowledgebase
TrEMBL Protein Database

Release notes
UniProt release 5.0 of 10-May-2005

Content

  Introduction
  UniProt/Swiss-Prot Protein Knowledgebase release statistics
  UniProt/TrEMBL Protein Database release statistics

  Submissions and Updates
  Download information
  Contact
  Citation

  Related documents: UniProt user manual, Recent changes, Forthcoming changes.

Introduction

Release 5.0 of the UniProt Knowledgebase is composed of the UniProt/Swiss-Prot Protein Knowledgebase release 47.0 and the UniProt/TrEMBL Protein Database release 30.0.

More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.


UniProt/Swiss-Prot protein knowledgebase release 47.0 statistics

Release 47.0 of 10-May-2005 of Swiss-Prot contains 181'577 sequence entries, comprising 65'746'672 amino acids abstracted from 128'440 references.

The growth of the database is summarized below.

Release Date Number of entries Number of amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
41.0 02/03 122'564 44'986'459
42.0 10/03 135'850 50'046'799
43.0 03/04 146'720 54'093'154
44.0 07/04 153'871 56'608'159
45.0 10/04 163'235 59'631'787
46.0 02/05 168'297 61'443'278
47.0 05/05 181'577 65'746'672

In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProt/Swiss-Prot, but which have now been deleted from the database.


Status of the model organisms

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:

From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:

Organism Database cross-references Index file Number of sequences
A.thaliana None yet arath.txt 3'288
C.albicans None yet calbican.txt 333
C.elegans Wormpep celegans.txt 2'651
D.discoideum DictyBase dicty.txt 324
D.melanogaster FlyBase fly.txt 2'226
M.musculus MGD mgdtosp.txt 9'228
S.cerevisiae SGD yeast.txt 5'090
S.pombe GeneDB_SPombe pombe.txt 2'778

UniProt/Swiss-Prot release statistics

1.  INTRODUCTION

Release 47.0 of 10-May-2005 of Swiss-Prot contains 181'577 sequence entries,
comprising 65'746'672 amino acids abstracted from 128'440 references. 

11'531 sequences have been added since release 46, the sequence data of
841 existing entries has been updated and the annotations of
166'572 entries have been revised. This represents an increase of 6%.


2.  AMINO ACID COMPOSITION

   2.1  Composition in percent for the complete database

   Ala (A) 7.84   Gln (Q) 3.94   Leu (L) 9.64   Ser (S) 6.85
   Arg (R) 5.34   Glu (E) 6.61   Lys (K) 5.91   Thr (T) 5.44
   Asn (N) 4.18   Gly (G) 6.95   Met (M) 2.38   Trp (W) 1.15
   Asp (D) 5.31   His (H) 2.28   Phe (F) 4.00   Tyr (Y) 3.06
   Cys (C) 1.54   Ile (I) 5.91   Pro (P) 4.83   Val (V) 6.73

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01


   2.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp


3.  TAXONOMIC ORIGIN

   Total number of species represented in this release of Swiss-Prot: 9212

   The first twenty species represent 64219 sequences:  35.4 % of the total
   number of entries.


   3.1 Table of the frequency of occurrence of species

        Species represented 1x: 4395
                            2x: 1441
                            3x:  721
                            4x:  464
                            5x:  318
                            6x:  275
                            7x:  190
                            8x:  160
                            9x:  135
                           10x:   74
                       11- 20x:  376
                       21- 50x:  298
                       51-100x:  108
                         >100x:  257


   3.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1      12202  Homo sapiens (Human)
       2       9228  Mus musculus (Mouse)
       3       5090  Saccharomyces cerevisiae (Baker's yeast)
       4       4842  Escherichia coli
       5       4300  Rattus norvegicus (Rat)
       6       3288  Arabidopsis thaliana (Mouse-ear cress)
       7       2778  Schizosaccharomyces pombe (Fission yeast)
       8       2777  Bacillus subtilis
       9       2651  Caenorhabditis elegans
      10       2226  Drosophila melanogaster (Fruit fly)
      11       1782  Methanococcus jannaschii
      12       1773  Haemophilus influenzae
      13       1738  Escherichia coli O157:H7
      14       1562  Bos taurus (Bovine)
      15       1500  Salmonella typhimurium
      16       1412  Escherichia coli O6
      17       1400  Mycobacterium tuberculosis
      18       1383  Shigella flexneri
      19       1157  Gallus gallus (Chicken)
      20       1130  Mycobacterium bovis
      21       1087  Salmonella typhi
      22       1019  Pseudomonas aeruginosa
      23        960  Synechocystis sp. (strain PCC 6803)
      24        960  Archaeoglobus fulgidus
      25        958  Sus scrofa (Pig)
      26        945  Xenopus laevis (African clawed frog)
      27        816  Rhizobium meliloti (Sinorhizobium meliloti)
      28        803  Vibrio cholerae
      29        791  Yersinia pestis
      30        760  Oryctolagus cuniculus (Rabbit)
      31        745  Aquifex aeolicus
      32        687  Mycoplasma pneumoniae
      33        686  Pasteurella multocida
      34        639  Vibrio parahaemolyticus
      35        639  Streptomyces coelicolor
      36        624  Bacillus halodurans
      37        618  Mycobacterium leprae
      38        607  Treponema pallidum
      39        589  Vibrio vulnificus
      40        579  Canis familiaris (Dog)
      41        577  Methanobacterium thermoautotrophicum
      42        577  Anabaena sp. (strain PCC 7120)
      43        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum) 
      44        565  Staphylococcus aureus (strain Mu50 / ATCC 700699)
      45        563  Helicobacter pylori (Campylobacter pylori)
      46        562  Staphylococcus aureus (strain N315)
      47        561  Buchnera aphidicola (subsp. Schizaphis graminum)
      48        546  Rickettsia prowazekii
      49        545  Staphylococcus aureus (strain MW2)
      50        544  Helicobacter pylori J99 (Campylobacter pylori J99)
      51        532  Pseudomonas putida (strain KT2440)
      52        528  Pseudomonas syringae (pv. tomato)
      53        522  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      54        520  Vibrio vulnificus (strain YJ016)
      55        517  Zea mays (Maize)
      56        515  Staphylococcus epidermidis
      57        507  Buchnera aphidicola (subsp. Baizongia pistaciae)
      58        506  Ralstonia solanacearum (Pseudomonas solanacearum)
      59        505  Bacillus anthracis
      60        505  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      61        500  Listeria monocytogenes
      62        500  Bradyrhizobium japonicum
      63        496  Listeria innocua
      64        495  Rhizobium loti (Mesorhizobium loti)
      65        487  Xanthomonas campestris (pv. campestris)
      66        486  Mycoplasma genitalium
      67        482  Neisseria meningitidis (serogroup B)
      68        482  Neisseria meningitidis (serogroup A)
      69        481  Oryza sativa (Rice)
      70        479  Clostridium acetobutylicum
      71        467  Caulobacter crescentus
      72        463  Thermotoga maritima
      73        450  Xanthomonas axonopodis (pv. citri)
      74        445  Streptococcus pneumoniae
      75        444  Photorhabdus luminescens (subsp. laumondii)
      76        440  Shewanella oneidensis
      77        440  Xylella fastidiosa
      78        439  Deinococcus radiodurans
      79        438  Pan troglodytes (Chimpanzee)
      80        434  Brachydanio rerio (Zebrafish) (Danio rerio)
      81        433  Bacillus cereus (strain ATCC 14579 / DSM 31)
      82        432  Pyrococcus horikoshii
      83        431  Chlamydia trachomatis
      84        428  Xylella fastidiosa (strain Temecula1 / ATCC 700964)
      85        427  Pyrococcus abyssi
      86        419  Methanosarcina acetivorans
      87        417  Borrelia burgdorferi (Lyme disease spirochete)
      88        417  Brucella suis
      89        417  Clostridium perfringens
      90        416  Brucella melitensis
      91        415  Corynebacterium glutamicum (Brevibacterium flavum)
      92        412  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      93        404  Oceanobacillus iheyensis
      94        404  Rhizobium sp. (strain NGR234)
      95        403  Staphylococcus aureus (strain MRSA252)
      96        402  Chlamydia muridarum
      97        402  Methanosarcina mazei (Methanosarcina frisia)
      98        401  Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
      99        400  Staphylococcus aureus (strain MSSA476)
     100        390  Pyrococcus furiosus
     101        386  Thermoanaerobacter tengcongensis
     102        382  Lactobacillus plantarum
     103        381  Ovis aries (Sheep)
     104        381  Sulfolobus solfataricus
     105        380  Campylobacter jejuni
     106        380  Neurospora crassa
     107        371  Streptococcus pyogenes
     108        369  Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
     109        368  Nicotiana tabacum (Common tobacco)
     110        364  Rickettsia conorii
     111        361  Streptococcus mutans
     112        357  Synechococcus elongatus (Thermosynechococcus elongatus)
     113        345  Pongo pygmaeus (Orangutan)
     114        342  Chlorobium tepidum
     115        338  Enterococcus faecalis (Streptococcus faecalis)
     116        337  Bordetella bronchiseptica (Alcaligenes bronchisepticus)
     117        336  Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
     118        335  Aeropyrum pernix
     119        333  Candida albicans (Yeast)
     120        333  Bordetella pertussis
     121        328  Streptomyces avermitilis
     122        327  Bordetella parapertussis
     123        327  Haemophilus ducreyi
     124        327  Streptococcus pyogenes (serotype M18)
     125        325  Chromobacterium violaceum
     126        324  Dictyostelium discoideum (Slime mold)
     127        323  Streptococcus pyogenes (serotype M3)
     128        321  Staphylococcus aureus
     129        320  Methanopyrus kandleri
     130        310  Corynebacterium efficiens
     131        307  Pisum sativum (Garden pea)
     132        304  Sulfolobus tokodaii
     133        300  Yersinia pseudotuberculosis
     134        296  Leptospira interrogans
     135        293  Nitrosomonas europaea
     136        291  Thermoplasma acidophilum
     137        283  Triticum aestivum (Wheat)
     138        282  Streptococcus agalactiae (serotype V)
     139        281  Streptococcus agalactiae (serotype III)
     140        278  Fusobacterium nucleatum (subsp. nucleatum)
     141        272  Hordeum vulgare (Barley)
     142        268  Lycopersicon esculentum (Tomato)
     143        268  Bacteriophage T4
     144        266  Glycine max (Soybean)
     145        261  Cavia porcellus (Guinea pig)
     146        261  Gloeobacter violaceus
     147        260  Bacillus cereus (strain ATCC 10987)
     148        257  Thermoplasma volcanium
     149        256  Solanum tuberosum (Potato)
     150        256  Pyrobaculum aerophilum
     151        254  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
     152        254  Vaccinia virus (strain Copenhagen) (VACV)
     153        254  Synechococcus sp. (strain WH8102)
     154        250  Pseudomonas putida
     155        247  Prochlorococcus marinus (strain MIT 9313)
     156        245  Prochlorococcus marinus
     157        244  Coxiella burnetii
     158        243  Kluyveromyces lactis (Yeast)
     159        242  Spinacia oleracea (Spinach)
     160        242  Macaca mulatta (Rhesus macaque)
     161        242  Clostridium tetani
     162        241  Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
     163        241  Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
     164        236  Bacteroides thetaiotaomicron
     165        233  Bacillus stearothermophilus
     166        233  Prochlorococcus marinus subsp. pastoris (strain CCMP 1378 / MED4)
     167        231  Rhodopseudomonas palustris
     168        228  Photobacterium profundum (Photobacterium sp. (strain SS9))
     169        225  Wolinella succinogenes
     170        225  Wigglesworthia glossinidia brevipalpis
     171        224  Equus caballus (Horse)
     172        224  Chlamydophila caviae
     173        220  Porphyra purpurea
     174        220  Ashbya gossypii (Yeast) (Eremothecium gossypii)
     175        214  Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
     176        213  Chlamydomonas reinhardtii
     177        212  Bifidobacterium longum
     178        209  Klebsiella pneumoniae
     179        205  Listeria monocytogenes (serotype 4b / strain F2365)
     180        204  Porphyromonas gingivalis (Bacteroides gingivalis)
     181        204  Rhodopirellula baltica
     182        203  Mycobacterium paratuberculosis
     183        200  Acinetobacter sp. (strain ADP1)
     184        200  Vaccinia virus (strain Western Reserve / WR) (VACV)


   
   3.3  Taxonomic distribution of the sequences

   Kingdom        sequences (% of the database)
    Archaea            9277 (  5%)
    Bacteria          82443 ( 45%)
    Eukaryota         80554 ( 44%)
    Viruses            9303 (  5%)


   Within Eukaryota:

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                  12203 ( 15%)           (  7%)
     Other Mammalia         23783 ( 30%)           ( 13%)
     Other Vertebrata        7207 (  9%)           (  4%)
     Viridiplantae          12609 ( 16%)           (  7%)
     Fungi                  11668 ( 14%)           (  6%)
     Insecta                 4327 (  5%)           (  2%)
     Nematoda                2930 (  4%)           (  2%)
     Other                   5827 (  7%)           (  3%)


4.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    3796             1001-1100     1494
                 51- 100   12862             1101-1200     1068
                101- 150   18457             1201-1300      767
                151- 200   17580             1301-1400      591
                201- 250   18107             1401-1500      454
                251- 300   15445             1501-1600      290
                301- 350   16195             1601-1700      216
                351- 400   14600             1701-1800      162
                401- 450   11251             1801-1900      177
                451- 500    9483             1901-2000      141
                501- 550    7016             2001-2100       86
                551- 600    4881             2101-2200      131
                601- 650    4022             2201-2300      115
                651- 700    2888             2301-2400       76
                701- 750    2437             2401-2500       63
                751- 800    2037             >2500          461
                801- 850    1633
                851- 900    1804
                901- 950    1267
                951-1000    1040


   The average sequence length in Swiss-Prot is 362 amino acids.

   The shortest sequence is   GWA_SEPOF (P83570):     2 amino acids.
   The longest sequence is  SYNE1_HUMAN (Q8NF91):  8797 amino acids.


5.  JOURNAL CITATIONS

   Note: the following citation statistics reflect the number of distinct
         journal citations.

   Total number of journals cited in this release of Swiss-Prot: 1579


   5.1 Table of the frequency of journal citations

        Journals cited 1x:  570
                       2x:  219
                       3x:  108
                       4x:   74
                       5x:   58
                       6x:   31
                       7x:   38
                       8x:   27
                       9x:   22
                      10x:   14
                  11- 20x:  123
                  21- 50x:  127
                  51-100x:   55
                    >100x:  113


   5.2  List of the most cited journals in Swiss-Prot

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1        11906   Journal of Biological Chemistry
    2         6037   Proceedings of the National Academy of Sciences of the U.S.A.
    3         4124   Journal of Bacteriology
    4         3852   Gene
    5         3833   Nucleic Acids Research
    6         3227   Biochemical and Biophysical Research Communications
    7         3188   FEBS Letters
    8         2861   Biochemistry
    9         2776   European Journal of Biochemistry
   10         2674   The EMBO Journal
   11         2443   Nature
   12         2410   Biochimica et Biophysica Acta
   13         2180   Journal of Molecular Biology
   14         2076   Genomics
   15         2006   Molecular and Cellular Biology
   16         1960   Cell
   17         1567   Biochemical Journal
   18         1458   Science
   19         1302   Molecular Microbiology
   20         1235   Plant Molecular Biology
   21         1225   Molecular and General Genetics
   22         1001   Journal of Biochemistry
   23          981   Journal of Cell Biology
   24          943   Virology
   25          927   Human Molecular Genetics
   26          857   Nature Genetics
   27          797   Genes and Development
   28          796   Journal of Virology
   29          744   The American Journal of Human Genetics
   30          743   Oncogene
   31          720   Plant Physiology
   32          708   Human Mutation
   33          648   Journal of Immunology
   34          635   Infection and Immunity
   35          623   Archives of Biochemistry and Biophysics
   36          615   Yeast
   37          610   Structure
   38          567   Development
   39          561   Journal of General Virology
   40          539   Microbiology
   41          521   Genetics
   42          507   FEMS Microbiology Letters
   43          492   Nature Structural Biology
   44          448   Human Genetics
   45          448   Blood
   46          443   Current Genetics
   47          387   Molecular and Biochemical Parasitology
   48          384   Applied and Environmental Microbiology
   49          378   Molecular Biology of the Cell
   50          372   Journal of Clinical Investigation
   51          363   Developmental Biology
   52          359   Mammalian Genome
   53          356   Cancer Research
   54          353   Molecular Endocrinology
   55          352   The Plant Cell
   56          351   Protein Science
   57          338   Acta Crystallographica, Section D
   58          334   Journal of Cell Science
   59          333   Immunogenetics
   60          333   Mechanisms of Development
   61          332   Neuron
   62          324   The Journal of Experimental Medicine
   63          320   Journal of Molecular Evolution
   64          311   DNA and Cell Biology
   65          305   The Plant Journal
   66          292   Journal of Neuroscience
   67          286   Endocrinology
   68          282   Biological Chemistry Hoppe-Seyler
   69          273   DNA Sequence
   70          263   Molecular Cell
   71          260   Journal of Neurochemistry
   72          249   Molecular Biology and Evolution
   73          247   The Journal of Clinical Endocrinology and Metabolism
   74          245   Current Biology
   75          239   Journal of General Microbiology
   76          239   Brain Research. Molecular Brain Research
   77          232   Toxicon
   78          229   Bioscience, Biotechnology, and Biochemistry
   79          222   American Journal of Physiology
   80          221   Cytogenetics and Cell Genetics
   81          214   Comparative Biochemistry and Physiology
   82          214   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   83          186   Molecular Pharmacology
   84          185   Antimicrobial Agents and Chemotherapy
   85          173   Proteins
   86          172   Journal of Investigative Dermatology
   87          163   Journal of Medical Genetics
   88          158   DNA Research
   89          158   DNA
   90          155   Peptides
   91          154   Molecular Plant-Microbe Interactions
   92          152   Genome Research
   93          152   Virus Research
   94          150   American Journal of Medical Genetics
   95          148   Tissue Antigens
   96          143   Biochimie
   97          139   Biology of Reproduction
   98          138   Bioorganicheskaia Khimiia
   99          135   Hemoglobin
  100          134   European Journal of Immunology
  101          130   Molecular and Cellular Endocrinology
  102          130   Plant and Cell Physiology
  103          117   Insect Biochemistry and Molecular Biology
  104          116   Agricultural and Biological Chemistry
  105          114   Archives of Microbiology
  106          114   Molecular Phylogenetics and Evolution
  107          107   General and Comparative Endocrinology
  108          107   Annals of Neurology
  109          104   European Journal of Human Genetics
  110          103   Diabetes
  111          103   Experimental Cell Research
  112          102   Journal of Human Genetics
  113          102   Neurology


6.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     354347              1.95
   Journal                          314613    170221    1.73
   Submitted to EMBL/GenBank/DDBJ    36948     31617    0.20
   Submitted to Swiss-Prot             646       643   <0.01
   Plant Gene Register                 500       488   <0.01
   Book citation                       490       478   <0.01
   Unpublished observations            397       393   <0.01
   Thesis                              288       286   <0.01
   Submitted to other databases        254       250   <0.01
   Patent                              122       120   <0.01
   Unpublished results                  83        81   <0.01
   Worm Breeder's Gazette                6         6   <0.01

Comments (CC)                       669562              3.69
   SIMILARITY                       192885    162260    1.06
   FUNCTION                         121080    118343    0.67
   SUBCELLULAR LOCATION              90200     90200    0.50
   CATALYTIC ACTIVITY                64662     60703    0.36
   SUBUNIT                           58853     58853    0.32
   PATHWAY                           32505     29804    0.18
   COFACTOR                          21977     21977    0.12
   TISSUE SPECIFICITY                19565     19565    0.11
   PTM                               12088     10734    0.07
   MISCELLANEOUS                     10225      9394    0.06
   DOMAIN                             8227      7239    0.05
   ALTERNATIVE PRODUCTS               7024      7024    0.04
   CAUTION                            6313      5604    0.03
   INDUCTION                          5029      5029    0.03
   DEVELOPMENTAL STAGE                4666      4666    0.03
   INTERACTION                        3083      3083    0.02
   DISEASE                            2933      2140    0.02
   ENZYME REGULATION                  2551      2551    0.01
   MASS SPECTROMETRY                  1754      1532    0.01
   DATABASE                           1302      1241    0.01
   BIOPHYSICOCHEMICAL PROPERTIES       961       961    0.01
   POLYMORPHISM                        504       491   <0.01
   ALLERGEN                            380       380   <0.01
   RNA EDITING                         355       355   <0.01
   TOXIC DOSE                          269       268   <0.01
   BIOTECHNOLOGY                       116       116   <0.01
   PHARMACEUTICAL                       55        55   <0.01

Features (FT)                      1008341              5.55
   TRANSMEM                         115652     25159    0.64
   METAL                             70212     17480    0.39
   CONFLICT                          67057     23460    0.37
   TURN                              62464      4662    0.34
   CARBOHYD                          59729     14988    0.33
   STRAND                            57266      4165    0.32
   DISULFID                          54939     14707    0.30
   TOPO_DOM                          52817     11324    0.29
   DOMAIN                            48227     25421    0.27
   HELIX                             45089      4519    0.25
   ACT_SITE                          41384     24663    0.23
   REPEAT                            38277      5571    0.21
   VARIANT                           33268      6451    0.18
   CHAIN                             29926     24316    0.16
   NP_BIND                           26120     18212    0.14
   MOD_RES                           22591     11680    0.12
   REGION                            21623     10792    0.12
   SIGNAL                            19085     19083    0.11
   COMPBIAS                          17474      9465    0.10
   BINDING                           15645     10259    0.09
   VARSPLIC                          14106      6212    0.08
   SITE                              11811      6609    0.07
   ZN_FING                           11593      4480    0.06
   MUTAGEN                           11144      2918    0.06
   NON_TER                           10952      8325    0.06
   MOTIF                              8401      6395    0.05
   INIT_MET                           8137      8073    0.04
   PROPEP                             6236      5233    0.03
   DNA_BIND                           5521      5189    0.03
   LIPID                              5352      3532    0.03
   COILED                             4676      2837    0.03
   PEPTIDE                            3812      1774    0.02
   TRANSIT                            3214      3184    0.02
   CA_BIND                            2288       928    0.01
   NON_CONS                           1052       506    0.01
   CROSSLNK                            592       480   <0.01
   UNSURE                              416       170   <0.01
   SE_CYS                              193       135   <0.01

Cross-references (DR)              1842067             10.14
   InterPro                         362455    164421    2.00
   EMBL                             350521    173966    1.93
   Pfam                             211867    155896    1.17
   PROSITE                          163750    101742    0.90
   PIR                               92542     85789    0.51
   GO                                83471     23262    0.46
   HSSP                              71939     71939    0.40
   PRINTS                            67421     52265    0.37
   TIGRFAMs                          64396     60065    0.35
   HAMAP                             58792     58682    0.32
   ProDom                            49046     47123    0.27
   SMART                             44819     34097    0.25
   Ensembl                           34859     34856    0.19
   PDB                               29940      8125    0.16
   SMR                               23589     23589    0.13
   TIGR                              17791     17285    0.10
   PIRSF                             13545     13348    0.07
   Genew                             11320     11263    0.06
   MIM                               10709      8803    0.06
   MGI                                8850      8809    0.05
   PANTHER                            7807      7795    0.04
   SGD                                5140      5079    0.03
   GermOnline                         4927      4877    0.03
   EcoGene                            4225      4223    0.02
   EchoBASE                           4159      4127    0.02
   IntAct                             3946      3946    0.02
   MEROPS                             3726      3615    0.02
   H-InvDB                            3677      3659    0.02
   WormPep                            3036      2648    0.02
   RGD                                3010      3007    0.02
   FlyBase                            2825      2797    0.02
   GeneDB_Spombe                      2806      2776    0.02
   TRANSFAC                           2749      2465    0.02
   SubtiList                          2727      2726    0.02
   WormBase                           2710      2635    0.01
   StyGene                            1454      1451    0.01
   TubercuList                        1428      1392    0.01
   SWISS-2DPAGE                       1132      1132    0.01
   ListiList                           997       989    0.01
   GeneFarm                            952       948    0.01
   Reactome                            720       720   <0.01
   Gramene                             641       609   <0.01
   Leproma                             622       618   <0.01
   PhotoList                           444       444   <0.01
   ZFIN                                434       427   <0.01
   MaizeDB                             423       418   <0.01
   HIV                                 370       365   <0.01
   REBASE                              368       363   <0.01
   OGP                                 365       365   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyBase                           325       323   <0.01
   GlycoSuiteDB                        283       283   <0.01
   SagaList                            282       281   <0.01
   PHCI-2DPAGE                         239       239   <0.01
   AGD                                 226       220   <0.01
   LegioList                           184       184   <0.01
   MypuList                            173       173   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   Siena-2DPAGE                        103       103   <0.01
   HSC-2DPAGE                           85        85   <0.01
   COMPLUYEAST-2DPAGE                   59        59   <0.01
   PhosSite                             54        54   <0.01
   PMMA-2DPAGE                          52        52   <0.01
   Maize-2DPAGE                         39        39   <0.01
   Rat-heart-2DPAGE                     28        28   <0.01
   ANU-2DPAGE                           14        14   <0.01

Number of explicitly cross-referenced databases: 67
Number of implicitly cross-referenced databases: 31


7.  MISCELLANEOUS STATISTICS

Total number of distinct authors cited in Swiss-Prot: 201756

Total number of entries encoded on a chloroplast: 4293
Total number of entries encoded on a mitochondrion: 3318
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 3019

Number of fragments: 8484
Number of additional sequences encoded on splice variants: 10767


UniProt/TrEMBL protein database release 30.0 statistics


1.  INTRODUCTION

Release 30.0 of 10-May-2005 of UniProt/TrEMBL has been produced in synch
with UniProt/Swiss-Prot release 47 and EMBL/DDBJ/GenBank nucleotide
sequence database release 81 and updates until the 16-April-2005. It contains 
1'714'475 sequence entries, comprising 540'729'498 amino acids.

149'924 sequences have been added since release 29. This represents an 
increase of 11.24%.

In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in UniProt/TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from TrEMBL. 


2.  AMINO ACID COMPOSITION

   2.1  Composition in percent for the complete database

   Ala (A) 7.72   Gln (Q) 3.88   Leu (L) 9.73   Ser (S) 7.10
   Arg (R) 5.30   Glu (E) 6.08   Lys (K) 5.56   Thr (T) 5.73
   Asn (N) 4.47   Gly (G) 6.89   Met (M) 2.41   Trp (W) 1.37
   Asp (D) 5.10   His (H) 2.27   Phe (F) 4.14   Tyr (Y) 3.14
   Cys (C) 1.50   Ile (I) 6.03   Pro (P) 4.94   Val (V) 6.48

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.07


   2.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Ile, Thr, Lys, Arg, Asp, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp


3.  TAXONOMIC ORIGIN

   Total number of species represented in this release of 
   UniProt/TrEMBL: 89807

   The first twenty species represent 499903 sequences: 29.2 % of the
   total number of entries.


   3.1 Table of the frequency of occurrence of species

        Species represented 1x:44301
                            2x:17044
                            3x: 8565
                            4x: 4547
                            5x: 2681
                            6x: 2006
                            7x: 1349
                            8x: 1150
                            9x:  935
                           10x:  816
                       11- 20x: 2983
                       21- 50x: 1787
                       51-100x:  721
                         >100x:  922


   3.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     126858  Human immunodeficiency virus 1
       2      56039  Homo sapiens (Human)
       3      47506  Oryza sativa (japonica cultivar-group)
       4      39737  Arabidopsis thaliana (Mouse-ear cress)
       5      38848  Mus musculus (Mouse)
       6      24423  Drosophila melanogaster (Fruit fly)
       7      22971  Hepatitis C virus
       8      20188  Caenorhabditis elegans
       9      15226  Anopheles gambiae str. PEST
      10      13201  Caenorhabditis briggsae
      11      12210  Brachydanio rerio (Zebrafish) (Danio rerio)
      12      10974  Neurospora crassa
      13      10718  Xenopus laevis (African clawed frog)
      14       9678  Schistosoma japonicum (Blood fluke)
      15       9528  Aspergillus nidulans FGSC A4
      16       9240  Candida albicans SC5314
      17       9048  Rattus norvegicus (Rat)
      18       8142  Bradyrhizobium japonicum
      19       7802  Plasmodium yoelii yoelii
      20       7566  Streptomyces coelicolor
      21       7397  Hepatitis B virus
      22       7379  Streptomyces avermitilis
      23       7183  Rhizobium loti (Mesorhizobium loti)
      24       7081  uncultured bacterium
      25       7067  Rhodopirellula baltica
      26       7006  Escherichia coli
      27       7006  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      28       6575  Cryptococcus neoformans (Filobasidiella neoformans)
      29       6493  Pseudomonas aeruginosa
      30       6463  Yarrowia lipolytica (Candida lipolytica)
      31       6394  Giardia lamblia ATCC 50803
      32       6275  Bacillus anthracis
      33       6241  Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
      34       5803  Nocardia farcinica
      35       5747  Burkholderia pseudomallei (Pseudomonas pseudomallei)
      36       5696  Rhizobium meliloti (Sinorhizobium meliloti)
      37       5565  Anabaena sp. (strain PCC 7120)
      38       5560  Bacillus cereus (strain ATCC 10987)
      39       5374  Gallus gallus (Chicken)
      40       5228  Plasmodium falciparum (isolate 3D7)
      41       5217  Yersinia pestis
      42       5197  Trypanosoma brucei
      43       5195  Kluyveromyces lactis (Yeast)
      44       5136  Helicobacter pylori (Campylobacter pylori)
      45       5121  Photobacterium profundum (Photobacterium sp. (strain SS9))
      46       5106  Candida glabrata (Yeast) (Torulopsis glabrata)
      47       4969  Pseudomonas syringae (pv. tomato)
      48       4964  Bacillus cereus (strain ZK)
      49       4947  Bordetella bronchiseptica (Alcaligenes bronchisepticus)
      50       4894  Bacillus thuringiensis (subsp. konkukian)
      51       4889  Escherichia coli O157:H7
      52       4867  Bacillus licheniformis (strain DSM 13 / ATCC 14580)
      53       4806  Bacillus cereus (strain ATCC 14579 / DSM 31)
      54       4781  Pseudomonas putida (strain KT2440)
      55       4772  Bacteroides fragilis
      56       4728  Ralstonia solanacearum (Pseudomonas solanacearum)
      57       4629  Xanthomonas oryzae (pv. oryzae)
      58       4616  Rhodopseudomonas palustris
      59       4607  Bacteroides thetaiotaomicron
      60       4586  Leptospira interrogans
      61       4530  Ashbya gossypii (Yeast) (Eremothecium gossypii)
      62       4470  Vibrio vulnificus (strain YJ016)
      63       4441  Azoarcus sp. (strain EbN1)
      64       4407  Oryza sativa (Rice)
      65       4404  Pongo pygmaeus (Orangutan)
      66       4404  Burkholderia mallei (Pseudomonas mallei)
      67       4396  Vibrio parahaemolyticus
      68       4319  Mycobacterium tuberculosis
      69       4247  Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
      70       4241  Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67
      71       4212  Mycobacterium paratuberculosis
      72       4181  Silicibacter pomeroyi
      73       4150  Gloeobacter violaceus
      74       4146  Shewanella oneidensis
      75       4115  Photorhabdus luminescens (subsp. laumondii)
      76       4105  Haloarcula marismortui (Halobacterium marismortui)
      77       4086  Chromobacterium violaceum
      78       4060  Corynebacterium glutamicum (Brevibacterium flavum)
      79       4058  Methanosarcina acetivorans
      80       4034  Plasmodium falciparum
      81       4031  Cryptosporidium parvum
      82       4027  Salmonella typhi
      83       4027  Vibrio vulnificus
      84       3989  Vibrio cholerae
      85       3978  Cryptosporidium hominis
      86       3958  Salmonella paratyphi-a
      87       3941  Bacillus clausii (strain KSM-K16)
      88       3938  Yersinia pseudotuberculosis
      89       3927  Shigella flexneri
      90       3926  Escherichia coli O6
      91       3911  Xanthomonas axonopodis (pv. citri)
      92       3850  Bordetella parapertussis
      93       3769  Vibrio fischeri (strain ATCC 700601 / ES114)
      94       3768  Listeria monocytogenes
      95       3755  Bos taurus (Bovine)
      96       3753  Salmonella typhimurium
      97       3720  Xanthomonas campestris (pv. campestris)
      98       3588  Enterococcus faecalis (Streptococcus faecalis)
      99       3562  Bacillus halodurans
     100       3539  Streptococcus pneumoniae
     101       3501  Torque teno virus
     102       3477  Bdellovibrio bacteriovorus
     103       3436  Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
     104       3407  Clostridium acetobutylicum
     105       3407  Geobacillus kaustophilus
     106       3325  Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
     107       3321  Caulobacter crescentus
     108       3290  Chimpanzee immunodeficiency virus (SIV(cpz)) (CIV)
     109       3225  Dictyostelium discoideum (Slime mold)
     110       3214  Geobacter sulfurreducens
     111       3213  Xenopus tropicalis (Western clawed frog) (Silurana tropicalis)
     112       3192  Symbiobacterium thermophilum
     113       3140  Acinetobacter sp. (strain ADP1)
     114       3113  Desulfotalea psychrophila
     115       3104  Streptococcus pyogenes
     116       3086  Oceanobacillus iheyensis
     117       3076  Brucella abortus biovar 1 str. 9-941
     118       3050  Legionella pneumophila (strain Paris)
     119       3033  Bordetella pertussis



   3.3  Taxonomic distribution of the sequences

   Kingdom        sequences (% of the database)
    Archaea           45239 (  3%)
    Bacteria         633961 ( 37%)
    Eukaryota        728869 ( 43%)
    Viruses          304257 ( 18%)
    Other              2149 ( <1%)

   Within Eukaryota:

   Category            sequences (% of Eukaryota) (% of the complete database)
     Human                  56039 (  8%)           (  3%)
     Other Mammalia         92879 ( 13%)           (  5%)
     Other Vertebrata       92476 ( 13%)           (  5%)
     Viridiplantae         177898 ( 24%)           ( 10%)
     Fungi                  89518 ( 12%)           (  5%)
     Insecta                85761 ( 12%)           (  5%)
     Nematoda               35993 (  5%)           (  2%)
     Other                  98305 ( 13%)           (  6%)



4.  SEQUENCE SIZE

   4.1  Repartition of the sequences by size (excluding fragments)

              From   To  Number             From   To   Number
                  1-  50   21155             1001-1100     9713
                 51- 100  102798             1101-1200     6992
                101- 150  128388             1201-1300     5303
                151- 200  117224             1301-1400     3404
                201- 250  118574             1401-1500     2846
                251- 300  109753             1501-1600     1953
                301- 350  107083             1601-1700     1551
                351- 400   86474             1701-1800     1314
                401- 450   67185             1801-1900     1058
                451- 500   58853             1901-2000      876
                501- 550   46252             2001-2100      675
                551- 600   32250             2101-2200      817
                601- 650   24775             2201-2300      679
                651- 700   19308             2301-2400      536
                701- 750   16492             2401-2500      377
                751- 800   13713             >2500         3339
                801- 850   11539
                851- 900   10187
                901- 950    7584
                951-1000    6052

 


   4.2  Longest and shortest sequences

   The shortest sequence is Q16047_HUMAN:     4 amino acids.
   The longest sequence is  Q8WZ42_HUMAN: 34350 amino acids.


5.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProt/TrEMBL 
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                    2369863              1.38
   Journal                         1496311   1260307    0.87
   Submitted to EMBL/GenBank/DDBJ   860323    677541    0.50
   Thesis                             4686      4634   <0.01
   Book citation                      3792      3748   <0.01
   Submitted to other databases        448       440   <0.01
   Other                              4303      4302   <0.01
 
Comments (CC)                       964235              0.56
   SIMILARITY                       174303    171382    0.10
   FUNCTION                         172046    170766    0.10
   CATALYTIC ACTIVITY               170005    152037    0.10
   SUBCELLULAR LOCATION             159288    159288    0.09
   SUBUNIT                           92437     92437    0.05
   CAUTION                           75319     75218    0.04
   PATHWAY                           54200     53057    0.03
   COFACTOR                          56256     56256    0.03
   INTERACTION                        1161      1161   <0.01
   MISCELLANEOUS                      3581      3564   <0.01
   DOMAIN                             5309      4658   <0.01
   ALLERGEN                            172       172   <0.01
   
Features (FT)                      1014689              0.59
   NON_TER                          960278    565239    0.56
   CHAIN                             40998     24408    0.02
   SIGNAL                            12809     12587    0.01
   TRANSIT                             604       600   <0.01
   

Cross-references (DR)             12951295              7.55
   GO                              3878298   1080665    2.26
   InterPro                        2376394   1270589    1.39
   EMBL                            2019928   1708137    1.18
   Pfam                            1588390   1198364    0.93
   PROSITE                          827908    540191    0.48
   PRINTS                           392823    317615    0.23
   SMART                            295055    227125    0.17
   HSSP                             290497    290218    0.17
   SMR                              248032    247914    0.14
   ProDom                           204145    195938    0.12
   PIR                              197872    162162    0.12
   TIGRFAMs                         181672    168057    0.11
   TIGR                              92497     86467    0.05
   Ensembl                           73110     73097    0.04
   PANTHER                           54609     54599    0.03
   Gramene                           43206     43193    0.03
   PIRSF                             32718     31848    0.02
   FlyBase                           29155     22548    0.02
   MGI                               24517     24515    0.01
   WormPep                           19095     19014    0.01
   WormBase                          19083     19014    0.01
   ZFIN                               8839      8837    0.01
   MEROPS                             8517      8250   <0.01
   LegioList                          5711      5681   <0.01
   IntAct                             5352      5352   <0.01
   ListiList                          4818      4801   <0.01
   AGD                                4483      4483   <0.01
   PhotoList                          4236      4112   <0.01
   Genew                              3264      3264   <0.01
   PDB                                2759      1629   <0.01
   TubercuList                        2494      2488   <0.01
   RGD                                2474      2459   <0.01
   GeneDB_Spombe                      2125      2119   <0.01
   SagaList                           1812      1718   <0.01
   SGD                                1374      1373   <0.01
   TRANSFAC                           1030      1017   <0.01
   Leproma                             985       984   <0.01
   DictyBase                           980       980   <0.01
   MypuList                            609       605   <0.01
   REBASE                              125       120   <0.01
   PHCI-2DPAGE                         108       108   <0.01
   SWISS-2DPAGE                         87        87   <0.01
   ANU-2DPAGE                           73        73   <0.01
   Reactome                             30        30   <0.01
   PMMA-2DPAGE                           3         3   <0.01
   Siena-2DPAGE                          2         2   <0.01
   COMPLUYEAST-2DPAGE                    1         1   <0.01
   
Number of explicitly cross-referenced databases: 68

6.  MISCELLANEOUS STATISTICS

Total number of distinct authors cited in UniProt/TrEMBL: 210250

Total number of entries encoded on a chloroplast: 41294
Total number of entries encoded on a mitochondrion: 99934
Total number of entries encoded on a cyanelle: 2
Total number of entries encoded on a plasmid: 33862

Number of fragments: 567403
Number of additional sequences encoded on splice variants: 55


Submissions and Updates

We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.

Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml

For all queries regarding submissions to UniProt and to submit new protein sequence data, please contact:

UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:


Download information

Bi-Weekly releases

The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by two files containing the sequences of all additional splice isoforms annotated in UniProt/Swiss-Prot and UniProt/TrEMBL. These data sets are documented in the file ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/README.varsplic

Major releases

For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest major release (updated 4 times per year) in flatfile format. Previous UniProt/Swiss-Prot and UniProt/TrEMBL are archived under ftp://ftp.uniprot.org/databases/uniprot/previous_major_releases The UniProt Knowledgebase major release is also available on CD-ROM from the EBI.


Contact

EMBL Outstation
European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 444
Fax: (+44 1223) 494 468
Electronic mail address: /
WWW server: http://www.ebi.ac.uk/


Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire
1, rue Michel Servet
1211 Geneva 4
Switzerland

Telephone: (+41 22) 702 50 50
Fax: (+41 22) 702 58 58
Electronic mail address:
WWW server: http://www.expasy.org/


Protein Information Resource (PIR)
Georgetown University Medical Center
3900 Reservoir Road, NW
Box 571455
Washington, DC 20057-1455
United States of America

Telephone: (+1 202) 687 1039
Fax: (+1 202) 687 0057)
Electronic mail address:
WWW server: http://pir.georgetown.edu

Citation

If you want to cite UniProt in a publication please use the following reference:

Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S., The Universal Protein Resource (UniProt), Nucleic Acids Res. 33: D154-D159 (2005).