![]() |
UniProt Knowledgebase Release notes UniProt release 5.0 of 10-May-2005 |
| Content |
|---|
Related documents: UniProt user manual, Recent changes, Forthcoming changes.
| Introduction |
|---|
Release 5.0 of the UniProt Knowledgebase is composed of the UniProt/Swiss-Prot Protein Knowledgebase release 47.0 and the UniProt/TrEMBL Protein Database release 30.0.
More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.
| UniProt/Swiss-Prot protein knowledgebase release 47.0 statistics |
|---|
Release 47.0 of 10-May-2005 of Swiss-Prot contains 181'577 sequence entries, comprising 65'746'672 amino acids abstracted from 128'440 references.
The growth of the database is summarized below.
| Release | Date | Number of entries | Number of amino acids |
|---|---|---|---|
| 2.0 | 09/86 | 3'939 | 900'163 |
| 3.0 | 11/86 | 4'160 | 969'641 |
| 4.0 | 04/87 | 4'387 | 1'036'010 |
| 5.0 | 09/87 | 5'205 | 1'327'683 |
| 6.0 | 01/88 | 6'102 | 1'653'982 |
| 7.0 | 04/88 | 6'821 | 1'885'771 |
| 8.0 | 08/88 | 7'724 | 2'224'465 |
| 9.0 | 11/88 | 8'702 | 2'498'140 |
| 10.0 | 03/89 | 10'008 | 2'952'613 |
| 11.0 | 07/89 | 10'856 | 3'265'966 |
| 12.0 | 10/89 | 12'305 | 3'797'482 |
| 13.0 | 01/90 | 13'837 | 4'347'336 |
| 14.0 | 04/90 | 15'409 | 4'914'264 |
| 15.0 | 08/90 | 16'941 | 5'486'399 |
| 16.0 | 11/90 | 18'364 | 5'986'949 |
| 17.0 | 02/91 | 20'024 | 6'524'504 |
| 18.0 | 05/91 | 20'772 | 6'792'034 |
| 19.0 | 08/91 | 21'795 | 7'173'785 |
| 20.0 | 11/91 | 22'654 | 7'500'130 |
| 21.0 | 03/92 | 23'742 | 7'866'596 |
| 22.0 | 05/92 | 25'044 | 8'375'696 |
| 23.0 | 08/92 | 26'706 | 9'011'391 |
| 24.0 | 12/92 | 28'154 | 9'545'427 |
| 25.0 | 04/93 | 29'955 | 10'214'020 |
| 26.0 | 07/93 | 31'808 | 10'875'091 |
| 27.0 | 10/93 | 33'329 | 11'484'420 |
| 28.0 | 02/94 | 36'000 | 12'496'420 |
| 29.0 | 06/94 | 38'303 | 13'464'008 |
| 30.0 | 10/94 | 40'292 | 14'147'368 |
| 31.0 | 02/95 | 43'470 | 15'335'248 |
| 32.0 | 11/95 | 49'340 | 17'385'503 |
| 33.0 | 02/96 | 52'205 | 18'531'384 |
| 34.0 | 10/96 | 59'021 | 21'210'389 |
| 35.0 | 11/97 | 69'113 | 25'083'768 |
| 36.0 | 07/98 | 74'019 | 26'840'295 |
| 37.0 | 12/98 | 77'977 | 28'268'293 |
| 38.0 | 07/99 | 80'000 | 29'085'965 |
| 39.0 | 05/00 | 86'593 | 31'411'114 |
| 40.0 | 10/01 | 101'602 | 37'315'215 |
| 41.0 | 02/03 | 122'564 | 44'986'459 |
| 42.0 | 10/03 | 135'850 | 50'046'799 |
| 43.0 | 03/04 | 146'720 | 54'093'154 |
| 44.0 | 07/04 | 153'871 | 56'608'159 |
| 45.0 | 10/04 | 163'235 | 59'631'787 |
| 46.0 | 02/05 | 168'297 | 61'443'278 |
| 47.0 | 05/05 | 181'577 | 65'746'672 |
In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProt/Swiss-Prot, but which have now been deleted from the database.
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:
| Organism | Database cross-references | Index file | Number of sequences |
|---|---|---|---|
| A.thaliana | None yet | arath.txt | 3'288 |
| C.albicans | None yet | calbican.txt | 333 |
| C.elegans | Wormpep | celegans.txt | 2'651 |
| D.discoideum | DictyBase | dicty.txt | 324 |
| D.melanogaster | FlyBase | fly.txt | 2'226 |
| M.musculus | MGD | mgdtosp.txt | 9'228 |
| S.cerevisiae | SGD | yeast.txt | 5'090 |
| S.pombe | GeneDB_SPombe | pombe.txt | 2'778 |
1. INTRODUCTION
Release 47.0 of 10-May-2005 of Swiss-Prot contains 181'577 sequence entries,
comprising 65'746'672 amino acids abstracted from 128'440 references.
11'531 sequences have been added since release 46, the sequence data of
841 existing entries has been updated and the annotations of
166'572 entries have been revised. This represents an increase of 6%.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.84 Gln (Q) 3.94 Leu (L) 9.64 Ser (S) 6.85
Arg (R) 5.34 Glu (E) 6.61 Lys (K) 5.91 Thr (T) 5.44
Asn (N) 4.18 Gly (G) 6.95 Met (M) 2.38 Trp (W) 1.15
Asp (D) 5.31 His (H) 2.28 Phe (F) 4.00 Tyr (Y) 3.06
Cys (C) 1.54 Ile (I) 5.91 Pro (P) 4.83 Val (V) 6.73
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
2.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of Swiss-Prot: 9212
The first twenty species represent 64219 sequences: 35.4 % of the total
number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x: 4395
2x: 1441
3x: 721
4x: 464
5x: 318
6x: 275
7x: 190
8x: 160
9x: 135
10x: 74
11- 20x: 376
21- 50x: 298
51-100x: 108
>100x: 257
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 12202 Homo sapiens (Human)
2 9228 Mus musculus (Mouse)
3 5090 Saccharomyces cerevisiae (Baker's yeast)
4 4842 Escherichia coli
5 4300 Rattus norvegicus (Rat)
6 3288 Arabidopsis thaliana (Mouse-ear cress)
7 2778 Schizosaccharomyces pombe (Fission yeast)
8 2777 Bacillus subtilis
9 2651 Caenorhabditis elegans
10 2226 Drosophila melanogaster (Fruit fly)
11 1782 Methanococcus jannaschii
12 1773 Haemophilus influenzae
13 1738 Escherichia coli O157:H7
14 1562 Bos taurus (Bovine)
15 1500 Salmonella typhimurium
16 1412 Escherichia coli O6
17 1400 Mycobacterium tuberculosis
18 1383 Shigella flexneri
19 1157 Gallus gallus (Chicken)
20 1130 Mycobacterium bovis
21 1087 Salmonella typhi
22 1019 Pseudomonas aeruginosa
23 960 Synechocystis sp. (strain PCC 6803)
24 960 Archaeoglobus fulgidus
25 958 Sus scrofa (Pig)
26 945 Xenopus laevis (African clawed frog)
27 816 Rhizobium meliloti (Sinorhizobium meliloti)
28 803 Vibrio cholerae
29 791 Yersinia pestis
30 760 Oryctolagus cuniculus (Rabbit)
31 745 Aquifex aeolicus
32 687 Mycoplasma pneumoniae
33 686 Pasteurella multocida
34 639 Vibrio parahaemolyticus
35 639 Streptomyces coelicolor
36 624 Bacillus halodurans
37 618 Mycobacterium leprae
38 607 Treponema pallidum
39 589 Vibrio vulnificus
40 579 Canis familiaris (Dog)
41 577 Methanobacterium thermoautotrophicum
42 577 Anabaena sp. (strain PCC 7120)
43 572 Buchnera aphidicola (subsp. Acyrthosiphon pisum)
44 565 Staphylococcus aureus (strain Mu50 / ATCC 700699)
45 563 Helicobacter pylori (Campylobacter pylori)
46 562 Staphylococcus aureus (strain N315)
47 561 Buchnera aphidicola (subsp. Schizaphis graminum)
48 546 Rickettsia prowazekii
49 545 Staphylococcus aureus (strain MW2)
50 544 Helicobacter pylori J99 (Campylobacter pylori J99)
51 532 Pseudomonas putida (strain KT2440)
52 528 Pseudomonas syringae (pv. tomato)
53 522 Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
54 520 Vibrio vulnificus (strain YJ016)
55 517 Zea mays (Maize)
56 515 Staphylococcus epidermidis
57 507 Buchnera aphidicola (subsp. Baizongia pistaciae)
58 506 Ralstonia solanacearum (Pseudomonas solanacearum)
59 505 Bacillus anthracis
60 505 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
61 500 Listeria monocytogenes
62 500 Bradyrhizobium japonicum
63 496 Listeria innocua
64 495 Rhizobium loti (Mesorhizobium loti)
65 487 Xanthomonas campestris (pv. campestris)
66 486 Mycoplasma genitalium
67 482 Neisseria meningitidis (serogroup B)
68 482 Neisseria meningitidis (serogroup A)
69 481 Oryza sativa (Rice)
70 479 Clostridium acetobutylicum
71 467 Caulobacter crescentus
72 463 Thermotoga maritima
73 450 Xanthomonas axonopodis (pv. citri)
74 445 Streptococcus pneumoniae
75 444 Photorhabdus luminescens (subsp. laumondii)
76 440 Shewanella oneidensis
77 440 Xylella fastidiosa
78 439 Deinococcus radiodurans
79 438 Pan troglodytes (Chimpanzee)
80 434 Brachydanio rerio (Zebrafish) (Danio rerio)
81 433 Bacillus cereus (strain ATCC 14579 / DSM 31)
82 432 Pyrococcus horikoshii
83 431 Chlamydia trachomatis
84 428 Xylella fastidiosa (strain Temecula1 / ATCC 700964)
85 427 Pyrococcus abyssi
86 419 Methanosarcina acetivorans
87 417 Borrelia burgdorferi (Lyme disease spirochete)
88 417 Brucella suis
89 417 Clostridium perfringens
90 416 Brucella melitensis
91 415 Corynebacterium glutamicum (Brevibacterium flavum)
92 412 Chlamydia pneumoniae (Chlamydophila pneumoniae)
93 404 Oceanobacillus iheyensis
94 404 Rhizobium sp. (strain NGR234)
95 403 Staphylococcus aureus (strain MRSA252)
96 402 Chlamydia muridarum
97 402 Methanosarcina mazei (Methanosarcina frisia)
98 401 Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
99 400 Staphylococcus aureus (strain MSSA476)
100 390 Pyrococcus furiosus
101 386 Thermoanaerobacter tengcongensis
102 382 Lactobacillus plantarum
103 381 Ovis aries (Sheep)
104 381 Sulfolobus solfataricus
105 380 Campylobacter jejuni
106 380 Neurospora crassa
107 371 Streptococcus pyogenes
108 369 Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
109 368 Nicotiana tabacum (Common tobacco)
110 364 Rickettsia conorii
111 361 Streptococcus mutans
112 357 Synechococcus elongatus (Thermosynechococcus elongatus)
113 345 Pongo pygmaeus (Orangutan)
114 342 Chlorobium tepidum
115 338 Enterococcus faecalis (Streptococcus faecalis)
116 337 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
117 336 Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
118 335 Aeropyrum pernix
119 333 Candida albicans (Yeast)
120 333 Bordetella pertussis
121 328 Streptomyces avermitilis
122 327 Bordetella parapertussis
123 327 Haemophilus ducreyi
124 327 Streptococcus pyogenes (serotype M18)
125 325 Chromobacterium violaceum
126 324 Dictyostelium discoideum (Slime mold)
127 323 Streptococcus pyogenes (serotype M3)
128 321 Staphylococcus aureus
129 320 Methanopyrus kandleri
130 310 Corynebacterium efficiens
131 307 Pisum sativum (Garden pea)
132 304 Sulfolobus tokodaii
133 300 Yersinia pseudotuberculosis
134 296 Leptospira interrogans
135 293 Nitrosomonas europaea
136 291 Thermoplasma acidophilum
137 283 Triticum aestivum (Wheat)
138 282 Streptococcus agalactiae (serotype V)
139 281 Streptococcus agalactiae (serotype III)
140 278 Fusobacterium nucleatum (subsp. nucleatum)
141 272 Hordeum vulgare (Barley)
142 268 Lycopersicon esculentum (Tomato)
143 268 Bacteriophage T4
144 266 Glycine max (Soybean)
145 261 Cavia porcellus (Guinea pig)
146 261 Gloeobacter violaceus
147 260 Bacillus cereus (strain ATCC 10987)
148 257 Thermoplasma volcanium
149 256 Solanum tuberosum (Potato)
150 256 Pyrobaculum aerophilum
151 254 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
152 254 Vaccinia virus (strain Copenhagen) (VACV)
153 254 Synechococcus sp. (strain WH8102)
154 250 Pseudomonas putida
155 247 Prochlorococcus marinus (strain MIT 9313)
156 245 Prochlorococcus marinus
157 244 Coxiella burnetii
158 243 Kluyveromyces lactis (Yeast)
159 242 Spinacia oleracea (Spinach)
160 242 Macaca mulatta (Rhesus macaque)
161 242 Clostridium tetani
162 241 Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
163 241 Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
164 236 Bacteroides thetaiotaomicron
165 233 Bacillus stearothermophilus
166 233 Prochlorococcus marinus subsp. pastoris (strain CCMP 1378 / MED4)
167 231 Rhodopseudomonas palustris
168 228 Photobacterium profundum (Photobacterium sp. (strain SS9))
169 225 Wolinella succinogenes
170 225 Wigglesworthia glossinidia brevipalpis
171 224 Equus caballus (Horse)
172 224 Chlamydophila caviae
173 220 Porphyra purpurea
174 220 Ashbya gossypii (Yeast) (Eremothecium gossypii)
175 214 Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
176 213 Chlamydomonas reinhardtii
177 212 Bifidobacterium longum
178 209 Klebsiella pneumoniae
179 205 Listeria monocytogenes (serotype 4b / strain F2365)
180 204 Porphyromonas gingivalis (Bacteroides gingivalis)
181 204 Rhodopirellula baltica
182 203 Mycobacterium paratuberculosis
183 200 Acinetobacter sp. (strain ADP1)
184 200 Vaccinia virus (strain Western Reserve / WR) (VACV)
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 9277 ( 5%)
Bacteria 82443 ( 45%)
Eukaryota 80554 ( 44%)
Viruses 9303 ( 5%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 12203 ( 15%) ( 7%)
Other Mammalia 23783 ( 30%) ( 13%)
Other Vertebrata 7207 ( 9%) ( 4%)
Viridiplantae 12609 ( 16%) ( 7%)
Fungi 11668 ( 14%) ( 6%)
Insecta 4327 ( 5%) ( 2%)
Nematoda 2930 ( 4%) ( 2%)
Other 5827 ( 7%) ( 3%)
4. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 3796 1001-1100 1494
51- 100 12862 1101-1200 1068
101- 150 18457 1201-1300 767
151- 200 17580 1301-1400 591
201- 250 18107 1401-1500 454
251- 300 15445 1501-1600 290
301- 350 16195 1601-1700 216
351- 400 14600 1701-1800 162
401- 450 11251 1801-1900 177
451- 500 9483 1901-2000 141
501- 550 7016 2001-2100 86
551- 600 4881 2101-2200 131
601- 650 4022 2201-2300 115
651- 700 2888 2301-2400 76
701- 750 2437 2401-2500 63
751- 800 2037 >2500 461
801- 850 1633
851- 900 1804
901- 950 1267
951-1000 1040
The average sequence length in Swiss-Prot is 362 amino acids.
The shortest sequence is GWA_SEPOF (P83570): 2 amino acids.
The longest sequence is SYNE1_HUMAN (Q8NF91): 8797 amino acids.
5. JOURNAL CITATIONS
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of Swiss-Prot: 1579
5.1 Table of the frequency of journal citations
Journals cited 1x: 570
2x: 219
3x: 108
4x: 74
5x: 58
6x: 31
7x: 38
8x: 27
9x: 22
10x: 14
11- 20x: 123
21- 50x: 127
51-100x: 55
>100x: 113
5.2 List of the most cited journals in Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 11906 Journal of Biological Chemistry
2 6037 Proceedings of the National Academy of Sciences of the U.S.A.
3 4124 Journal of Bacteriology
4 3852 Gene
5 3833 Nucleic Acids Research
6 3227 Biochemical and Biophysical Research Communications
7 3188 FEBS Letters
8 2861 Biochemistry
9 2776 European Journal of Biochemistry
10 2674 The EMBO Journal
11 2443 Nature
12 2410 Biochimica et Biophysica Acta
13 2180 Journal of Molecular Biology
14 2076 Genomics
15 2006 Molecular and Cellular Biology
16 1960 Cell
17 1567 Biochemical Journal
18 1458 Science
19 1302 Molecular Microbiology
20 1235 Plant Molecular Biology
21 1225 Molecular and General Genetics
22 1001 Journal of Biochemistry
23 981 Journal of Cell Biology
24 943 Virology
25 927 Human Molecular Genetics
26 857 Nature Genetics
27 797 Genes and Development
28 796 Journal of Virology
29 744 The American Journal of Human Genetics
30 743 Oncogene
31 720 Plant Physiology
32 708 Human Mutation
33 648 Journal of Immunology
34 635 Infection and Immunity
35 623 Archives of Biochemistry and Biophysics
36 615 Yeast
37 610 Structure
38 567 Development
39 561 Journal of General Virology
40 539 Microbiology
41 521 Genetics
42 507 FEMS Microbiology Letters
43 492 Nature Structural Biology
44 448 Human Genetics
45 448 Blood
46 443 Current Genetics
47 387 Molecular and Biochemical Parasitology
48 384 Applied and Environmental Microbiology
49 378 Molecular Biology of the Cell
50 372 Journal of Clinical Investigation
51 363 Developmental Biology
52 359 Mammalian Genome
53 356 Cancer Research
54 353 Molecular Endocrinology
55 352 The Plant Cell
56 351 Protein Science
57 338 Acta Crystallographica, Section D
58 334 Journal of Cell Science
59 333 Immunogenetics
60 333 Mechanisms of Development
61 332 Neuron
62 324 The Journal of Experimental Medicine
63 320 Journal of Molecular Evolution
64 311 DNA and Cell Biology
65 305 The Plant Journal
66 292 Journal of Neuroscience
67 286 Endocrinology
68 282 Biological Chemistry Hoppe-Seyler
69 273 DNA Sequence
70 263 Molecular Cell
71 260 Journal of Neurochemistry
72 249 Molecular Biology and Evolution
73 247 The Journal of Clinical Endocrinology and Metabolism
74 245 Current Biology
75 239 Journal of General Microbiology
76 239 Brain Research. Molecular Brain Research
77 232 Toxicon
78 229 Bioscience, Biotechnology, and Biochemistry
79 222 American Journal of Physiology
80 221 Cytogenetics and Cell Genetics
81 214 Comparative Biochemistry and Physiology
82 214 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
83 186 Molecular Pharmacology
84 185 Antimicrobial Agents and Chemotherapy
85 173 Proteins
86 172 Journal of Investigative Dermatology
87 163 Journal of Medical Genetics
88 158 DNA Research
89 158 DNA
90 155 Peptides
91 154 Molecular Plant-Microbe Interactions
92 152 Genome Research
93 152 Virus Research
94 150 American Journal of Medical Genetics
95 148 Tissue Antigens
96 143 Biochimie
97 139 Biology of Reproduction
98 138 Bioorganicheskaia Khimiia
99 135 Hemoglobin
100 134 European Journal of Immunology
101 130 Molecular and Cellular Endocrinology
102 130 Plant and Cell Physiology
103 117 Insect Biochemistry and Molecular Biology
104 116 Agricultural and Biological Chemistry
105 114 Archives of Microbiology
106 114 Molecular Phylogenetics and Evolution
107 107 General and Comparative Endocrinology
108 107 Annals of Neurology
109 104 European Journal of Human Genetics
110 103 Diabetes
111 103 Experimental Cell Research
112 102 Journal of Human Genetics
113 102 Neurology
6. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 354347 1.95
Journal 314613 170221 1.73
Submitted to EMBL/GenBank/DDBJ 36948 31617 0.20
Submitted to Swiss-Prot 646 643 <0.01
Plant Gene Register 500 488 <0.01
Book citation 490 478 <0.01
Unpublished observations 397 393 <0.01
Thesis 288 286 <0.01
Submitted to other databases 254 250 <0.01
Patent 122 120 <0.01
Unpublished results 83 81 <0.01
Worm Breeder's Gazette 6 6 <0.01
Comments (CC) 669562 3.69
SIMILARITY 192885 162260 1.06
FUNCTION 121080 118343 0.67
SUBCELLULAR LOCATION 90200 90200 0.50
CATALYTIC ACTIVITY 64662 60703 0.36
SUBUNIT 58853 58853 0.32
PATHWAY 32505 29804 0.18
COFACTOR 21977 21977 0.12
TISSUE SPECIFICITY 19565 19565 0.11
PTM 12088 10734 0.07
MISCELLANEOUS 10225 9394 0.06
DOMAIN 8227 7239 0.05
ALTERNATIVE PRODUCTS 7024 7024 0.04
CAUTION 6313 5604 0.03
INDUCTION 5029 5029 0.03
DEVELOPMENTAL STAGE 4666 4666 0.03
INTERACTION 3083 3083 0.02
DISEASE 2933 2140 0.02
ENZYME REGULATION 2551 2551 0.01
MASS SPECTROMETRY 1754 1532 0.01
DATABASE 1302 1241 0.01
BIOPHYSICOCHEMICAL PROPERTIES 961 961 0.01
POLYMORPHISM 504 491 <0.01
ALLERGEN 380 380 <0.01
RNA EDITING 355 355 <0.01
TOXIC DOSE 269 268 <0.01
BIOTECHNOLOGY 116 116 <0.01
PHARMACEUTICAL 55 55 <0.01
Features (FT) 1008341 5.55
TRANSMEM 115652 25159 0.64
METAL 70212 17480 0.39
CONFLICT 67057 23460 0.37
TURN 62464 4662 0.34
CARBOHYD 59729 14988 0.33
STRAND 57266 4165 0.32
DISULFID 54939 14707 0.30
TOPO_DOM 52817 11324 0.29
DOMAIN 48227 25421 0.27
HELIX 45089 4519 0.25
ACT_SITE 41384 24663 0.23
REPEAT 38277 5571 0.21
VARIANT 33268 6451 0.18
CHAIN 29926 24316 0.16
NP_BIND 26120 18212 0.14
MOD_RES 22591 11680 0.12
REGION 21623 10792 0.12
SIGNAL 19085 19083 0.11
COMPBIAS 17474 9465 0.10
BINDING 15645 10259 0.09
VARSPLIC 14106 6212 0.08
SITE 11811 6609 0.07
ZN_FING 11593 4480 0.06
MUTAGEN 11144 2918 0.06
NON_TER 10952 8325 0.06
MOTIF 8401 6395 0.05
INIT_MET 8137 8073 0.04
PROPEP 6236 5233 0.03
DNA_BIND 5521 5189 0.03
LIPID 5352 3532 0.03
COILED 4676 2837 0.03
PEPTIDE 3812 1774 0.02
TRANSIT 3214 3184 0.02
CA_BIND 2288 928 0.01
NON_CONS 1052 506 0.01
CROSSLNK 592 480 <0.01
UNSURE 416 170 <0.01
SE_CYS 193 135 <0.01
Cross-references (DR) 1842067 10.14
InterPro 362455 164421 2.00
EMBL 350521 173966 1.93
Pfam 211867 155896 1.17
PROSITE 163750 101742 0.90
PIR 92542 85789 0.51
GO 83471 23262 0.46
HSSP 71939 71939 0.40
PRINTS 67421 52265 0.37
TIGRFAMs 64396 60065 0.35
HAMAP 58792 58682 0.32
ProDom 49046 47123 0.27
SMART 44819 34097 0.25
Ensembl 34859 34856 0.19
PDB 29940 8125 0.16
SMR 23589 23589 0.13
TIGR 17791 17285 0.10
PIRSF 13545 13348 0.07
Genew 11320 11263 0.06
MIM 10709 8803 0.06
MGI 8850 8809 0.05
PANTHER 7807 7795 0.04
SGD 5140 5079 0.03
GermOnline 4927 4877 0.03
EcoGene 4225 4223 0.02
EchoBASE 4159 4127 0.02
IntAct 3946 3946 0.02
MEROPS 3726 3615 0.02
H-InvDB 3677 3659 0.02
WormPep 3036 2648 0.02
RGD 3010 3007 0.02
FlyBase 2825 2797 0.02
GeneDB_Spombe 2806 2776 0.02
TRANSFAC 2749 2465 0.02
SubtiList 2727 2726 0.02
WormBase 2710 2635 0.01
StyGene 1454 1451 0.01
TubercuList 1428 1392 0.01
SWISS-2DPAGE 1132 1132 0.01
ListiList 997 989 0.01
GeneFarm 952 948 0.01
Reactome 720 720 <0.01
Gramene 641 609 <0.01
Leproma 622 618 <0.01
PhotoList 444 444 <0.01
ZFIN 434 427 <0.01
MaizeDB 423 418 <0.01
HIV 370 365 <0.01
REBASE 368 363 <0.01
OGP 365 365 <0.01
ECO2DBASE 351 299 <0.01
DictyBase 325 323 <0.01
GlycoSuiteDB 283 283 <0.01
SagaList 282 281 <0.01
PHCI-2DPAGE 239 239 <0.01
AGD 226 220 <0.01
LegioList 184 184 <0.01
MypuList 173 173 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 103 103 <0.01
HSC-2DPAGE 85 85 <0.01
COMPLUYEAST-2DPAGE 59 59 <0.01
PhosSite 54 54 <0.01
PMMA-2DPAGE 52 52 <0.01
Maize-2DPAGE 39 39 <0.01
Rat-heart-2DPAGE 28 28 <0.01
ANU-2DPAGE 14 14 <0.01
Number of explicitly cross-referenced databases: 67
Number of implicitly cross-referenced databases: 31
7. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in Swiss-Prot: 201756
Total number of entries encoded on a chloroplast: 4293
Total number of entries encoded on a mitochondrion: 3318
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 3019
Number of fragments: 8484
Number of additional sequences encoded on splice variants: 10767
| UniProt/TrEMBL protein database release 30.0 statistics |
|---|
1. INTRODUCTION
Release 30.0 of 10-May-2005 of UniProt/TrEMBL has been produced in synch
with UniProt/Swiss-Prot release 47 and EMBL/DDBJ/GenBank nucleotide
sequence database release 81 and updates until the 16-April-2005. It contains
1'714'475 sequence entries, comprising 540'729'498 amino acids.
149'924 sequences have been added since release 29. This represents an
increase of 11.24%.
In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in UniProt/TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from TrEMBL.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.72 Gln (Q) 3.88 Leu (L) 9.73 Ser (S) 7.10
Arg (R) 5.30 Glu (E) 6.08 Lys (K) 5.56 Thr (T) 5.73
Asn (N) 4.47 Gly (G) 6.89 Met (M) 2.41 Trp (W) 1.37
Asp (D) 5.10 His (H) 2.27 Phe (F) 4.14 Tyr (Y) 3.14
Cys (C) 1.50 Ile (I) 6.03 Pro (P) 4.94 Val (V) 6.48
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.07
2.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Ile, Thr, Lys, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of
UniProt/TrEMBL: 89807
The first twenty species represent 499903 sequences: 29.2 % of the
total number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x:44301
2x:17044
3x: 8565
4x: 4547
5x: 2681
6x: 2006
7x: 1349
8x: 1150
9x: 935
10x: 816
11- 20x: 2983
21- 50x: 1787
51-100x: 721
>100x: 922
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 126858 Human immunodeficiency virus 1
2 56039 Homo sapiens (Human)
3 47506 Oryza sativa (japonica cultivar-group)
4 39737 Arabidopsis thaliana (Mouse-ear cress)
5 38848 Mus musculus (Mouse)
6 24423 Drosophila melanogaster (Fruit fly)
7 22971 Hepatitis C virus
8 20188 Caenorhabditis elegans
9 15226 Anopheles gambiae str. PEST
10 13201 Caenorhabditis briggsae
11 12210 Brachydanio rerio (Zebrafish) (Danio rerio)
12 10974 Neurospora crassa
13 10718 Xenopus laevis (African clawed frog)
14 9678 Schistosoma japonicum (Blood fluke)
15 9528 Aspergillus nidulans FGSC A4
16 9240 Candida albicans SC5314
17 9048 Rattus norvegicus (Rat)
18 8142 Bradyrhizobium japonicum
19 7802 Plasmodium yoelii yoelii
20 7566 Streptomyces coelicolor
21 7397 Hepatitis B virus
22 7379 Streptomyces avermitilis
23 7183 Rhizobium loti (Mesorhizobium loti)
24 7081 uncultured bacterium
25 7067 Rhodopirellula baltica
26 7006 Escherichia coli
27 7006 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
28 6575 Cryptococcus neoformans (Filobasidiella neoformans)
29 6493 Pseudomonas aeruginosa
30 6463 Yarrowia lipolytica (Candida lipolytica)
31 6394 Giardia lamblia ATCC 50803
32 6275 Bacillus anthracis
33 6241 Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
34 5803 Nocardia farcinica
35 5747 Burkholderia pseudomallei (Pseudomonas pseudomallei)
36 5696 Rhizobium meliloti (Sinorhizobium meliloti)
37 5565 Anabaena sp. (strain PCC 7120)
38 5560 Bacillus cereus (strain ATCC 10987)
39 5374 Gallus gallus (Chicken)
40 5228 Plasmodium falciparum (isolate 3D7)
41 5217 Yersinia pestis
42 5197 Trypanosoma brucei
43 5195 Kluyveromyces lactis (Yeast)
44 5136 Helicobacter pylori (Campylobacter pylori)
45 5121 Photobacterium profundum (Photobacterium sp. (strain SS9))
46 5106 Candida glabrata (Yeast) (Torulopsis glabrata)
47 4969 Pseudomonas syringae (pv. tomato)
48 4964 Bacillus cereus (strain ZK)
49 4947 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
50 4894 Bacillus thuringiensis (subsp. konkukian)
51 4889 Escherichia coli O157:H7
52 4867 Bacillus licheniformis (strain DSM 13 / ATCC 14580)
53 4806 Bacillus cereus (strain ATCC 14579 / DSM 31)
54 4781 Pseudomonas putida (strain KT2440)
55 4772 Bacteroides fragilis
56 4728 Ralstonia solanacearum (Pseudomonas solanacearum)
57 4629 Xanthomonas oryzae (pv. oryzae)
58 4616 Rhodopseudomonas palustris
59 4607 Bacteroides thetaiotaomicron
60 4586 Leptospira interrogans
61 4530 Ashbya gossypii (Yeast) (Eremothecium gossypii)
62 4470 Vibrio vulnificus (strain YJ016)
63 4441 Azoarcus sp. (strain EbN1)
64 4407 Oryza sativa (Rice)
65 4404 Pongo pygmaeus (Orangutan)
66 4404 Burkholderia mallei (Pseudomonas mallei)
67 4396 Vibrio parahaemolyticus
68 4319 Mycobacterium tuberculosis
69 4247 Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
70 4241 Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67
71 4212 Mycobacterium paratuberculosis
72 4181 Silicibacter pomeroyi
73 4150 Gloeobacter violaceus
74 4146 Shewanella oneidensis
75 4115 Photorhabdus luminescens (subsp. laumondii)
76 4105 Haloarcula marismortui (Halobacterium marismortui)
77 4086 Chromobacterium violaceum
78 4060 Corynebacterium glutamicum (Brevibacterium flavum)
79 4058 Methanosarcina acetivorans
80 4034 Plasmodium falciparum
81 4031 Cryptosporidium parvum
82 4027 Salmonella typhi
83 4027 Vibrio vulnificus
84 3989 Vibrio cholerae
85 3978 Cryptosporidium hominis
86 3958 Salmonella paratyphi-a
87 3941 Bacillus clausii (strain KSM-K16)
88 3938 Yersinia pseudotuberculosis
89 3927 Shigella flexneri
90 3926 Escherichia coli O6
91 3911 Xanthomonas axonopodis (pv. citri)
92 3850 Bordetella parapertussis
93 3769 Vibrio fischeri (strain ATCC 700601 / ES114)
94 3768 Listeria monocytogenes
95 3755 Bos taurus (Bovine)
96 3753 Salmonella typhimurium
97 3720 Xanthomonas campestris (pv. campestris)
98 3588 Enterococcus faecalis (Streptococcus faecalis)
99 3562 Bacillus halodurans
100 3539 Streptococcus pneumoniae
101 3501 Torque teno virus
102 3477 Bdellovibrio bacteriovorus
103 3436 Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
104 3407 Clostridium acetobutylicum
105 3407 Geobacillus kaustophilus
106 3325 Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
107 3321 Caulobacter crescentus
108 3290 Chimpanzee immunodeficiency virus (SIV(cpz)) (CIV)
109 3225 Dictyostelium discoideum (Slime mold)
110 3214 Geobacter sulfurreducens
111 3213 Xenopus tropicalis (Western clawed frog) (Silurana tropicalis)
112 3192 Symbiobacterium thermophilum
113 3140 Acinetobacter sp. (strain ADP1)
114 3113 Desulfotalea psychrophila
115 3104 Streptococcus pyogenes
116 3086 Oceanobacillus iheyensis
117 3076 Brucella abortus biovar 1 str. 9-941
118 3050 Legionella pneumophila (strain Paris)
119 3033 Bordetella pertussis
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 45239 ( 3%)
Bacteria 633961 ( 37%)
Eukaryota 728869 ( 43%)
Viruses 304257 ( 18%)
Other 2149 ( <1%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 56039 ( 8%) ( 3%)
Other Mammalia 92879 ( 13%) ( 5%)
Other Vertebrata 92476 ( 13%) ( 5%)
Viridiplantae 177898 ( 24%) ( 10%)
Fungi 89518 ( 12%) ( 5%)
Insecta 85761 ( 12%) ( 5%)
Nematoda 35993 ( 5%) ( 2%)
Other 98305 ( 13%) ( 6%)
4. SEQUENCE SIZE
4.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 21155 1001-1100 9713
51- 100 102798 1101-1200 6992
101- 150 128388 1201-1300 5303
151- 200 117224 1301-1400 3404
201- 250 118574 1401-1500 2846
251- 300 109753 1501-1600 1953
301- 350 107083 1601-1700 1551
351- 400 86474 1701-1800 1314
401- 450 67185 1801-1900 1058
451- 500 58853 1901-2000 876
501- 550 46252 2001-2100 675
551- 600 32250 2101-2200 817
601- 650 24775 2201-2300 679
651- 700 19308 2301-2400 536
701- 750 16492 2401-2500 377
751- 800 13713 >2500 3339
801- 850 11539
851- 900 10187
901- 950 7584
951-1000 6052
4.2 Longest and shortest sequences
The shortest sequence is Q16047_HUMAN: 4 amino acids.
The longest sequence is Q8WZ42_HUMAN: 34350 amino acids.
5. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some UniProt/TrEMBL
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 2369863 1.38
Journal 1496311 1260307 0.87
Submitted to EMBL/GenBank/DDBJ 860323 677541 0.50
Thesis 4686 4634 <0.01
Book citation 3792 3748 <0.01
Submitted to other databases 448 440 <0.01
Other 4303 4302 <0.01
Comments (CC) 964235 0.56
SIMILARITY 174303 171382 0.10
FUNCTION 172046 170766 0.10
CATALYTIC ACTIVITY 170005 152037 0.10
SUBCELLULAR LOCATION 159288 159288 0.09
SUBUNIT 92437 92437 0.05
CAUTION 75319 75218 0.04
PATHWAY 54200 53057 0.03
COFACTOR 56256 56256 0.03
INTERACTION 1161 1161 <0.01
MISCELLANEOUS 3581 3564 <0.01
DOMAIN 5309 4658 <0.01
ALLERGEN 172 172 <0.01
Features (FT) 1014689 0.59
NON_TER 960278 565239 0.56
CHAIN 40998 24408 0.02
SIGNAL 12809 12587 0.01
TRANSIT 604 600 <0.01
Cross-references (DR) 12951295 7.55
GO 3878298 1080665 2.26
InterPro 2376394 1270589 1.39
EMBL 2019928 1708137 1.18
Pfam 1588390 1198364 0.93
PROSITE 827908 540191 0.48
PRINTS 392823 317615 0.23
SMART 295055 227125 0.17
HSSP 290497 290218 0.17
SMR 248032 247914 0.14
ProDom 204145 195938 0.12
PIR 197872 162162 0.12
TIGRFAMs 181672 168057 0.11
TIGR 92497 86467 0.05
Ensembl 73110 73097 0.04
PANTHER 54609 54599 0.03
Gramene 43206 43193 0.03
PIRSF 32718 31848 0.02
FlyBase 29155 22548 0.02
MGI 24517 24515 0.01
WormPep 19095 19014 0.01
WormBase 19083 19014 0.01
ZFIN 8839 8837 0.01
MEROPS 8517 8250 <0.01
LegioList 5711 5681 <0.01
IntAct 5352 5352 <0.01
ListiList 4818 4801 <0.01
AGD 4483 4483 <0.01
PhotoList 4236 4112 <0.01
Genew 3264 3264 <0.01
PDB 2759 1629 <0.01
TubercuList 2494 2488 <0.01
RGD 2474 2459 <0.01
GeneDB_Spombe 2125 2119 <0.01
SagaList 1812 1718 <0.01
SGD 1374 1373 <0.01
TRANSFAC 1030 1017 <0.01
Leproma 985 984 <0.01
DictyBase 980 980 <0.01
MypuList 609 605 <0.01
REBASE 125 120 <0.01
PHCI-2DPAGE 108 108 <0.01
SWISS-2DPAGE 87 87 <0.01
ANU-2DPAGE 73 73 <0.01
Reactome 30 30 <0.01
PMMA-2DPAGE 3 3 <0.01
Siena-2DPAGE 2 2 <0.01
COMPLUYEAST-2DPAGE 1 1 <0.01
Number of explicitly cross-referenced databases: 68
6. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in UniProt/TrEMBL: 210250
Total number of entries encoded on a chloroplast: 41294
Total number of entries encoded on a mitochondrion: 99934
Total number of entries encoded on a cyanelle: 2
Total number of entries encoded on a plasmid: 33862
Number of fragments: 567403
Number of additional sequences encoded on splice variants: 55
| Submissions and Updates |
|---|
We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.
Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml
For all queries regarding submissions to UniProt and to submit new protein sequence data, please contact:
UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:
| Download information |
|---|
The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by two files containing the sequences of all additional splice isoforms annotated in UniProt/Swiss-Prot and UniProt/TrEMBL. These data sets are documented in the file ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/README.varsplic
For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest major release (updated 4 times per year) in flatfile format. Previous UniProt/Swiss-Prot and UniProt/TrEMBL are archived under ftp://ftp.uniprot.org/databases/uniprot/previous_major_releases The UniProt Knowledgebase major release is also available on CD-ROM from the EBI.
| Contact |
|---|
| Citation |
|---|
If you want to cite UniProt in a publication please use the following reference:
Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S., The Universal Protein Resource (UniProt), Nucleic Acids Res. 33: D154-D159 (2005).