![]() |
UniProt Knowledgebase Release notes UniProtKB release 6.0 of 13-Sep-2005 |
| Content |
|---|
Related documents: UniProtKB user manual, Recent changes, Forthcoming changes.
| Introduction |
|---|
Release 6.0 of the UniProt Knowledgebase is composed of the UniProtKB/Swiss-Prot Protein Knowledgebase release 48.0 and the UniProtKB/TrEMBL Protein Database release 31.0.
More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.
| UniProtKB/Swiss-Prot protein knowledgebase release 48.0 statistics |
|---|
Release 48.0 of 13-Sep-2005 of Swiss-Prot contains 194'317 sequence entries, comprising 70'391'852 amino acids abstracted from 133'723 references.
The growth of the database is summarized below.
| Release | Date | Number of entries | Number of amino acids |
|---|---|---|---|
| 2.0 | 09/86 | 3'939 | 900'163 |
| 3.0 | 11/86 | 4'160 | 969'641 |
| 4.0 | 04/87 | 4'387 | 1'036'010 |
| 5.0 | 09/87 | 5'205 | 1'327'683 |
| 6.0 | 01/88 | 6'102 | 1'653'982 |
| 7.0 | 04/88 | 6'821 | 1'885'771 |
| 8.0 | 08/88 | 7'724 | 2'224'465 |
| 9.0 | 11/88 | 8'702 | 2'498'140 |
| 10.0 | 03/89 | 10'008 | 2'952'613 |
| 11.0 | 07/89 | 10'856 | 3'265'966 |
| 12.0 | 10/89 | 12'305 | 3'797'482 |
| 13.0 | 01/90 | 13'837 | 4'347'336 |
| 14.0 | 04/90 | 15'409 | 4'914'264 |
| 15.0 | 08/90 | 16'941 | 5'486'399 |
| 16.0 | 11/90 | 18'364 | 5'986'949 |
| 17.0 | 02/91 | 20'024 | 6'524'504 |
| 18.0 | 05/91 | 20'772 | 6'792'034 |
| 19.0 | 08/91 | 21'795 | 7'173'785 |
| 20.0 | 11/91 | 22'654 | 7'500'130 |
| 21.0 | 03/92 | 23'742 | 7'866'596 |
| 22.0 | 05/92 | 25'044 | 8'375'696 |
| 23.0 | 08/92 | 26'706 | 9'011'391 |
| 24.0 | 12/92 | 28'154 | 9'545'427 |
| 25.0 | 04/93 | 29'955 | 10'214'020 |
| 26.0 | 07/93 | 31'808 | 10'875'091 |
| 27.0 | 10/93 | 33'329 | 11'484'420 |
| 28.0 | 02/94 | 36'000 | 12'496'420 |
| 29.0 | 06/94 | 38'303 | 13'464'008 |
| 30.0 | 10/94 | 40'292 | 14'147'368 |
| 31.0 | 02/95 | 43'470 | 15'335'248 |
| 32.0 | 11/95 | 49'340 | 17'385'503 |
| 33.0 | 02/96 | 52'205 | 18'531'384 |
| 34.0 | 10/96 | 59'021 | 21'210'389 |
| 35.0 | 11/97 | 69'113 | 25'083'768 |
| 36.0 | 07/98 | 74'019 | 26'840'295 |
| 37.0 | 12/98 | 77'977 | 28'268'293 |
| 38.0 | 07/99 | 80'000 | 29'085'965 |
| 39.0 | 05/00 | 86'593 | 31'411'114 |
| 40.0 | 10/01 | 101'602 | 37'315'215 |
| 41.0 | 02/03 | 122'564 | 44'986'459 |
| 42.0 | 10/03 | 135'850 | 50'046'799 |
| 43.0 | 03/04 | 146'720 | 54'093'154 |
| 44.0 | 07/04 | 153'871 | 56'608'159 |
| 45.0 | 10/04 | 163'235 | 59'631'787 |
| 46.0 | 02/05 | 168'297 | 61'443'278 |
| 47.0 | 05/05 | 181'577 | 65'746'672 |
| 48.0 | 09/05 | 194'317 | 70'391'852 |
In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProtKB/Swiss-Prot, but which have now been deleted from the database.
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:
| Organism | Database cross-references | Index file | Number of sequences |
|---|---|---|---|
| A.thaliana | TAIR | arath.txt | 3'609 |
| C.albicans | None yet | calbican.txt | 390 |
| C.elegans | Wormpep | celegans.txt | 2'667 |
| D.discoideum | DictyBase | dicty.txt | 325 |
| D.melanogaster | FlyBase | fly.txt | 2'273 |
| M.musculus | MGD | mgdtosp.txt | 9'933 |
| S.cerevisiae | SGD | yeast.txt | 5'139 |
| S.pombe | GeneDB_SPombe | pombe.txt | 2'840 |
1. INTRODUCTION
Release 48.0 of 13-Sep-2005 of Swiss-Prot contains 194'317 sequence entries,
comprising 70'391'852 amino acids abstracted from 133'723 references.
11'963 sequences have been added since release 47, the sequence data of
1'095 existing entries has been updated and the annotations of
93'692 entries have been revised. This represents an increase of 7%.
The growth of the database is summarized below.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.83 Gln (Q) 3.94 Leu (L) 9.64 Ser (S) 6.85
Arg (R) 5.35 Glu (E) 6.63 Lys (K) 5.93 Thr (T) 5.42
Asn (N) 4.18 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.15
Asp (D) 5.32 His (H) 2.28 Phe (F) 4.00 Tyr (Y) 3.06
Cys (C) 1.53 Ile (I) 5.92 Pro (P) 4.83 Val (V) 6.72
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
2.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of Swiss-Prot: 9'479
The first twenty species represent 66639 sequences: 34.3 % of the total
number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x: 4552
2x: 1489
3x: 734
4x: 476
5x: 320
6x: 281
7x: 197
8x: 156
9x: 138
10x: 78
11- 20x: 382
21- 50x: 287
51-100x: 111
>100x: 278
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 12860 Homo sapiens (Human)
2 9933 Mus musculus (Mouse)
3 5139 Saccharomyces cerevisiae (Baker's yeast)
4 4846 Escherichia coli
5 4570 Rattus norvegicus (Rat)
6 3609 Arabidopsis thaliana (Mouse-ear cress)
7 2840 Schizosaccharomyces pombe (Fission yeast)
8 2814 Bacillus subtilis
9 2667 Caenorhabditis elegans
10 2273 Drosophila melanogaster (Fruit fly)
11 1782 Methanococcus jannaschii
12 1772 Haemophilus influenzae
13 1758 Escherichia coli O157:H7
14 1653 Bos taurus (Bovine)
15 1512 Salmonella typhimurium
16 1438 Escherichia coli O6
17 1404 Shigella flexneri
18 1403 Mycobacterium tuberculosis
19 1230 Gallus gallus (Chicken)
20 1136 Mycobacterium bovis
21 1106 Salmonella typhi
22 1029 Pseudomonas aeruginosa
23 1001 Xenopus laevis (African clawed frog)
24 983 Sus scrofa (Pig)
25 964 Synechocystis sp. (strain PCC 6803)
26 964 Archaeoglobus fulgidus
27 823 Rhizobium meliloti (Sinorhizobium meliloti)
28 810 Vibrio cholerae
29 809 Yersinia pestis
30 770 Oryctolagus cuniculus (Rabbit)
31 746 Aquifex aeolicus
32 694 Pasteurella multocida
33 687 Mycoplasma pneumoniae
34 661 Pongo pygmaeus (Orangutan)
35 652 Vibrio parahaemolyticus
36 644 Streptomyces coelicolor
37 632 Bacillus halodurans
38 621 Mycobacterium leprae
39 608 Treponema pallidum
40 603 Canis familiaris (Dog)
41 599 Vibrio vulnificus
42 591 Staphylococcus aureus (strain Mu50 / ATCC 700699)
43 588 Staphylococcus aureus (strain N315)
44 587 Anabaena sp. (strain PCC 7120)
45 583 Methanobacterium thermoautotrophicum
46 578 Vibrio vulnificus (strain YJ016)
47 572 Buchnera aphidicola subsp. Acyrthosiphon pisum
48 571 Staphylococcus aureus (strain MW2)
49 566 Oryza sativa (Rice)
50 563 Helicobacter pylori (Campylobacter pylori)
51 562 Buchnera aphidicola subsp. Schizaphis graminum
52 546 Pseudomonas putida (strain KT2440)
53 546 Rickettsia prowazekii
54 544 Helicobacter pylori J99 (Campylobacter pylori J99)
55 541 Pseudomonas syringae pv. tomato
56 531 Bacillus anthracis
57 528 Lactococcus lactis subsp. lactis (Streptococcus lactis)
58 528 Staphylococcus epidermidis
59 524 Bradyrhizobium japonicum
60 523 Brachydanio rerio (Zebrafish) (Danio rerio)
61 521 Zea mays (Maize)
62 517 Ralstonia solanacearum (Pseudomonas solanacearum)
63 513 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
64 512 Listeria monocytogenes
65 507 Buchnera aphidicola subsp. Baizongia pistaciae
66 506 Listeria innocua
67 500 Rhizobium loti (Mesorhizobium loti)
68 493 Xanthomonas campestris pv. campestris
69 493 Neisseria meningitidis serogroup B
70 490 Neisseria meningitidis serogroup A
71 488 Photorhabdus luminescens subsp. laumondii
72 486 Mycoplasma genitalium
73 485 Clostridium acetobutylicum
74 475 Caulobacter crescentus
75 467 Thermotoga maritima
76 462 Staphylococcus aureus (strain MRSA252)
77 461 Staphylococcus aureus (strain MSSA476)
78 459 Shewanella oneidensis
79 458 Bacillus cereus (strain ATCC 14579 / DSM 31)
80 456 Xanthomonas axonopodis pv. citri
81 453 Streptococcus pneumoniae
82 451 Pan troglodytes (Chimpanzee)
83 447 Xylella fastidiosa
84 441 Deinococcus radiodurans
85 440 Listeria monocytogenes serotype 4b (strain F2365)
86 437 Xylella fastidiosa (strain Temecula1 / ATCC 700964)
87 436 Pyrococcus horikoshii
88 431 Chlamydia trachomatis
89 431 Pyrococcus abyssi
90 430 Methanosarcina acetivorans
91 426 Halobacterium salinarium (Halobacterium halobium)
92 423 Brucella melitensis
93 423 Brucella suis
94 422 Clostridium perfringens
95 421 Corynebacterium glutamicum (Brevibacterium flavum)
96 419 Oceanobacillus iheyensis
97 419 Haemophilus ducreyi
98 418 Borrelia burgdorferi (Lyme disease spirochete)
99 418 Neurospora crassa
100 417 Mimivirus
101 412 Chlamydia pneumoniae (Chlamydophila pneumoniae)
102 410 Methanosarcina mazei (Methanosarcina frisia)
103 404 Rhizobium sp. (strain NGR234)
104 402 Chlamydia muridarum
105 399 Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
106 399 Yersinia pseudotuberculosis
107 398 Pyrococcus furiosus
108 390 Thermoanaerobacter tengcongensis
109 390 Candida albicans (Yeast)
110 389 Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
111 388 Lactobacillus plantarum
112 385 Campylobacter jejuni
113 384 Ovis aries (Sheep)
114 383 Sulfolobus solfataricus
115 375 Streptococcus mutans
116 372 Synechococcus elongatus (Thermosynechococcus elongatus)
117 370 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
118 369 Nicotiana tabacum (Common tobacco)
119 367 Rickettsia conorii
120 366 Streptococcus pyogenes serotype M1
121 363 Streptococcus pyogenes serotype M6
122 363 Bordetella pertussis
123 361 Enterococcus faecalis (Streptococcus faecalis)
124 360 Chromobacterium violaceum
125 360 Streptococcus pyogenes serotype M18
126 359 Bordetella parapertussis
127 359 Streptococcus pyogenes serotype M3
128 359 Streptomyces avermitilis
129 346 Chlorobium tepidum
130 338 Aeropyrum pernix
131 338 Staphylococcus aureus
132 332 Methanopyrus kandleri
133 330 Leptospira interrogans
134 329 Corynebacterium efficiens
135 329 Erwinia carotovora subsp. atroseptica (Pectobacterium atrosepticum)
136 328 Pyrococcus kodakaraensis (Thermococcus kodakaraensis)
137 325 Dictyostelium discoideum (Slime mold)
138 319 Leptospira interrogans serogroup Icterohaemorrhagiae serovar copenhageni
139 313 Bacillus cereus (strain ATCC 10987)
140 313 Nitrosomonas europaea
141 313 Pisum sativum (Garden pea)
142 309 Staphylococcus aureus (strain COL)
143 309 Sulfolobus tokodaii
144 307 Kluyveromyces lactis (Yeast)
145 297 Streptococcus agalactiae serotype V
146 297 Streptococcus agalactiae serotype III
147 297 Thermoplasma acidophilum
148 296 Gloeobacter violaceus
149 294 Photobacterium profundum (Photobacterium sp. (strain SS9))
150 294 Ashbya gossypii (Yeast) (Eremothecium gossypii)
151 285 Triticum aestivum (Wheat)
152 280 Synechococcus sp. (strain WH8102)
153 280 Fusobacterium nucleatum subsp. nucleatum
154 279 Staphylococcus epidermidis (strain ATCC 35984 / RP62A)
155 278 Pseudomonas putida
156 273 Prochlorococcus marinus (strain MIT 9313)
157 273 Hordeum vulgare (Barley)
158 270 Lycopersicon esculentum (Tomato)
159 268 Cavia porcellus (Guinea pig)
160 268 Bacteriophage T4
161 268 Glycine max (Soybean)
162 267 Macaca mulatta (Rhesus macaque)
163 265 Rhodopseudomonas palustris
164 265 Prochlorococcus marinus
165 264 Pyrobaculum aerophilum
166 262 Coxiella burnetii
167 261 Thermoplasma volcanium
168 257 Solanum tuberosum (Potato)
169 257 Prochlorococcus marinus subsp. pastoris (strain CCMP 1378 / MED4)
170 256 Clostridium tetani
171 254 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
172 254 Vaccinia virus (strain Copenhagen) (VACV)
173 253 Candida glabrata (Yeast) (Torulopsis glabrata)
174 252 Acinetobacter sp. (strain ADP1)
175 251 Emericella nidulans (Aspergillus nidulans)
176 250 Bacteroides thetaiotaomicron
177 249 Bacillus thuringiensis subsp. konkukian
178 246 Salmonella paratyphi-a
179 245 Spinacia oleracea (Spinach)
180 245 Wolinella succinogenes
181 242 Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
182 239 Mycobacterium paratuberculosis
183 235 Bacillus stearothermophilus
184 233 Wigglesworthia glossinidia brevipalpis
185 231 Thermus thermophilus (strain HB8 / ATCC 27634 / DSM 579)
186 228 Equus caballus (Horse)
187 227 Chlamydophila caviae
188 227 Bifidobacterium longum
189 223 Geobacter sulfurreducens
190 221 Rhodopirellula baltica
191 220 Porphyra purpurea
192 219 Porphyromonas gingivalis (Bacteroides gingivalis)
193 219 Burkholderia pseudomallei (Pseudomonas pseudomallei)
194 217 Corynebacterium diphtheriae
195 216 Chlamydomonas reinhardtii
196 214 Helicobacter hepaticus
197 213 Methanococcus maripaludis
198 212 Bacillus clausii (strain KSM-K16)
199 211 Bacillus cereus (strain ZK)
200 210 Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
201 209 Klebsiella pneumoniae
202 209 Thermus thermophilus (strain HB27 / ATCC BAA-163 / DSM 7039)
203 203 Haloarcula marismortui (Halobacterium marismortui)
204 202 Mannheimia succiniciproducens (strain MBEL55E)
205 202 Yarrowia lipolytica (Candida lipolytica)
206 200 Vaccinia virus (strain Western Reserve / WR) (VACV)
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 9783 ( 5%)
Bacteria 89394 ( 46%)
Eukaryota 85403 ( 44%)
Viruses 9737 ( 5%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 12861 ( 15%) ( 7%)
Other Mammalia 25396 ( 30%) ( 13%)
Other Vertebrata 7582 ( 9%) ( 4%)
Viridiplantae 13805 ( 16%) ( 7%)
Fungi 12450 ( 15%) ( 6%)
Insecta 4391 ( 5%) ( 2%)
Nematoda 2971 ( 3%) ( 2%)
Other 5947 ( 7%) ( 3%)
4. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 4028 1001-1100 1637
51- 100 13893 1101-1200 1126
101- 150 19808 1201-1300 842
151- 200 18963 1301-1400 639
201- 250 19439 1401-1500 495
251- 300 16591 1501-1600 309
301- 350 17240 1601-1700 230
351- 400 15608 1701-1800 177
401- 450 12090 1801-1900 189
451- 500 10072 1901-2000 154
501- 550 7724 2001-2100 95
551- 600 5156 2101-2200 148
601- 650 4379 2201-2300 119
651- 700 3103 2301-2400 84
701- 750 2629 2401-2500 64
751- 800 2190 >2500 480
801- 850 1774
851- 900 1891
901- 950 1339
951-1000 1108
The average sequence length in Swiss-Prot is 362 amino acids.
The shortest sequence is GWA_SEPOF (P83570): 2 amino acids.
The longest sequence is SYNE1_HUMAN (Q8NF91): 8797 amino acids.
5. JOURNAL CITATIONS
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of Swiss-Prot: 1618
5.1 Table of the frequency of journal citations
Journals cited 1x: 577
2x: 226
3x: 114
4x: 77
5x: 55
6x: 36
7x: 33
8x: 36
9x: 21
10x: 15
11- 20x: 124
21- 50x: 132
51-100x: 56
>100x: 116
5.2 List of the most cited journals in Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 12470 Journal of Biological Chemistry
2 6211 Proceedings of the National Academy of Sciences of the U.S.A.
3 4207 Journal of Bacteriology
4 3922 Gene
5 3880 Nucleic Acids Research
6 3338 Biochemical and Biophysical Research Communications
7 3249 FEBS Letters
8 2906 Biochemistry
9 2799 European Journal of Biochemistry
10 2756 The EMBO Journal
11 2524 Nature
12 2458 Biochimica et Biophysica Acta
13 2228 Journal of Molecular Biology
14 2134 Molecular and Cellular Biology
15 2118 Genomics
16 2018 Cell
17 1619 Biochemical Journal
18 1490 Science
19 1337 Molecular Microbiology
20 1251 Plant Molecular Biology
21 1241 Molecular and General Genetics
22 1020 Journal of Cell Biology
23 1013 Journal of Biochemistry
24 963 Virology
25 961 Human Molecular Genetics
26 894 Nature Genetics
27 854 Journal of Virology
28 837 Genes and Development
29 778 The American Journal of Human Genetics
30 766 Oncogene
31 757 Plant Physiology
32 735 Human Mutation
33 669 Infection and Immunity
34 668 Journal of Immunology
35 638 Structure
36 636 Archives of Biochemistry and Biophysics
37 626 Yeast
38 625 Development
39 578 Journal of General Virology
40 561 Genetics
41 559 Microbiology
42 517 FEMS Microbiology Letters
43 507 Nature Structural Biology
44 473 Blood
45 457 Human Genetics
46 452 Current Genetics
47 410 Molecular Biology of the Cell
48 396 Applied and Environmental Microbiology
49 394 The Plant Cell
50 390 Molecular and Biochemical Parasitology
51 384 Journal of Clinical Investigation
52 383 Developmental Biology
53 374 Cancer Research
54 370 Mammalian Genome
55 367 Journal of Cell Science
56 361 Protein Science
57 358 Mechanisms of Development
58 356 Molecular Endocrinology
59 346 Neuron
60 344 Acta Crystallographica, Section D
61 340 Immunogenetics
62 331 The Journal of Experimental Medicine
63 327 Journal of Molecular Evolution
64 326 The Plant Journal
65 316 DNA and Cell Biology
66 315 Journal of Neuroscience
67 314 Molecular Cell
68 298 Endocrinology
69 283 Biological Chemistry Hoppe-Seyler
70 279 DNA Sequence
71 276 Journal of Neurochemistry
72 268 Current Biology
73 259 The Journal of Clinical Endocrinology and Metabolism
74 255 Molecular Biology and Evolution
75 247 Brain Research. Molecular Brain Research
76 240 Journal of General Microbiology
77 240 Bioscience, Biotechnology, and Biochemistry
78 239 Toxicon
79 238 American Journal of Physiology
80 227 Cytogenetics and Cell Genetics
81 216 Comparative Biochemistry and Physiology
82 214 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
83 198 Antimicrobial Agents and Chemotherapy
84 189 Molecular Pharmacology
85 181 Journal of Investigative Dermatology
86 179 Proteins
87 173 Journal of Medical Genetics
88 163 Peptides
89 161 DNA Research
90 158 DNA
91 157 Molecular Plant-Microbe Interactions
92 155 Genome Research
93 154 Virus Research
94 154 American Journal of Medical Genetics
95 152 Tissue Antigens
96 151 Plant and Cell Physiology
97 144 Biochimie
98 144 European Journal of Immunology
99 143 Biology of Reproduction
100 138 Bioorganicheskaia Khimiia
101 137 Molecular and Cellular Endocrinology
102 135 Hemoglobin
103 119 Archives of Microbiology
104 119 Insect Biochemistry and Molecular Biology
105 119 Molecular Phylogenetics and Evolution
106 118 Agricultural and Biological Chemistry
107 117 Experimental Cell Research
108 112 Journal of Human Genetics
109 112 Annals of Neurology
110 110 Nature Cell Biology
111 110 European Journal of Human Genetics
112 109 General and Comparative Endocrinology
113 108 Neurology
114 106 RNA
115 104 Diabetes
116 103 Developmental Dynamics
6. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 378807 1.95
Journal 335752 181666 1.73
Submitted to EMBL/GenBank/DDBJ 40044 34264 0.21
Submitted to Swiss-Prot 671 668 <0.01
Plant Gene Register 510 498 <0.01
Book citation 494 482 <0.01
Unpublished observations 469 465 <0.01
Submitted to other databases 353 345 <0.01
Thesis 301 299 <0.01
Patent 124 122 <0.01
Unpublished results 83 81 <0.01
Worm Breeder's Gazette 6 6 <0.01
Comments (CC) 732470 3.77
SIMILARITY 208863 174553 1.07
FUNCTION 130272 127160 0.67
SUBCELLULAR LOCATION 97545 97545 0.50
CATALYTIC ACTIVITY 69879 65211 0.36
SUBUNIT 64500 64500 0.33
PATHWAY 35314 32279 0.18
COFACTOR 27402 24680 0.14
TISSUE SPECIFICITY 20576 20576 0.11
PTM 13541 11921 0.07
MISCELLANEOUS 12757 11828 0.07
DOMAIN 9667 8549 0.05
ALTERNATIVE PRODUCTS 7803 7803 0.04
CAUTION 7087 6298 0.04
INDUCTION 5374 5374 0.03
INTERACTION 4966 4966 0.03
DEVELOPMENTAL STAGE 4928 4928 0.03
DISEASE 3066 2236 0.02
ENZYME REGULATION 2770 2770 0.01
MASS SPECTROMETRY 1881 1597 0.01
DATABASE 1413 1322 0.01
BIOPHYSICOCHEMICAL PROPERTIES 1117 1117 0.01
POLYMORPHISM 509 496 <0.01
RNA EDITING 401 401 <0.01
ALLERGEN 387 387 <0.01
TOXIC DOSE 277 276 <0.01
BIOTECHNOLOGY 117 117 <0.01
PHARMACEUTICAL 58 58 <0.01
Features (FT) 1082233 5.57
TRANSMEM 123597 26975 0.64
METAL 77253 19387 0.40
CONFLICT 71568 24920 0.37
TOPO_DOM 62341 12741 0.32
TURN 62287 4652 0.32
CARBOHYD 61962 15580 0.32
STRAND 57083 4155 0.29
DISULFID 57006 15630 0.29
DOMAIN 55803 29830 0.29
ACT_SITE 45107 26532 0.23
HELIX 44973 4509 0.23
REPEAT 42352 6078 0.22
VARIANT 34916 6771 0.18
CHAIN 31363 25429 0.16
NP_BIND 26977 18970 0.14
MOD_RES 25644 13037 0.13
REGION 22431 11354 0.12
BINDING 20862 11717 0.11
SIGNAL 19975 19973 0.10
COMPBIAS 18938 10358 0.10
VARSPLIC 16016 6973 0.08
MUTAGEN 12686 3279 0.07
ZN_FING 12442 4999 0.06
SITE 12122 6874 0.06
NON_TER 10993 8349 0.06
MOTIF 10543 7622 0.05
INIT_MET 8934 8858 0.05
PROPEP 6568 5481 0.03
DNA_BIND 5748 5387 0.03
LIPID 5732 3787 0.03
COILED 5564 3389 0.03
PEPTIDE 4022 1851 0.02
TRANSIT 3372 3342 0.02
CA_BIND 2334 941 0.01
NON_CONS 1096 526 0.01
CROSSLNK 1001 761 0.01
UNSURE 417 170 <0.01
SE_CYS 205 139 <0.01
Cross-references (DR) 2038749 10.49
InterPro 399585 178352 2.06
EMBL 371472 186572 1.91
Pfam 235069 172308 1.21
PROSITE 175229 108545 0.90
GO 95530 27032 0.49
PIR 93479 86872 0.48
PRINTS 74849 58177 0.39
HSSP 73924 73924 0.38
TIGRFAMs 72463 67688 0.37
HAMAP 66305 66201 0.34
ProDom 52161 50186 0.27
SMART 48269 36731 0.25
PANTHER 47077 44568 0.24
Ensembl 36475 36473 0.19
PDB 30616 8392 0.16
SMR 26242 26242 0.14
TIGR 19090 18555 0.10
PIRSF 15949 15699 0.08
HGNC 12075 12018 0.06
MIM 11065 9064 0.06
MGI 9693 9659 0.05
IntAct 7064 7064 0.04
SGD 5192 5129 0.03
GermOnline 4926 4876 0.03
EcoGene 4225 4223 0.02
EchoBASE 4159 4127 0.02
MEROPS 3861 3746 0.02
H-InvDB 3676 3658 0.02
TAIR 3675 3603 0.02
RGD 3231 3228 0.02
WormPep 3097 2666 0.02
FlyBase 2883 2852 0.01
GeneDB_Spombe 2872 2838 0.01
TRANSFAC 2782 2494 0.01
SubtiList 2757 2756 0.01
WormBase 2738 2661 0.01
Gramene 1890 1883 0.01
StyGene 1467 1464 0.01
TubercuList 1431 1395 0.01
SWISS-2DPAGE 1155 1155 0.01
GeneFarm 1059 1053 0.01
ListiList 1019 1011 0.01
Reactome 992 992 0.01
Leproma 625 621 <0.01
ZFIN 516 509 <0.01
PhotoList 488 488 <0.01
MaizeDB 426 421 <0.01
HIV 370 365 <0.01
REBASE 367 362 <0.01
OGP 367 367 <0.01
ECO2DBASE 351 299 <0.01
DictyBase 326 324 <0.01
AGD 300 294 <0.01
SagaList 298 297 <0.01
LegioList 286 286 <0.01
GlycoSuiteDB 283 283 <0.01
PHCI-2DPAGE 239 239 <0.01
MypuList 175 175 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 103 103 <0.01
HSC-2DPAGE 85 85 <0.01
COMPLUYEAST-2DPAGE 59 59 <0.01
PhosSite 54 54 <0.01
PMMA-2DPAGE 52 52 <0.01
Maize-2DPAGE 39 39 <0.01
Rat-heart-2DPAGE 28 28 <0.01
ANU-2DPAGE 16 16 <0.01
Number of explicitly cross-referenced databases: 69
Number of implicitly cross-referenced databases: 31
7. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in Swiss-Prot: 208469
Total number of entries encoded on a plastid: 64
Total number of entries encoded on a mitochondrion: 3334
Total number of entries encoded on a plasmid: 3046
Number of fragments: 8504
Number of additional sequences encoded on splice variants: 12128
| UniProtKB/TrEMBL protein database release 31.0 statistics |
|---|
1. INTRODUCTION
Release 31.0 of 13-Sept-2005 of UniProtKB/TrEMBL has been produced in synch
with UniProtKB/Swiss-Prot release 48 and EMBL/DDBJ/GenBank nucleotide
sequence database release 83 and updates until the 19-August-2005. It contains
2'055'517 sequence entries, comprising 680'464'593 amino acids.
405'513 sequences have been added since release 30. This represents an
increase of 27%.
In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from TrEMBL.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.85 Gln (Q) 3.87 Leu (L) 9.72 Ser (S) 7.14
Arg (R) 5.38 Glu (E) 6.08 Lys (K) 5.51 Thr (T) 5.71
Asn (N) 4.48 Gly (G) 6.87 Met (M) 2.38 Trp (W) 1.35
Asp (D) 5.13 His (H) 2.27 Phe (F) 4.11 Tyr (Y) 3.11
Cys (C) 1.49 Ile (I) 5.96 Pro (P) 4.94 Val (V) 6.48
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.06
2.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Ile, Thr, Lys, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of
UniProtKB/TrEMBL: 95545
The first twenty species represent 571629 sequences: 27.1 % of the
total number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x:46664
2x:18167
3x: 9165
4x: 4827
5x: 2821
6x: 2208
7x: 1484
8x: 1222
9x: 1005
10x: 763
11- 20x: 3497
21- 50x: 1926
51-100x: 775
>100x: 1021
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 138508 Human immunodeficiency virus 1
2 58027 Homo sapiens (Human)
3 49342 Oryza sativa (japonica cultivar-group)
4 39688 Arabidopsis thaliana (Mouse-ear cress)
5 39144 Mus musculus (Mouse)
6 27998 Tetraodon nigroviridis (Green puffer)
7 25252 Drosophila melanogaster (Fruit fly)
8 25184 Hepatitis C virus
9 20341 Caenorhabditis elegans
10 20090 Trypanosoma cruzi
11 15223 Anopheles gambiae str. PEST
12 14672 Plasmodium chabaudi
13 14614 Dictyostelium discoideum (Slime mold)
14 13522 Brachydanio rerio (Zebrafish) (Danio rerio)
15 13197 Caenorhabditis briggsae
16 11765 Plasmodium berghei
17 11636 Gibberella zeae PH-1
18 11543 Xenopus laevis (African clawed frog)
19 11007 Magnaporthe grisea 70-15
20 10876 Neurospora crassa
21 9872 Aspergillus fumigatus Af293
22 9826 Rattus norvegicus (Rat)
23 9676 Schistosoma japonicum (Blood fluke)
24 9474 Aspergillus nidulans FGSC A4
25 9168 Candida albicans SC5314
26 9092 Entamoeba histolytica HM-1:IMSS
27 8990 Hepatitis B virus
28 8349 uncultured bacterium
29 8212 Leishmania major
30 8122 Bradyrhizobium japonicum
31 8063 Solibacter usitatus Ellin6076
32 7801 Plasmodium yoelii yoelii
33 7663 Burkholderia vietnamiensis G4
34 7563 Streptomyces coelicolor
35 7349 Streptomyces avermitilis
36 7236 Escherichia coli
37 7178 Rhizobium loti (Mesorhizobium loti)
38 7050 Rhodopirellula baltica
39 7049 Burkholderia cenocepacia HI2424
40 6994 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
41 6545 Pseudomonas aeruginosa
42 6531 Cryptococcus neoformans var. neoformans B-3501A
43 6498 Ustilago maydis 521
44 6456 Burkholderia cenocepacia AU 1054
45 6433 Ralstonia eutropha JMP134
46 6399 Yarrowia lipolytica (Candida lipolytica)
47 6394 Giardia lamblia ATCC 50803
48 6243 Bacillus anthracis
49 6180 Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
50 6124 Pseudomonas fluorescens (strain Pf-5 / ATCC BAA-477)
51 5905 Bacillus cereus G9241
52 5848 Cryptococcus neoformans var. neoformans JEC21
53 5757 Nocardia farcinica
54 5701 Burkholderia pseudomallei (Pseudomonas pseudomallei)
55 5694 Rhizobium meliloti (Sinorhizobium meliloti)
56 5661 Crocosphaera watsonii
57 5644 Polaromonas sp. JS666
58 5556 Anabaena sp. (strain PCC 7120)
59 5507 Bacillus cereus (strain ATCC 10987)
60 5474 Gallus gallus (Chicken)
61 5429 Trypanosoma brucei
62 5421 Bacillus cereus (strain ZK)
63 5226 Plasmodium falciparum (isolate 3D7)
64 5193 Yersinia pestis
65 5183 Helicobacter pylori (Campylobacter pylori)
66 5131 Kluyveromyces lactis (Yeast)
67 5074 Pseudomonas syringae pv. syringae (strain B728a)
68 5055 Photobacterium profundum (Photobacterium sp. (strain SS9))
69 5043 Candida glabrata (Yeast) (Torulopsis glabrata)
70 5042 Pseudomonas syringae pv. phaseolicola 1448A
71 4959 Pseudomonas syringae pv. tomato
72 4938 Azotobacter vinelandii AvOP
73 4918 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
74 4872 Colwellia psychrerythraea (strain 34H / ATCC BAA-681) (Vibrio psychroerythus)
75 4865 Escherichia coli O157:H7
76 4837 Bacillus thuringiensis subsp. konkukian
77 4796 Bacillus licheniformis (strain DSM 13 / ATCC 14580)
78 4782 Bacillus cereus (strain ATCC 14579 / DSM 31)
79 4767 Pseudomonas putida (strain KT2440)
80 4747 Streptococcus pneumoniae
81 4744 Bacteroides fragilis
82 4730 Ralstonia solanacearum (Pseudomonas solanacearum)
83 4610 Burkholderia mallei (Pseudomonas mallei)
84 4593 Bacteroides thetaiotaomicron
85 4583 Rhodopseudomonas palustris
86 4563 Xanthomonas oryzae pv. oryzae
87 4552 Leptospira interrogans
88 4546 Oryza sativa (Rice)
89 4533 Frankia sp. CcI3
90 4518 Arthrobacter sp. FB24
91 4515 Salmonella choleraesuis
92 4456 Ashbya gossypii (Yeast) (Eremothecium gossypii)
93 4412 Vibrio vulnificus (strain YJ016)
94 4390 Vibrio parahaemolyticus
95 4381 Azoarcus sp. (strain EbN1)
96 4352 Mycobacterium tuberculosis
97 4310 Anaeromyxobacter dehalogenans 2CP-C
98 4237 Xanthomonas campestris pv. campestris (strain 8004)
99 4213 Bacteroides fragilis (strain ATCC 25285 / NCTC 9343)
100 4180 Mycobacterium paratuberculosis
101 4159 Erwinia carotovora subsp. atroseptica (Pectobacterium atrosepticum)
102 4155 Dechloromonas aromatica RCB
103 4127 Shewanella oneidensis
104 4119 Silicibacter pomeroyi
105 4116 Gloeobacter violaceus
106 4106 Theileria parva
107 4099 Pongo pygmaeus (Orangutan)
108 4081 Photorhabdus luminescens subsp. laumondii
109 4075 Plasmodium falciparum
110 4056 Corynebacterium glutamicum (Brevibacterium flavum)
111 4051 Chromobacterium violaceum
112 4051 Cryptosporidium parvum
113 4046 Methanosarcina acetivorans
114 4037 Haloarcula marismortui (Halobacterium marismortui)
115 4027 Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
116 4023 Vibrio vulnificus
117 4007 Cryptosporidium hominis
118 4006 Salmonella typhi
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 50509 ( 3%)
Bacteria 804377 ( 37%)
Eukaryota 914970 ( 43%)
Viruses 333442 ( 18%)
Other 1078 ( <1%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 58027 ( 6%) ( 3%)
Other Mammalia 98381 ( 11%) ( 5%)
Other Vertebrata 128012 ( 14%) ( 6%)
Viridiplantae 185692 ( 20%) ( 9%)
Fungi 89476 ( 15%) ( 6%)
Insecta 89476 ( 10%) ( 4%)
Nematoda 36401 ( 4%) ( 2%)
Other 183435 ( 20%) ( 9%)
4. SEQUENCE SIZE
4.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 26909 1001-1100 13065
51- 100 126492 1101-1200 9306
101- 150 159806 1201-1300 6932
151- 200 147708 1301-1400 4548
201- 250 149510 1401-1500 3797
251- 300 138902 1501-1600 2629
301- 350 135080 1601-1700 2119
351- 400 109439 1701-1800 1745
401- 450 86159 1801-1900 1340
451- 500 75059 1901-2000 1143
501- 550 58434 2001-2100 861
551- 600 41682 2101-2200 1010
601- 650 32186 2201-2300 823
651- 700 24992 2301-2400 645
701- 750 21448 2401-2500 481
751- 800 18053 >2500 4181
801- 850 14994
851- 900 13392
901- 950 10107
951-1000 8026
4.2 Longest and shortest sequences
The shortest sequence is Q16047_HUMAN: 4 amino acids.
The longest sequence is Q8WZ42_HUMAN: 34350 amino acids.
5. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some TrEMBL
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 2964602 1.41
Journal 1700868 1464012 0.81
Submitted to EMBL/GenBank/DDBJ 1219962 912710 0.58
Thesis 4784 4732 <0.01
Book citation 4076 4032 <0.01
Submitted to other databases 440 432 <0.01
Other 34472 20641 0.02
Comments (CC) 1056686 0.50
CAUTION 323137 323137 0.15
SIMILARITY 237377 234839 0.11
FUNCTION 131897 117967 0.06
SUBCELLULAR LOCATION 108462 108460 0.05
CATALYTIC ACTIVITY 105716 91616 0.05
SUBUNIT 65915 65915 0.03
COFACTOR 42117 42117 0.02
PATHWAY 32883 32502 0.02
MISCELLANEOUS 3629 3619 <0.01
INTERACTION 3468 3468 <0.01
DOMAIN 1951 1592 <0.01
MASS SPECTROMETRY 119 63 <0.01
ALLERGEN 15 15 <0.01
Features (FT) 1162978 0.55
NON_TER 1088124 650363 0.52
CHAIN 42871 25628 0.02
SIGNAL 31423 30474 0.01
TRANSIT 560 556 <0.01
Cross-references (DR) 14786754 7.02
GO 4294849 1243800 2.04
InterPro 2765243 1431739 1.31
EMBL 2446971 2099223 1.16
Pfam 1748797 1332161 0.83
PROSITE 1016814 646081 0.48
PRINTS 431666 357696 0.21
SMART 352928 268332 0.17
HSSP 286785 286508 0.14
SMR 277524 277496 0.13
ProDom 222934 214109 0.11
TIGRFAMs 205592 190426 0.10
PIR 196746 161117 0.09
Ensembl 117293 117293 0.06
TIGR 91544 85531 0.04
Gramene 58354 58319 0.03
PANTHER 53906 53896 0.03
PIRSF 40276 39473 0.02
MGI 35859 33668 0.02
FlyBase 22134 22084 0.01
WormPep 19260 19178 0.01
WormBase 19250 19178 0.01
TAIR 17779 17718 0.01
ZFIN 10704 10700 0.01
MEROPS 8295 8031 <0.01
IntAct 5715 5715 <0.01
LegioList 5607 5577 <0.01
ListiList 4796 4779 <0.01
AGD 4416 4416 <0.01
PhotoList 4192 4068 <0.01
HGNC 3538 3538 <0.01
PDB 2968 1762 <0.01
TubercuList 2557 2551 <0.01
RGD 2331 2316 <0.01
GeneDB_Spombe 2063 2057 <0.01
SagaList 1796 1702 <0.01
SGD 1323 1321 <0.01
TRANSFAC 989 977 <0.01
Leproma 982 981 <0.01
DictyBase 979 979 <0.01
MypuList 607 603 <0.01
REBASE 125 120 <0.01
PHCI-2DPAGE 108 108 <0.01
ANU-2DPAGE 70 70 <0.01
SWISS-2DPAGE 63 63 <0.01
Reactome 20 20 <0.01
PMMA-2DPAGE 3 3 <0.01
Siena-2DPAGE 2 2 <0.01
COMPLUYEAST-2DPAGE 1 1 <0.01
Number of explicitly cross-referenced databases: 69
6. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in TrEMBL: 216643
Total number of entries encoded on Plastid; Chloroplast: 42172
Total number of entries encoded on Mitochondrion: 108892
Total number of entries encoded on Plastid; Cyanelle: 7
Total number of entries encoded on Plastid; Apicoplast: 142
Total number of entries encoded on Plastid; Non-photosynthetic plastid: 198
Total number of entries encoded on Plastid: 1833
Total number of entries encoded on Plasmid: 37058
Number of fragments: 652514
| Submissions and Updates |
|---|
We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.
Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml
For all queries regarding submissions to UniProtkb and to submit new protein sequence data, please contact:
UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:
| Download information |
|---|
The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by a file containing the sequences of all additional splice isoforms annotated in UniProtKB/Swiss-Prot. This data set is documented in the file ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/README.varsplic
For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest major release (updated 4 times per year) in flatfile format. Previous UniProtKB/Swiss-Prot and UniProtKB/TrEMBL are archived under ftp://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases. The UniProt Knowledgebase major release is also available on CD-ROM from the EBI.
| Contact |
|---|
| Citation |
|---|
If you want to cite UniProt in a publication please use the following reference:
Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S., The Universal Protein Resource (UniProt), Nucleic Acids Res. 33: D154-D159 (2005).