![]() |
UniProt Knowledgebase Release notes UniProtKB release 7.0 of 7-Feb-2006 |
| Content |
|---|
Related documents: UniProtKB user manual, Recent changes, Forthcoming changes.
| Introduction |
|---|
Release 7.0 of the UniProt Knowledgebase is composed of the UniProtKB/Swiss-Prot Protein Knowledgebase release 49.0 and the UniProtKB/TrEMBL Protein Database release 32.0.
More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.
| UniProtKB/Swiss-Prot protein knowledgebase release 49.0 statistics |
|---|
Release 49.0 of 07-Feb-2006 of Swiss-Prot contains 207'132 sequence entries, comprising 75'438'310 amino acids abstracted from 139'151 references.
The growth of the database is summarized below.
| Release | Date | Number of entries | Number of amino acids |
|---|---|---|---|
| 2.0 | 09/86 | 3'939 | 900'163 |
| 3.0 | 11/86 | 4'160 | 969'641 |
| 4.0 | 04/87 | 4'387 | 1'036'010 |
| 5.0 | 09/87 | 5'205 | 1'327'683 |
| 6.0 | 01/88 | 6'102 | 1'653'982 |
| 7.0 | 04/88 | 6'821 | 1'885'771 |
| 8.0 | 08/88 | 7'724 | 2'224'465 |
| 9.0 | 11/88 | 8'702 | 2'498'140 |
| 10.0 | 03/89 | 10'008 | 2'952'613 |
| 11.0 | 07/89 | 10'856 | 3'265'966 |
| 12.0 | 10/89 | 12'305 | 3'797'482 |
| 13.0 | 01/90 | 13'837 | 4'347'336 |
| 14.0 | 04/90 | 15'409 | 4'914'264 |
| 15.0 | 08/90 | 16'941 | 5'486'399 |
| 16.0 | 11/90 | 18'364 | 5'986'949 |
| 17.0 | 02/91 | 20'024 | 6'524'504 |
| 18.0 | 05/91 | 20'772 | 6'792'034 |
| 19.0 | 08/91 | 21'795 | 7'173'785 |
| 20.0 | 11/91 | 22'654 | 7'500'130 |
| 21.0 | 03/92 | 23'742 | 7'866'596 |
| 22.0 | 05/92 | 25'044 | 8'375'696 |
| 23.0 | 08/92 | 26'706 | 9'011'391 |
| 24.0 | 12/92 | 28'154 | 9'545'427 |
| 25.0 | 04/93 | 29'955 | 10'214'020 |
| 26.0 | 07/93 | 31'808 | 10'875'091 |
| 27.0 | 10/93 | 33'329 | 11'484'420 |
| 28.0 | 02/94 | 36'000 | 12'496'420 |
| 29.0 | 06/94 | 38'303 | 13'464'008 |
| 30.0 | 10/94 | 40'292 | 14'147'368 |
| 31.0 | 02/95 | 43'470 | 15'335'248 |
| 32.0 | 11/95 | 49'340 | 17'385'503 |
| 33.0 | 02/96 | 52'205 | 18'531'384 |
| 34.0 | 10/96 | 59'021 | 21'210'389 |
| 35.0 | 11/97 | 69'113 | 25'083'768 |
| 36.0 | 07/98 | 74'019 | 26'840'295 |
| 37.0 | 12/98 | 77'977 | 28'268'293 |
| 38.0 | 07/99 | 80'000 | 29'085'965 |
| 39.0 | 05/00 | 86'593 | 31'411'114 |
| 40.0 | 10/01 | 101'602 | 37'315'215 |
| 41.0 | 02/03 | 122'564 | 44'986'459 |
| 42.0 | 10/03 | 135'850 | 50'046'799 |
| 43.0 | 03/04 | 146'720 | 54'093'154 |
| 44.0 | 07/04 | 153'871 | 56'608'159 |
| 45.0 | 10/04 | 163'235 | 59'631'787 |
| 46.0 | 02/05 | 168'297 | 61'443'278 |
| 47.0 | 05/05 | 181'577 | 65'746'672 |
| 48.0 | 09/05 | 194'317 | 70'391'852 |
| 49.0 | 02/06 | 207'132 | 75'438'310 |
In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProtKB/Swiss-Prot, but which have now been deleted from the database.
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:
| Organism | Database cross-references | Index file | Number of sequences |
|---|---|---|---|
| A.thaliana | TAIR | arath.txt | 3'957 |
| C.albicans | None yet | calbican.txt | 479 |
| C.elegans | Wormpep | celegans.txt | 2'784 |
| D.discoideum | DictyBase | dicty.txt | 325 |
| D.melanogaster | FlyBase | fly.txt | 2'338 |
| M.musculus | MGD | mgdtosp.txt | 10'523 |
| S.cerevisiae | SGD | yeast.txt | 5'271 |
| S.pombe | GeneDB_SPombe | pombe.txt | 2'945 |
1. INTRODUCTION
Release 49.0 of 07-Feb-2006 of UniProtKB/Swiss-Prot contains 207'132 sequence entries,
comprising 75'438'310 amino acids abstracted from 139'151 references.
12'815 sequences have been added since release 48, the sequence data of 991 existing
entries has been updated and the annotations of all entries have been revised.
This represents an increase of 7%.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.83 Gln (Q) 3.95 Leu (L) 9.64 Ser (S) 6.86
Arg (R) 5.35 Glu (E) 6.64 Lys (K) 5.93 Thr (T) 5.42
Asn (N) 4.18 Gly (G) 6.93 Met (M) 2.38 Trp (W) 1.15
Asp (D) 5.32 His (H) 2.29 Phe (F) 4.00 Tyr (Y) 3.06
Cys (C) 1.52 Ile (I) 5.91 Pro (P) 4.83 Val (V) 6.71
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.00
2.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of UniProtKB/Swiss-Prot: 9731
The first twenty species represent 69270 sequences: 33.4 % of the total
number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x: 4676
2x: 1525
3x: 748
4x: 488
5x: 319
6x: 287
7x: 197
8x: 159
9x: 140
10x: 78
11- 20x: 404
21- 50x: 308
51-100x: 110
>100x: 292
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 13433 Homo sapiens (Human)
2 10523 Mus musculus (Mouse)
3 5271 Saccharomyces cerevisiae (Baker's yeast)
4 4865 Rattus norvegicus (Rat)
5 4849 Escherichia coli
6 3957 Arabidopsis thaliana (Mouse-ear cress)
7 2945 Schizosaccharomyces pombe (Fission yeast)
8 2824 Bacillus subtilis
9 2784 Caenorhabditis elegans
10 2338 Drosophila melanogaster (Fruit fly)
11 1796 Escherichia coli O157:H7
12 1789 Bos taurus (Bovine)
13 1782 Methanococcus jannaschii
14 1772 Haemophilus influenzae
15 1549 Salmonella typhimurium
16 1476 Escherichia coli O6
17 1444 Shigella flexneri
18 1405 Mycobacterium tuberculosis
19 1323 Gallus gallus (Chicken)
20 1145 Mycobacterium bovis
21 1141 Salmonella typhi
22 1121 Xenopus laevis (African clawed frog)
23 1057 Pseudomonas aeruginosa
24 1022 Sus scrofa (Pig)
25 967 Archaeoglobus fulgidus
26 966 Synechocystis sp. (strain PCC 6803)
27 929 Pongo pygmaeus (Orangutan)
28 846 Vibrio cholerae
29 844 Yersinia pestis
30 836 Rhizobium meliloti (Sinorhizobium meliloti)
31 784 Oryctolagus cuniculus (Rabbit)
32 748 Aquifex aeolicus
33 724 Oryza sativa (Rice)
34 711 Pasteurella multocida
35 687 Vibrio parahaemolyticus
36 687 Mycoplasma pneumoniae
37 657 Staphylococcus aureus (strain Mu50 / ATCC 700699)
38 654 Staphylococcus aureus (strain N315)
39 650 Streptomyces coelicolor
40 643 Bacillus halodurans
41 639 Staphylococcus aureus (strain MW2)
42 636 Staphylococcus aureus (strain COL)
43 634 Staphylococcus aureus (strain MSSA476)
44 633 Staphylococcus aureus (strain MRSA252)
45 633 Vibrio vulnificus
46 627 Canis familiaris (Dog)
47 624 Mycobacterium leprae
48 619 Brachydanio rerio (Zebrafish) (Danio rerio)
49 613 Vibrio vulnificus (strain YJ016)
50 608 Treponema pallidum
51 596 Anabaena sp. (strain PCC 7120)
52 585 Methanobacterium thermoautotrophicum
53 572 Buchnera aphidicola subsp. Acyrthosiphon pisum
54 565 Pseudomonas putida (strain KT2440)
55 565 Helicobacter pylori (Campylobacter pylori)
56 562 Buchnera aphidicola subsp. Schizaphis graminum
57 560 Pseudomonas syringae pv. tomato
58 550 Bacillus anthracis
59 548 Staphylococcus epidermidis (strain ATCC 35984 / RP62A)
60 547 Rickettsia prowazekii
61 547 Staphylococcus epidermidis (strain ATCC 12228)
62 546 Helicobacter pylori J99 (Campylobacter pylori J99)
63 542 Bradyrhizobium japonicum
64 536 Lactococcus lactis subsp. lactis (Streptococcus lactis)
65 529 Ralstonia solanacearum (Pseudomonas solanacearum)
66 526 Zea mays (Maize)
67 526 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
68 525 Listeria monocytogenes
69 525 Photorhabdus luminescens subsp. laumondii
70 519 Listeria innocua
71 513 Rhizobium loti (Mesorhizobium loti)
72 508 Xanthomonas campestris pv. campestris
73 507 Buchnera aphidicola subsp. Baizongia pistaciae
74 505 Neisseria meningitidis serogroup B
75 502 Neisseria meningitidis serogroup A
76 495 Clostridium acetobutylicum
77 493 Shewanella oneidensis
78 492 Pan troglodytes (Chimpanzee)
79 490 Neurospora crassa
80 486 Mycoplasma genitalium
81 486 Caulobacter crescentus
82 479 Candida albicans (Yeast)
83 477 Bacillus cereus (strain ATCC 14579 / DSM 31)
84 473 Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
85 470 Thermotoga maritima
86 470 Xanthomonas axonopodis pv. citri
87 464 Streptococcus pneumoniae
88 458 Xylella fastidiosa
89 455 Yersinia pseudotuberculosis
90 455 Listeria monocytogenes serotype 4b (strain F2365)
91 449 Xylella fastidiosa (strain Temecula1 / ATCC 700964)
92 446 Deinococcus radiodurans
93 440 Mimivirus
94 440 Pyrococcus horikoshii
95 440 Haemophilus ducreyi
96 436 Brucella melitensis
97 436 Methanosarcina acetivorans
98 435 Oceanobacillus iheyensis
99 435 Pyrococcus abyssi
100 435 Brucella suis
101 433 Corynebacterium glutamicum (Brevibacterium flavum)
102 433 Clostridium perfringens
103 432 Chlamydia trachomatis
104 432 Halobacterium salinarium (Halobacterium halobium)
105 427 Kluyveromyces lactis (Yeast)
106 419 Borrelia burgdorferi (Lyme disease spirochete)
107 416 Methanosarcina mazei (Methanosarcina frisia)
108 415 Ashbya gossypii (Yeast) (Eremothecium gossypii)
109 413 Chlamydia pneumoniae (Chlamydophila pneumoniae)
110 411 Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
111 408 Nicotiana tabacum (Common tobacco)
112 408 Pyrococcus furiosus
113 404 Rhizobium sp. (strain NGR234)
114 403 Chlamydia muridarum
115 400 Thermoanaerobacter tengcongensis
116 396 Lactobacillus plantarum
117 391 Campylobacter jejuni
118 389 Ovis aries (Sheep)
119 389 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
120 387 Sulfolobus solfataricus
121 386 Streptococcus mutans
122 384 Synechococcus elongatus (Thermosynechococcus elongatus)
123 384 Erwinia carotovora subsp. atroseptica (Pectobacterium atrosepticum)
124 379 Chromobacterium violaceum
125 379 Streptococcus pyogenes serotype M1
126 377 Streptococcus pyogenes serotype M6
127 377 Bordetella pertussis
128 376 Bordetella parapertussis
129 376 Enterococcus faecalis (Streptococcus faecalis)
130 374 Streptococcus pyogenes serotype M18
131 373 Rickettsia conorii
132 373 Streptococcus pyogenes serotype M3
133 370 Candida glabrata (Yeast) (Torulopsis glabrata)
134 369 Staphylococcus aureus
135 368 Streptomyces avermitilis
136 353 Pyrococcus kodakaraensis (Thermococcus kodakaraensis)
137 352 Chlorobium tepidum
138 342 Aeropyrum pernix
139 341 Corynebacterium efficiens
140 340 Bacillus cereus (strain ATCC 10987)
141 338 Methanopyrus kandleri
142 337 Photobacterium profundum (Photobacterium sp. (strain SS9))
143 336 Leptospira interrogans
144 328 Nitrosomonas europaea
145 326 Leptospira interrogans serogroup Icterohaemorrhagiae serovar copenhageni
146 325 Dictyostelium discoideum (Slime mold)
147 323 Salmonella paratyphi-a
148 319 Emericella nidulans (Aspergillus nidulans)
149 317 Sulfolobus tokodaii
150 316 Pisum sativum (Garden pea)
151 313 Streptococcus agalactiae serotype III
152 310 Streptococcus agalactiae serotype V
153 305 Gloeobacter violaceus
154 303 Thermoplasma acidophilum
155 302 Lycopersicon esculentum (Tomato)
156 295 Yarrowia lipolytica (Candida lipolytica)
157 293 Triticum aestivum (Wheat)
158 292 Synechococcus sp. (strain WH8102)
159 289 Fusobacterium nucleatum subsp. nucleatum
160 287 Prochlorococcus marinus (strain MIT 9313)
161 287 Rhodopseudomonas palustris
162 284 Prochlorococcus marinus
163 283 Bacillus thuringiensis subsp. konkukian
164 281 Macaca mulatta (Rhesus macaque)
165 281 Acinetobacter sp. (strain ADP1)
166 280 Pseudomonas putida
167 278 Sulfolobus acidocaldarius
168 276 Hordeum vulgare (Barley)
169 274 Coxiella burnetii
170 271 Cavia porcellus (Guinea pig)
171 269 Pyrobaculum aerophilum
172 269 Glycine max (Soybean)
173 268 Bacteriophage T4
174 268 Prochlorococcus marinus subsp. pastoris (strain CCMP 1378 / MED4)
175 267 Thermoplasma volcanium
176 265 Clostridium tetani
177 261 Solanum tuberosum (Potato)
178 259 Bacteroides thetaiotaomicron
179 258 Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
180 258 Rhodopirellula baltica
181 257 Mycobacterium paratuberculosis
182 254 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
183 254 Vaccinia virus (strain Copenhagen) (VACV)
184 254 Wolinella succinogenes
185 249 Bacillus clausii (strain KSM-K16)
186 248 Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
187 246 Spinacia oleracea (Spinach)
188 244 Burkholderia pseudomallei (Pseudomonas pseudomallei)
189 244 Bacillus cereus (strain ZK / E33L)
190 243 Mannheimia succiniciproducens (strain MBEL55E)
191 242 Wigglesworthia glossinidia brevipalpis
192 240 Thermus thermophilus (strain HB8 / ATCC 27634 / DSM 579)
193 238 Geobacter sulfurreducens
194 237 Bifidobacterium longum
195 235 Bacillus stearothermophilus
196 234 Corynebacterium diphtheriae
197 232 Equus caballus (Horse)
198 231 Chlamydophila caviae
199 229 Porphyromonas gingivalis (Bacteroides gingivalis)
200 225 Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
201 224 Burkholderia mallei (Pseudomonas mallei)
202 224 Helicobacter hepaticus
203 224 Methanococcus maripaludis
204 221 Methylococcus capsulatus
205 220 Porphyra purpurea
206 219 Thermus thermophilus (strain HB27 / ATCC BAA-163 / DSM 7039)
207 217 Haloarcula marismortui (Halobacterium marismortui)
208 216 Chlamydomonas reinhardtii
209 212 Zymomonas mobilis
210 212 Synechococcus sp. (strain PCC 6301) (Anacystis nidulans)
211 209 Klebsiella pneumoniae
212 209 Leifsonia xyli subsp. xyli
213 205 Blochmannia floridanus
214 204 Geobacillus kaustophilus
215 203 Nocardia farcinica
216 200 Vaccinia virus (strain Western Reserve / WR) (VACV)
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 10124 ( 5%)
Bacteria 96390 ( 47%)
Eukaryota 90758 ( 44%)
Viruses 9860 ( 5%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 13434 ( 15%) ( 6%)
Other Mammalia 27101 ( 30%) ( 13%)
Other Vertebrata 8051 ( 9%) ( 4%)
Viridiplantae 14694 ( 16%) ( 7%)
Fungi 13810 ( 15%) ( 7%)
Insecta 4492 ( 5%) ( 2%)
Nematoda 3133 ( 3%) ( 2%)
Other 6043 ( 7%) ( 3%)
4. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 4139 1001-1100 1796
51- 100 14793 1101-1200 1183
101- 150 21093 1201-1300 898
151- 200 20181 1301-1400 675
201- 250 20675 1401-1500 557
251- 300 17760 1501-1600 335
301- 350 18308 1601-1700 241
351- 400 16671 1701-1800 198
401- 450 13067 1801-1900 196
451- 500 10877 1901-2000 162
501- 550 8282 2001-2100 104
551- 600 5613 2101-2200 153
601- 650 4693 2201-2300 132
651- 700 3352 2301-2400 89
701- 750 2810 2401-2500 70
751- 800 2326 >2500 521
801- 850 1932
851- 900 2052
901- 950 1565
951-1000 1200
The average sequence length in UniProtKB/Swiss-Prot is 364 amino acids.
The shortest sequence is GWA_SEPOF (P83570): 2 amino acids.
The longest sequence is SYNE1_HUMAN (Q8NF91): 8797 amino acids.
5. JOURNAL CITATIONS
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of UniProtKB/Swiss-Prot: 1662
5.1 Table of the frequency of journal citations
Journals cited 1x: 586
2x: 225
3x: 128
4x: 81
5x: 56
6x: 39
7x: 31
8x: 41
9x: 18
10x: 16
11- 20x: 116
21- 50x: 145
51-100x: 59
>100x: 121
5.2 List of the most cited journals in UniProtKB/Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 13103 Journal of Biological Chemistry
2 6437 Proceedings of the National Academy of Sciences of the U.S.A.
3 4277 Journal of Bacteriology
4 3988 Gene
5 3925 Nucleic Acids Research
6 3442 Biochemical and Biophysical Research Communications
7 3316 FEBS Letters
8 3026 Biochemistry
9 2866 The EMBO Journal
10 2842 European Journal of Biochemistry
11 2599 Nature
12 2504 Biochimica et Biophysica Acta
13 2294 Journal of Molecular Biology
14 2257 Molecular and Cellular Biology
15 2158 Genomics
16 2069 Cell
17 1667 Biochemical Journal
18 1560 Science
19 1379 Molecular Microbiology
20 1277 Plant Molecular Biology
21 1251 Molecular and General Genetics
22 1080 Journal of Cell Biology
23 1034 Journal of Biochemistry
24 1006 Virology
25 989 Human Molecular Genetics
26 956 Journal of Virology
27 946 Nature Genetics
28 899 Genes and Development
29 828 Plant Physiology
30 815 Oncogene
31 807 The American Journal of Human Genetics
32 747 Human Mutation
33 699 Journal of Immunology
34 693 Infection and Immunity
35 664 Structure
36 663 Development
37 652 Archives of Biochemistry and Biophysics
38 641 Yeast
39 616 Journal of General Virology
40 608 Genetics
41 573 Microbiology
42 530 FEMS Microbiology Letters
43 520 Nature Structural Biology
44 490 Blood
45 465 Human Genetics
46 462 The Plant Cell
47 456 Current Genetics
48 455 Molecular Biology of the Cell
49 408 Applied and Environmental Microbiology
50 404 Cancer Research
51 403 Developmental Biology
52 395 Journal of Clinical Investigation
53 393 Molecular and Biochemical Parasitology
54 391 Journal of Cell Science
55 381 Mammalian Genome
56 379 Protein Science
57 378 Mechanisms of Development
58 375 Neuron
59 375 The Plant Journal
60 367 Molecular Endocrinology
61 362 Acta Crystallographica, Section D
62 358 Molecular Cell
63 354 The Journal of Experimental Medicine
64 346 Immunogenetics
65 340 Journal of Neuroscience
66 331 Journal of Molecular Evolution
67 321 Endocrinology
68 320 DNA and Cell Biology
69 304 Current Biology
70 294 Journal of Neurochemistry
71 286 DNA Sequence
72 283 Biological Chemistry Hoppe-Seyler
73 270 American Journal of Physiology
74 267 Molecular Biology and Evolution
75 266 The Journal of Clinical Endocrinology and Metabolism
76 260 Bioscience, Biotechnology, and Biochemistry
77 258 Brain Research. Molecular Brain Research
78 243 Toxicon
79 241 Journal of General Microbiology
80 238 Cytogenetics and Cell Genetics
81 221 Comparative Biochemistry and Physiology
82 214 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
83 207 Antimicrobial Agents and Chemotherapy
84 201 Proteins
85 196 Molecular Pharmacology
86 186 Journal of Investigative Dermatology
87 186 Journal of Medical Genetics
88 170 DNA Research
89 170 Peptides
90 166 Plant and Cell Physiology
91 162 Molecular Plant-Microbe Interactions
92 162 Virus Research
93 161 Genome Research
94 159 Biology of Reproduction
95 158 DNA
96 152 Tissue Antigens
97 151 European Journal of Immunology
98 146 Biochimie
99 141 Molecular and Cellular Endocrinology
100 139 American Journal of Medical Genetics
101 138 Bioorganicheskaia Khimiia
102 135 Hemoglobin
103 128 Experimental Cell Research
104 127 Nature Cell Biology
105 126 Archives of Microbiology
106 124 Annals of Neurology
107 124 Molecular Phylogenetics and Evolution
108 121 Neurology
109 120 Insect Biochemistry and Molecular Biology
110 118 Agricultural and Biological Chemistry
111 117 European Journal of Human Genetics
112 113 Journal of Human Genetics
113 113 Immunity
114 113 RNA
115 112 General and Comparative Endocrinology
116 111 Developmental Dynamics
117 106 Diabetes
118 103 Molecular Reproduction and Development
119 103 Molecular Immunology
120 103 Planta
121 102 Genes to Cells
122 100 Journal of Protein Chemistry
6. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some UniProtKB/Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 407878 1.97
Journal 361072 192976 1.74
Submitted to EMBL/GenBank/DDBJ 43559 37255 0.21
Submitted to Swiss-Prot 726 723 <0.01
Unpublished observations 567 563 <0.01
Book citation 547 535 <0.01
Plant Gene Register 519 507 <0.01
Submitted to other databases 388 380 <0.01
Thesis 341 339 <0.01
Patent 131 129 <0.01
Unpublished results 22 22 <0.01
Worm Breeder's Gazette 6 6 <0.01
Comments (CC) 801430 3.87
SIMILARITY 225736 186967 1.09
FUNCTION 140862 137032 0.68
SUBCELLULAR LOCATION 106708 106708 0.52
CATALYTIC ACTIVITY 75237 69754 0.36
SUBUNIT 71489 71489 0.35
PATHWAY 39782 34536 0.19
COFACTOR 30534 27405 0.15
TISSUE SPECIFICITY 21716 21716 0.10
MISCELLANEOUS 17969 16342 0.09
PTM 15424 13217 0.07
DOMAIN 10600 9263 0.05
ALTERNATIVE PRODUCTS 8509 8509 0.04
CAUTION 7869 6989 0.04
INDUCTION 5801 5801 0.03
DEVELOPMENTAL STAGE 5218 5218 0.03
INTERACTION 4956 4956 0.02
DISEASE 3190 2329 0.02
ENZYME REGULATION 3089 3089 0.01
MASS SPECTROMETRY 2070 1747 0.01
DATABASE 1562 1406 0.01
BIOPHYSICOCHEMICAL PROPERTIES 1303 1303 0.01
POLYMORPHISM 531 519 <0.01
ALLERGEN 406 406 <0.01
RNA EDITING 403 403 <0.01
TOXIC DOSE 280 278 <0.01
BIOTECHNOLOGY 125 125 <0.01
PHARMACEUTICAL 61 61 <0.01
Features (FT) 1502318 7.25
CHAIN 210458 203970 1.02
STRAND 147090 7249 0.71
TRANSMEM 135865 29419 0.66
TURN 95995 7364 0.46
METAL 88482 21513 0.43
CONFLICT 75912 26372 0.37
TOPO_DOM 70026 14125 0.34
HELIX 67093 7146 0.32
CARBOHYD 64943 16393 0.31
DISULFID 64555 16785 0.31
DOMAIN 62427 33891 0.30
ACT_SITE 48527 28416 0.23
REPEAT 45273 6546 0.22
VARIANT 37244 7394 0.18
BINDING 30308 14568 0.15
MOD_RES 29797 14761 0.14
NP_BIND 28495 20106 0.14
REGION 27780 14613 0.13
SIGNAL 20949 20947 0.10
COMPBIAS 20505 11351 0.10
VARSPLIC 17662 7652 0.09
MUTAGEN 14751 3693 0.07
ZN_FING 14118 5500 0.07
MOTIF 12527 8696 0.06
SITE 10950 6128 0.05
NON_TER 10844 8278 0.05
INIT_MET 9461 9384 0.05
PROPEP 6855 5725 0.03
COILED 6420 3939 0.03
DNA_BIND 6094 5691 0.03
LIPID 6044 3972 0.03
PEPTIDE 5845 3574 0.03
TRANSIT 3637 3603 0.02
CA_BIND 2417 978 0.01
CROSSLNK 1210 942 0.01
NON_CONS 1120 523 0.01
UNSURE 418 170 <0.01
SE_CYS 221 155 <0.01
Cross-references (DR) 2236664 10.80
InterPro 423342 189133 2.04
EMBL 395184 199200 1.91
Pfam 248875 182310 1.20
PROSITE 188497 116608 0.91
GO 97279 27620 0.47
PIR 94760 88562 0.46
PRINTS 78657 61371 0.38
TIGRFAMs 76795 71754 0.37
HSSP 76069 76069 0.37
HAMAP 71745 71631 0.35
BioCyc 67849 62817 0.33
SMART 58049 44253 0.28
ProDom 54565 52510 0.26
PANTHER 48143 45588 0.23
Ensembl 38163 38153 0.18
PDB 30838 8497 0.15
SMR 26812 26812 0.13
TIGR 20204 19648 0.10
PIRSF 17045 16795 0.08
LinkHub 14271 14271 0.07
HGNC 12793 12737 0.06
MIM 11422 9364 0.06
MGI 10357 10318 0.05
IntAct 6588 6588 0.03
SGD 5328 5263 0.03
GermOnline 4926 4880 0.02
RGD 4605 4602 0.02
EcoGene 4225 4223 0.02
EchoBASE 4159 4127 0.02
TAIR 3998 3926 0.02
MEROPS 3958 3837 0.02
H-InvDB 3676 3658 0.02
WormPep 3260 2782 0.02
GeneDB_Spombe 2978 2943 0.01
FlyBase 2967 2920 0.01
WormBase 2859 2781 0.01
TRANSFAC 2811 2522 0.01
SubtiList 2766 2765 0.01
Gramene 2092 2084 0.01
StyGene 1505 1502 0.01
TubercuList 1433 1397 0.01
GeneFarm 1305 1299 0.01
SWISS-2DPAGE 1166 1166 0.01
ListiList 1045 1037 0.01
Reactome 998 998 <0.01
Leproma 627 624 <0.01
ZFIN 613 606 <0.01
PhotoList 525 525 <0.01
MaizeDB 432 427 <0.01
AGD 421 415 <0.01
HIV 370 365 <0.01
OGP 369 369 <0.01
REBASE 352 348 <0.01
ECO2DBASE 351 299 <0.01
LegioList 334 334 <0.01
DictyBase 326 324 <0.01
SagaList 314 313 <0.01
GlycoSuiteDB 282 282 <0.01
PHCI-2DPAGE 239 239 <0.01
MypuList 181 181 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 103 103 <0.01
HSC-2DPAGE 85 85 <0.01
PhosSite 64 62 <0.01
COMPLUYEAST-2DPAGE 59 59 <0.01
PMMA-2DPAGE 52 52 <0.01
Rat-heart-2DPAGE 28 28 <0.01
PptaseDB 27 27 <0.01
ANU-2DPAGE 20 20 <0.01
Number of explicitly cross-referenced databases: 70
Number of implicitly cross-referenced databases: 29
7. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in UniProtKB/Swiss-Prot: 216069
Total number of entries encoded on a Mitochondrion: 3397
Total number of entries encoded on a Plasmid: 3073
Total number of entries encoded on a Plastid: 20
Total number of entries encoded on a Plastid; Apicoplast: 2
Total number of entries encoded on a Plastid; Chloroplast: 5174
Total number of entries encoded on a Plastid; Cyanelle: 145
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: 86
Number of fragments: 8433
Number of additional sequences encoded on splice variants: 13333
| UniProtKB/TrEMBL protein database release 32.0 statistics |
|---|
1. INTRODUCTION
Release 32.0 of 07-February-2006 of UniProtKB/TrEMBL has been produced in synch
with UniProtKB/Swiss-Prot release 49 and EMBL/DDBJ/GenBank nucleotide sequence
database release 85 and updates until the 30-January-2006. It contains
2'605'574 sequence entries comprising 838'379'783 amino acids.
In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in UniProtKB/TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from UniProtKB/TrEMBL.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 8.20 Gln (Q) 3.87 Leu (L) 9.82 Ser (S) 6.97
Arg (R) 5.50 Glu (E) 6.06 Lys (K) 5.32 Thr (T) 5.67
Asn (N) 4.32 Gly (G) 6.99 Met (M) 2.39 Trp (W) 1.34
Asp (D) 5.18 His (H) 2.26 Phe (F) 4.06 Tyr (Y) 3.05
Cys (C) 1.42 Ile (I) 5.93 Pro (P) 4.91 Val (V) 6.58
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.05
2.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Ile, Thr, Arg, Lys, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of
UniProtKB/TrEMBL: 103997
The first twenty species represent 604592 sequences: 23.2 % of the
total number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x:49711
2x:19812
3x: 9920
4x: 5494
5x: 3115
6x: 2428
7x: 1675
8x: 1380
9x: 1119
10x: 923
11- 20x: 4231
21- 50x: 2134
51-100x: 860
>100x: 1195
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 146096 Human immunodeficiency virus 1
2 57551 Homo sapiens (Human)
3 57096 Oryza sativa (japonica cultivar-group)
4 51339 Mus musculus (Mouse)
5 42093 Arabidopsis thaliana (Mouse-ear cress)
6 28014 Tetraodon nigroviridis (Green puffer)
7 26030 Hepatitis C virus
8 25255 Drosophila melanogaster (Fruit fly)
9 20339 Caenorhabditis elegans
10 20120 Trypanosoma cruzi
11 15136 Anopheles gambiae str. PEST
12 14669 Plasmodium chabaudi
13 14617 Dictyostelium discoideum (Slime mold)
14 13851 Brachydanio rerio (Zebrafish) (Danio rerio)
15 13144 Caenorhabditis briggsae
16 12337 Xenopus laevis (African clawed frog)
17 12181 Aspergillus oryzae
18 11767 Plasmodium berghei
19 11748 Gibberella zeae (Fusarium graminearum)
20 11209 uncultured bacterium
21 10803 Neurospora crassa
22 10435 Hepatitis B virus (HBV)
23 10158 Aspergillus fumigatus (Sartorya fumigata)
24 9889 Rattus norvegicus (Rat)
25 9739 Trypanosoma brucei
26 9693 Schistosoma japonicum (Blood fluke)
27 9405 Aspergillus nidulans FGSC A4
28 9090 Entamoeba histolytica HM-1:IMSS
29 9050 Candida albicans SC5314
30 8102 Bradyrhizobium japonicum
31 8063 Solibacter usitatus Ellin6076
32 7937 Frankia sp. EAN1pec
33 7800 Plasmodium yoelii yoelii
34 7740 Escherichia coli
35 7715 Burkholderia sp. (strain 383) (Burkholderia cepacia
36 7663 Burkholderia vietnamiensis G4
37 7559 Streptomyces coelicolor
38 7432 Bradyrhizobium sp. BTAi1
39 7341 Streptomyces avermitilis
40 7165 Rhizobium loti (Mesorhizobium loti)
41 7085 Leishmania major
42 7049 Burkholderia cenocepacia HI2424
43 7013 Rhodopirellula baltica
44 6979 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
45 6752 Hahella chejuensis KCTC 2396
46 6567 Pseudomonas aeruginosa
47 6562 Bos taurus (Bovine)
48 6526 Burkholderia ambifaria AMMD
49 6505 Cryptococcus neoformans (Filobasidiella neoformans)
50 6475 Cryptococcus neoformans var. neoformans B-3501A
51 6456 Burkholderia cenocepacia AU 1054
52 6451 Ustilago maydis 521
53 6408 Ralstonia eutropha (strain JMP134) (Alcaligenes eutrophus)
54 6394 Giardia lamblia ATCC 50803
55 6329 Burkholderia pseudomallei (strain 1710b)
56 6316 Ralstonia metallidurans (strain CH34)
57 6310 Yarrowia lipolytica (Candida lipolytica)
58 6228 Bacillus anthracis
59 6129 Bacillus thuringiensis serovar israelensis ATCC 35646
60 6084 Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
61 6079 Pseudomonas fluorescens (strain Pf-5 / ATCC BAA-477)
62 5905 Bacillus cereus G9241
63 5737 Nocardia farcinica
64 5728 Pseudomonas fluorescens (strain PfO-1)
65 5707 Rhizobium meliloti (Sinorhizobium meliloti)
66 5686 Burkholderia pseudomallei (Pseudomonas pseudomallei)
67 5661 Crocosphaera watsonii
68 5646 Polaromonas sp. JS666
69 5638 Anabaena variabilis (strain ATCC 29413)
70 5593 Gallus gallus (Chicken)
71 5561 Burkholderia thailandensis E264
72 5550 Anabaena sp. (strain PCC 7120)
73 5494 Bacillus cereus (strain ATCC 10987)
74 5394 Bacillus cereus (strain ZK / E33L)
75 5312 Chimpanzee immunodeficiency virus (SIV-cpz)
76 5288 Helicobacter pylori (Campylobacter pylori)
77 5245 Pseudomonas putida F1
78 5234 Plasmodium falciparum
79 5223 Plasmodium falciparum (isolate 3D7)
80 5153 Yersinia pestis
81 5084 Paracoccus denitrificans PD1222
82 5053 Clostridium beijerincki NCIMB 8052
83 5050 Streptococcus pneumoniae
84 5019 Pseudomonas syringae pv. syringae (strain B728a)
85 5018 Photobacterium profundum (Photobacterium sp. (strain SS9))
86 5009 Pseudomonas syringae pv. phaseolicola (strain 1448A / Race 6)
87 5005 Kluyveromyces lactis (Yeast)
88 4971 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
89 4955 Pseudomonas syringae pv. tomato
90 4938 Azotobacter vinelandii AvOP
91 4935 Rhodopseudomonas palustris BisB18
92 4929 Candida glabrata (Yeast) (Torulopsis glabrata)
93 4911 Nocardioides sp. JS614
94 4896 Rhodopseudomonas palustris BisA53
95 4827 Colwellia psychrerythraea (strain 34H / ATCC BAA-681) (Vibrio psychroerythus)
96 4818 Escherichia coli O157:H7
97 4809 Bacillus thuringiensis subsp. konkukian
98 4769 Bacillus cereus (strain ATCC 14579 / DSM 31)
99 4751 Bacillus licheniformis (strain DSM 13 / ATCC 14580)
100 4748 Pseudomonas putida (strain KT2440)
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 63173 ( 2%)
Bacteria 1193711 ( 46%)
Eukaryota 984105 ( 38%)
Viruses 362130 ( 14%)
Other 2455 ( <1%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 0 ( 0%) ( 0%)
Other Mammalia 174802 ( 18%) ( 7%)
Other Vertebrata 137413 ( 14%) ( 5%)
Viridiplantae 204082 ( 21%) ( 8%)
Fungi 137160 ( 14%) ( 5%)
Insecta 98720 ( 10%) ( 4%)
Nematoda 36538 ( 4%) ( 1%)
Other 195390 ( 20%) ( 7%)
4. SEQUENCE SIZE
4.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 30122 1001-1100 15745
51- 100 163087 1101-1200 11054
101- 150 210030 1201-1300 7947
151- 200 196359 1301-1400 5191
201- 250 198812 1401-1500 4323
251- 300 186329 1501-1600 2996
301- 350 177963 1601-1700 2360
351- 400 142131 1701-1800 1985
401- 450 113612 1801-1900 1498
451- 500 97281 1901-2000 1280
501- 550 73289 2001-2100 953
551- 600 52919 2101-2200 1068
601- 650 40519 2201-2300 877
651- 700 31396 2301-2400 695
701- 750 27159 2401-2500 514
751- 800 22979 >2500 4643
801- 850 18351
851- 900 16327
901- 950 12048
951-1000 9446
4.2 Longest and shortest sequences
The shortest sequence is Q16047_HUMAN: 4 amino acids.
The longest sequence is Q3ASY8_CHLCH: 36805 amino acids.
5. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some UniProtKB/TrEMBL
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 3900912 1.50
Journal 2025956 1662223 0.78
Submitted to EMBL/GenBank/DDBJ 1829984 1216539 0.70
Thesis 4894 4841 <0.01
Book citation 4117 4074 <0.01
Submitted to other databases 435 427 <0.01
Other 35526 21675 0.01
Comments (CC) 1391074 0.53
CAUTION 555822 555822 0.21
SIMILARITY 287304 283060 0.11
SUBCELLULAR LOCATION 139002 139002 0.05
FUNCTION 135438 134264 0.05
CATALYTIC ACTIVITY 98620 95511 0.04
SUBUNIT 65771 65771 0.03
COFACTOR 51801 51334 0.02
PATHWAY 44306 42318 0.02
DOMAIN 5498 3877 <0.01
MISCELLANEOUS 3737 3727 <0.01
INTERACTION 3643 3643 <0.01
MASS SPECTROMETRY 116 61 <0.01
ALLERGEN 16 16 <0.01
Features (FT) 1334625 0.51
NON_TER 1204261 720162 0.46
SIGNAL 84376 81398 0.03
CHAIN 45421 27275 0.02
TRANSIT 567 563 <0.01
Cross-references (DR) 18719573 7.18
GO 5741378 1631013 2.20
InterPro 3343552 1700220 1.28
EMBL 2991301 2596793 1.15
Pfam 2100412 1572588 0.81
PROSITE 1218339 783259 0.47
PRINTS 519797 431928 0.20
SMART 394067 312319 0.15
SMR 294379 294364 0.11
BioCyc 290893 275436 0.11
TIGRFAMs 289333 268049 0.11
HSSP 282910 282629 0.11
ProDom 277265 266338 0.11
PANTHER 246734 236085 0.09
PIR 195329 159779 0.07
Ensembl 111130 111128 0.04
TIGR 96019 89923 0.04
Gramene 57215 57183 0.02
PIRSF 56960 56161 0.02
MGI 46996 44473 0.02
FlyBase 26906 26866 0.01
TAIR 20427 20360 0.01
WormPep 19117 19036 0.01
WormBase 19116 19036 0.01
LinkHub 15357 15357 0.01
ZFIN 11986 11982 <0.01
MEROPS 8168 7910 <0.01
IntAct 5821 5821 <0.01
LegioList 5569 5539 <0.01
ListiList 4770 4753 <0.01
AGD 4295 4295 <0.01
PhotoList 4155 4031 <0.01
PDB 3162 1872 <0.01
HGNC 3063 3063 <0.01
TubercuList 2555 2549 <0.01
RGD 2144 2132 <0.01
GeneDB_Spombe 1963 1957 <0.01
SagaList 1780 1686 <0.01
SGD 1327 1323 <0.01
Leproma 980 979 <0.01
DictyBase 979 979 <0.01
TRANSFAC 954 942 <0.01
MypuList 601 597 <0.01
REBASE 124 119 <0.01
PHCI-2DPAGE 108 108 <0.01
ANU-2DPAGE 65 65 <0.01
SWISS-2DPAGE 52 52 <0.01
Reactome 14 14 <0.01
PMMA-2DPAGE 3 3 <0.01
Siena-2DPAGE 2 2 <0.01
COMPLUYEAST-2DPAGE 1 1 <0.01
Number of explicitly cross-referenced databases: 70
6. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in UniProtKB/TrEMBL: 222640
Total number of entries encoded on a Mitochondrion: 124483
Total number of entries encoded on a Plasmid: 41026
Total number of entries encoded on a Plastid: 2319
Total number of entries encoded on a Plastid; Apicoplast: 125
Total number of entries encoded on a Plastid; Chloroplast: 45103
Total number of entries encoded on a Plastid; Cyanelle: 5
Total number of entries encoded on a Plastid; Non-photosynthetic plastid:
Number of fragments: 722286
| Submissions and Updates |
|---|
We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.
Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml
For all queries regarding submissions to UniProtkb and to submit new protein sequence data, please contact:
UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:
| Download information |
|---|
The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by a file containing the sequences of all additional splice isoforms annotated in UniProtKB/Swiss-Prot. This data set is documented in the file ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/README.varsplic
For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest major release (updated 3 times per year) in flatfile format. Previous UniProtKB/Swiss-Prot and UniProtKB/TrEMBL are archived under ftp://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases. The UniProt Knowledgebase major release is also available on CD-ROM from the EBI.
| Contact |
|---|
| Citation |
|---|
If you want to cite UniProt in a publication please use the following reference:
Wu C.H., Apweiler R., Bairoch A., Natale D.A., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Mazumder R., O'Donovan C., Redaschi N., Suzek B. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34: D187-D191 (2006).