In order to provide human proteomics MS/MS search databases that are well defined, comprehensive, and frequently updated, we have developed an automated system that integrates all of major sources of human protein sequences into a set of search databases. These databases are tiered into several levels of complexity from which researchers may choose depending on the goal of the experiment and the data processing resources available.
On the first of every month, all protein lists are pulled down from their original sources. If any of them have changed, they are integrated according to the description in Deutsch et al. (submitted) and released here. If none of the source databases have changed, there is no new release. Briefly, the individual levels are as follows:
Level 1 | Includes only the core ~20,000 primary isoforms from Swiss-Prot, Universal Protein Contaminants |
Level 2 | Level 1 plus all ~22,000 "varplic" alternative splice isoforms from Swiss-Prot, immunoglobulin variable region sequences from Swiss-Prot and IMGT. |
Level 3 | Level 2 plus GENCODE, UniProt "UP000005640" and additional non-redundant sequences from other small sources including microbes, external contributions, and additional RefSeq XP sequences. |
Level 4 | A "kitchen sink" database that includes Level 3 plus all other distinct sequences from UniProtKB/TrEMBL and RefSeq XP that are not already present in lower levels. |
Database | Date | # Entries | Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|---|---|---|
Swiss-Prot canonical | 2024-12-01 | 20,406 | 20,406 | 20,406 | 20,406 | 20,406 |
Swiss-Prot + varsplic | 2024-12-01 | 42,502 | 20,406 | 42,498 | 42,498 | 42,498 |
GENCODE | 2024-11-01 | 112,218 | 60,333 | 60,333 | ||
UP000005640 | 2024-12-01 | 105,497 | 20,406 | 42,498 | 45,321 | 45,321 |
UniProtKB + TrEMBL | 2024-12-01 | 227,111 | 20,406 | 42,498 | 45,321 | 142,898 |
NCBI RefSeq NP | 2024-12-01 | 67,695 | 13,430 | 12,778 | ||
NCBI RefSeq XP | 2024-12-01 | 131,347 | 50,937 | |||
IMGT | 2024-12-01 | 711 | 711 | 711 | 711 | |
Microb | 2024-12-01 | 1,608 | 1,608 | 1,608 | ||
Contrib | 2024-12-01 | 726,331 | 726,331 | 726,331 | ||
Contaminant | 2024-04-19 | 499 | 299 | 299 | 299 | 299 |
# Entries | 20,705 | 43,508 | 848,047 | 995,909 |
Below are the monthly releases of the THISP databases available for download. The "Base" is the set of Level 1-4 FASTA files (target and target-decoy). The "Components" is the set of all individual source components (from neXtProt, RefSeq, IMGT, cRAP, etc.) used to make the FASTA files in "Base", as described in the THISP article.
If you use this database, please cite us:
General purpose citation: Deutsch et al., "Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics", J Proteome Res. Author manuscript; available in PMC 2016 Nov 4.