Getting Blast sequence databanks
Introduction 

To obtain Blast ready sequence databanks, you have to know that Blast only accepts a set of sequences formatted for it. This is not specific to KoriBlast, actually this is a requirement of the Blast software itself. Namely, to run Blast with any set of sequences, they have to be provided to the formatdb tool available from the NCBI as part of the Blast suite of softwares. That formatdb software is the only one capable of creating a Blast ready sequence databank. And you should also note that formatdb only accepts Fasta formatted set of sequences in order to prepare a Blast databank.

KoriBlast extends the capability of formatdb and provides to you an easy way to prepare Blast databanks from standard sequence data sets formatted as Genbank, Embl, Uniprot, Swissprot files (either plain text or gzipped).

You have two ways to provide KoriBlast (and Blast) with a Blast sequence databank. Either you download from the web such a sequence database or you have to prepare it from a set of sequence files. In both cases, KoriBlast will help you to install the sequence databanks.

 

Getting Blast databanks from the NCBI 

As far as we know, only the NCBI directly provides Blast ready sequence databanks. They can be downloaded from their web site: ftp://ftp.ncbi.nih.gov/blast/db.The files of interest all have the extension '.tar.gz' and they have to be downloaded using a tool such as FileZilla, or any other FTP tool of your choice. Starting with KoriBlast 2.6, these files can be directly uncompressed and unarchived from KoriBlast.

In the following table, we give some details to explain which files to download when you want to use some particular databanks. The NCBI help desk provides a more detailed document about the content of this repository, so do not hesitate to read it too.

 Databank
What do download
Protein sequences (nr)

Retrieve all files called nr.XX.tar.gz where XX is a number.

Nucleotide sequences (nt)

Retrieve all files called nt.XX.tar.gz where XX is a number.

RefSeq Genomic

Retrieve all files called refseq_genomic.XX.tar.gz where XX is a number.

Swissprot

You have two ways to get the SwissProt Blast databank from NCBI. First, you can retrieve the file called swissprot.tar.gz as well as all the files for nr (see above). Second, which is far more rapid: enter the directory ftp://ftp.ncbi.nih.gov/blast/db/fasta, then get the swissprot.gz file and use KoriBlast to prepare the corresponding databank.

PDB

You have two ways to get the PDB Blast databank from NCBI. First, you can retrieve the file called pdb.tar.gz as well as all the files for nr (see above). Second, which is far more rapid: enter the directory ftp://ftp.ncbi.nih.gov/blast/db/fasta, then get the pdbaa.gz file and use KoriBlast to prepare the corresponding databank.

All the files you can download from the NCBI Blast databank repository can be provided to KoriBlast to run Blast searches. This is quite easy, as explained here.

 

Getting sequence data files from the Web 

Many sequence data sets are available all around the world, in various file formats accepted by KoriBlast (Genbank, Embl, Uniprot, Swissprot, Fasta). We cannot mention all of them, but here are some major sources of data:

 Data types
Source  What do download
Various sequences sets
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA The NCBI provide a wide range of Fasta formatted sequence data sets. From the NCBI repository, just download the gzipped files of interest.
Genomes data sets
ftp://ftp.ncbi.nih.gov/genbank/genomes

Within that repository, you can find, for each genome of interest, the following Fasta formatted files:

1. files with the '.fna' extension:sequence of a chromosome

2. files with the '.faa' extension: set of proteins

3. files with the '.ffn' extension: set of CDS

RefSeq genomes data set
ftp://ftp.ncbi.nih.gov/genbank/genomes Same comments as above
UniProt (SwissProt and TrEmbl)
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/
Get one of the '.fasta.gz' files as needed
Vector contamination detection
ftp://ftp.ncbi.nih.gov/pub/UniVec Get one of the following files: UniVec or UniVec_core
Human SNPs
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/rs_fasta Get the files with the extension '.fas.gz'.
Other SNPs
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/ From the provided link, enter the directory of the target organism, then enter the 'rs_fasta' directory. From there, get the files with the extension '.fas.gz'.

All the files you can download from this table can be provided to KoriBlast to prepare a Blast databank. This is quite easy, as explained here

 

Using NCBI thematic databanks 

The NCBI provides a wide range of sequence databanks available to use with their Blast Internet service. You will appreciate to know that KoriBlast is capable of using all of them. This is explained in the KoriBlast User Manual. Within KoriBlast, click on the button [Help and Visual Tutorial] to open the manual. Within the table of content, follow the path: The Configuration Module, Updating the sequence databases for the NCBI Blast system.

 

Go back to Using FAQ

 

Newsletter



Receive HTML?


Follow us on Korilog on Youtube Korilog on Twitter
Products Services Download & try Korilog Compagny
© 2007-2012 Korilog SARL, all rights reserved. Terms of Use and Privacy Policy