| How to prepare your Fasta files for Blast databank creation ? |
|
To use your own set of sequences as a sequence bank ready to be search for with BLAST, you have to convert your sequences to a BLAST binary file. This transformation requires to use the NCBI tool called formatdb . This tool is capable of converting the file(s) containing your set of sequences (FASTA formatted ) into a binary indexed file fully optimized to speedup the BLAST search. By default, the formatdb tool only accepts Fasta files where the sequence identifiers have to be formatted using somes rules. The following section gives you what to do with your Fasta files so that they conform to the formatdb tool specification. If your Fasta file contains simple sequence IDs, such as:
>P18646 10 kDa protein precursor - Vigna unguiculata (Cowpea) it is ok as is to be used with formatdb. If your Fasta file contains multiple sequence IDs, such as:
>P18646|DEF_VIGUN 10 kDa protein precursor - Vigna unguiculata (Cowpea) (you can see that we have two IDs separated by a pipe '|' character) then, you have to prefix the sequence ID with the string "lcl|" (without the quotes) like this:
>lcl|P18646|DEF_VIGUN 10 kDa protein precursor - Vigna unguiculata (Cowpea) If you have a Fasta file that contains either a unique or multiple IDs prefixed with a database identifier, such as: >sp|P18646|DEF_VIGUN Defensin-like protein OS=Vigna unguiculata PE=3 SV=1 MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS GRCRDDVRCWCTRNC (note the database code: sp) then, you have to check that the database code is recognized by the formatdb tool. Here are the valid database codes you can use: gb, emb, dbj, pir, prf, sp, pdb, pat, bbs, ref and lcl.
Please refer to this document (table 1.1) for more information about Fasta files compatibility with the NCBI's formatdb tool.
If your Fasta files conform to these rules they can be converted to a Blast databank by KoriBlast/formatdb.
|