Get sequence with '|' in their names

Dear all,

I just discovered and installed SequenceServer, I love it, congratulation.

I have a set of sequences in a blast database that have piple (|) in their names, and for these sequences, and only these sequences, I can not see or download them in the blast results page.

Here is a name example: PWVi6_TR80574|c0_g1_i1 (its derived from a Trinity assembly contig)

If I try to get the sequence directly from the blast database, it works, so I guess this is a problem within SequenceServer.

tristan@salentinella:/home/blastdb$ blastdbcmd -db all35sp_contigs.fas -entry ‘PWVi6_TR80574|c0_g1_i1’

lcl>PWVi6_TR80574|c0_g1_i1 len=350 path=[655:0-349] [-1, 655, -2]
CGAGGATCCTTTGTCCTTATCAATGGCAAATAATGTCATTTGTCATTTATTTATTCCTTTGAAAGGAGTTGAAAGGACGA
TAGGAATAATAAATAAATCTAGGAATTCTTTTCTTTATCAGAATAGATGAACCAACGTCTCCAAATTATGCGTTTGAGGC
TCTTGACAAAAATCAAGAAAAGTTGGCAGTTTTCAACAATTTAAAAATTCGTTTTTAAAATCAAGAAATTACAAAAAATG
GTGAATTCTATAATTAATAAAATGTGAAGGAGATTATGAAACAGAATTTTGGTGAAGCAGAGTGTTTAATTTATACTTAA
AGGAAATATTTTTAGTTGATGAACTGAAAG

Thanks for you help !

sorry I meant pipes (not piple)

Hey Tristane,

Did you have the database created from FASTA files already or did you have SequenceServer do that for you? I’m guessing you had the databases from before and they weren’t created with -parse_seqids option (I think if it was lcl would have been gnl). If it’s indeed the case deleting the databases and re-creating them with SequenceServer should fix the issue (or manually with -parse_seqids option).

– Priyam

Hi Anurag,
Thé database was made manually with the -parse_seqids option, and the download works for most sequences, just not for the one with pipes. I’ll try to format the db with sequenceServer to see if makes any difference.

Hey Tristane,

In that case re-creating dbs with SequenceServer won’t be of any help. I just picked your example sequence and tried running it locally. It appears in the blast result page just fine. But download indeed doesn’t work.

The issue in part BLAST’s whimsical treatment of sequence ids and in part how SequenceServer attempts to generalise / work around those. The command that sequenceserver runs to retrieve sequences are:

blastdbcmd -outfmt ‘%g %i %a %t %s’ -db … -entry ‘id_as_obtained_from_blast_xml_output’

(note the -outfmt flag)

This fails behind the scenes for your sequence with the following error:

Error: [blastdbcmd] FASTA-style ID LCL|PWVI6_TR80574|C0_G1_I1 has too many parts.

However, blastdbcmd -outfmt ‘%f’ -db … -entry … (same as not specifying the -outfmt switch) works.

I will see how to deal with this. But the short term and the better long term fix really is to avoid pipes in your sequence ids, or prefix all your sequence ids with lcl| or gnl| (if you must have pipes).

– Priyam

Hi Anurag,
Thanks for the explanation, that makes sense now. I also see that this is an old blast+ issue that is dependant on the blast version, see this post :
http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html?m=1

I am reluctant in removing the pipes as they are coming upstream from the assembler and this would break many backward compatibility. I think I’ll go with adding a prefix such as lcl for the time being.

Thanks again,
– Tristan