Bug run into upon blasting

This happened when running 1 protein blast query against 2 nucleotide databases (TBLASTN). Running the query against each database individually works fine:

Hey Tim,

I’m unable to reproduce this issue. Can you reproduce the issue and send me the BLAST archive file sequenceserver generates? To get the archive file, launch SequenceServer in debug mode:

sequenceserver -D

Then BLAST and look at the logs in the terminal. You will see

DEBUG Executing: tblastn -db …

and then

DEBUG Executing: blast_formatter -archive ‘/path/to/archive/file’ …

Copy the archive file to your home directory and send it over in an email attachment or using pastie / gist.

– Priyam

Hi Priyam,

I’ve been unable to replicate this. If I run into it again I’ll send you the causal files. One changed variable was that earlier I had installed SequenceServer in my user directory (gem install --user-install sequenceserver), but am now using a globally installed sequenceserver.

Hi Priyam.

I’m having the same issue that Tim reported. I was having the issue on version 1.0.2. I have now upgraded to 1.0.4 and still have the problem. Here’s an example error message. The problem occurs for some pairs of databases, but not all.

Ox::ParseError - invalid format, elements overlap at line 962, column 26 [parse.c:561]

I have attached the blast results file as you suggested to Tim, and the xml report file (each with a .txt added to the name to get them accepted in the e-mail). I am new to SequenceServer, so let me know if I should provide more information.

Thanks for any help you can give me.

Chris

sequenceserver_blast_result20150602-9295-lg937f.txt (221 KB)

sequenceserver-xml_report.xml20150602-9295-kyxw4n.txt (52.2 KB)

Hi Priyam,

I took a look at the XML file. It looks like the and <Iteration_hits> elements have no closing tags in the cases where the error occurs.

Chris

Hi,

I ran blast_formatter on the command line against the results file. Looks like the problem with the XML file is caused by an exception from blast_formatter:

blast_formatter -archive ‘sequenceserver_blast_result20150602-13132-1t9kh01’ -outfmt 5 -out ‘bf.out’
Error: Failed s_BlastXMLAddIteration Q(0/1NCBI C++ Exception:
T0 “/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_10349_130.14.22.10_9008__PrepareRelease_Linux64-Centos_1414443165/c++/compilers/unix/…/…/src/objtools/alnmgr/alnvec.cpp”, line 902: Error: ncbi::objects::CAlnVec::TranslateNAToAA() - CAlnVec::TranslateNAToAA(): NA size expected to be divisible by 3

Chris

Hey Chris,

Thanks for sending the XML and archive files. I can reproduce the XML parsing error. And the error does seem to be due to & <Iteration_hits> tags not being closed.

I have opened a GitHub issue to track this bug - https://github.com/wurmlab/sequenceserver/issues/194

I am not able to get XML from the archive file because BLAST looks for the database from which archive file was generated (6691_Trinity.fasta). But the error you are seeing very well might be the reason the generated XML is malformed.

– Priyam

Hey Chris,

Would you be able to share subset of the FASTA file containing the hit(s) that cause BLAST to error out? This would be helpful to submit a bug report to BLAST.

– Priyam

Hi Priyam,

I will check with the owners of the data to see if I can give it to you.

One thing to remember is that, as with Tim’s original problem, I can run the BLASTs against each of my databases individually and everything is fine, but if I check 2 databases and run it I get the error.

I have found a single hit that I can remove from the blast results for the two databases that allows blast_formatter to run. It’s this one:

{
dim 2,
ids {
local str “Query_1”,
local str “comp125955_c0_seq1”
},
loc {
int {
from 17,
to 56,
strand unknown,
id local str “Query_1”
},
int {
from 248,
to 367,
strand plus,
id local str “comp125955_c0_seq1”
}
}

But this exact hit also appears in the single database results. So, my guess is a bug in blast_formatter.

Chris

Hello again,

I think I’ve found the cause of the problem. The FASTA files for the 2 databases both contain sequences with identifier comp125955_c0_seq1 - but they are different. The hit is to the sequence in the second database, but blast_formatter is loading the sequence from the first database. This sequence is actually too short to contain the hit (i.e. the hit runs from 248 to 367 while the sequence in the first database is only 366 bases long).

So I’m no longer sure whether that’s a bug in anything but the data. What do you think?

Thanks very much for your help on this.

Chris

Hi Chris,

Thanks for tracking this down. I’ve now been able to reproduce it. Steps:

(1) Create a seqserv db folder with the following 2 databases. The important thing is that they have the same name, and the second db alphabetically has a longer sequence length

$ head *fa
==> rand1.fa <==

random_sequence
AAAAAAAAAAAAAAAAAAAAAAAAAAAASAAACTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCA

==> rand2.fa <==

random_sequence
TCGATTAAAATAAAGCTTGTCATGGAGTGTCCTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCACCCCACCCCTGACAACTCCCGTAGG

(2) Run seqserv 1.0.4, blast 2.2.29+ or 2.2.30+

(3) tblastn with the following sequence (ie rand2 translated)
SWSVLVVKRKPQNVS*THHPTPDNSRR

=> error thrown.

I also note that querying with rand2 in tblastx also fails the same way. blastn works, but the alignment is incorrectly displayed.

random_sequence 1 / 1

Hit length: 75

1.
Score E value Identities Gaps Strand
147.36 (162) 5.10 × 10-41 16/81 (19.75) 0/81 (0.00) +/+
Query 1 TCATGGAGTGTCCTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCACCCC 60
| | ||| ||| ||| | || | |
Subject 20 AAAAAAAAASAAACTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCA--- 76

Query 61 ACCCCTGACAACTCCCGTAGG 81
``
Subject 77 --------------------- 76
2.
Score E value Identities Gaps Strand
80.63 (88) 6.22 × 10-21 44/44 (100.00) 0/44 (0.00) +/+
Query 13 CTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCA 56
||||||||||||||||||||||||||||||||||||||||||||
Subject 33 CTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCA 76

Have not tried blastp, but presumably there is a similar problem there.

I feel then that blast is at fault here in this corner case. Command line blast also has a similar problem

$ blastn -query <(echo TCGATTAAAATAAAGCTTGTCATGGAGTGTCCTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCACCCCACCCCTGACAACTCCCGTAGG) -db ‘rand1.fa rand2.fa’

>lcl|random_sequence
Length=76

Score = 185 bits (100), Expect = 2e-52
Identities = 24/100 (24%), Gaps = 0/100 (0%)
Strand=Plus/Plus

Query 1 TCGATTAAAATAAAGCTTGTCATGGAGTGTCCTAGTAGTAAAACGAAAACCCCAGAACGT 60
| |||| ||| | | ||| ||| ||| |
Sbjct 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAASAAACTAGTAGTAAAACGAAAACCCCAGAACG 60

Query 61 TTCTTAGACTCACCACCCCACCCCTGACAACTCCCGTAGG 100
|| | |
Sbjct 61 TTTCTTAGACTCACCA------------------------ 100

Score = 82.4 bits (44), Expect = 2e-21
Identities = 44/44 (100%), Gaps = 0/44 (0%)
Strand=Plus/Plus

Query 32 CTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCA 75
||||||||||||||||||||||||||||||||||||||||||||
Sbjct 33 CTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCA 76

Unfortunately even if the XML could be parsed correctly there is no way to determine which DB the query is hitting, as far as I can tell.

$ tblastn -query <(echo 'SWSVLVVKRKPQNVS*THHPTPDNSRR') -db 'rand1.fa rand2.fa' -outfmt 5
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" ["http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"](http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd)>
<BlastOutput>
<BlastOutput_program>tblastn</BlastOutput_program>
<BlastOutput_version>TBLASTN 2.2.29+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>rand1.fa rand2.fa</BlastOutput_db>
<BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
<BlastOutput_query-def>unnamed protein product</BlastOutput_query-def>
<BlastOutput_query-len>27</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>L;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>unnamed protein product</Iteration_query-def>
<Iteration_query-len>27</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>random_sequence</Hit_id>
<Hit_def>No definition line</Hit_def>
<Hit_accession>random_sequence</Hit_accession>
<Hit_len>76</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>54.6842</Hsp_bit-score>
<Hsp_score>130</Hsp_score>
<Hsp_evalue>1.79999e-17</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>27</Hsp_query-to>
<Hsp_hit-from>20</Hsp_hit-from>
<Hsp_hit-to>100</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>2</Hsp_hit-frame>
<Hsp_identity>2</Hsp_identity>
<Hsp_positive>2</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>27</Hsp_align-len>
<Hsp_qseq>SWSVLVVKRKPQNVS*THHPTPDNSRR</Hsp_qseq>
<Hsp_hseq>KKKZTSSKTKTPERFLDSP--------</Hsp_hseq>
<Hsp_midline> K K </Hsp_midline>
</Hsp>
<Hsp>
<Hsp_num>2</Hsp_num>
<Hsp_bit-score>28.8758</Hsp_bit-score>
<Hsp_score>63</Hsp_score>
<Hsp_evalue>7.24853e-08</Hsp_evalue>
<Hsp_query-from>5</Hsp_query-from>
<Hsp_query-to>18</Hsp_query-to>
<Hsp_hit-from>33</Hsp_hit-from>
<Hsp_hit-to>74</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>3</Hsp_hit-frame>
<Hsp_identity>14</Hsp_identity>
<Hsp_positive>14</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>14</Hsp_align-len>
<Hsp_qseq>LVVKRKPQNVS*TH</Hsp_qseq>
<Hsp_hseq>LVVKRKPQNVS*TH</Hsp_hseq>
<Hsp_midline>LVVKRKPQNVS*TH</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>2</Statistics_db-num>
<Statistics_db-len>176</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>1566</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

And I couldn’t seem any help in the blast_formater -help either (e.g. db# the hit came from). So some possible solutions:

  1. wait for ncbi to fix this issue, which presumably requires adding more info to output files, then update seqserv to suit.
  2. seqserv runs against each blast database separately, but then e-values will be off.
  3. ensure uniqueness of sequence IDs when formatting databases.

Anurag, what do you think? None of those sounds particularly appealing to me. Is this a known blast bug at all do you know?

ben

Indeed there’s no way for us to know which database a hit came from … would have made life easier. But all this while I assumed BLAST knows, just that it doesn’t expose this information to us. What we are looking at here is a major problem, imo. It’s not uncommon for predicted transcripts and corresponding proteins to have the same id. Globally unique sequence id should not be assumed for local BLAST.

Vivek and I tried the second approach while working on the 1.0 rewrite. For us the goal was to make sequence retrieval unambiguous. We tried using the -dbsize option so that evalue will be the same.(Using blastdbcmd size of each db in number of residues can be queried. Sum that for all the dbs that were selected for blast and use that with -dbsize). But evalue still changed, marginally but noticeably (e.g., 1.2e-3 would be 1.6e-3 … or the other way round). We weren’t sure if this was good so we didn’t go ahead with this approach.

As I mentioned earlier, it’s not uncommon for transcritps and their corresponding cds and proteins to have the same id (but in different FASTA files), we decided not to enforce globally unique ids either. Instead we started working on what we called ‘doctor’ (to be invoked as sequenceserver --doctor) which would just warn users of non-unique ids (globally) and a few other things (like, numeric ids) instead of enforcing it. This work remains unfinished here - https://github.com/wurmlab/sequenceserver/pull/133

So,

  1. We can advise users to ensure globally unique ids (by simply prefix something meaningful to the ids), but we cannot, should not enforce it.
  2. Till some version back BLAST didn’t even bother about duplicate ids in the same FASTA file. At least now it does. BLAST should take care of this as well.

Ben, would it be possible for you to take up this issue with NCBI? … maybe try finalising ‘sequenceserver --doctor’ as well (unlikely that I will be able to get to this anytime soon).

Thank you both, Chris and Ben, for debugging this.

– Priyam

Thanks Anurag,

I have sent an email to blast-help, will report anything back to this list.

In the process, though, I came across another (imperfect) workaround. If 
you format the databases manually without -parse_seqids (the seqserv 
default), then it seems to be ok. This breaks sequence retrieval though. 
The other workaround is simply to ensure global uniqueness of names e.g. 
by putting the name of the database as part of the sequence identifier 
(needs to be part of the first word of the identifier?).

Also, I noticed that getting XML output does actually generate an error, 
though the exitstatus is still 0, hence we aren't picking it up (true 
either going through a blast archive or directly output from the blast 
call). We should be able to detect this easily by simply ensuring that 
no stderr is output, I think.

None of this actually solves the problem though. I'm pretty busy too but 
I'll try to make time for doctoring.

On top of that, my email was actually wrong - I was unable to reproduce 
with that error. The following databases do though:
==> rand1.fa <==
 >random_sequence
CATGGAGTGTCCTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCACCCCACCCCTGACAACTCCCGTAGG

==> rand2.fa <==
 >random_sequence
TCGATTAAAATAAAGCTTGTCATGGAGTGTCCTAGTAGTAAAACGAAAACCCCAGAACGTTTCTTAGACTCACCACCCCACCCCTGACAACTCCCGTAGG

Thanks,
ben