Auto Select database(s) based on sequence type

Hi,

I would like to know is there any way to auto check the database(s) based on the sequence type detected through the sequence pasted in the sequence box?

Suppose, if I paste a sequence containing ATGC, I want all the databases in nucleotide section should be marked automatically (highlighted in yellow). Is it possible in SequenceServer?

Thank you!!

Hi,

There is no automatic way to do that. You will have to implement it yourself in javascript. I believe there is an event that you can listen to for sequence type changes and act accordingly.

Priyam

Thanks Priyam for the response.

Can you please tell me how can I fix below error? It occurs when I try to download Sequence Hit. I know that the issue is occurring because the there is nothing it finds in the database with the id local to Sequence Server but I don’t know how I can make sequence server to use the id that is being used in the database instead?

# ERROR: incorrect number of sequences found.
# You requested 1 sequence(s) with the following identifiers:
# 845014
# from the following databases:
# OrthoDNA_Sequences.db
# but we found 0 sequence(s).

Did you format the database yourself? If so, this can happen if you missed the -parse_seqids option. If not, what is your BLAST version? If SequenceServer downloaded BLAST, it is likely to be 2.2.30. That and many subsequent versions of BLAST had problems with numeric sequence ids. Try minimum 2.9.0 in that case or even 2.10.0. Please let us know if that helped.

Priyam

Hi Priyam,

Thank you for the response.
Happy New Year! I had to change the fasta file header but your suggestion to use -parse_seqids worked. I am encountering another problem related to UTF-8 with some of the protein sequences where view or download sequence capability is not working. My hunch is that it’s happening only to those sequences which are ending with asterisk (*), a special character.

Below is the error I am receiving for the sequences ending with *. Can you please suggest instead of removing *, how can we fix this?

Hi Varnika,

Happy New Year!
You have two options:

  1. Upgrade to 2.0.0-rc8 where the issue is fixed.
  2. In /var/lib/gems/2.7.0/gems/sequenceserver-1.0.14/lib/sequenceserver/sequence.rb, look for the line Sequence.new(*line.chomp.split(’ ')) (should be line 197 or around based on the error message) and the following just above it:

line = line.encode(‘UTF-8’, invalid: :replace, replace: ‘X’)

This will replace * characters with X in the output.

Priyam

Thank you Priyam for all your help - upgrading sequence server to 2.0.0 did fix the download issue for protein sequences having special characters in there.

Thanks,
Varnika

Cheers!

Priyam

Dear Anurag and the Sequenceserver team,

Would replacing the asterisk with X not make the reported length of the protein too big by 1 amino acid (for downstream software)? Or is it done in a different way in the rc8 version?

Best regards,
Lukasz

Dear Lukasz,

2.0-rc8 takes the approach of replacing asterisk with X. We took this approach after much deliberation: https://github.com/wurmlab/sequenceserver/issues/188.

If you would rather sequenceserver additionally show a warning when * has been replaced by X, that may be possible. Although, I will only be look into it after a few weeks. If you have other ideas, I am all ears.

Priyam

Hello Anurag,

I see it now - this is about situations when * appears in the middle of the sequence. I agree that replacing it with X then is reasonable. But if the * is at the end of the sequence it seems to me that removing it would be better (only if it’s at the end, as the STOP codon).

The edge case is that there is a selenocysteine as the last residue but this would be unresolvable anyway. Or if there are two * characters at the end, I would replace the penultimate one with X and remove the last one.

Cheers,
Łukasz

Hi Lukasz,

I think it would be possible to remove ‘*’ if it is at the end of the sequence. I will discuss with Yannick if I should make this change. I didn’t fully understand your concern about selenocysteine - do you think that if the last residue is ‘U’ (i.e., selenocysteine), that should be removed too?

Priyam

Hi Anurag,

I considered these cases:

  1. there is one ‘X’ at the end of the sequence and it was put there by BLAST because it was a STOP codon marked with ‘*’;

  2. there is one ‘X’ at the end of the sequence and it was put there by BLAST because it was a non-standard residue;

  3. there are two 'X’s at the end of the sequence because it ended with “U*” or “X*”, for example.

Case 1 is probably most likely. Case 2 seems quite unlikely and case 3 is very unlikely, but possible. It seems to me that the length would be reported correctly for the largest fraction of results times if X is removed when it is the last character in the results, and only the last X was removed if the sequence ends with XX.

Best regards,
Lukasz

Hi Lukasz,
thanks for highlighting this. In discussing potential solutions, I’m concerned about inconsistencies that we might create between the HTML output and, e.g., a table format output.

I guess with protein-protein BLAST SequenceSErver shouldn’t have to do anything, because the easy solution is at the blast database level.
But the issue does potentially exist if there is translated blast
Or am I reading this wrong?
Thanks,
Yannick

Dear Yannick,

Thank you for replying - yes, it’s true that this could cause potential incompatibility in reports and yes, it would be easier to fix at a database level. I was more focused on my usual use case (blastp).

I may understand it incorrectly, but from what Anurag wrote/linked I thought that BLAST would convert all the non-standard amino acid letters (e.g. U, B, J) and non-standard characters (*) into Xs when producing the output file you use in sequenceserver to visualize the results (an .xml I suppose). If this happens, and the X is at the end of a sequence, it would have probably come from a ‘*’ character at the end of the sequence in the protein database. So the full subject length reports would be overestimated by 1. Please let me know if this is all wrong!

I also think the issue wouldn’t exist for blastx because BLAST would be internally consistent in how it translates the nucleotides. And for tblastn it would be the same as for blastp, no? Because the ‘subjects’ are translated, not the query?

Cheers,
Lukasz