SequenceServer integration with HPC system

Hello Anurag,

I would like to thank for this software and your previous post on integration sequenceserver with HPC, I have question related to implementation.

I replaced the temp locations in blast.rb and I replaced the blast binaries with sge wrappers,jobs got submitted to cluster, Because of Job result which is in BLAST Archive format with different filename the Sequenceserver not able to generate xml. Can you help us how we can resolve this issue.

On Wednesday, 18 March 2015 12:39:36 UTC+5:30, Anurag Priyam wrote:Thanks!

Based on user input SequenceServer constructs a command (just like you would create a command without SequenceServer e.g. blastp -query foo.fa -db “bar.fa baz.fa”) which is then executed in the shell with due security considerations. Output, in BLAST Archive format (-outfmt 11), is redirected to a file. We then obtain XML output from the archive file using blast_formatter (again, output is redirected to a file). We parse the XML and generate HTML ourselves. The same archive file is used to generate XML and tabular report for download.

We used pipes in the very early days of SequenceServer (when we were just starting out) but soon felt that pipes were unreliable. So not anymore. Query sequences are written to a file and passed to blast using -query option instead of piping from stdin. Output is written to a file which is subsequently read instead of reading from a pipe.

For antgenomes.org, which is hosted on a thin server but runs BLAST on a 48 core fat machine (designated node on QMUL’s HPC cluster), we simply replace BLAST+ binaries with a shim that executes BLAST on the fat machine via ssh:

#!/usr/bin/env sh

ssh /path/to/blastn “$@”

The same scheme can be used to queue jobs if the queuing system allows waiting on a job id. I guess the corresponding shim would look something like:

#!/usr/bin/env sh

job_id=
qsub -N $job_id /path/to/blastn “$@”

qusb -hold_jid $job_id

(or use -sync option maybe)

If waiting on job id is not allowed in the conventional UNIX sense, it will not work because SequenceServer processes requests synchronously. That bit is due to change soon though.

I hope this helps. Please let us know if you took the above suggestions to integrate SequenceServer into an HPC system. We will be happy to help along the way.

– Priyam

Hey Aslam,

I am sorry I couldn’t respond sooner - was really caught up last week.

I don’t quite understand why BLAST archive files have a different name than what SequenceServer expects it to be. Could you provide some more context here? how do your shims look? what changes did you make to SS? is the file system is shared b/w the machine that runs SS and the nodes that run the job?

Also, you can use TMPDIR environment variable to change the location of where SequenceServer writes temporary files (query, BLAST archive, xml, tsv, FASTA(s) to download, etc.) without having to modifying blast.rb

– Priyam

Hi Priyam,

Thanks for your reply, Yes , I have created TMPDIR env variable, And ran the query., Here are the files are generated. But it unable to generate the xml report. Is there any thing i missed in configuration.

total 142
-rw------- 1 xxxx yyyy 0 Dec 22 05:50 sequenceserver_blast_error20151222-850-1v4obxl
-rw------- 1 xxxx yyyy 45 Dec 22 05:50 sequenceserver_blast_result20151222-850-1hrqe4e
-rw------- 1 xxxx yyyy 57 Dec 22 05:50 sequenceserver_query20151222-850-n0gd19
-rw------- 1 xxxx yyyy 0 Dec 22 05:50 sequenceserver-xml_report.xml20151222-850-o447f
[xxxx@zzzz tmp]$cat sequenceserver_blast_error20151222-850-1v4obxl
[xxxx@zzzz tmp]$cat sequenceserver_blast_result20151222-850-1hrqe4e
Your job 20558 (“blastn”) has been submitted
[xxxx@zzzz tmp]$cat sequenceserver-xml_report.xml20151222-850-o447f

Thanks
Syed

Hello Priyam,

Yes, The files system is shared between SS and execution nodes,

[xxxx@yyyyy ~]$cat /home/xxxx/blast_formatter.e20558
Illegal variable name.

Thanks
Aslam Syed

Hey Aslam,

I am sorry I have been sitting on a response to your email because I am really out of ideas. Can we take this off the list and see if you can get me temporary access to your setup and we can investigate what’s happening? (you know my email I hope?)

– Priyam

Hi both,

Don’t know if you made any progress with this but…

The problem is that qsub will not direct the output of the script it’s running to > and errors to 2>. This will only capture the output and errors of the qsub itself.

So this line in the ruby script blast.rb:

system("#{command} > #{rfile.path} 2> #{efile.path}")

works fine locally, and even when sending the job somewhere else through SSH, but will not work when qsub-ing. You need to specify -o and -e options in your shim and remove the redirection in the ruby script.

So what I did (I don’t know Ruby very well so I appreciate this is very rudimentary!)…

I added a couple of lines to blast.rb:

File.write(’/tmp/outpath’, “#{rfile.path}”)
File.write(’/tmp/errpath’, “#{efile.path}”)

Obviously after the rfile and efile Tempfiles have already been declared!

Then I changed the line:

system("#{command} > #{rfile.path} 2> #{efile.path}")

to just:

system("#{command}")

Then my shim looks like this:

#!/usr/bin/env sh

jobid=mktemp bl.XXXX
rm $jobid

opath=cat /tmp/outpath
epath=cat /tmp/errpath

qsub -sync y -b y -pe slowpara 2 -N $jobid -o $opath -e $epath /usr/local/bin/blastn “$@”

And now it’s working a treat!

Also note that you need the -b y option, as blastn is a binary and not a script, which qsub expects by default.

I should probably also point out that I’m using Sun Grid Engine, which may be different to other grid engines (although it does use the DRMAA).

Was your solution any different? I can see a potential issue with my solution if two users submit jobs simultaneously, and the paths to one job’s output/error is overwritten by the other, but that seems extremely unlikely in my scenario.

Andy

Also just found another problem when selecting multiple reference genomes, the ’ ’ were not parsed correctly to the command line, which resulted in a “too many arguments” error.

Fixed that by changing my shim to:

#!/usr/bin/env sh

jobid=mktemp bl.XXXX
rm $jobid

opath=cat /tmp/outpath
epath=cat /tmp/errpath

cmdline=echo "$@" | sed "s/\-db\ /\-db\ \'/" | sed "s/\ \-query\ /\'\ \-query\ /"

qsub -sync y -b y -pe slowpara 2 -N $jobid -o $opath -e $epath /usr/local/bin/blastn $cmdline

This is brilliant, Andy. Do you mind if I update the HPC section of website (http://www.sequenceserver.com/doc/#hpc) with your followup?

The problem with quotes not being correctly parsed is something I face with ssh as well. I add an extra quote directly in blast.rb to cope with that.

Priyam

Yes, please do, glad to help.

It’s occurred to me you don’t need to specify the job ID, as the -sync y option works fine, but I’ve left it in so I can identify BLAST jobs from other HPC jobs.

My shim now looks like this so I don’t have to change the shim code for each BLAST type:

#!/usr/bin/env sh
jobid=mktemp bl.XXXX
rm $jobid

bltype=basename $0

opath=cat /tmp/outpath
epath=cat /tmp/errpath

cmdline=echo "$@" | sed "s/\-db\ /\-db\ \'/" | sed "s/\ \-query\ /\'\ \-query\ /"

qsub -sync y -b y -pe slowpara 4 -N $jobid -o $opath -e $epath /usr/local/bin/$bltype $cmdline

Andy

Hi Andy,

I have included your script on the website. In doing so I made a small modification regarding how rfile and efile are passed to the script. I thought it would interest you - http://www.sequenceserver.com/doc/#hpc.

I was also wondering if you could review that section in general. Like, anything else you stumbled with in setting up SequenceServer to run BLAST on your cluster?

Priyam