UCSC pairwise alignment (axtNet format) is now supported (new in pygr 0.7beta3, will be released soon). User can add whatever pairwise alignment they want from UCSC axtNet files. In this example, we are using hg18_self (
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsSelf/axtNet/), hg18_panTro2 (
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/axtNet/), hg18_mm8 (
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsMm8/axtNet/), hg18_rn4 (
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsRn4/axtNet/), hg18_canFam2 (
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsCanFam2/axtNet/) for widely used genome assemblies.
First, you have to create seqdb.BlastDB for all of the genome assemblies. Sequence name should be same as assembly name. I.e, save hg18 assembly as "hg18" and make seqdb.BlastDB.
Then, download all axtNet alignments from UCSC website (hg18_canFam2 hg18_mm8 hg18_panTro2 hg18_rn4 hg18_self) and uncompress gzip archives. Here is the simple example how to make pygr.NLMSA for hg18.
#!/usr/bin/env python
import sys, os, string, glob
from pygr import seqdb
from pygr import cnestedlist
args = sys.argv
if len(args) != 2:
print 'Usage:', args[0], 'dummy'
sys.exit()
genomes = {}
seqlist = ['hg18', 'panTro2', 'mm8', 'rn4', 'canFam2']
for orgstr in seqlist:
genomes[orgstr] = seqdb.BlastDB(orgstr)
genomeUnion = seqdb.PrefixUnionDict(genomes)
msaname = 'hg18_pairwise5way'
mafdir = '/data/root2/' + msaname + '/'
axtlist = glob.glob(os.path.join(axtdir, '*/*.axt')
axtlist.sort()
msa = cnestedlist.NLMSA(msaname, 'w', genomeUnion, axtFiles = axtlist) # axtFiles for axtNet (cf. mafFiles for MAF), FOR AXTNET FORMAT, YOU *MUST* GIVE *axtFiles =*
msa.save_seq_dict() # IF YOU ARE NOT USING pygr.Data, YOU *MUST* SAVE SEQDICT.
Depending on your system, it would take several hours to finish. By default, it will use 1GB memory. If you have less memory, you have to decrease file size. You can pass arguments as follows.
cnestedlist.NLMSA(msaname, 'w', genomeUnion, axtFiles = axtlist, maxlen = 536870912, maxint = 22369620) # 500MB Version. It will use less than 750MB Memory at most.
If you are planning to save NLMSA into pygr.Data and never open directly from file, you don't have to give additional options. But, if you are planning to open NLMSA directly from file, seqDict should be saved into file and you have to do msa.save_seq_dict().
One big important differenct between MAF NLMSA and axtNet NLMSA is... if we are using hg18 referenced alignments, in MAF NLMSA, if you query using hg18, you can get all alignments of other species (same as axtNet NLMSA). But, if you query using panTro2, you can only get hg18 alignments in axtNet (MAF NLMSA gives all other hits from all other species). It is because axtNet NLMSA alignment is always stored as "pairwiseMode".