LPmerge parser for unix (or linux) environment

If you are using LPmerge to construct consensus maps from several individual maps, you will quickly realize that manually checking the lengths and RMSE values for each chromosome for all K values can be tiring.

As I work on a species with 21 chromosomes (& currently happen to be working with the task of merging 25 individual maps), I thought why not automate the process of selecting the “best” map. And by best, I mean the map that gives the lowest RMSE and sd values. However, low RMSE and sd values do not always gurantee you the best map, so manual scan is necessary every now and then…. especially for chromosomes that have bizarre map lengths. For the most part though, you can put your trust in RMSE doing a good job.

After you run LPmerge, you will have a ‘.Rout’ file. Use this as your input file and this bash script will give you the best linkage map for each K value you run for each chromosome.

#import your file and extract lines with relevant data
#to run: ./script.sh infile outfile

awk '/Consensus/' "$1" | awk -v OFS="\t" '$1=$1' | awk -v OFS="\t" '{print $4,$5}' > con.tmp
awk '/mean/' "$1" | awk -v OFS="\t" '$1=$1' | awk -v OFS="\t" '{print $2,$3}' > avg.tmp
awk '/sd/' "$1" | awk -v OFS="\t" '$1=$1' | awk -v OFS="\t" '{print $2,$3}' > std.tmp

#get rid of the ugly quotes that R likes to add to every friggin thing
sed -i -e 's/"//g' con.tmp

#bring everything into a single file
paste con.tmp avg.tmp std.tmp > pstd.tmp

#assign k values -- note: if you are using a different k value then change the 5 below
#to whatever value of k you are using
awk -v OFS="\t" '{print "k"(NR-1) % 5 + 1, $0}' pstd.tmp > withk.tmp

#sum RMSE and sd since we essentially want the smallest of both
#ime, the smallest RMSE usually tends to pair with the smallest sd
awk -v OFS="\t" 'NR==1{print $0, $5+$7; next} {print $0,$5+$7}' withk.tmp > sum.tmp

#split the file into blocks of 5 as our k = 1:5
#again, if you are using k = 1:4, then change the 5 below to 4
split -l 5 sum.tmp del

#for each block, print the line with lowest RMSE+sd
for m in del*; do
 awk 'NR==1 || $8 < min {row= $0; min = $8}; END {print row}' $m 
done > kval.tmp

#add chromosome numbers to the file
awk -v OFS="\t" '{$1 = "chrom_"(NR) FS $1;}1' kval.tmp > "$2"

#cleanup
rm del*
rm *.tmp

That’s it! Feel free to reach out if this isn’t working for you.

Advertisements

याँ मुन्तिर कमेन्ट ठोक्नुस्

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s