Single Consolidated Output for Multiple Traits in GAPIT GWAS

I like GAPIT for some occasional GWAS I need to run.

Because it’s one of the more user-friendly packages for GWAS work.
Because the degrees of customization it offers within a single package are great.
And also because it outputs every single thing. Some unnecessary things too, sure, but the convenience of having every single output is better than not.

Though, if you have several traits for a set of genotypes, you know that you will get a different file for each trait. Consolidating the gwas results for all traits into one single file could be painful if you have say, 3 or more traits. I usually have 10+ traits (different traits from multiple locations and years/seasons) so opening each output file and carrying out copy-pasta is uncool and therefore not for me. Hence, this post! 😉

Not sure if you source the gapit functions script from zzlab:
I don’t. I save the script as ‘gapit_source_code.txt’ locally and call this instead.

Once you save the text file, go to line 5205.
Alternatively, search for this line:

Memory=GAPIT.Memory(Memory=Memory,Infor="Extract GWAS start")

as line numbers may get changed in future versions of the text file.

Then right before the line where it says


insert this snippet:

colnames(new.PWI.Filtered)=c("SNP","Chromosome","Position ",
              new.PWI.Filtered, envir=globalenv())

Save and close the text file, or save it under a different name but be sure to call the altered source file for your work!

Alternatively, download this file and copy the code, paste into text editor and save as a txt file. In this file, I have only added the few lines of code shown above; none of the functions are altered.

Now, in your R script — the one you are using to import genotype, phenotype, etc — insert these two lines AFTER you run the GAPIT function.

traits_list <- mget(ls(pattern = "trait_.*"))
all_traits_GWAS <- Reduce(function(...) merge(..., all = T), traits_list)

For example, this is how my script looks:

library(compiler) #this library is already installed in R

#Remember, this is the altered source code

myY <- read.table("pheno.txt", head = T)
myG <- read.table("geno.txt", head = F)

myGAPIT <- GAPIT(Y=myY, G=myG,, Model.selection = F)
#myGAPIT <- GAPIT(Y=myY, G=myG,, Model.selection = T)

traits_list <- mget(ls(pattern = "trait_.*")) #make a list of all traits
all_traits_GWAS <- Reduce(function(...) merge(..., all = T), traits_list) #merge all lists into a single data frame
write.table(all_traits_GWAS, "all_traits_GWAS.txt", sep = "\t", quote = F, row.names = F)

The “all_traits_GWAS.txt” file is the consolidated output with gwas results from each individual trait, which is much more convenient and an efficient way of doing things.

If this isn’t working for you, feel free to reach out.


LPmerge parser for unix (or linux) environment

If you are using LPmerge to construct consensus maps from several individual maps, you will quickly realize that manually checking the lengths and RMSE values for each chromosome for all K values can be tiring.

As I work on a species with 21 chromosomes (& currently happen to be working with the task of merging 25 individual maps), I thought why not automate the process of selecting the “best” map. And by best, I mean the map that gives the lowest RMSE and sd values. However, low RMSE and sd values do not always gurantee you the best map, so manual scan is necessary every now and then…. especially for chromosomes that have bizarre map lengths. For the most part though, you can put your trust in RMSE doing a good job.

After you run LPmerge, you will have a ‘.Rout’ file. Use this as your input file and this bash script will give you the best linkage map for each K value you run for each chromosome.

#import your file and extract lines with relevant data
#to run: ./ infile outfile

awk '/Consensus/' "$1" | awk -v OFS="\t" '$1=$1' | awk -v OFS="\t" '{print $4,$5}' > con.tmp
awk '/mean/' "$1" | awk -v OFS="\t" '$1=$1' | awk -v OFS="\t" '{print $2,$3}' > avg.tmp
awk '/sd/' "$1" | awk -v OFS="\t" '$1=$1' | awk -v OFS="\t" '{print $2,$3}' > std.tmp

#get rid of the ugly quotes that R likes to add to every friggin thing
sed -i -e 's/"//g' con.tmp

#bring everything into a single file
paste con.tmp avg.tmp std.tmp > pstd.tmp

#assign k values -- note: if you are using a different k value then change the 5 below
#to whatever value of k you are using
awk -v OFS="\t" '{print "k"(NR-1) % 5 + 1, $0}' pstd.tmp > withk.tmp

#sum RMSE and sd since we essentially want the smallest of both
#ime, the smallest RMSE usually tends to pair with the smallest sd
awk -v OFS="\t" 'NR==1{print $0, $5+$7; next} {print $0,$5+$7}' withk.tmp > sum.tmp

#split the file into blocks of 5 as our k = 1:5
#again, if you are using k = 1:4, then change the 5 below to 4
split -l 5 sum.tmp del

#for each block, print the line with lowest RMSE+sd
for m in del*; do
 awk 'NR==1 || $8 < min {row= $0; min = $8}; END {print row}' $m 
done > kval.tmp

#add chromosome numbers to the file
awk -v OFS="\t" '{$1 = "chrom_"(NR) FS $1;}1' kval.tmp > "$2"

rm del*
rm *.tmp

That’s it! Feel free to reach out if this isn’t working for you.

Estimating and plotting the decay of linkage disequilibrium

A quick post that shows how you can estimate the decay of linkage disequilibrium (LD) and plot it along the genome/chromosome. We’ll be looking at two methods to do this, which unsurprisingly, produce different estimates of LD decay. I personally prefer the 2nd method as it is more scientific.

For both methods, I am using this input file. It basically has two tab delimited columns – one with distance in megabase pairs (Mbp) and the other with LD (r2) values.

First, import the file:

file <- read.table("D_genome_ld.txt", sep="\t", header=T)

1. Method 1: The “Neanderthal” Way (sorry, dear mortal Neanderthals!)

Continue reading

Chromosome or group specific LD heatmaps

If you work on genomics, chances are you have to create an LD heatmap figure sooner or later. I am sure you have used the R package LDheatmap or have thought of using it at some point. The only problem is that despite being a sweet (not necessarily fast) program, it does have its limitations. The biggest one being production of publication-worthy graphics.

To give you an example, look at the following LD heatmap it produced with my mock data. My mock data has 7 chromosomes with 10 markers on each chromosome.

Here is the input data in Excel format.

Continue reading