A quick post that shows how you can estimate the decay of linkage disequilibrium (LD) and plot it along the genome/chromosome. We’ll be looking at two methods to do this, which unsurprisingly, produce different estimates of LD decay. I personally prefer the 2nd method as it is more scientific.
For both methods, I am using this input file. It basically has two tab delimited columns – one with distance in megabase pairs (Mbp) and the other with LD (r2) values.
First, import the file:
file <- read.table("D_genome_ld.txt", sep="\t", header=T)
1. Method 1: The “Neanderthal” Way (sorry, dear mortal Neanderthals!)
Basically, this is a quick and dirty method of determining the LD decay by using a loess curve function. This approach reports the decay distance at r2 = 0.2. Now, why 0.2? This value is generally considered as one conventional way of estimating where the ld decay begins. I have read papers where they also use 0.1 but that is a bit too low, if you ask me. You can do a google scholar search to find out why some people prefer 0.2.
Anyway, here’s the code:
loes <- loess(file$rsq ~ file$dist, span=0.9, degree=1, control = loess.control(surface = "direct")) pre <- predict(loes) frame <- data.frame(file$dist, pre) xval <- frame[which.min(abs(0.2 - frame$pre)),] #find x value where y=0.2 plot(file$dist, file$rsq, pch=19, cex=0.9, xlab="distance (Mbp)", ylab="LD (r^2)") lines(frame, col="blue", lwd=2) abline(h=0.2, col="red") abline(v=xval[1], col="green") mtext(round(xval[1,1],2), side=1, line=0.05, at=xval[1,1], cex=0.9, col="green")
And the plot we get is:
That half decay position of 11.6 Mbp is pretty high! This data comes from the D sub-genome of wheat, which should have pretty low LD. You could play with the loess function to obtain a smoother line but in my opinion it is not worth it. Hence, we need a better, proven method of accurately calculating the decay distance.
2. Method 2: The Hill and Weir (1988) method
Full disclosure: this code was first shared by Fabio in his blog. I have modified his code a bit to import input data from a file, label the plots differently, and finally, plot it in ggplot.
Hill and Weir published this method first (following equation):
It was later simplified by Remington et al., (1991):
You can read those papers (citations below) if you’re interested in their derivations and details.
Now the code that implements this formula:
# this data was obtained from genotyping of 250 individuals, so n = no. of gametes = no. of sampled chromosomes = 2 x 250 n = 500 ## in my experience, changing the value of C does not do a whole lot of damage to your calculations ## you can try different values of Cs to see for yourself ## but reported C values range from about 0.5 to 2 Cstart <- c(C=0.1) # let's fit a non linear model using the arbitrary C value modelC <- nls(rsq ~ ((10+C*dist)/((2+C*dist)*(11+C*dist)))*(1+((3+C*dist)*(12+12*C*dist+(C*dist)^2))/(n*(2+C*dist)*(11+C*dist))), data=file, start=Cstart, control=nls.control(maxiter=100)) # extract rho, the recombination parameter, 4Nr rho <- summary(modelC)$parameters[1] # feed in the new value of rho to obtain LD values adjusted for their distances along the chromosome/genome newrsq <- ((10+rho*file$dist)/((2+rho*file$dist)*(11+rho*file$dist)))*(1+((3+rho*file$dist)*(12+12*rho*file$dist+(rho*file$dist)^2))/(n*(2+rho*file$dist)*(11+rho*file$dist))) newfile <- data.frame(file$dist, newrsq) #maxld <- max(file$rsq) #using max LD value from initial input file maxld <- max(newfile$newrsq) #using max LD value from adjusted data halfdecay = maxld*0.5 halfdecaydist <- newfile$file.dist[which.min(abs(newfile$newrsq-halfdecay))] newfile <- newfile[order(newfile$file.dist),] # create the plot plot(file$dist, file$rsq, pch=".", cex=2, xlab="distance (Mbp)", ylab="LD (r^2)") lines(newfile$file.dist, newfile$newrsq, col="blue", lwd=2) abline(h=0.2, col="red") abline(v=halfdecaydist, col="green") mtext(round(halfdecaydist,2), side=1, line=0.05, at=halfdecaydist, cex=0.75, col="green")
This produces the plot below:
We see that the LD decay value is much lower now. This looks about right to me.
Then I was interested in seeing where the r2=0.2 intersected in this method.
Not that far from the curve actually! So if someone wants to use the value of 0.2 as determining the decay distance, it should be okay. Not pinpoint accuracy but nothing devastating either.
To get the point of intersection when r2=0.2 (or any other value)
f1 <- data.frame(newfile$file.dist, newfile$newrsq) xval <- f1[which.min(abs(0.2 - f1$newfile.newrsq)),] #find x value where y=0.2 xval[1,1]
Finally, let’s draw the plot in ggplot2:
library(ggplot2) #load the library ggplot(file, aes(dist, rsq)) + geom_point() + geom_smooth(formula = newfile$file.dist ~ newfile$newrsq, se=F)
The plot:
You can make it look prettier by playing with the aesthetic settings.
And there you have it. Different approaches to plot your LD decay in R, including ggplot2. Thanks to Fabio again for sharing the code many years ago.
References:
- Hill WG and BS Weir (1988) Variances and covariances of squared linkage disequilibria in finite populations. Theor Popul Biol 33:54–78. doi:10.1016/0040-5809(88)90004-4
- Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, Doebley J, Kresovich S, Goodman MM, Buckler ES (2001) Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc Natl Acad Sci USA 98:11479–11484
Meki, use:
f1 <- data.frame(newfile$file.dist, newfile$newrsq)
xval <- f1[which.min(abs(0.2 – f1$newfile.newrsq)),] #find x value where y=0.2
Updated the post too. And, no citation needed! 😀 Cheers!
LikeLike
hi jujumaan,
for method2 (The Hill and Weir (1988) method), if I want to use the value of 0.1 as determining the decay distance, how can I determine the LD decay value?
Thanks
Meki
LikeLike
Thanks!
what about when i have thousands of SNP markers? can i use all the r2 values between any two markers pairs or should I use a subset like r2 only within 100kbp?
please also tell me how to site your script
LikeLike
It should, as this method starts with pre-calculated LD values. So as long as you have r2 (or any other measure of LD) and the genomic distance, this will work. If you have higher ploidy data, LD estimation may require special programs/packages though.
LikeLike
does this work any ploidy SNP data?
LikeLike