A quick post that shows how you can estimate the decay of linkage disequilibrium (LD) and plot it along the genome/chromosome. We’ll be looking at two methods to do this, which unsurprisingly, produce different estimates of LD decay. I personally prefer the 2nd method as it is more scientific.
For both methods, I am using this input file. It basically has two tab delimited columns – one with distance in megabase pairs (Mbp) and the other with LD (r2) values.
First, import the file:
file <- read.table("D_genome_ld.txt", sep="\t", header=T)
1. Method 1: The “Neanderthal” Way (sorry, dear mortal Neanderthals!)
Basically, this is a quick and dirty method of determining the LD decay by using a loess curve function. This approach reports the decay distance at r2 = 0.2. Now, why 0.2? This value is generally considered as one conventional way of estimating where the ld decay begins. I have read papers where they also use 0.1 but that is a bit too low, if you ask me. You can do a google scholar search to find out why some people prefer 0.2.
Anyway, here’s the code:
loes <- loess(file$rsq ~ file$dist, span=0.9, degree=1, control = loess.control(surface = "direct")) pre <- predict(loes) frame <- data.frame(file$dist, pre) xval <- frame[which.min(abs(0.2 - frame$pre)),] #find x value where y=0.2 plot(file$dist, file$rsq, pch=19, cex=0.9, xlab="distance (Mbp)", ylab="LD (r^2)") lines(frame, col="blue", lwd=2) abline(h=0.2, col="red") abline(v=xval, col="green") mtext(round(xval[1,1],2), side=1, line=0.05, at=xval[1,1], cex=0.9, col="green")
And the plot we get is:
That half decay position of 11.6 Mbp is pretty high! This data comes from the D sub-genome of wheat, which should have pretty low LD. You could play with the loess function to obtain a smoother line but in my opinion it is not worth it. Hence, we need a better, proven method of accurately calculating the decay distance.
2. Method 2: The Hill and Weir (1988) method
Full disclosure: this code was first shared by Fabio in his blog. I have modified his code a bit to import input data from a file, label the plots differently, and finally, plot it in ggplot.
Hill and Weir published this method first (following equation):
It was later simplified by Remington et al., (1991):
You can read those papers (citations below) if you’re interested in their derivations and details.
Now the code that implements this formula:
# this data was obtained from genotyping of 250 individuals, so n = no. of gametes = no. of sampled chromosomes = 2 x 250 n = 500 ## in my experience, changing the value of C does not do a whole lot of damage to your calculations ## you can try different values of Cs to see for yourself ## but reported C values range from about 0.5 to 2 Cstart <- c(C=0.1) # let's fit a non linear model using the arbitrary C value modelC <- nls(rsq ~ ((10+C*dist)/((2+C*dist)*(11+C*dist)))*(1+((3+C*dist)*(12+12*C*dist+(C*dist)^2))/(n*(2+C*dist)*(11+C*dist))), data=file, start=Cstart, control=nls.control(maxiter=100)) # extract rho, the recombination parameter, 4Nr rho <- summary(modelC)$parameters # feed in the new value of rho to obtain LD values adjusted for their distances along the chromosome/genome newrsq <- ((10+rho*file$dist)/((2+rho*file$dist)*(11+rho*file$dist)))*(1+((3+rho*file$dist)*(12+12*rho*file$dist+(rho*file$dist)^2))/(n*(2+rho*file$dist)*(11+rho*file$dist))) newfile <- data.frame(file$dist, newrsq) #maxld <- max(file$rsq) #using max LD value from initial input file maxld <- max(newfile$newrsq) #using max LD value from adjusted data halfdecay = maxld*0.5 halfdecaydist <- newfile$file.dist[which.min(abs(newfile$newrsq-halfdecay))] newfile <- newfile[order(newfile$file.dist),] # create the plot plot(file$dist, file$rsq, pch=".", cex=2, xlab="distance (Mbp)", ylab="LD (r^2)") lines(newfile$file.dist, newfile$newrsq, col="blue", lwd=2) abline(h=0.2, col="red") abline(v=halfdecaydist, col="green") mtext(round(halfdecaydist,2), side=1, line=0.05, at=halfdecaydist, cex=0.75, col="green")
This produces the plot below:
We see that the LD decay value is much lower now. This looks about right to me.
Then I was interested in seeing where the r2=0.2 intersected in this method.
Not that far from the curve actually! So if someone wants to use the value of 0.2 as determining the decay distance, it should be okay. Not pinpoint accuracy but nothing devastating either.
Finally, let’s draw the plot in ggplot2:
library(ggplot2) #load the library ggplot(file, aes(dist, rsq)) + geom_point() + geom_smooth(formula = newfile$file.dist ~ newfile$newrsq, se=F)
You can make it look prettier by playing with the aesthetic settings.
And there you have it. Different approaches to plot your LD decay in R, including ggplot2. Thanks to Fabio again for sharing the code many years ago.
- Hill WG and BS Weir (1988) Variances and covariances of squared linkage disequilibria in finite populations. Theor Popul Biol 33:54–78. doi:10.1016/0040-5809(88)90004-4
- Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, Doebley J, Kresovich S, Goodman MM, Buckler ES (2001) Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc Natl Acad Sci USA 98:11479–11484