We have identified a region that is a very good candidate for harbouring the sex-determination locus on LG6.
To further validate our findings, let’s now conduct an association test for each SNP from the adult sexed individuals dataset. We will then plot the statistical significance of association for each SNP against their linkage map position to see whether we detect the same region as we did using the sex-linked markers.
We will use the package SNPassoc.
library(SNPassoc)
library(xtable)
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.0.5
library(janitor)
library(dartR)
library(rrBLUP)
library(ggpubr)
First of all, we are going to assess if there is any population structure within our adult sexed individuals. They come from 4 different creeks so that is likely the case, and if so we will have to account for it in the GWAS.
We will read in the vcf tools of the adult individuals produced with the population module (see: “13. Identify Sex Linked Markers” for details).
my.gl <- dartR::gl.read.vcf(vcffile = "../3_SexLinked/13_1_populations_sex_DM_R40.vcf")
## Starting ::
## Starting dartR
## Starting gl.read.vcf
## Starting gl.compliance.check
## Processing genlight object with SNP data
## Checking coding of SNPs
## SNP data scored NA, 0, 1 or 2 confirmed
## Checking locus metrics and flags
## Recalculating locus metrics
## Checking for monomorphic loci
## No monomorphic loci detected
## Checking for loci with all missing data
## No loci with all missing data detected
## Checking whether individual names are unique.
## Checking for individual metrics
## Warning: Creating a slot for individual metrics
## Checking for population assignments
## Population assignments not detected, individuals assigned to a single population labelled 'pop1'
## Spelling of coordinates checked and changed if necessary to lat/lon
## Completed: gl.compliance.check
## Starting gl.recalc.metrics
## Processing genlight object with SNP data
## Starting utils.recalc.avgpic
## Processing genlight object with SNP data
## Recalculating OneRatioRef, OneRatioSnp, PICRef, PICSnp, AvgPIC
## Completed: utils.recalc.avgpic
## Starting utils.recalc.callrate
## Processing genlight object with SNP data
## Recalculating locus metric CallRate
## Completed: utils.recalc.callrate
## Starting utils.recalc.maf
## Processing genlight object with SNP data
## Recalculating FreqHoms and FreqHets
## Starting utils.recalc.freqhets
## Processing genlight object with SNP data
## Recalculating locus metric freqHets
## Completed: utils.recalc.freqhets
## Starting utils.recalc.freqhomref
## Processing genlight object with SNP data
## Recalculating locus metric freqHomRef
## Completed: utils.recalc.freqhomref
## Starting utils.recalc.freqhomsnp
## Processing genlight object with SNP data
## Recalculating locus metric freqHomSnp
## Completed: utils.recalc.freqhomsnp
## Recalculating Minor Allele Frequency (MAF)
## Completed: utils.recalc.maf
## Locus metrics recalculated
## Completed: gl.recalc.metrics
Let’s check call rate by individual and by locus
#individual call rate
gl.report.callrate(my.gl,method = "ind")
## Starting gl.report.callrate
## Processing genlight object with SNP data
## Reporting Call Rate by Individual
## No. of loci = 14069
## No. of individuals = 38
## Minimum : 0.6316725
## 1st quartile : 0.7095565
## Median : 0.7674319
## Mean : 0.7552663
## 3r quartile : 0.7982444
## Maximum : 0.8509489
## Missing Rate Overall: 0.24
## Quantile Threshold Retained Percent Filtered Percent
## 1 100% 0.8509489 1 2.6 37 97.4
## 2 95% 0.8298529 2 5.3 36 94.7
## 3 90% 0.8228232 4 10.5 34 89.5
## 4 85% 0.8212133 6 15.8 32 84.2
## 5 80% 0.8097093 8 21.1 30 78.9
## 6 75% 0.7982444 10 26.3 28 73.7
## 7 70% 0.7851944 12 31.6 26 68.4
## 8 65% 0.7808835 13 34.2 25 65.8
## 9 60% 0.7771981 15 39.5 23 60.5
## 10 55% 0.7705843 17 44.7 21 55.3
## 11 50% 0.7674319 19 50.0 19 50.0
## 12 45% 0.7523278 21 55.3 17 44.7
## 13 40% 0.7403085 23 60.5 15 39.5
## 14 35% 0.7329021 25 65.8 13 34.2
## 15 30% 0.7236548 26 68.4 12 31.6
## 16 25% 0.7095565 28 73.7 10 26.3
## 17 20% 0.7027649 30 78.9 8 21.1
## 18 15% 0.6946691 32 84.2 6 15.8
## 19 10% 0.6845831 34 89.5 4 10.5
## 20 5% 0.6694648 36 94.7 2 5.3
## 21 0% 0.6316725 38 100.0 0 0.0
## Completed: gl.report.callrate
Individuals have mostly low levels of missing data with no obvious outliers, so all good there.
#filter snp call rate
gl.report.callrate(my.gl)
## Starting gl.report.callrate
## Processing genlight object with SNP data
## Reporting Call Rate by Locus
## No. of loci = 14069
## No. of individuals = 38
## Minimum : 0.421053
## 1st quartile : 0.578947
## Median : 0.789474
## Mean : 0.7552663
## 3r quartile : 0.947368
## Maximum : 1
## Missing Rate Overall: 0.24
## Quantile Threshold Retained Percent Filtered Percent
## 1 100% 1.000000 1553 11.0 12516 89.0
## 2 95% 1.000000 1553 11.0 12516 89.0
## 3 90% 1.000000 1553 11.0 12516 89.0
## 4 85% 0.973684 2698 19.2 11371 80.8
## 5 80% 0.947368 3687 26.2 10382 73.8
## 6 75% 0.947368 3687 26.2 10382 73.8
## 7 70% 0.921053 4479 31.8 9590 68.2
## 8 65% 0.894737 5179 36.8 8890 63.2
## 9 60% 0.868421 5674 40.3 8395 59.7
## 10 55% 0.815789 6606 47.0 7463 53.0
## 11 50% 0.789474 7142 50.8 6927 49.2
## 12 45% 0.736842 8074 57.4 5995 42.6
## 13 40% 0.710526 8527 60.6 5542 39.4
## 14 35% 0.657895 9433 67.0 4636 33.0
## 15 30% 0.631579 9892 70.3 4177 29.7
## 16 25% 0.578947 10803 76.8 3266 23.2
## 17 20% 0.552632 11259 80.0 2810 20.0
## 18 15% 0.500000 12374 88.0 1695 12.0
## 19 10% 0.473684 12942 92.0 1127 8.0
## 20 5% 0.447368 13469 95.7 600 4.3
## 21 0% 0.421053 14069 100.0 0 0.0
## Completed: gl.report.callrate
In terms of call rate by SNP we have a few SNPs with low (< 0.7 Call Rate). That was alright for heterozygosity method that were looking at sex-specific loci, but for GWAS we might be a bit more stringent. Let’s filter then to retain only markers present in at least 80% of individuals.
my.gl <- gl.filter.callrate(my.gl, threshold = 0.8)
## Starting gl.filter.callrate
## Processing genlight object with SNP data
## Recalculating Call Rate
## Removing loci based on Call Rate, threshold = 0.8
## Completed: gl.filter.callrate
Now let’s look at relatedness and population structure.
my.grm <- dartR::gl.grm(my.gl)
## Starting ::
## Starting dartR
## Starting gl.grm
## Processing genlight object with SNP data
## Registered S3 method overwritten by 'gplots':
## method from
## reorder.factor gdata
## Completed: ::
## Completed: dartR
## Completed: gl.grm
gl.grm.network(my.grm,my.gl,method = 'kk',node.label.size = 2,node.size = 4)
## Starting gl.grm.network
## Processing genlight object with SNP data
## Completed: gl.grm.network
From this it already looks clear that we have structure of YC and MRu against all other samples, and the YC and MRu samples are also highly related to each other.
Let’s look where they fall on a PCOA.
my.pcoa <- gl.pcoa(my.gl)
## Starting gl.pcoa
## Processing genlight object with SNP data
## Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state
## Completed: gl.pcoa
gl.pcoa.plot(glPca = my.pcoa, x = my.gl,interactive = FALSE)
## Starting gl.pcoa.plot
## Processing an ordination file (glPca)
## Processing genlight object with SNP data
## Plotting populations in a space defined by the SNPs
## Preparing plot .... please wait
## Completed: gl.pcoa.plot
There are some individuals that are clustering against the rest of the samples. Let’s check which ones these are (likely the YC and MRu samples).
as.data.frame(my.pcoa$scores) %>% filter(PC1 < -20)
## PC1 PC2 PC3 PC4 PC5
## LS_DM_MRu_1204_F1 -28.49856 -8.582607 15.4940058 8.4292523 -16.19871527
## LS_DM_MRu_1210_M1 -28.42007 -9.911423 12.6980512 -7.4089470 18.45627751
## LS_DM_YC_1233_M1 -29.52757 -10.156051 -15.5945210 16.7996139 5.85532878
## LS_DM_YC_1236_F1 -27.70711 -8.976040 -11.9079795 -18.7166884 -9.53751517
## LS_DM_YC_1237_F1 -27.93108 19.270559 -0.5749979 0.8138258 0.05768402
## LS_DM_YC_1240_M1 -29.09871 18.428612 0.3256411 -0.7787646 0.89298018
As epected the YC and MRu individuals cluster on their own. I have ran the next step both with and without these individuals and the results persists. Here to account for pop structure I will choose the easier path and remove those 6 individuals.
Now let’s run a GWAS for sex using the SNPassoc package.
First of all, we will format the data to the format required by SNPassoc, which is the ped format.
Clear environment
rm(list = ls())
Read genotype data for the adult sexed individuals. This was produced before (see: “13. Identify Sex Linked Markers” for details).
my.df <- readRDS(file = "../3_SexLinked/13_2_InputDataframe_sex_R40.RData")
Remove individuals from the YC and MRu populations. We can do this using the Pop identifier in their name.
my.df <- select(my.df, -contains("_YC_"))
my.df <- select(my.df, -contains("_MRu_"))
Then we remove "_a" from the end of sample IDs, as we don’t need the replicate info anymore and it makes life harder. Remember, we have already retained one replicate per sample.
names(my.df) <- gsub(pattern = "_a",
replacement = "",
x = names(my.df))
names(my.df) <- gsub(pattern = "_b",
replacement = "",
x = names(my.df))
Transpose to have individuals as rows:
my.df <- as.data.frame(t(my.df))
my.df <- row_to_names(my.df, row_number = 1)
my.df <- rownames_to_column(my.df, var = "sampleID")
Add columns required for ped file (I am using the RIGHT function by Andrew Haynes found here).
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
my.df <- my.df %>%
#add a FAM column
add_column(FAM="FAM", .before = 'sampleID') %>%
#add father info
add_column(father=0, .after = 'sampleID') %>%
#add mother info
add_column(mother=0, .after = 'father') %>%
#aadd sex information, which is stored in sample IDs
mutate(sex=case_when(RIGHT(sampleID,2)=="F1" ~ 2,
RIGHT(sampleID,2)=="M1" ~ 1),
.after = 'mother') %>%
mutate(sex2=case_when(RIGHT(sampleID,2)=="F1" ~ 2,
RIGHT(sampleID,2)=="M1" ~ 1),
.after = 'mother')
Extract genotype columns to recode in plink format
gens <- my.df[,7:ncol(my.df)]
gens[gens==-1]<-NA
gens[gens==" 0"]<-"AA"
gens[gens==" 1"]<-"AC"
gens[gens==" 2"]<-"CC"
Make final dataset by merging sampleID, sex and genotypes coded as required by plink format
my.df <- cbind(my.df %>% select(sampleID,sex),gens)
Remove unnecessary objects from R environment
rm(gens, RIGHT)
Save the resulting file
write_delim(my.df, file = "../3_SexLinked/15_2_SNPassoc_Input.txt",
delim = "\t")
Now we can run the association test. To do so, we need to convert to SNPassoc data format
sex <- setupSNP(data=my.df, colSNPs=3:ncol(my.df), sep="")
Next we conduct association tests for all markers using the codominant model.
WGcod <- WGassociation(sex, data = sex, model = "codominant",
quantitative = FALSE, genotypingRate = 0.6)
Save the resulting file, as well as the SNPassoc formatted R object
write_delim(x = cbind("SNPID"=rownames(WGcod),WGcod), file = "../3_SexLinked/15_3_SNPassoc_Output_Raw.txt",
delim = "\t",)
saveRDS(WGcod,file = "../3_SexLinked/15_3_SNPassoc_Output_Raw.RData")
saveRDS(sex,file = "../3_SexLinked/15_3_SNPassoc_Input_Formatted.RData")
Now let’s investigate how our results are. We will do so by producing a QQplot, calculating an inflation factor and adjusting significane using FDR correction.
WGcod <- readRDS(file = "../3_SexLinked/15_3_SNPassoc_Output_Raw.RData")
qqpval(WGcod$codominant)
This qqplot plots sorted observed Pvalues on the y axis and expected Pvalues based on aw uniform distribution. If you look at the underlying function these are produced with (1:n)/(n+1)
where n
is the number of loci tested.
Let’s plot the qqplot with Confidence intervals as well. To do so I use the gg_qqplot function from this blog. First I define the function.
gg_qqplot <- function(ps, ci = 0.95) {
n <- length(ps)
df <- data.frame(
observed = -log10(sort(ps)),
expected = -log10(ppoints(n)),
clower = -log10(qbeta(p = (1 - ci) / 2, shape1 = 1:n, shape2 = n:1)),
cupper = -log10(qbeta(p = (1 + ci) / 2, shape1 = 1:n, shape2 = n:1))
)
log10Pe <- expression(paste("Expected -log"[10], plain((p-value))))
log10Po <- expression(paste("Observed -log"[10], plain((p-value))))
ggplot(df) +
geom_ribbon(
mapping = aes(x = expected, ymin = clower, ymax = cupper),
alpha = 0.1
) +
geom_point(aes(expected, observed), shape = 1, size = 2) +
geom_abline(intercept = 0, slope = 1, alpha = 0.5) +
# geom_line(aes(expected, cupper), linetype = 2, size = 0.5) +
# geom_line(aes(expected, clower), linetype = 2, size = 0.5) +
xlab(log10Pe) +
ylab(log10Po)
}
Then I use it to produce the qqplot.
#retain only non NA pvalues. NA pvalues are returned where the two groups being tested are identical (i.e., exact same numbers for each genotype for the two sexes)
my.ps <- WGcod$codominant[is.na(WGcod$codominant) == FALSE ]
#and plot
q1 <- gg_qqplot(my.ps, ci = 0.95) +
theme_bw(base_size = 12) +
theme(
axis.ticks = element_line(size=0.5),
panel.grid = element_blank()
)
q1
Save the resulting graph
ggsave("../3_SexLinked/15_4_SNPassoc_QQplot.jpeg", plot = q1,
device = "jpeg",dpi = 300,width = 8,height = 4)
This shows us that the lower p-values we obtained are outside the expected 95% Confidence Intervals. Let’s see how many markers are significant at an alpha value of 0.05 and after FDR correction.
as.data.frame(WGcod) %>%
filter(is.na(comments)==TRUE & is.na(codominant)==FALSE & codominant < 0.05)
## comments codominant
## X334_26 <NA> 1.612097e-05
## X334_36 <NA> 1.612097e-05
## X374_45 <NA> 3.725127e-02
## X639_67 <NA> 2.505355e-02
## X650_68 <NA> 2.618523e-03
## X735_59 <NA> 3.878582e-02
## X802_9 <NA> 4.934112e-02
## X802_19 <NA> 4.934112e-02
## X914_10 <NA> 2.735361e-02
## X914_68 <NA> 4.209276e-02
## X1024_12 <NA> 5.197401e-03
## X1054_19 <NA> 3.596809e-02
## X2232_35 <NA> 8.588863e-03
## X2232_36 <NA> 6.864989e-04
## X2232_41 <NA> 4.210526e-03
## X2475_52 <NA> 1.885730e-02
## X3099_27 <NA> 3.259451e-02
## X3325_29 <NA> 3.018402e-02
## X3325_44 <NA> 3.018402e-02
## X3325_61 <NA> 3.018402e-02
## X3331_53 <NA> 1.032433e-02
## X3735_62 <NA> 3.694014e-02
## X3747_57 <NA> 3.731630e-02
## X3908_43 <NA> 4.411765e-02
## X3908_45 <NA> 4.411765e-02
## X3908_51 <NA> 4.411765e-02
## X3908_52 <NA> 4.411765e-02
## X4032_13 <NA> 6.770808e-05
## X4032_53 <NA> 6.770808e-05
## X4693_59 <NA> 4.140787e-04
## X4911_50 <NA> 1.211616e-02
## X4933_10 <NA> 1.619039e-02
## X5219_30 <NA> 4.241063e-02
## X5700_35 <NA> 5.980881e-03
## X5707_18 <NA> 2.736803e-02
## X5707_50 <NA> 2.736803e-02
## X6308_47 <NA> 6.993007e-03
## X6667_8 <NA> 3.197277e-02
## X6667_34 <NA> 2.580246e-02
## X6814_7 <NA> 3.596809e-02
## X6814_41 <NA> 3.913043e-02
## X6981_7 <NA> 3.195239e-02
## X6981_31 <NA> 3.195239e-02
## X7141_47 <NA> 2.837529e-03
## X7194_9 <NA> 4.997501e-05
## X7517_59 <NA> 4.545455e-02
## X7621_64 <NA> 2.747253e-02
## X7627_8 <NA> 3.726708e-02
## X7627_39 <NA> 3.726708e-02
## X7627_61 <NA> 1.372998e-02
## X7627_64 <NA> 3.726708e-02
## X7727_20 <NA> 1.767396e-02
## X8155_52 <NA> 4.151923e-04
## X8268_44 <NA> 8.398770e-03
## X8307_48 <NA> 1.328959e-02
## X8351_15 <NA> 3.250774e-02
## X8351_17 <NA> 3.250774e-02
## X8351_23 <NA> 3.250774e-02
## X9618_22 <NA> 3.852064e-05
## X9618_32 <NA> 3.224194e-06
## X11005_55 <NA> 4.267203e-02
## X11297_62 <NA> 3.171046e-02
## X11376_27 <NA> 4.516660e-02
## X11427_40 <NA> 1.888457e-02
## X12040_47 <NA> 3.229646e-02
## X12065_18 <NA> 1.372998e-02
## X12440_13 <NA> 1.686633e-02
## X12451_48 <NA> 4.489164e-02
## X12700_56 <NA> 2.385965e-02
## X12726_45 <NA> 2.960560e-02
## X12778_11 <NA> 4.214559e-02
## X13015_13 <NA> 1.940508e-03
## X13267_26 <NA> 1.162718e-02
## X13322_25 <NA> 6.676560e-03
## X13326_11 <NA> 4.214559e-02
## X13727_20 <NA> 3.947472e-02
## X13727_22 <NA> 3.947472e-02
## X13894_32 <NA> 3.496503e-03
## X14133_41 <NA> 9.147039e-03
## X14133_57 <NA> 1.134221e-02
## X14158_40 <NA> 2.608696e-02
## X14842_52 <NA> 2.401058e-02
## X14842_65 <NA> 2.401058e-02
## X14924_58 <NA> 2.810607e-02
## X15306_55 <NA> 3.195239e-02
## X15777_44 <NA> 1.102605e-02
## X15777_54 <NA> 1.102605e-02
## X15889_7 <NA> 4.395604e-02
## X15889_16 <NA> 2.777223e-02
## X16011_58 <NA> 2.098118e-02
## X16214_28 <NA> 9.293891e-03
## X16214_47 <NA> 6.639555e-03
## X16681_58 <NA> 4.289216e-02
## X16896_51 <NA> 2.523608e-02
## X16919_17 <NA> 6.770808e-05
## X16941_12 <NA> 3.894598e-02
## X16986_12 <NA> 8.942897e-05
## X16986_61 <NA> 1.377485e-03
## X17224_38 <NA> 3.913043e-02
## X17401_60 <NA> 1.185771e-03
## X17481_61 <NA> 4.456375e-02
## X17724_6 <NA> 1.888457e-02
## X18102_35 <NA> 4.449649e-02
## X18102_49 <NA> 4.449649e-02
## X18329_57 <NA> 5.797101e-04
## X18329_64 <NA> 1.204013e-03
## X18388_21 <NA> 2.747253e-02
## X18747_35 <NA> 2.107871e-02
## X18747_62 <NA> 2.107871e-02
## X18916_33 <NA> 3.173666e-02
## X19156_57 <NA> 2.698322e-02
## X19602_42 <NA> 6.191950e-03
## X19863_14 <NA> 3.088621e-02
## X19863_46 <NA> 3.088621e-02
## X19871_46 <NA> 1.425404e-02
## X19918_47 <NA> 4.114907e-02
## X20053_36 <NA> 4.927761e-02
## X20630_53 <NA> 1.041234e-03
## X20807_40 <NA> 3.181812e-02
## X20918_41 <NA> 2.380952e-02
## X20919_21 <NA> 5.685531e-04
## X21001_58 <NA> 3.964554e-02
## X21001_63 <NA> 3.964554e-02
## X21315_28 <NA> 1.490683e-02
## X21315_41 <NA> 1.490683e-02
## X21315_47 <NA> 1.490683e-02
## X21713_27 <NA> 3.611971e-02
## X22398_31 <NA> 4.884821e-03
## X22744_40 <NA> 3.311037e-02
## X22756_51 <NA> 4.214559e-02
## X22756_52 <NA> 4.214559e-02
## X22776_57 <NA> 1.991800e-02
## X23244_66 <NA> 1.525553e-04
## X23407_20 <NA> 8.061561e-03
## X23597_35 <NA> 4.407602e-02
## X23918_47 <NA> 1.282051e-02
## X24064_66 <NA> 2.484141e-02
## X24066_32 <NA> 1.708823e-04
## X24640_55 <NA> 7.326304e-03
## X24912_64 <NA> 4.471764e-02
## X25072_33 <NA> 6.057343e-04
## X25138_39 <NA> 3.678337e-03
## X25445_21 <NA> 3.340622e-02
## X25445_63 <NA> 3.340622e-02
## X25533_7 <NA> 2.397602e-02
## X25533_10 <NA> 2.397602e-02
## X25533_19 <NA> 2.397602e-02
## X25915_26 <NA> 8.326827e-03
## X25915_46 <NA> 8.326827e-03
## X25915_55 <NA> 1.430206e-02
## X26687_36 <NA> 1.084295e-04
## X27632_8 <NA> 3.164931e-02
## X27755_40 <NA> 1.888457e-02
## X27755_44 <NA> 1.888457e-02
## X27779_50 <NA> 4.338154e-02
## X27779_66 <NA> 4.338154e-02
## X28026_47 <NA> 3.336633e-02
## X28473_53 <NA> 4.482505e-03
## X28534_40 <NA> 2.909467e-02
## X28680_16 <NA> 4.125088e-02
## X29169_39 <NA> 3.410400e-02
## X29717_11 <NA> 4.159587e-02
## X29717_30 <NA> 2.585374e-02
## X29717_58 <NA> 4.159587e-02
## X29883_62 <NA> 2.424242e-02
## X29928_20 <NA> 1.204289e-02
## X30107_7 <NA> 2.580246e-02
## X30107_17 <NA> 2.580246e-02
## X30107_33 <NA> 2.580246e-02
## X30503_14 <NA> 2.824478e-04
## X30536_35 <NA> 8.092728e-04
## X30641_43 <NA> 4.156858e-04
## X30641_54 <NA> 1.399300e-04
## X30763_31 <NA> 1.852537e-02
## X30794_14 <NA> 4.449649e-02
## X30794_40 <NA> 4.449649e-02
## X30933_30 <NA> 1.960784e-02
## X31389_31 <NA> 7.263035e-04
## X31389_42 <NA> 1.054312e-03
## X31389_50 <NA> 1.624188e-03
## X31389_51 <NA> 1.054312e-03
## X31415_50 <NA> 6.993007e-03
## X31958_29 <NA> 9.700774e-03
## X32322_27 <NA> 2.900808e-02
## X32322_50 <NA> 4.259721e-02
## X32707_50 <NA> 7.711895e-04
## X32829_10 <NA> 3.138052e-04
## X32829_41 <NA> 3.138052e-04
## X32829_45 <NA> 3.138052e-04
## X33230_68 <NA> 3.137652e-02
## X33276_26 <NA> 3.432494e-03
## X33613_48 <NA> 2.532769e-02
## X33613_59 <NA> 2.532769e-02
## X33781_49 <NA> 4.338154e-02
## X33808_58 <NA> 3.384210e-03
## X34059_60 <NA> 1.237069e-02
## X34276_9 <NA> 1.904925e-02
## X34276_19 <NA> 1.904925e-02
## X34358_50 <NA> 3.024414e-02
## X34683_51 <NA> 2.714932e-02
## X35595_56 <NA> 3.895525e-02
## X35595_66 <NA> 3.895525e-02
## X35748_28 <NA> 4.338154e-02
## X35842_38 <NA> 6.770808e-05
## X35993_43 <NA> 2.168820e-02
## X36141_59 <NA> 4.833490e-02
## X36195_7 <NA> 4.214559e-02
## X36625_10 <NA> 2.105536e-02
## X36625_63 <NA> 1.699101e-02
## X36885_68 <NA> 2.287582e-02
## X37034_38 <NA> 4.113534e-05
## X37034_66 <NA> 4.113534e-05
## X37466_39 <NA> 4.449649e-02
## X37619_21 <NA> 7.120743e-03
## X37619_33 <NA> 8.218407e-03
## X37619_42 <NA> 8.218407e-03
## X37987_12 <NA> 2.321981e-02
## X37987_26 <NA> 2.361883e-02
## X37987_52 <NA> 2.361883e-02
## X38114_63 <NA> 1.673722e-02
## X38177_9 <NA> 4.449649e-02
## X38177_64 <NA> 4.449649e-02
## X38312_6 <NA> 3.020967e-02
## X38607_48 <NA> 6.770808e-05
## X38866_34 <NA> 6.931540e-03
## X38866_47 <NA> 6.864989e-04
## X39344_24 <NA> 3.728997e-02
## X39344_26 <NA> 3.728997e-02
## X39664_7 <NA> 6.987594e-03
## X39971_50 <NA> 1.212276e-02
## X40145_58 <NA> 2.371542e-02
## X40145_61 <NA> 7.489078e-03
## X40444_17 <NA> 9.810479e-03
## X40998_10 <NA> 2.322204e-02
## X40998_67 <NA> 3.336633e-02
## X41125_40 <NA> 2.514371e-02
## X41261_10 <NA> 3.370798e-03
## X41742_17 <NA> 9.657010e-03
## X41896_48 <NA> 1.818182e-02
## X42297_19 <NA> 3.540925e-02
## X42612_62 <NA> 1.010101e-02
## X42648_64 <NA> 4.113534e-02
## X42662_30 <NA> 4.027500e-02
## X42662_35 <NA> 2.924092e-02
## X42662_36 <NA> 2.924092e-02
## X42662_40 <NA> 2.924092e-02
## X43497_68 <NA> 3.182798e-02
## X43840_52 <NA> 4.154246e-03
## X43924_10 <NA> 3.139030e-02
## X44883_48 <NA> 1.213599e-02
## X45038_61 <NA> 1.292437e-02
## X45038_62 <NA> 1.863328e-02
## X45263_7 <NA> 3.483823e-02
## X45277_20 <NA> 2.810607e-02
## X45277_67 <NA> 2.810607e-02
## X45294_22 <NA> 2.245146e-02
## X45294_55 <NA> 2.245146e-02
## X45299_49 <NA> 1.298701e-02
## X45307_28 <NA> 6.170300e-03
## X45688_44 <NA> 1.359135e-02
## X45990_11 <NA> 4.982669e-02
## X45993_12 <NA> 4.333794e-04
## X46352_67 <NA> 4.757416e-02
## X47206_13 <NA> 6.617716e-03
## X47388_63 <NA> 4.545455e-02
## X47692_48 <NA> 2.275526e-02
## X48241_60 <NA> 2.329920e-02
## X48483_21 <NA> 3.770739e-02
## X48483_48 <NA> 3.194221e-02
## X49716_7 <NA> 2.023220e-02
## X49924_37 <NA> 4.765590e-02
## X49924_47 <NA> 4.765590e-02
## X51163_39 <NA> 7.751371e-03
## X51332_32 <NA> 4.210526e-02
## X51574_47 <NA> 4.576659e-04
## X52278_15 <NA> 1.364248e-02
## X54327_49 <NA> 4.710305e-02
## X54587_64 <NA> 4.395604e-02
## X57428_53 <NA> 5.431671e-03
## X60625_8 <NA> 4.449649e-02
## X60625_41 <NA> 4.449649e-02
## X65816_49 <NA> 1.533166e-02
## X68160_20 <NA> 4.214559e-02
## X69834_64 <NA> 1.035197e-04
## X75021_34 <NA> 9.395302e-04
## X76522_27 <NA> 1.667605e-02
## X76864_52 <NA> 3.229646e-02
## X80361_62 <NA> 1.427974e-02
## X84083_68 <NA> 2.194543e-02
## X86312_62 <NA> 4.377704e-02
## X116502_23 <NA> 2.105263e-02
## X116502_26 <NA> 2.105263e-02
## X116502_27 <NA> 4.912281e-02
## X116502_28 <NA> 4.912281e-02
## X151186_22 <NA> 4.872257e-02
## X153320_24 <NA> 2.357242e-02
## X175541_10 <NA> 3.841391e-02
## X175541_24 <NA> 3.841391e-02
## X178740_42 <NA> 4.074074e-02
## X179603_11 <NA> 3.229646e-02
## X199881_10 <NA> 6.879945e-05
## X203234_39 <NA> 3.373097e-03
## X204127_54 <NA> 2.748826e-03
## X206720_38 <NA> 2.197802e-02
## X207207_38 <NA> 9.121313e-04
## X208246_64 <NA> 3.496503e-02
## X214417_43 <NA> 8.085431e-03
## X214417_53 <NA> 8.085431e-03
## X214417_63 <NA> 3.171046e-02
## X215072_12 <NA> 6.506035e-06
## X215072_57 <NA> 3.311037e-02
## X215072_68 <NA> 5.928854e-03
## X217209_24 <NA> 4.834484e-02
## X217209_35 <NA> 4.834484e-02
## X217209_68 <NA> 4.834484e-02
## X218040_28 <NA> 4.345654e-02
## X218040_32 <NA> 4.345654e-02
## X218040_37 <NA> 3.384116e-02
## X218040_46 <NA> 4.345654e-02
## X218040_61 <NA> 4.345654e-02
## X219037_53 <NA> 2.298137e-02
## X220670_15 <NA> 2.161478e-02
Before correction there are 322 markers significantly associated with sex at an alpha of 0.05. Given we are doing ~10,000 tests we will need to adjust for multiple testing.
ajs_p <- p.adjust(my.ps,method = "fdr", n=length(my.ps))
How many markers are significant now at an alpha value of 0.05?
ajs_p.len <- length(ajs_p[which(ajs_p<=0.05)])
cat('fdr corrected threshold p value is:', my.ps[ajs_p.len], 'total:', ajs_p.len, "\n")
## fdr corrected threshold p value is: 0.2197166 total: 14
Let’s see which are these markers. First we add corrected p-values to our dataframe
WGcod <- as.data.frame(WGcod) %>%
filter(is.na(codominant)==FALSE) %>% select(-comments)
WGcod <- cbind(WGcod,"p.adjust"=ajs_p)
Now see which markers are significant.
WGcod %>% filter(p.adjust < 0.05) %>% print.data.frame()
## codominant p.adjust
## X334_26 1.612097e-05 0.03958907
## X334_36 1.612097e-05 0.03958907
## X4032_13 6.770808e-05 0.04827264
## X4032_53 6.770808e-05 0.04827264
## X7194_9 4.997501e-05 0.04827264
## X9618_22 3.852064e-05 0.04827264
## X9618_32 3.224194e-06 0.03167126
## X16919_17 6.770808e-05 0.04827264
## X35842_38 6.770808e-05 0.04827264
## X37034_38 4.113534e-05 0.04827264
## X37034_66 4.113534e-05 0.04827264
## X38607_48 6.770808e-05 0.04827264
## X199881_10 6.879945e-05 0.04827264
## X215072_12 6.506035e-06 0.03195439
We can see there are four pairs of SNPs that come from the same locus. This is to be expected if those loci are linked to a sex-determining locus. We will explore this further in a second, now let’s finish assessing our GWAS by calculating the genomic inflation factor. This should be close to 1. Values higher than 1 indicate p-value inflation, usually due to population structure.
my.chisq <- qchisq(1 - my.ps, 1)
median(my.chisq) / qchisq(0.5, 1)
## [1] 0.6849137
It is lower than 1, so no population structure. Yet it is a bit too low. I have not figured out why this is the case yet. I will report it and see what reviewers say.
Let’s finish by saving the dataset with adjusted p-values.
WGcod <- WGcod %>%
rownames_to_column(var="markerID") %>%
mutate(markerID = gsub("X","",markerID))
And save
write_delim(WGcod,file = "../3_SexLinked/15_4_SNPassoc_Output_FDR.txt",
delim = "\t",col_names = TRUE)
Now that we have conducted the GWAS, let’s see how significance changes across the genome and where the 14 SNPs significantly associated with sex fall on the genome.
First we clear the environment
rm(list=ls())
We first import the list of sex-linked markers from previous analyses, the list of p-values from the GWAS and the linkage map data.
my.gwas <- read_delim("../3_SexLinked/15_4_SNPassoc_Output_FDR.txt",
delim = "\t")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## markerID = col_character(),
## codominant = col_double(),
## p.adjust = col_double()
## )
my.map <- read_delim("../2_LinMap/10_4_linkageMap.txt",
delim = "\t")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## SNPID = col_character(),
## fasta = col_character(),
## Segregation = col_character(),
## tag = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
my.sexLinked <- read_delim("../3_SexLinked/13_3_SexLinkedMarkers_DM_2.txt",
delim = "\t")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## markerID = col_character(),
## Method = col_character(),
## SexDet = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
my.gcounts <- readRDS("../3_SexLinked/13_2_SexGenotypeCounts.RData")
Now that we have all of this, let’s add it all to a single dataframe.
Create dataframe of all markerIDs across the four datasets
ids1 <- my.gcounts %>% pull(markerID) %>% unique()
ids2 <- my.gwas %>% pull(markerID) %>% unique()
ids3 <- my.map %>% pull(SNPID) %>% unique()
ids4 <- my.sexLinked %>% pull(markerID) %>% unique()
markerID <- unique(c(ids1,ids2,ids3,ids4))
my.df <- as.data.frame(markerID)
rm(ids1,ids2,ids3,ids4,markerID)
Add genotype counts
my.df <- left_join(my.df, my.gcounts, by="markerID")
Add linkage map data to the database of genotype counts
#rename colname of marker IDs so that it matches between datasets
my.map <- my.map %>% rename("markerID"="SNPID")
my.df <- left_join(my.df,my.map, by = "markerID")
Add data on sex-linked loci to this dataframe.
my.sexLinked <- my.sexLinked %>% select(markerID,Method,SexDet,LG_locus,mean_F,mean_M)
my.sexLinked <- pivot_wider(data=my.sexLinked,
names_from = Method,
values_from = Method)
my.df <- left_join(my.df,my.sexLinked,
by=c("markerID"))
Add p-values from the GWAS
my.gwas <- my.gwas %>% rename("GWAS.pvalue"="codominant",
"GWAS.FDR"="p.adjust")
my.df <- left_join(my.df,my.gwas, by="markerID")
Add a column to depict significant markers from the GWAS
my.df <- my.df %>%
mutate("GWAS.associated"=case_when(GWAS.FDR < 0.05 ~ "significant",
GWAS.FDR >= 0.05 ~ "non.significant",
is.na(GWAS.FDR)==TRUE ~ "not.assessed"))
Add Locus ID for all markers (currently only present for linkage map markers). It is contained within the markerID, but it is good to have it easily accessibly in a separate column.
my.df <- my.df %>% mutate(Locus = gsub("_.*","",markerID))
And let’s rename the LepMap ID (currently called ID) to make it more explicit
my.df <- my.df %>% rename("LepmapID"="ID")
We currently have fasta sequences only for the linkage map markers, but we want it for all markers, including sex-linked and sex-associated. This might be useful for anyone trying to use this map to blast it to other genomes, or to check the sex-linked/associated markers.
Clear environment
rm(my.gcounts,my.gwas,my.map,my.sexLinked)
To achieve this, first thing we need to do is retain a list of unique Locus IDs first.
my.ids <- my.df %>% pull(Locus) %>% unique()
Next we want to save it as a single column, and as a tab delimited file
write_delim(as.data.frame(my.ids),
file = "../3_SexLinked/15_5_All_LociIDs.txt",
delim = "\t",col_names = FALSE)
Now we extract the corresponding fasta sequences from the stacks catalogue with seqtk
.
#!/bin/bash
#load seqtk
module load seqtk/1.3
#run seqtk for each family
seqtk subseq ./1_DataPrep/catalog.fa.gz ./3_SexLinked/15_5_All_LociIDs.txt > ./3_SexLinked/15_5_All_Loci.fa
The resulting output has two lines per locus. The first one has the format >locusID NS=...
while the second one contains the fasta sequence.
To convert this to tabular format to then merge it with my current table I downloaded pyfaidx version 0.6.4 (05 April 2022). I then used the below command.
faidx --transform transposed 15_5_All_Loci.fa > 15_5_All_Loci.tsv
Let’s import the resulting file
my.fastas <- read_delim("../3_SexLinked/15_5_All_Loci.tsv",
delim = "\t",
col_names = c("Locus","some",
"length","fasta"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Locus = col_double(),
## some = col_double(),
## length = col_double(),
## fasta = col_character()
## )
Let’s retain only columns of interest (i.e., locus, fasta and length)
my.fastas <- my.fastas %>%
select(Locus, fasta, length) %>%
mutate(Locus=as.character(Locus))
And merge with the current table
my.df <- left_join(my.df %>% select(-fasta), my.fastas, by = "Locus")
my.df <- my.df %>% select(-length.x) %>% rename("length"="length.y")
Finally, for the 14 markers associated with sex, let’s retrieve whether markers from the same locus have been mapped (sometimes the specific marker was not mapped, but the same locus was).
temp <- my.df %>%
dplyr::select(Locus,LG,male,female,sex.averaged) %>%
group_by(Locus) %>%
summarise(LG_locus = mean(LG,na.rm=TRUE),
mean_F = mean(female,na.rm = TRUE),
mean_M = mean(male,na.rm = TRUE),
mean_Ave = mean(sex.averaged, na.rm = TRUE))
temp <- temp %>%
mutate(Locus = as.character(Locus))
my.df <- my.df %>% select(-c(LG_locus,mean_F,mean_M))
my.df <- left_join(my.df, temp, by = "Locus")
rm(temp)
Let’s re-arrange columns a bit
my.df <- my.df %>% select(markerID,Locus,LepmapID,fasta,length,
LG, male, female, sex.averaged,
LG_locus,mean_F,mean_M,mean_Ave,
Segregation:NULLS,
SexDet,Method1:GWAS.associated,
Zero_M:Snp_F)
Let’s save the resulting table, which we will attach as supplementary material to the publication.
write_csv(my.df,file = "../3_SexLinked/15_5_SummaryTable_MapAndSex.csv",
col_names = TRUE)
Okay now we have all of our data in one neat table. Let’s plot results of the GWAS across linkage groups, and then compare the markers identified with those identified based on heterozygosity methods.
rm(list=ls())
Read in the table. I had to use read.csv
and not read_csv
because the latter tried to guess my data types for columns based on the first 1000 rows, and freaked out when it encountered non-NA values after 1000 rows if the first 1000 were all NAs.
my.df <- read.csv("../3_SexLinked/15_5_SummaryTable_MapAndSex.csv")
Let’s start by plotting significance of GWAS by linkage group. First we add alternating colours to linkage groups for plotting.
my.df <- my.df %>% mutate(colour = case_when(is.na(LG)==TRUE ~ "NA",
LG %% 2 == 0 ~ "black",
LG %% 2 != 0 ~ "red"))
Now retain a dataframe of markers present on linkage map and the GWAS analysis.
my.gmap <- my.df %>% filter(is.na(LG)==FALSE & is.na(GWAS.FDR)==FALSE)
And now plot it
p1 <- ggplot(my.gmap, aes(x = sex.averaged, y=-log10(GWAS.pvalue), )) +
#geom_histogram(binwidth = 1, aes(fill=colour, col=colour)) +
geom_hline(yintercept=0) +
geom_point(aes(fill=colour, col=colour)) +
facet_grid(~LG, scales = "free_x", switch = "x") +
geom_hline(yintercept=(-log10(0.05))) +
theme_classic() +
scale_fill_manual(values = c("black","red")) +
scale_color_manual(values = c("black","red")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank(),
strip.background = element_blank(),
legend.position = "none") +
labs(y = "-log10(pvalue)", x = "Linkage groups")
p1
This first graph shows the original p-values before correction. Now let’s plot it after FDR correction.
p2 <- ggplot(my.gmap, aes(x = sex.averaged, y=-log10(GWAS.FDR), )) +
#geom_histogram(binwidth = 1, aes(fill=colour, col=colour)) +
geom_hline(yintercept=0) +
geom_point(aes(fill=colour, col=colour)) +
facet_grid(~LG, scales = "free_x", switch = "x") +
geom_hline(yintercept=(-log10(0.05))) +
theme_classic() +
scale_fill_manual(values = c("black","red")) +
scale_color_manual(values = c("black","red")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank(),
strip.background = element_blank(),
legend.position = "none") +
labs(y = "-log10(pvalue corrected by FDR method)", x = "Linkage groups")
p2
Save both plots
ggsave("../3_SexLinked/15_6_SNPassoc_SignificanceByLG.jpeg", p1,
device = "jpeg", dpi = 300, width = 8, height = 4)
ggsave("../3_SexLinked/15_6_SNPassoc_SignificanceByLG_FDR.jpeg", p2,
device = "jpeg", dpi = 300, width = 8, height = 4)
Now let’s plot pvalue versus linkage map position for LG6 only. And let’s highlight markers that were identified by Methods 1 to 5 as well, for both corrected and uncorrected pvalues
both.ps <- my.df %>%
filter(is.na(SexDet)==FALSE & GWAS.pvalue<0.05) %>%
pull(markerID)
both.FDR <- my.df %>%
filter(is.na(SexDet)==FALSE & GWAS.FDR<0.05) %>%
pull(markerID)
Assign an extra column stating whether those markers were identified by sex-linkage methods as well.
my.df.2 <- my.df %>%
mutate(both.ps=if_else(markerID%in%both.ps,"both","not")) %>%
mutate(both.FDR=if_else(markerID%in%both.FDR,"both","not"))
Plot linkage group 6
p3 <- ggplot(my.df.2 %>% filter(LG==6), aes(x = sex.averaged, y=-log10(GWAS.pvalue))) +
geom_point(aes(fill=both.ps, col=both.ps)) +
geom_hline(yintercept=(-log10(0.05))) +
theme_classic() +
scale_fill_manual(values = c("red","black")) +
scale_color_manual(values = c("red","black")) +
theme(strip.background = element_blank(),
legend.position = "none",
panel.spacing.x = unit(0,"lines"),
panel.spacing.y = unit(1,"lines")) +
labs(y = "-log10(pvalue)", x = element_blank())
p3
## Warning: Removed 518 rows containing missing values (geom_point).
p4 <- ggplot(my.df.2 %>% filter(LG==6), aes(x = sex.averaged, y=-log10(GWAS.FDR))) +
geom_point(aes(fill=both.FDR, col=both.FDR)) +
geom_hline(yintercept=(-log10(0.05))) +
theme_classic() +
scale_fill_manual(values = c("red","black")) +
scale_color_manual(values = c("red","black")) +
theme(strip.background = element_blank(),
legend.position = "none",
panel.spacing.x = unit(0,"lines"),
panel.spacing.y = unit(1,"lines")) +
labs(y = "-log10(pvalue)", x = "Distance on Linkage Group 6 (cM)")
p4
## Warning: Removed 518 rows containing missing values (geom_point).
Arrange them together and plot:
p5 <- ggarrange(p3,p4, ncol=1)
p5
And save the resulting graph
ggsave("../3_SexLinked/15_6_SNPassoc_SignificanceLG6.jpeg", device="jpeg",
dpi=300, width = 8, height = 4)
Finally, let’s summarise simply a few things:
How many markers significant after FDR for GWAS:
cat('Total number of markers significant after FDR correction:',
nrow(my.df %>% filter(GWAS.associated=="significant")), "\n")
## Total number of markers significant after FDR correction: 14
How many XY and ZW markers identified?
cat('Total number of XY and ZW markers identified is',
length(my.df %>% filter(SexDet=="XY") %>% pull(markerID) %>% unique()),
'and',
length(my.df %>% filter(SexDet=="ZW") %>% pull(markerID) %>% unique()),
'respectively', "\n")
## Total number of XY and ZW markers identified is 29 and 2 respectively
How many were identified by both?
cat(nrow(my.df %>% filter(SexDet=="XY"&GWAS.associated=="significant")),
'markers were identified by both methods with FDR, and',
nrow(my.df %>% filter(SexDet=="XY"&GWAS.pvalue<0.06)),
'were identified by both before correcting for multiple testing')
## 7 markers were identified by both methods with FDR, and 29 were identified by both before correcting for multiple testing
How many of those occur on Linkage group 6?
cat(nrow(my.df %>%
filter(SexDet=="XY" & GWAS.associated=="significant" & LG==6)),
'of these markers occurs on LG6 after FDR, while',
nrow(my.df %>%
filter(SexDet=="XY" & GWAS.pvalue < 0.05 & LG==6)),
'of these occur on LG6 before FDR')
## 1 of these markers occurs on LG6 after FDR, while 8 of these occur on LG6 before FDR
How many occur on other Linkage groups?
cat(nrow(my.df %>%
filter(SexDet=="XY" & GWAS.associated=="significant" & LG!=6)),
'and',
nrow(my.df %>%
filter(SexDet=="XY" & GWAS.pvalue < 0.05 & LG!=6)),
'of these markers occur on other LGs before and after FDR correction, respectively')
## 0 and 1 of these markers occur on other LGs before and after FDR correction, respectively