15. Association tests

1. Background
2. Load packages
3. Data exploration
4. GWAS
5. Compare and plot on linkage map
6. GWAS plots

1. Background

We have identified a region that is a very good candidate for harbouring the sex-determination locus on LG6.

To further validate our findings, let’s now conduct an association test for each SNP from the adult sexed individuals dataset. We will then plot the statistical significance of association for each SNP against their linkage map position to see whether we detect the same region as we did using the sex-linked markers.

We will use the package SNPassoc.

2. Load packages

library(SNPassoc)
library(xtable)
library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.0.5

library(janitor)
library(dartR)
library(rrBLUP)
library(ggpubr)

3. Data exploration

First of all, we are going to assess if there is any population structure within our adult sexed individuals. They come from 4 different creeks so that is likely the case, and if so we will have to account for it in the GWAS.

We will read in the vcf tools of the adult individuals produced with the population module (see: “13. Identify Sex Linked Markers” for details).

my.gl <- dartR::gl.read.vcf(vcffile = "../3_SexLinked/13_1_populations_sex_DM_R40.vcf")

## Starting :: 
##  Starting dartR 
##  Starting gl.read.vcf 
## Starting gl.compliance.check 
##   Processing genlight object with SNP data
##   Checking coding of SNPs
##     SNP data scored NA, 0, 1 or 2 confirmed
##   Checking locus metrics and flags
##   Recalculating locus metrics
##   Checking for monomorphic loci
##     No monomorphic loci detected
##   Checking for loci with all missing data
##     No loci with all missing data detected
##   Checking whether individual names are unique.
##   Checking for individual metrics
##   Warning: Creating a slot for individual metrics
##   Checking for population assignments
##   Population assignments not detected, individuals assigned to a single population labelled 'pop1'
##   Spelling of coordinates checked and changed if necessary to lat/lon
## Completed: gl.compliance.check 
## Starting gl.recalc.metrics 
##   Processing genlight object with SNP data
## Starting utils.recalc.avgpic 
##   Processing genlight object with SNP data
##   Recalculating OneRatioRef, OneRatioSnp, PICRef, PICSnp, AvgPIC
## Completed: utils.recalc.avgpic 
## Starting utils.recalc.callrate 
##   Processing genlight object with SNP data
##   Recalculating locus metric CallRate
## Completed: utils.recalc.callrate 
## Starting utils.recalc.maf 
##   Processing genlight object with SNP data
##   Recalculating FreqHoms and FreqHets
## Starting utils.recalc.freqhets 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHets
## Completed: utils.recalc.freqhets 
## Starting utils.recalc.freqhomref 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomRef
## Completed: utils.recalc.freqhomref 
## Starting utils.recalc.freqhomsnp 
##   Processing genlight object with SNP data
##   Recalculating locus metric freqHomSnp
## Completed: utils.recalc.freqhomsnp 
##   Recalculating Minor Allele Frequency (MAF)
## Completed: utils.recalc.maf 
##   Locus metrics recalculated
## Completed: gl.recalc.metrics

Let’s check call rate by individual and by locus

#individual call rate
gl.report.callrate(my.gl,method = "ind")

## Starting gl.report.callrate 
##   Processing genlight object with SNP data
##   Reporting Call Rate by Individual
##   No. of loci = 14069 
##   No. of individuals = 38 
##     Minimum      :  0.6316725 
##     1st quartile :  0.7095565 
##     Median       :  0.7674319 
##     Mean         :  0.7552663 
##     3r quartile  :  0.7982444 
##     Maximum      :  0.8509489 
##     Missing Rate Overall:  0.24

##    Quantile Threshold Retained Percent Filtered Percent
## 1      100% 0.8509489        1     2.6       37    97.4
## 2       95% 0.8298529        2     5.3       36    94.7
## 3       90% 0.8228232        4    10.5       34    89.5
## 4       85% 0.8212133        6    15.8       32    84.2
## 5       80% 0.8097093        8    21.1       30    78.9
## 6       75% 0.7982444       10    26.3       28    73.7
## 7       70% 0.7851944       12    31.6       26    68.4
## 8       65% 0.7808835       13    34.2       25    65.8
## 9       60% 0.7771981       15    39.5       23    60.5
## 10      55% 0.7705843       17    44.7       21    55.3
## 11      50% 0.7674319       19    50.0       19    50.0
## 12      45% 0.7523278       21    55.3       17    44.7
## 13      40% 0.7403085       23    60.5       15    39.5
## 14      35% 0.7329021       25    65.8       13    34.2
## 15      30% 0.7236548       26    68.4       12    31.6
## 16      25% 0.7095565       28    73.7       10    26.3
## 17      20% 0.7027649       30    78.9        8    21.1
## 18      15% 0.6946691       32    84.2        6    15.8
## 19      10% 0.6845831       34    89.5        4    10.5
## 20       5% 0.6694648       36    94.7        2     5.3
## 21       0% 0.6316725       38   100.0        0     0.0
## Completed: gl.report.callrate

Individuals have mostly low levels of missing data with no obvious outliers, so all good there.

#filter snp call rate
gl.report.callrate(my.gl)

## Starting gl.report.callrate 
##   Processing genlight object with SNP data
##   Reporting Call Rate by Locus
##   No. of loci = 14069 
##   No. of individuals = 38 
##     Minimum      :  0.421053 
##     1st quartile :  0.578947 
##     Median       :  0.789474 
##     Mean         :  0.7552663 
##     3r quartile  :  0.947368 
##     Maximum      :  1 
##     Missing Rate Overall:  0.24

##    Quantile Threshold Retained Percent Filtered Percent
## 1      100%  1.000000     1553    11.0    12516    89.0
## 2       95%  1.000000     1553    11.0    12516    89.0
## 3       90%  1.000000     1553    11.0    12516    89.0
## 4       85%  0.973684     2698    19.2    11371    80.8
## 5       80%  0.947368     3687    26.2    10382    73.8
## 6       75%  0.947368     3687    26.2    10382    73.8
## 7       70%  0.921053     4479    31.8     9590    68.2
## 8       65%  0.894737     5179    36.8     8890    63.2
## 9       60%  0.868421     5674    40.3     8395    59.7
## 10      55%  0.815789     6606    47.0     7463    53.0
## 11      50%  0.789474     7142    50.8     6927    49.2
## 12      45%  0.736842     8074    57.4     5995    42.6
## 13      40%  0.710526     8527    60.6     5542    39.4
## 14      35%  0.657895     9433    67.0     4636    33.0
## 15      30%  0.631579     9892    70.3     4177    29.7
## 16      25%  0.578947    10803    76.8     3266    23.2
## 17      20%  0.552632    11259    80.0     2810    20.0
## 18      15%  0.500000    12374    88.0     1695    12.0
## 19      10%  0.473684    12942    92.0     1127     8.0
## 20       5%  0.447368    13469    95.7      600     4.3
## 21       0%  0.421053    14069   100.0        0     0.0
## Completed: gl.report.callrate

In terms of call rate by SNP we have a few SNPs with low (< 0.7 Call Rate). That was alright for heterozygosity method that were looking at sex-specific loci, but for GWAS we might be a bit more stringent. Let’s filter then to retain only markers present in at least 80% of individuals.

my.gl <- gl.filter.callrate(my.gl, threshold = 0.8)

## Starting gl.filter.callrate 
##   Processing genlight object with SNP data
##   Recalculating Call Rate
##   Removing loci based on Call Rate, threshold = 0.8

## Completed: gl.filter.callrate

Now let’s look at relatedness and population structure.

my.grm <- dartR::gl.grm(my.gl)

## Starting :: 
##  Starting dartR 
##  Starting gl.grm 
##   Processing genlight object with SNP data

## Registered S3 method overwritten by 'gplots':
##   method         from 
##   reorder.factor gdata

## Completed: :: 
##  Completed: dartR 
##  Completed: gl.grm

gl.grm.network(my.grm,my.gl,method = 'kk',node.label.size = 2,node.size = 4)

## Starting gl.grm.network 
##   Processing genlight object with SNP data

## Completed: gl.grm.network

From this it already looks clear that we have structure of YC and MRu against all other samples, and the YC and MRu samples are also highly related to each other.

Let’s look where they fall on a PCOA.

my.pcoa <- gl.pcoa(my.gl)

## Starting gl.pcoa 
##   Processing genlight object with SNP data
##   Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state

## Completed: gl.pcoa

gl.pcoa.plot(glPca = my.pcoa, x = my.gl,interactive = FALSE)

## Starting gl.pcoa.plot 
##   Processing an ordination file (glPca)
##   Processing genlight object with SNP data
##   Plotting populations in a space defined by the SNPs
##   Preparing plot .... please wait

## Completed: gl.pcoa.plot

There are some individuals that are clustering against the rest of the samples. Let’s check which ones these are (likely the YC and MRu samples).

as.data.frame(my.pcoa$scores) %>% filter(PC1 < -20)

##                         PC1        PC2         PC3         PC4          PC5
## LS_DM_MRu_1204_F1 -28.49856  -8.582607  15.4940058   8.4292523 -16.19871527
## LS_DM_MRu_1210_M1 -28.42007  -9.911423  12.6980512  -7.4089470  18.45627751
## LS_DM_YC_1233_M1  -29.52757 -10.156051 -15.5945210  16.7996139   5.85532878
## LS_DM_YC_1236_F1  -27.70711  -8.976040 -11.9079795 -18.7166884  -9.53751517
## LS_DM_YC_1237_F1  -27.93108  19.270559  -0.5749979   0.8138258   0.05768402
## LS_DM_YC_1240_M1  -29.09871  18.428612   0.3256411  -0.7787646   0.89298018

As epected the YC and MRu individuals cluster on their own. I have ran the next step both with and without these individuals and the results persists. Here to account for pop structure I will choose the easier path and remove those 6 individuals.

4. GWAS

Now let’s run a GWAS for sex using the SNPassoc package.

4.1. Convert to Ped format

First of all, we will format the data to the format required by SNPassoc, which is the ped format.

Clear environment

rm(list = ls())

Read genotype data for the adult sexed individuals. This was produced before (see: “13. Identify Sex Linked Markers” for details).

my.df <- readRDS(file = "../3_SexLinked/13_2_InputDataframe_sex_R40.RData")

Remove individuals from the YC and MRu populations. We can do this using the Pop identifier in their name.

my.df <- select(my.df, -contains("_YC_"))
my.df <- select(my.df, -contains("_MRu_"))

Then we remove "_a" from the end of sample IDs, as we don’t need the replicate info anymore and it makes life harder. Remember, we have already retained one replicate per sample.

names(my.df) <- gsub(pattern = "_a",
                     replacement = "",
                     x = names(my.df))

names(my.df) <- gsub(pattern = "_b",
                     replacement = "",
                     x = names(my.df))

Transpose to have individuals as rows:

my.df <- as.data.frame(t(my.df))

my.df <- row_to_names(my.df, row_number = 1)

my.df <- rownames_to_column(my.df, var = "sampleID")

Add columns required for ped file (I am using the RIGHT function by Andrew Haynes found here).

RIGHT = function(x,n){
  substring(x,nchar(x)-n+1)
}

my.df <- my.df %>%
  #add a FAM column
  add_column(FAM="FAM", .before = 'sampleID') %>%
  #add father info
  add_column(father=0, .after =  'sampleID') %>%
  #add mother info
  add_column(mother=0, .after =  'father') %>%
  #aadd sex information, which is stored in sample IDs
  mutate(sex=case_when(RIGHT(sampleID,2)=="F1" ~ 2,
                       RIGHT(sampleID,2)=="M1" ~ 1),
         .after = 'mother') %>%
  mutate(sex2=case_when(RIGHT(sampleID,2)=="F1" ~ 2,
                       RIGHT(sampleID,2)=="M1" ~ 1),
         .after = 'mother')

Extract genotype columns to recode in plink format

gens <- my.df[,7:ncol(my.df)]

gens[gens==-1]<-NA
gens[gens==" 0"]<-"AA"
gens[gens==" 1"]<-"AC"
gens[gens==" 2"]<-"CC"

Make final dataset by merging sampleID, sex and genotypes coded as required by plink format

my.df <- cbind(my.df %>% select(sampleID,sex),gens)

Remove unnecessary objects from R environment

rm(gens, RIGHT)

Save the resulting file

write_delim(my.df, file = "../3_SexLinked/15_2_SNPassoc_Input.txt",
            delim = "\t")

4.2 Association test

Now we can run the association test. To do so, we need to convert to SNPassoc data format

sex <- setupSNP(data=my.df, colSNPs=3:ncol(my.df), sep="")

Next we conduct association tests for all markers using the codominant model.

WGcod <- WGassociation(sex, data = sex, model = "codominant",
              quantitative = FALSE, genotypingRate = 0.6)

Save the resulting file, as well as the SNPassoc formatted R object

write_delim(x = cbind("SNPID"=rownames(WGcod),WGcod), file = "../3_SexLinked/15_3_SNPassoc_Output_Raw.txt",
            delim = "\t",)

saveRDS(WGcod,file = "../3_SexLinked/15_3_SNPassoc_Output_Raw.RData")
saveRDS(sex,file = "../3_SexLinked/15_3_SNPassoc_Input_Formatted.RData")

4.3. QQplot, significance and inflation factor

Now let’s investigate how our results are. We will do so by producing a QQplot, calculating an inflation factor and adjusting significane using FDR correction.

WGcod <- readRDS(file = "../3_SexLinked/15_3_SNPassoc_Output_Raw.RData")

qqpval(WGcod$codominant)

This qqplot plots sorted observed Pvalues on the y axis and expected Pvalues based on aw uniform distribution. If you look at the underlying function these are produced with (1:n)/(n+1) where n is the number of loci tested.

Let’s plot the qqplot with Confidence intervals as well. To do so I use the gg_qqplot function from this blog. First I define the function.

gg_qqplot <- function(ps, ci = 0.95) {
  n  <- length(ps)
  df <- data.frame(
    observed = -log10(sort(ps)),
    expected = -log10(ppoints(n)),
    clower   = -log10(qbeta(p = (1 - ci) / 2, shape1 = 1:n, shape2 = n:1)),
    cupper   = -log10(qbeta(p = (1 + ci) / 2, shape1 = 1:n, shape2 = n:1))
  )
  log10Pe <- expression(paste("Expected -log"[10], plain((p-value))))
  log10Po <- expression(paste("Observed -log"[10], plain((p-value))))
  ggplot(df) +
    geom_ribbon(
      mapping = aes(x = expected, ymin = clower, ymax = cupper),
      alpha = 0.1
    ) +
    geom_point(aes(expected, observed), shape = 1, size = 2) +
    geom_abline(intercept = 0, slope = 1, alpha = 0.5) +
    # geom_line(aes(expected, cupper), linetype = 2, size = 0.5) +
    # geom_line(aes(expected, clower), linetype = 2, size = 0.5) +
    xlab(log10Pe) +
    ylab(log10Po)
}

Then I use it to produce the qqplot.

#retain only non NA pvalues. NA pvalues are returned where the two groups being tested are identical (i.e., exact same numbers for each genotype for the two sexes)
my.ps <- WGcod$codominant[is.na(WGcod$codominant) == FALSE ]

#and plot
q1 <- gg_qqplot(my.ps, ci = 0.95) +
  theme_bw(base_size = 12) +
  theme(
    axis.ticks = element_line(size=0.5),
    panel.grid = element_blank()
  )
q1

Save the resulting graph

ggsave("../3_SexLinked/15_4_SNPassoc_QQplot.jpeg", plot = q1,
       device = "jpeg",dpi = 300,width = 8,height = 4)

This shows us that the lower p-values we obtained are outside the expected 95% Confidence Intervals. Let’s see how many markers are significant at an alpha value of 0.05 and after FDR correction.

as.data.frame(WGcod) %>%
  filter(is.na(comments)==TRUE & is.na(codominant)==FALSE & codominant < 0.05)

##            comments   codominant
## X334_26        <NA> 1.612097e-05
## X334_36        <NA> 1.612097e-05
## X374_45        <NA> 3.725127e-02
## X639_67        <NA> 2.505355e-02
## X650_68        <NA> 2.618523e-03
## X735_59        <NA> 3.878582e-02
## X802_9         <NA> 4.934112e-02
## X802_19        <NA> 4.934112e-02
## X914_10        <NA> 2.735361e-02
## X914_68        <NA> 4.209276e-02
## X1024_12       <NA> 5.197401e-03
## X1054_19       <NA> 3.596809e-02
## X2232_35       <NA> 8.588863e-03
## X2232_36       <NA> 6.864989e-04
## X2232_41       <NA> 4.210526e-03
## X2475_52       <NA> 1.885730e-02
## X3099_27       <NA> 3.259451e-02
## X3325_29       <NA> 3.018402e-02
## X3325_44       <NA> 3.018402e-02
## X3325_61       <NA> 3.018402e-02
## X3331_53       <NA> 1.032433e-02
## X3735_62       <NA> 3.694014e-02
## X3747_57       <NA> 3.731630e-02
## X3908_43       <NA> 4.411765e-02
## X3908_45       <NA> 4.411765e-02
## X3908_51       <NA> 4.411765e-02
## X3908_52       <NA> 4.411765e-02
## X4032_13       <NA> 6.770808e-05
## X4032_53       <NA> 6.770808e-05
## X4693_59       <NA> 4.140787e-04
## X4911_50       <NA> 1.211616e-02
## X4933_10       <NA> 1.619039e-02
## X5219_30       <NA> 4.241063e-02
## X5700_35       <NA> 5.980881e-03
## X5707_18       <NA> 2.736803e-02
## X5707_50       <NA> 2.736803e-02
## X6308_47       <NA> 6.993007e-03
## X6667_8        <NA> 3.197277e-02
## X6667_34       <NA> 2.580246e-02
## X6814_7        <NA> 3.596809e-02
## X6814_41       <NA> 3.913043e-02
## X6981_7        <NA> 3.195239e-02
## X6981_31       <NA> 3.195239e-02
## X7141_47       <NA> 2.837529e-03
## X7194_9        <NA> 4.997501e-05
## X7517_59       <NA> 4.545455e-02
## X7621_64       <NA> 2.747253e-02
## X7627_8        <NA> 3.726708e-02
## X7627_39       <NA> 3.726708e-02
## X7627_61       <NA> 1.372998e-02
## X7627_64       <NA> 3.726708e-02
## X7727_20       <NA> 1.767396e-02
## X8155_52       <NA> 4.151923e-04
## X8268_44       <NA> 8.398770e-03
## X8307_48       <NA> 1.328959e-02
## X8351_15       <NA> 3.250774e-02
## X8351_17       <NA> 3.250774e-02
## X8351_23       <NA> 3.250774e-02
## X9618_22       <NA> 3.852064e-05
## X9618_32       <NA> 3.224194e-06
## X11005_55      <NA> 4.267203e-02
## X11297_62      <NA> 3.171046e-02
## X11376_27      <NA> 4.516660e-02
## X11427_40      <NA> 1.888457e-02
## X12040_47      <NA> 3.229646e-02
## X12065_18      <NA> 1.372998e-02
## X12440_13      <NA> 1.686633e-02
## X12451_48      <NA> 4.489164e-02
## X12700_56      <NA> 2.385965e-02
## X12726_45      <NA> 2.960560e-02
## X12778_11      <NA> 4.214559e-02
## X13015_13      <NA> 1.940508e-03
## X13267_26      <NA> 1.162718e-02
## X13322_25      <NA> 6.676560e-03
## X13326_11      <NA> 4.214559e-02
## X13727_20      <NA> 3.947472e-02
## X13727_22      <NA> 3.947472e-02
## X13894_32      <NA> 3.496503e-03
## X14133_41      <NA> 9.147039e-03
## X14133_57      <NA> 1.134221e-02
## X14158_40      <NA> 2.608696e-02
## X14842_52      <NA> 2.401058e-02
## X14842_65      <NA> 2.401058e-02
## X14924_58      <NA> 2.810607e-02
## X15306_55      <NA> 3.195239e-02
## X15777_44      <NA> 1.102605e-02
## X15777_54      <NA> 1.102605e-02
## X15889_7       <NA> 4.395604e-02
## X15889_16      <NA> 2.777223e-02
## X16011_58      <NA> 2.098118e-02
## X16214_28      <NA> 9.293891e-03
## X16214_47      <NA> 6.639555e-03
## X16681_58      <NA> 4.289216e-02
## X16896_51      <NA> 2.523608e-02
## X16919_17      <NA> 6.770808e-05
## X16941_12      <NA> 3.894598e-02
## X16986_12      <NA> 8.942897e-05
## X16986_61      <NA> 1.377485e-03
## X17224_38      <NA> 3.913043e-02
## X17401_60      <NA> 1.185771e-03
## X17481_61      <NA> 4.456375e-02
## X17724_6       <NA> 1.888457e-02
## X18102_35      <NA> 4.449649e-02
## X18102_49      <NA> 4.449649e-02
## X18329_57      <NA> 5.797101e-04
## X18329_64      <NA> 1.204013e-03
## X18388_21      <NA> 2.747253e-02
## X18747_35      <NA> 2.107871e-02
## X18747_62      <NA> 2.107871e-02
## X18916_33      <NA> 3.173666e-02
## X19156_57      <NA> 2.698322e-02
## X19602_42      <NA> 6.191950e-03
## X19863_14      <NA> 3.088621e-02
## X19863_46      <NA> 3.088621e-02
## X19871_46      <NA> 1.425404e-02
## X19918_47      <NA> 4.114907e-02
## X20053_36      <NA> 4.927761e-02
## X20630_53      <NA> 1.041234e-03
## X20807_40      <NA> 3.181812e-02
## X20918_41      <NA> 2.380952e-02
## X20919_21      <NA> 5.685531e-04
## X21001_58      <NA> 3.964554e-02
## X21001_63      <NA> 3.964554e-02
## X21315_28      <NA> 1.490683e-02
## X21315_41      <NA> 1.490683e-02
## X21315_47      <NA> 1.490683e-02
## X21713_27      <NA> 3.611971e-02
## X22398_31      <NA> 4.884821e-03
## X22744_40      <NA> 3.311037e-02
## X22756_51      <NA> 4.214559e-02
## X22756_52      <NA> 4.214559e-02
## X22776_57      <NA> 1.991800e-02
## X23244_66      <NA> 1.525553e-04
## X23407_20      <NA> 8.061561e-03
## X23597_35      <NA> 4.407602e-02
## X23918_47      <NA> 1.282051e-02
## X24064_66      <NA> 2.484141e-02
## X24066_32      <NA> 1.708823e-04
## X24640_55      <NA> 7.326304e-03
## X24912_64      <NA> 4.471764e-02
## X25072_33      <NA> 6.057343e-04
## X25138_39      <NA> 3.678337e-03
## X25445_21      <NA> 3.340622e-02
## X25445_63      <NA> 3.340622e-02
## X25533_7       <NA> 2.397602e-02
## X25533_10      <NA> 2.397602e-02
## X25533_19      <NA> 2.397602e-02
## X25915_26      <NA> 8.326827e-03
## X25915_46      <NA> 8.326827e-03
## X25915_55      <NA> 1.430206e-02
## X26687_36      <NA> 1.084295e-04
## X27632_8       <NA> 3.164931e-02
## X27755_40      <NA> 1.888457e-02
## X27755_44      <NA> 1.888457e-02
## X27779_50      <NA> 4.338154e-02
## X27779_66      <NA> 4.338154e-02
## X28026_47      <NA> 3.336633e-02
## X28473_53      <NA> 4.482505e-03
## X28534_40      <NA> 2.909467e-02
## X28680_16      <NA> 4.125088e-02
## X29169_39      <NA> 3.410400e-02
## X29717_11      <NA> 4.159587e-02
## X29717_30      <NA> 2.585374e-02
## X29717_58      <NA> 4.159587e-02
## X29883_62      <NA> 2.424242e-02
## X29928_20      <NA> 1.204289e-02
## X30107_7       <NA> 2.580246e-02
## X30107_17      <NA> 2.580246e-02
## X30107_33      <NA> 2.580246e-02
## X30503_14      <NA> 2.824478e-04
## X30536_35      <NA> 8.092728e-04
## X30641_43      <NA> 4.156858e-04
## X30641_54      <NA> 1.399300e-04
## X30763_31      <NA> 1.852537e-02
## X30794_14      <NA> 4.449649e-02
## X30794_40      <NA> 4.449649e-02
## X30933_30      <NA> 1.960784e-02
## X31389_31      <NA> 7.263035e-04
## X31389_42      <NA> 1.054312e-03
## X31389_50      <NA> 1.624188e-03
## X31389_51      <NA> 1.054312e-03
## X31415_50      <NA> 6.993007e-03
## X31958_29      <NA> 9.700774e-03
## X32322_27      <NA> 2.900808e-02
## X32322_50      <NA> 4.259721e-02
## X32707_50      <NA> 7.711895e-04
## X32829_10      <NA> 3.138052e-04
## X32829_41      <NA> 3.138052e-04
## X32829_45      <NA> 3.138052e-04
## X33230_68      <NA> 3.137652e-02
## X33276_26      <NA> 3.432494e-03
## X33613_48      <NA> 2.532769e-02
## X33613_59      <NA> 2.532769e-02
## X33781_49      <NA> 4.338154e-02
## X33808_58      <NA> 3.384210e-03
## X34059_60      <NA> 1.237069e-02
## X34276_9       <NA> 1.904925e-02
## X34276_19      <NA> 1.904925e-02
## X34358_50      <NA> 3.024414e-02
## X34683_51      <NA> 2.714932e-02
## X35595_56      <NA> 3.895525e-02
## X35595_66      <NA> 3.895525e-02
## X35748_28      <NA> 4.338154e-02
## X35842_38      <NA> 6.770808e-05
## X35993_43      <NA> 2.168820e-02
## X36141_59      <NA> 4.833490e-02
## X36195_7       <NA> 4.214559e-02
## X36625_10      <NA> 2.105536e-02
## X36625_63      <NA> 1.699101e-02
## X36885_68      <NA> 2.287582e-02
## X37034_38      <NA> 4.113534e-05
## X37034_66      <NA> 4.113534e-05
## X37466_39      <NA> 4.449649e-02
## X37619_21      <NA> 7.120743e-03
## X37619_33      <NA> 8.218407e-03
## X37619_42      <NA> 8.218407e-03
## X37987_12      <NA> 2.321981e-02
## X37987_26      <NA> 2.361883e-02
## X37987_52      <NA> 2.361883e-02
## X38114_63      <NA> 1.673722e-02
## X38177_9       <NA> 4.449649e-02
## X38177_64      <NA> 4.449649e-02
## X38312_6       <NA> 3.020967e-02
## X38607_48      <NA> 6.770808e-05
## X38866_34      <NA> 6.931540e-03
## X38866_47      <NA> 6.864989e-04
## X39344_24      <NA> 3.728997e-02
## X39344_26      <NA> 3.728997e-02
## X39664_7       <NA> 6.987594e-03
## X39971_50      <NA> 1.212276e-02
## X40145_58      <NA> 2.371542e-02
## X40145_61      <NA> 7.489078e-03
## X40444_17      <NA> 9.810479e-03
## X40998_10      <NA> 2.322204e-02
## X40998_67      <NA> 3.336633e-02
## X41125_40      <NA> 2.514371e-02
## X41261_10      <NA> 3.370798e-03
## X41742_17      <NA> 9.657010e-03
## X41896_48      <NA> 1.818182e-02
## X42297_19      <NA> 3.540925e-02
## X42612_62      <NA> 1.010101e-02
## X42648_64      <NA> 4.113534e-02
## X42662_30      <NA> 4.027500e-02
## X42662_35      <NA> 2.924092e-02
## X42662_36      <NA> 2.924092e-02
## X42662_40      <NA> 2.924092e-02
## X43497_68      <NA> 3.182798e-02
## X43840_52      <NA> 4.154246e-03
## X43924_10      <NA> 3.139030e-02
## X44883_48      <NA> 1.213599e-02
## X45038_61      <NA> 1.292437e-02
## X45038_62      <NA> 1.863328e-02
## X45263_7       <NA> 3.483823e-02
## X45277_20      <NA> 2.810607e-02
## X45277_67      <NA> 2.810607e-02
## X45294_22      <NA> 2.245146e-02
## X45294_55      <NA> 2.245146e-02
## X45299_49      <NA> 1.298701e-02
## X45307_28      <NA> 6.170300e-03
## X45688_44      <NA> 1.359135e-02
## X45990_11      <NA> 4.982669e-02
## X45993_12      <NA> 4.333794e-04
## X46352_67      <NA> 4.757416e-02
## X47206_13      <NA> 6.617716e-03
## X47388_63      <NA> 4.545455e-02
## X47692_48      <NA> 2.275526e-02
## X48241_60      <NA> 2.329920e-02
## X48483_21      <NA> 3.770739e-02
## X48483_48      <NA> 3.194221e-02
## X49716_7       <NA> 2.023220e-02
## X49924_37      <NA> 4.765590e-02
## X49924_47      <NA> 4.765590e-02
## X51163_39      <NA> 7.751371e-03
## X51332_32      <NA> 4.210526e-02
## X51574_47      <NA> 4.576659e-04
## X52278_15      <NA> 1.364248e-02
## X54327_49      <NA> 4.710305e-02
## X54587_64      <NA> 4.395604e-02
## X57428_53      <NA> 5.431671e-03
## X60625_8       <NA> 4.449649e-02
## X60625_41      <NA> 4.449649e-02
## X65816_49      <NA> 1.533166e-02
## X68160_20      <NA> 4.214559e-02
## X69834_64      <NA> 1.035197e-04
## X75021_34      <NA> 9.395302e-04
## X76522_27      <NA> 1.667605e-02
## X76864_52      <NA> 3.229646e-02
## X80361_62      <NA> 1.427974e-02
## X84083_68      <NA> 2.194543e-02
## X86312_62      <NA> 4.377704e-02
## X116502_23     <NA> 2.105263e-02
## X116502_26     <NA> 2.105263e-02
## X116502_27     <NA> 4.912281e-02
## X116502_28     <NA> 4.912281e-02
## X151186_22     <NA> 4.872257e-02
## X153320_24     <NA> 2.357242e-02
## X175541_10     <NA> 3.841391e-02
## X175541_24     <NA> 3.841391e-02
## X178740_42     <NA> 4.074074e-02
## X179603_11     <NA> 3.229646e-02
## X199881_10     <NA> 6.879945e-05
## X203234_39     <NA> 3.373097e-03
## X204127_54     <NA> 2.748826e-03
## X206720_38     <NA> 2.197802e-02
## X207207_38     <NA> 9.121313e-04
## X208246_64     <NA> 3.496503e-02
## X214417_43     <NA> 8.085431e-03
## X214417_53     <NA> 8.085431e-03
## X214417_63     <NA> 3.171046e-02
## X215072_12     <NA> 6.506035e-06
## X215072_57     <NA> 3.311037e-02
## X215072_68     <NA> 5.928854e-03
## X217209_24     <NA> 4.834484e-02
## X217209_35     <NA> 4.834484e-02
## X217209_68     <NA> 4.834484e-02
## X218040_28     <NA> 4.345654e-02
## X218040_32     <NA> 4.345654e-02
## X218040_37     <NA> 3.384116e-02
## X218040_46     <NA> 4.345654e-02
## X218040_61     <NA> 4.345654e-02
## X219037_53     <NA> 2.298137e-02
## X220670_15     <NA> 2.161478e-02

Before correction there are 322 markers significantly associated with sex at an alpha of 0.05. Given we are doing ~10,000 tests we will need to adjust for multiple testing.

ajs_p <- p.adjust(my.ps,method = "fdr", n=length(my.ps))

How many markers are significant now at an alpha value of 0.05?

ajs_p.len <- length(ajs_p[which(ajs_p<=0.05)])
cat('fdr corrected threshold p value is:', my.ps[ajs_p.len], 'total:', ajs_p.len, "\n")

## fdr corrected threshold p value is: 0.2197166 total: 14

Let’s see which are these markers. First we add corrected p-values to our dataframe

WGcod <- as.data.frame(WGcod) %>%
  filter(is.na(codominant)==FALSE) %>% select(-comments)
WGcod <- cbind(WGcod,"p.adjust"=ajs_p)

Now see which markers are significant.

WGcod %>% filter(p.adjust < 0.05) %>% print.data.frame()

##              codominant   p.adjust
## X334_26    1.612097e-05 0.03958907
## X334_36    1.612097e-05 0.03958907
## X4032_13   6.770808e-05 0.04827264
## X4032_53   6.770808e-05 0.04827264
## X7194_9    4.997501e-05 0.04827264
## X9618_22   3.852064e-05 0.04827264
## X9618_32   3.224194e-06 0.03167126
## X16919_17  6.770808e-05 0.04827264
## X35842_38  6.770808e-05 0.04827264
## X37034_38  4.113534e-05 0.04827264
## X37034_66  4.113534e-05 0.04827264
## X38607_48  6.770808e-05 0.04827264
## X199881_10 6.879945e-05 0.04827264
## X215072_12 6.506035e-06 0.03195439

We can see there are four pairs of SNPs that come from the same locus. This is to be expected if those loci are linked to a sex-determining locus. We will explore this further in a second, now let’s finish assessing our GWAS by calculating the genomic inflation factor. This should be close to 1. Values higher than 1 indicate p-value inflation, usually due to population structure.

my.chisq <- qchisq(1 - my.ps, 1)
median(my.chisq) / qchisq(0.5, 1)

## [1] 0.6849137

It is lower than 1, so no population structure. Yet it is a bit too low. I have not figured out why this is the case yet. I will report it and see what reviewers say.

Let’s finish by saving the dataset with adjusted p-values.

WGcod <- WGcod %>%
  rownames_to_column(var="markerID") %>%
  mutate(markerID = gsub("X","",markerID))

And save

write_delim(WGcod,file = "../3_SexLinked/15_4_SNPassoc_Output_FDR.txt",
            delim = "\t",col_names = TRUE)

5. Compare and plot on linkage map

Now that we have conducted the GWAS, let’s see how significance changes across the genome and where the 14 SNPs significantly associated with sex fall on the genome.

First we clear the environment

rm(list=ls())

5.1. Read in all required datasets

We first import the list of sex-linked markers from previous analyses, the list of p-values from the GWAS and the linkage map data.

GWAS data

my.gwas <- read_delim("../3_SexLinked/15_4_SNPassoc_Output_FDR.txt",
                      delim = "\t")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   markerID = col_character(),
##   codominant = col_double(),
##   p.adjust = col_double()
## )

Linkage map

my.map <- read_delim("../2_LinMap/10_4_linkageMap.txt",
                     delim = "\t")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   SNPID = col_character(),
##   fasta = col_character(),
##   Segregation = col_character(),
##   tag = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Sex-linked markers

my.sexLinked <- read_delim("../3_SexLinked/13_3_SexLinkedMarkers_DM_2.txt",
                           delim = "\t")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   markerID = col_character(),
##   Method = col_character(),
##   SexDet = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

SNP genotype counts by sex

my.gcounts <- readRDS("../3_SexLinked/13_2_SexGenotypeCounts.RData")

Now that we have all of this, let’s add it all to a single dataframe.

5.2. Merge datasets

Create dataframe of all markerIDs across the four datasets

ids1 <- my.gcounts %>% pull(markerID) %>% unique()
ids2 <- my.gwas %>% pull(markerID) %>% unique()
ids3 <- my.map %>% pull(SNPID) %>% unique()
ids4 <- my.sexLinked %>% pull(markerID) %>% unique()

markerID <- unique(c(ids1,ids2,ids3,ids4))

my.df <- as.data.frame(markerID)

rm(ids1,ids2,ids3,ids4,markerID)

Add genotype counts

my.df <- left_join(my.df, my.gcounts, by="markerID")

Add linkage map data to the database of genotype counts

#rename colname of marker IDs so that it matches between datasets
my.map <- my.map %>% rename("markerID"="SNPID")


my.df <- left_join(my.df,my.map, by = "markerID")

Add data on sex-linked loci to this dataframe.

my.sexLinked <- my.sexLinked %>% select(markerID,Method,SexDet,LG_locus,mean_F,mean_M)

my.sexLinked <- pivot_wider(data=my.sexLinked,
                             names_from = Method,
                             values_from = Method)

my.df <- left_join(my.df,my.sexLinked,
                  by=c("markerID"))

Add p-values from the GWAS

my.gwas <- my.gwas %>% rename("GWAS.pvalue"="codominant",
                              "GWAS.FDR"="p.adjust")

my.df <- left_join(my.df,my.gwas, by="markerID")

Add a column to depict significant markers from the GWAS

my.df <- my.df %>%
  mutate("GWAS.associated"=case_when(GWAS.FDR < 0.05 ~ "significant",
                                     GWAS.FDR >= 0.05 ~ "non.significant",
                                     is.na(GWAS.FDR)==TRUE ~ "not.assessed"))

Add Locus ID for all markers (currently only present for linkage map markers). It is contained within the markerID, but it is good to have it easily accessibly in a separate column.

my.df <- my.df %>% mutate(Locus = gsub("_.*","",markerID))

And let’s rename the LepMap ID (currently called ID) to make it more explicit

my.df <- my.df %>% rename("LepmapID"="ID")

5.3. Add fasta sequence to all loci

We currently have fasta sequences only for the linkage map markers, but we want it for all markers, including sex-linked and sex-associated. This might be useful for anyone trying to use this map to blast it to other genomes, or to check the sex-linked/associated markers.

Clear environment

rm(my.gcounts,my.gwas,my.map,my.sexLinked)

To achieve this, first thing we need to do is retain a list of unique Locus IDs first.

my.ids <- my.df %>% pull(Locus) %>% unique()

Next we want to save it as a single column, and as a tab delimited file

write_delim(as.data.frame(my.ids),
            file = "../3_SexLinked/15_5_All_LociIDs.txt",
              delim = "\t",col_names = FALSE)

Now we extract the corresponding fasta sequences from the stacks catalogue with seqtk.

#!/bin/bash

#load seqtk
module load seqtk/1.3

#run seqtk for each family
seqtk subseq ./1_DataPrep/catalog.fa.gz ./3_SexLinked/15_5_All_LociIDs.txt > ./3_SexLinked/15_5_All_Loci.fa

The resulting output has two lines per locus. The first one has the format >locusID NS=... while the second one contains the fasta sequence.

To convert this to tabular format to then merge it with my current table I downloaded pyfaidx version 0.6.4 (05 April 2022). I then used the below command.

faidx --transform transposed 15_5_All_Loci.fa > 15_5_All_Loci.tsv

Let’s import the resulting file

my.fastas <- read_delim("../3_SexLinked/15_5_All_Loci.tsv",
                        delim = "\t",
                        col_names = c("Locus","some",
                                      "length","fasta"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Locus = col_double(),
##   some = col_double(),
##   length = col_double(),
##   fasta = col_character()
## )

Let’s retain only columns of interest (i.e., locus, fasta and length)

my.fastas <- my.fastas %>%
  select(Locus, fasta, length) %>%
  mutate(Locus=as.character(Locus))

And merge with the current table

my.df <- left_join(my.df %>% select(-fasta), my.fastas, by = "Locus")

my.df <- my.df %>% select(-length.x) %>% rename("length"="length.y")

Finally, for the 14 markers associated with sex, let’s retrieve whether markers from the same locus have been mapped (sometimes the specific marker was not mapped, but the same locus was).

temp <- my.df %>%
  dplyr::select(Locus,LG,male,female,sex.averaged) %>%
  group_by(Locus) %>%
  summarise(LG_locus = mean(LG,na.rm=TRUE),
            mean_F = mean(female,na.rm = TRUE),
            mean_M = mean(male,na.rm = TRUE),
            mean_Ave = mean(sex.averaged, na.rm = TRUE))

temp <- temp %>%
  mutate(Locus = as.character(Locus))

my.df <- my.df %>% select(-c(LG_locus,mean_F,mean_M))

my.df <- left_join(my.df, temp, by = "Locus")

rm(temp)

Let’s re-arrange columns a bit

my.df <- my.df %>% select(markerID,Locus,LepmapID,fasta,length,
                        LG, male, female, sex.averaged,
                        LG_locus,mean_F,mean_M,mean_Ave,
                        Segregation:NULLS,
                        SexDet,Method1:GWAS.associated,
                        Zero_M:Snp_F)

Let’s save the resulting table, which we will attach as supplementary material to the publication.

write_csv(my.df,file = "../3_SexLinked/15_5_SummaryTable_MapAndSex.csv",
            col_names = TRUE)

6. GWAS plots

Okay now we have all of our data in one neat table. Let’s plot results of the GWAS across linkage groups, and then compare the markers identified with those identified based on heterozygosity methods.

rm(list=ls())

Read in the table. I had to use read.csv and not read_csv because the latter tried to guess my data types for columns based on the first 1000 rows, and freaked out when it encountered non-NA values after 1000 rows if the first 1000 were all NAs.

my.df <- read.csv("../3_SexLinked/15_5_SummaryTable_MapAndSex.csv")

Let’s start by plotting significance of GWAS by linkage group. First we add alternating colours to linkage groups for plotting.

my.df <- my.df %>% mutate(colour = case_when(is.na(LG)==TRUE ~ "NA",
                                             LG %% 2 == 0 ~ "black",
                                             LG %% 2 != 0 ~ "red"))

Now retain a dataframe of markers present on linkage map and the GWAS analysis.

my.gmap <- my.df %>% filter(is.na(LG)==FALSE & is.na(GWAS.FDR)==FALSE)

And now plot it

p1 <- ggplot(my.gmap, aes(x = sex.averaged, y=-log10(GWAS.pvalue), )) + 
  #geom_histogram(binwidth = 1, aes(fill=colour, col=colour)) +
  geom_hline(yintercept=0) +
  geom_point(aes(fill=colour, col=colour)) +
  facet_grid(~LG, scales = "free_x", switch = "x") +
  geom_hline(yintercept=(-log10(0.05))) +
  theme_classic() +
  scale_fill_manual(values = c("black","red")) +
  scale_color_manual(values = c("black","red")) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line.x = element_blank(),
        strip.background = element_blank(),
        legend.position = "none") +
  labs(y = "-log10(pvalue)", x = "Linkage groups")

p1

This first graph shows the original p-values before correction. Now let’s plot it after FDR correction.

p2 <- ggplot(my.gmap, aes(x = sex.averaged, y=-log10(GWAS.FDR), )) + 
  #geom_histogram(binwidth = 1, aes(fill=colour, col=colour)) +
  geom_hline(yintercept=0) +
  geom_point(aes(fill=colour, col=colour)) +
  facet_grid(~LG, scales = "free_x", switch = "x") +
  geom_hline(yintercept=(-log10(0.05))) +
  theme_classic() +
  scale_fill_manual(values = c("black","red")) +
  scale_color_manual(values = c("black","red")) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line.x = element_blank(),
        strip.background = element_blank(),
        legend.position = "none") +
  labs(y = "-log10(pvalue corrected by FDR method)", x = "Linkage groups")

p2

Save both plots

ggsave("../3_SexLinked/15_6_SNPassoc_SignificanceByLG.jpeg", p1,
       device = "jpeg", dpi = 300, width = 8, height = 4)
ggsave("../3_SexLinked/15_6_SNPassoc_SignificanceByLG_FDR.jpeg", p2,
       device = "jpeg", dpi = 300, width = 8, height = 4)

Now let’s plot pvalue versus linkage map position for LG6 only. And let’s highlight markers that were identified by Methods 1 to 5 as well, for both corrected and uncorrected pvalues

both.ps <- my.df %>%
  filter(is.na(SexDet)==FALSE & GWAS.pvalue<0.05) %>%
  pull(markerID)

both.FDR <- my.df %>%
  filter(is.na(SexDet)==FALSE & GWAS.FDR<0.05) %>%
  pull(markerID)

Assign an extra column stating whether those markers were identified by sex-linkage methods as well.

my.df.2 <- my.df %>%
  mutate(both.ps=if_else(markerID%in%both.ps,"both","not")) %>%
  mutate(both.FDR=if_else(markerID%in%both.FDR,"both","not"))

Plot linkage group 6

p3 <- ggplot(my.df.2 %>% filter(LG==6), aes(x = sex.averaged, y=-log10(GWAS.pvalue))) +
  geom_point(aes(fill=both.ps, col=both.ps)) +
  geom_hline(yintercept=(-log10(0.05))) +
  theme_classic() +
  scale_fill_manual(values = c("red","black")) +
  scale_color_manual(values = c("red","black")) +
  theme(strip.background = element_blank(),
        legend.position = "none",
        panel.spacing.x = unit(0,"lines"),
        panel.spacing.y = unit(1,"lines")) +
  labs(y = "-log10(pvalue)", x = element_blank())

p3

## Warning: Removed 518 rows containing missing values (geom_point).

p4 <- ggplot(my.df.2 %>% filter(LG==6), aes(x = sex.averaged, y=-log10(GWAS.FDR))) +
  geom_point(aes(fill=both.FDR, col=both.FDR)) +
  geom_hline(yintercept=(-log10(0.05))) +
  theme_classic() +
  scale_fill_manual(values = c("red","black")) +
  scale_color_manual(values = c("red","black")) +
  theme(strip.background = element_blank(),
        legend.position = "none",
        panel.spacing.x = unit(0,"lines"),
        panel.spacing.y = unit(1,"lines")) +
  labs(y = "-log10(pvalue)", x = "Distance on Linkage Group 6 (cM)")

p4

## Warning: Removed 518 rows containing missing values (geom_point).

Arrange them together and plot:

p5 <- ggarrange(p3,p4, ncol=1)
p5

And save the resulting graph

ggsave("../3_SexLinked/15_6_SNPassoc_SignificanceLG6.jpeg", device="jpeg",
       dpi=300, width = 8, height = 4)

Finally, let’s summarise simply a few things:

How many markers significant after FDR for GWAS:

cat('Total number of markers significant after FDR correction:',
    nrow(my.df %>% filter(GWAS.associated=="significant")), "\n")

## Total number of markers significant after FDR correction: 14

How many XY and ZW markers identified?

cat('Total number of XY and ZW markers identified is',
    length(my.df %>% filter(SexDet=="XY") %>% pull(markerID) %>% unique()),
    'and',
    length(my.df %>% filter(SexDet=="ZW") %>% pull(markerID) %>% unique()),
    'respectively', "\n")

## Total number of XY and ZW markers identified is 29 and 2 respectively

How many were identified by both?

cat(nrow(my.df %>% filter(SexDet=="XY"&GWAS.associated=="significant")),
    'markers were identified by both methods with FDR, and',
    nrow(my.df %>% filter(SexDet=="XY"&GWAS.pvalue<0.06)),
    'were identified by both before correcting for multiple testing')

## 7 markers were identified by both methods with FDR, and 29 were identified by both before correcting for multiple testing

How many of those occur on Linkage group 6?

cat(nrow(my.df %>%
           filter(SexDet=="XY" & GWAS.associated=="significant" & LG==6)),
    'of these markers occurs on LG6 after FDR, while',
    nrow(my.df %>%
           filter(SexDet=="XY" & GWAS.pvalue < 0.05 & LG==6)),
    'of these occur on LG6 before FDR')

## 1 of these markers occurs on LG6 after FDR, while 8 of these occur on LG6 before FDR

How many occur on other Linkage groups?

cat(nrow(my.df %>%
           filter(SexDet=="XY" & GWAS.associated=="significant" & LG!=6)),
    'and',
    nrow(my.df %>%
           filter(SexDet=="XY" & GWAS.pvalue < 0.05 & LG!=6)),
    'of these markers occur on other LGs before and after FDR correction, respectively')

## 0 and 1 of these markers occur on other LGs before and after FDR correction, respectively