3. Segregation Distortion

1. Background
- 2.1. Possible segregation patterns
2. Assign a segregation to each marker
- 2.1 Calculate chi-square statistic
- 2.2 Decide an alpha value
3. Splitting uninformative and informative markers
4. Conclusion

1. Background

In the file 1_DataPrep_Stacks.Rmd I produced a catalogue of loci from raw fastQ files for all Litoria serrata/myola samples we have sequences thus far.

In the file 2_DataPrep_QCFiltering.Rmd I filtered individuals by call rate, retained only one individual per replicate pair, and then filtered markers by call rate.

Here I will assess segregation distortion in the linkage mapping family only. I will assign a segregation to each marker by comparing the genotype ratios observed in each offspring to those expected for different segregation types. Then, using the known parent genotypes and the segregation observed and assigned in the offspring I will further filter the markers using mendelian segregation expectations.

2.1. Possible segregation patterns

We have counted 0s, 1s and 2s (i.e., AA, AB, BB) genotypes for each SNP to be able to compare expected to observed genotype proportions.

Now we want to define all possible segregations. For bi-allelic markers, six possible combination of parent getnotypes are possible:

AAxBB -> 0AA:1AB:0BB [Uninformative]
AAxAA -> 1AA:0AB:0BB [Uninformative]
BBxBB -> 0AA:0AB:1BB [Uninformative]
ABx– -> 1AA:0AB:1BB [Null Alleles]
ABxAB -> 1AA:2AB:1BB [Double heterozygote]
ABxAA -> 1AA:1AB:0BB [Uniparental]
ABxBB -> 0AA:1AB:1BB [Uniparental]

To assess what segregation each marker comes from, I will compare the observed genotype proportions to the expected proportions using a chi-square test.

First, I need to define the proportion of genotypes for all 7 possible segregations. Because chi-square cannot take 0s, I added a modest possible error rate (from 0.005 to 0.2, according to the segregation). I identified these proportions from a lot of attempts, seeing what markers would be successfully identified.

“AAxBB” = c(0.005,0.99,0.005)
“AAxAA” = c(0.99,0.005,0.005)
“BBxBB” = c(0.005,0.005,0.99)
“NULLS” = c(0.49, 0.02, 0.49)
“ABxAB” = c(0.25, 0.5, 0.25)
“ABxAA” = c(0.499, 0.499, 0.002)
“ABxBB” = c(0.002, 0.499, 0.499)

2. Assign a segregation to each marker

Let’s load a few packages we will need throughout

library(tidyverse)
library(knitr)
library(pander)

2.1 Calculate chi-square statistic

Basically, for each marker I calculate a p-value to state whether the observed segregation is significantly different from the expected segregation. I did so for all possible 7 segregation types for each marker.

So, for instance, let’s assume I have a marker, markerID_1, which was successfully genotyped in 120 offspring. Let’s assume that this marker has 30 AA offspring, 60 AB offspring and 30 BB offspring, with the respective proportions of 0.25 AA, 0.50 AB and 0.25 BB. I would expect p-values as follows:

# simple table creation here | MarkerID | AA | AB | BB | AAxAA | … | ABxAB | ABxAA | |—————|:—:|:—:|:—:|:——:|:—:|:——:|:——:| | markerID_1 | 30 | 60 | 30 | < 0.01 | … | 0.06 | < 0.01 | | markerID_2 | 60 | 60 | 0 | < 0.01 | … | < 0.01 | 0.06 |

This is an idealised example, but you’ll see that it is usually pretty clear cut. I will use as input a table with markerID, and counts of the three possible genotypes (AA, AB, BB) and add a column for each segregation with the correspective p-value. I would then expect p-values below a certain threshold when observed and expected counts are very different, or above a certain threshold when they are very similar and thus likely the same segregation. For instance for markerID_1 we are pretty confident that it comes from an ABxAB segregation.

I run this step on the HPC as I need a lot of cores for it to run in less than many days. To do so I use an R script called 3_AssignSegregation.R, which i submit through qsub with the shell script 3_AssignSegregation.sh.

This is the R script:

#read in required packages
library(tidyverse)
library(parallel)

#set working directory
setwd("~/ProjectDirectory/1_DataPrep/")

#read in input RData file
metadata <- readRDS("1_3_FilteredMarkerData_metadata.RData")

#select columns of interest
temp <- lapply(metadata, function(x){x %>% select(markerID,Zero:Two)})

#create cluster of 40 cores
cl <- parallel::makeCluster(40)

#calculate chi.square for each possible segregation
fam_seg <- lapply(temp, function(d) {
  AAxBB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.005,0.99,0.005),simulate.p.value = T)$p.value)
  AAxAA <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.99,0.005,0.005),simulate.p.value = T)$p.value)
  BBxBB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.005,0.005,0.99),simulate.p.value = T)$p.value)
  NULLS <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.49, 0.02, 0.49),simulate.p.value = T)$p.value)
  ABxAB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.25, 0.5, 0.25),simulate.p.value = T)$p.value)
  ABxAA <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.49, 0.49, 0.02),simulate.p.value = T)$p.value)
  ABxBB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.02, 0.49, 0.49),simulate.p.value = T)$p.value)
  d <- cbind(d,AAxBB,AAxAA,BBxBB,NULLS,ABxAB,ABxAA,ABxBB)
  return(d)
})

#stop the cluster
parallel::stopCluster(cl)

#save the results of chi square
saveRDS(fam_seg,file = "3_1_FilteredMarkerData_metadata.RData")

And this is the shell script I use to submit the R script:

#!/bin/bash

#PBS -N segCount
#PBS -l ncpus=40
#PBS -l mem=20g
#PBS -l walltime=24:00:00

#load R 4.0.3
#R 4.0.3 is available through a conda3 environment at the JCU HPC
module load anaconda3
source $CONDA_PROF/conda.sh
conda activate R-403

#set directory to linkage mapping folder
cd ~/ProjectDirectory/1_DataPrep/

#run R script
#this script will save the result within the 1_filtering folder
R -f ./0_Rscripts/3_FilteringByMendelianInheritance.R

#close conda environment
conda deactivate

2.2 Decide an alpha value

Now, using the calculated chi-square p-value, I will assign a segregation to each marker. I will do so by assigning a segregation only if one of the 7 comparisons returned a non-significant result, as in, the assessed marker is statistically indistinguishable from only 1 segregation. To do this though, I need to identify an ideal alpha value.

First, I make sure I have a clear working environment

rm(list = ls())

Load packages

#read in required packages
library(tidyverse)

Then, I can read in the resulting file from the previous step: 3_1_FilteredMarkerData_metadata.RData.

metadata <- readRDS("../1_DataPrep/3_1_FilteredMarkerData_metadata.RData")

I also read initially the old metadata files to check everything worked fine

old.met <- readRDS("../1_DataPrep/2_3_FilteredMarkerData_metadata_HC.RData")

Let’s compare the number of rows to check they match.

nrow(metadata)==nrow(old.met)

## [1] TRUE

Now, let’s see if there’s many SNPs with more than 1 non-significant p-values. Those SNPs would be very similar to more than 1 segregation, and it would thus be hard/not possible to reliably assign them to only 1 segregation.

temp <- metadata %>%
  mutate(Above = rowSums(select(., AAxBB:ABxBB)>0.0005))

table(temp$Above)

## 
##     0     1 
##  3808 12537

There are no markers that would be assigned to more than 1 segregation, while 3808 are assigned to no segregation at all.

Let’s look at markers with no assigned segregation.

head(temp %>%
    select(markerID:Two,Above) %>%
    filter(Above == 0))

##   markerID Zero One Two Above
## 1    44_45   71  74  52     0
## 2     49_9   91  70  33     0
## 3    49_43   63 134   0     0
## 4    72_21   68  72  70     0
## 5   107_17  137  63   0     0
## 6   110_40  106  87  41     0

Visually inspecting these, they seem to be mostly highly distorted markers (e.g., 2:2:1|1:1:1|1:2:0, which is not expected from any possible parent pair). To try and get a better feel for it, I will check a few extra things.

To do this, we first have to create a column, Segregation, assigning the segregation, and giving a “NoSeg” when no segregation was assigned.

#select only column with p-values from chi-square
y <- temp %>% select(AAxBB:ABxBB)

#assign segregation at different alpha values
Seg0005 <- apply(y, 1, function(x) {
    ifelse(any(x > 0.0005), names(x)[x > 0.0005], "NoSeg") })

Seg001 <- apply(y, 1, function(x) {
    ifelse(any(x > 0.001), names(x)[x > 0.001], "NoSeg") })

Seg005 <- apply(y, 1, function(x) {
    ifelse(any(x > 0.005), names(x)[x > 0.005], "NoSeg") })

Seg01 <- apply(y, 1, function(x) {
    ifelse(any(x > 0.01), names(x)[x > 0.01], "NoSeg") })

Seg05 <- apply(y, 1, function(x) {
    ifelse(any(x > 0.05), names(x)[x > 0.05], "NoSeg") })

#bind all together into one dataframe
fam_seg_all <- cbind(temp,Seg0005,Seg001,Seg005,Seg01,Seg05)

#remove unnecessary files
rm(y)

Now, let’s look at the extremes, in terms of genotype counts, for each segregation.

ABxAA: Informative uniparental reference

First we look at ABxAA markers with the most BB genotypes.

head(fam_seg_all %>%
         filter(Seg0005 == "ABxAA") %>%
         arrange(desc(Two))
     )

##   markerID Zero One Two        AAxBB        AAxAA        BBxBB        NULLS
## 1 15369_60  142 140  16 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 2 22715_26  136 134  16 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 3   737_23  144 132  15 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 4  8483_29  131  96  15 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 5 17054_56  115  88  14 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 6 27174_57  135  90  14 0.0004997501 0.0004997501 0.0004997501 0.0004997501
##          ABxAB        ABxAA        ABxBB Above Seg0005 Seg001 Seg005 Seg01
## 1 0.0004997501 0.0009995002 0.0004997501     1   ABxAA  NoSeg  NoSeg NoSeg
## 2 0.0004997501 0.0014992504 0.0004997501     1   ABxAA  ABxAA  NoSeg NoSeg
## 3 0.0004997501 0.0019990005 0.0004997501     1   ABxAA  ABxAA  NoSeg NoSeg
## 4 0.0004997501 0.0009995002 0.0004997501     1   ABxAA  NoSeg  NoSeg NoSeg
## 5 0.0004997501 0.0014992504 0.0004997501     1   ABxAA  ABxAA  NoSeg NoSeg
## 6 0.0004997501 0.0009995002 0.0004997501     1   ABxAA  NoSeg  NoSeg NoSeg
##   Seg05
## 1 NoSeg
## 2 NoSeg
## 3 NoSeg
## 4 NoSeg
## 5 NoSeg
## 6 NoSeg

There are a maximum of 16 BB genotypes out of ~300 individuals, for an individual error rate of ~5%. At this stage I consider this acceptable.

Let’s now look at how many of the ABxAA markers were assigned to more than 1 segregation with an alpha value of 0.0001.

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.0001)) %>%
       filter(Seg0005 == "ABxAA" & Above > 1) %>%
       arrange(Two)
     )

##   markerID Zero One Two        AAxBB        AAxAA        BBxBB        NULLS
## 1     1_21  143 140   0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 2    42_37  145 157   0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 3    42_58  158 143   0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 4    42_63  158 143   0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 5    55_31  110 102   0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 6    63_62  133 151   0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
##          ABxAB      ABxAA        ABxBB Above Seg0005 Seg001 Seg005 Seg01 Seg05
## 1 0.0004997501 0.05347326 0.0004997501     7   ABxAA  ABxAA  ABxAA ABxAA ABxAA
## 2 0.0004997501 0.03048476 0.0004997501     7   ABxAA  ABxAA  ABxAA ABxAA NoSeg
## 3 0.0004997501 0.02498751 0.0004997501     7   ABxAA  ABxAA  ABxAA ABxAA NoSeg
## 4 0.0004997501 0.03398301 0.0004997501     7   ABxAA  ABxAA  ABxAA ABxAA NoSeg
## 5 0.0004997501 0.10344828 0.0004997501     7   ABxAA  ABxAA  ABxAA ABxAA ABxAA
## 6 0.0004997501 0.03048476 0.0004997501     7   ABxAA  ABxAA  ABxAA ABxAA NoSeg

We can see that if we were to use an alpha value of 0.0001 markers that clearly come from an ABxAA segregation would not be identified. Let’s try and alpha value of 0.0005.

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.0005)) %>%
       filter(Seg0005 == "ABxAA" & Above > 1) %>%
       arrange(Two)
     )

##  [1] markerID Zero     One      Two      AAxBB    AAxAA    BBxBB    NULLS   
##  [9] ABxAB    ABxAA    ABxBB    Above    Seg0005  Seg001   Seg005   Seg01   
## [17] Seg05   
## <0 rows> (or 0-length row.names)

With this alpha value no markers are assigned to more than 1 segregation. This is a good result, as it means that for this category of informative markers we can identify them reliably with an alpha value of 0.0005.

But, let’s look if the markers we identified as ABxAA have similar to the expected genotype counts (i.e., 1AA:1AB:0BB). Let’s look at ABxAA markers more likely to be wrong, those with the lowest p-values.

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.0005)) %>%
       filter(Seg0005 == "ABxAA") %>%
       filter(Above > 0) %>%
       arrange(ABxAA) %>%
       select(markerID:Two,ABxAA)
     )

##   markerID Zero One Two        ABxAA
## 1   525_57  126  66   2 0.0009995002
## 2   527_20  144  76   0 0.0009995002
## 3   744_50  147  87   3 0.0009995002
## 4  1006_41   78 127   0 0.0009995002
## 5  2013_49  153  79   0 0.0009995002
## 6  2187_61  144  80  11 0.0009995002

We can see that there is some segregation distortion (e.g., 525_57 has a ratio of 2AA:1AB:0BB), with some erroneous individuals. Because of the low number of erroneous individuals we will retain this markers for now, as they are likely to be the result of segregation distortion more than sequencing and calling error.

Nevertheless, let’s see what they would look like if we used a more conservative alpha value.

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.001)) %>%
       filter(Seg0005 == "ABxAA") %>%
       filter(Above > 0) %>%
       arrange(ABxAA) %>%
       select(markerID:Two,ABxAA)
     )

##   markerID Zero One Two      ABxAA
## 1   107_40   74 119   0 0.00149925
## 2   604_27  160 100   4 0.00149925
## 3   613_53  141  92   0 0.00149925
## 4  1038_40  169 116  11 0.00149925
## 5  1114_15  106 154   0 0.00149925
## 6  2067_56   82 143   1 0.00149925

The ratio of AA to AB is closer to 1:1, but we have a similar amount of erroneous genotypes (BB individuals).

For these markers 0.001 might be too conservative. Let’s see how many fall below that threshold.

nrow(fam_seg_all %>%
         filter(Seg0005 == "ABxAA") %>%
         filter(Above > 0) %>%
         filter(ABxAA <= 0.001)
       )

## [1] 129

Not too many. 0.001 might be an all right threshold for ABxAA markers to remove markers that are too distorted while retaining the most markers possible. Let’s look at how many markers we would lose with a more conservative alpha value of 0.005.

nrow(fam_seg_all %>%
         filter(Seg0005 == "ABxAA") %>%
         filter(Above > 0) %>%
         filter(ABxAA <= 0.005)
       )

## [1] 641

Nearly 6 times as much, so possibly too conservative.

ABx–: Informative uniparental null alleles

Null alleles are those where one of the two parents doesn’t get sequenced (possibly because of a mutation at the cut site). The other parent in question has an AB genotype. All offspring end up as either AA or BB. These markers can still be used for linkage mapping, if we consider the – parent homozygote. To make it easy, and because all information comes from the heterozygote parent anyway, we can consider the – parent as AA, and thus all A- offspring become AA, and all B- offspring become AB.

Let’s look at markers identified as NULLS with an alpha value of 0.0005, ordering them by how many incorrect AB individuals there are. NOTE that for these markers the AB individuals might in fact be correct and be the ones for which both alleles could be sequenced despite the mutation at the restriction site (if this is the mechanism originating them.)

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.0005)) %>%
         filter(Seg0005 == "NULLS") %>%
         arrange(desc(One))
       )

##   markerID Zero One Two        AAxBB        AAxAA        BBxBB        NULLS
## 1  8401_50  126  15 117 0.0004997501 0.0004997501 0.0004997501 0.0009995002
## 2 56458_19  143  12 105 0.0004997501 0.0004997501 0.0004997501 0.0019990005
## 3  7399_44  122  10 116 0.0004997501 0.0004997501 0.0004997501 0.0584707646
## 4 21073_27  144  10 106 0.0004997501 0.0004997501 0.0004997501 0.0064967516
## 5  5670_52  114   9 105 0.0004997501 0.0004997501 0.0004997501 0.0814592704
## 6 78356_61  124   9 104 0.0004997501 0.0004997501 0.0004997501 0.0659670165
##          ABxAB        ABxAA        ABxBB Above Seg0005 Seg001 Seg005 Seg01
## 1 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NoSeg  NoSeg NoSeg
## 2 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NoSeg NoSeg
## 3 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
## 4 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NoSeg
## 5 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
## 6 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
##   Seg05
## 1 NoSeg
## 2 NoSeg
## 3 NULLS
## 4 NoSeg
## 5 NULLS
## 6 NULLS

We have one marker with an ‘error rate’ (used loosely here) of 5%, while all other markers have an error rate < 3-4 %.

Furthermore, it seems that if we used an alpha of 0.005 instead of 0.0005 we would filter markers with more than 10 individuals being genotyped as AB (of whome we would expect 0 for these markers).

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.005)) %>%
       filter(Seg0005 == "NULLS") %>%
       filter(NULLS > 0.005) %>%
       arrange(desc(One))
       )

##   markerID Zero One Two        AAxBB        AAxAA        BBxBB       NULLS
## 1  7399_44  122  10 116 0.0004997501 0.0004997501 0.0004997501 0.058470765
## 2 21073_27  144  10 106 0.0004997501 0.0004997501 0.0004997501 0.006496752
## 3  5670_52  114   9 105 0.0004997501 0.0004997501 0.0004997501 0.081459270
## 4 78356_61  124   9 104 0.0004997501 0.0004997501 0.0004997501 0.065967016
## 5  13527_8  143   7 112 0.0004997501 0.0004997501 0.0004997501 0.110444778
## 6 21772_25  115   7  99 0.0004997501 0.0004997501 0.0004997501 0.257371314
##          ABxAB        ABxAA        ABxBB Above Seg0005 Seg001 Seg005 Seg01
## 1 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
## 2 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NoSeg
## 3 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
## 4 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
## 5 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
## 6 0.0004997501 0.0004997501 0.0004997501     1   NULLS  NULLS  NULLS NULLS
##   Seg05
## 1 NULLS
## 2 NoSeg
## 3 NULLS
## 4 NULLS
## 5 NULLS
## 6 NULLS

Let’s look at how easily NULLS markers are identified by assessing to how many segregations they would be assigned at an alpha value of 0.0005.

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.0005)) %>%
         filter(Seg0005 == "NULLS" & Above > 1) %>%
         arrange(One)
       )

##  [1] markerID Zero     One      Two      AAxBB    AAxAA    BBxBB    NULLS   
##  [9] ABxAB    ABxAA    ABxBB    Above    Seg0005  Seg001   Seg005   Seg01   
## [17] Seg05   
## <0 rows> (or 0-length row.names)

With an alpha value of 0.0005 no NULL markers are assigned to more than one segregation.

Let’s look at NULL alleles with the lowest p-values.

head(fam_seg_all %>%
       mutate(Above = rowSums(
         select(., AAxBB:ABxBB)>0.0005)) %>%
         filter(Seg0005 == "NULLS") %>%
         filter(Above > 0) %>%
         arrange(NULLS) %>%
         select(markerID:Two,NULLS)
       )

##   markerID Zero One Two        NULLS
## 1  8401_50  126  15 117 0.0009995002
## 2 51449_58  131   1  70 0.0009995002
## 3 27575_56  141   0  90 0.0014992504
## 4 56458_19  143  12 105 0.0019990005
## 5 40616_56  121   2  74 0.0029985007
## 6   6575_8  149   7  98 0.0039980010

These include the two markers with more than 10 erroneous genotypes. Having an alpha value of 0.005 in this case would filter those two markers out.

nrow(fam_seg_all %>%
         filter(Seg0005 == "NULLS") %>%
         filter(Above > 0) %>%
         filter(NULLS <= 0.005)
       )

## [1] 7

Further, it would remove only 7 markers total, so a = 0.005 is probably a good threshold for NULLS markers.

ABxAB: Informative double-heterozygote

Once again, let’s start by seeing if these markers are consistently assigned to one segregation only at a non-conservative alpha value of 0.0005.

head(fam_seg_all %>%
       mutate(Above = rowSums
              (select(., AAxBB:ABxBB)>0.0005)) %>%
       filter(Seg0005 == "ABxAB" & Above > 1)
       )

##  [1] markerID Zero     One      Two      AAxBB    AAxAA    BBxBB    NULLS   
##  [9] ABxAB    ABxAA    ABxBB    Above    Seg0005  Seg001   Seg005   Seg01   
## [17] Seg05   
## <0 rows> (or 0-length row.names)

There are no markers assigned to more than 1 segregation. Good result!

In light of this, let’s check the distribution of segregation with p-value > 0.0005 for more than 1 segregation type.

temp2 <- fam_seg_all %>%
  mutate(Above = rowSums(
    select(., AAxBB:ABxBB)>0.0005)) %>%
  filter(Above > 1)

table(temp2$Seg0005)

## < table of extent 0 >

There are no markers assigned to more than 1 segregation. From applying this workflow to other families with less than ~150 individuals, I know that this is a result of our large number of offspring (~300), as for smaller families I would get ~50 markers at least assigned to more than 1 segregration.

Let’s now try and plot the number of markers assigned to each segregation across different alpha values

alpha <- c(0.0005,0.001,0.005,0.01,0.05)

t1 <- table(fam_seg_all$Seg0005)
t2 <- table(fam_seg_all$Seg001)
t3 <- table(fam_seg_all$Seg005)
t4 <- table(fam_seg_all$Seg01)
t5 <- table(fam_seg_all$Seg05)

t6 <- rbind(t1,t2,t3,t4,t5)

## Warning in rbind(t1, t2, t3, t4, t5): number of columns of result is not a
## multiple of vector length (arg 4)

t6 <- cbind(t6,alpha)

t6[4,] <- c(0,843,8083,1958,5253,208,0.01)
t6[5,] <- c(0,792,3049,1795,10548,161,0.05)

fam_seg_count <- t6 %>% as_tibble %>%
    pivot_longer(!alpha, names_to="Seg",values_to="Count")

rm(t1,t2,t3,t4,t5,t6)

And plot it

ggplot(fam_seg_count,
       aes(x=alpha, y=Count, colour=Seg)) +
  geom_point() + geom_line() +
  scale_x_continuous(trans='log10') +
  theme_classic()

From this graph it seems obvious that 0.05 is too strict, as > 10,000 markers end up not being assigned to a segregation. I will thus assign a segregation to markers with a threshold of 0.005, which removes most of the markers we found with more than 10 erroneous offsprings while retaining a good amount of markers.

#remove markers assigned to more than 1 segregation
#should be zero. This is a leftover from analysing a larger dataset of 4 families. Still thought I'd leave it
fam_seg <- fam_seg_all %>%
  mutate(Above = rowSums(
    select(., AAxBB:ABxBB)>0.0005)) %>%
  filter(Above < 2)

#now let's assign a segregation only if alpha value is above 0.005
#IMPORTANT: THIS MEANS WE ASSIGN SEGREGATION ONLY IF OBSERVED GENOTYPE PROPORTIONS ARE NOT SIGNIFICANTLY DIFFERENT FROM EXPECTED PROPORTIONS FOR THAT SEGREGATION.
y <- fam_seg %>% select(AAxBB:ABxBB)
Segregation <- apply(y, 1, function(x) {
    ifelse(any(x > 0.005), names(x)[x > 0.005], "NoSeg") })

fam_seg <- cbind(fam_seg[,1:11],Segregation)

Let’s look at number of markers per segregation now

table(fam_seg$Segregation)

## 
## AAxAA AAxBB ABxAA ABxAB NoSeg NULLS 
##     4   863  8607  2005  4652   214

Most markers are uniparental, which are the preferred markers for linkage mapping. 2002 markers come from a double heterozygote cross, 4652 could not be assigned to a segregation and 214 are NULL markers.

Let’s check the extremes for our informative segregation once more, to confirm we only retained markers for which we can confidently assign a segergation.

ABxAA The least AAs:

head(fam_seg %>%
         filter(Segregation == "ABxAA") %>%
         arrange(Zero) %>%
         select(markerID:Two,ABxAA)
       )

##   markerID Zero One Two       ABxAA
## 1 43206_68   73 106   8 0.006996502
## 2  6354_66   77 108   0 0.011994003
## 3 38636_16   77 112   0 0.008995502
## 4 20262_64   78 107   0 0.010994503
## 5 23675_31   78 113   1 0.011994003
## 6 23675_35   78 115   0 0.005497251

Skewed but few (<10 erroneous genotypes).

Now the least ABs

head(fam_seg %>%
         filter(Segregation == "ABxAA") %>%
         arrange(One) %>%
         select(markerID:Two,ABxAA)
       )

##   markerID Zero One Two       ABxAA
## 1 35205_52  114  74   2 0.010994503
## 2 64937_19  116  74   2 0.013493253
## 3  2713_25  113  75   1 0.009995002
## 4 18761_24  121  75   3 0.006996502
## 5 27228_46  116  75   5 0.015492254
## 6 38740_16  116  75   1 0.008495752

Same story, skewed but few (<6) erroneous genotypes.

Now the most BB, which we shold have 0 of.

head(fam_seg %>%
         filter(Segregation == "ABxAA") %>%
         arrange(desc(Two)) %>%
         select(markerID:Two,ABxAA)
       )

##    markerID Zero One Two       ABxAA
## 1  13007_16  148 131  13 0.010494753
## 2  13007_47  148 131  13 0.005497251
## 3 137542_36  156 134  13 0.009495252
## 4   6086_55  126 107  12 0.005497251
## 5   6377_17  138 142  12 0.039480260
## 6   6377_18  138 142  12 0.028985507

A maximum of 13, which is less than 5% error rate. Acceptable. We will filter those genotype calls out later.

ABx– Let’s repeat those stage for ABx– markers

head(fam_seg %>%
         filter(Segregation == "NULLS") %>%
         arrange(Zero) %>%
         select(markerID:Two,NULLS)
       )

##    markerID Zero One Two     NULLS
## 1  30571_49   94   3  95 0.9445277
## 2  30571_66   94   2  98 0.6246877
## 3  84861_58   99   0 101 0.1194403
## 4  33265_46  100   3 105 0.7906047
## 5 210906_31  100   2 115 0.3323338
## 6 210906_32  100   2 116 0.2983508

head(fam_seg %>%
         filter(Segregation == "NULLS") %>%
         arrange(desc(One)) %>%
         select(markerID:Two,NULLS)
       )

##   markerID Zero One Two       NULLS
## 1  7399_44  122  10 116 0.058470765
## 2 21073_27  144  10 106 0.006496752
## 3  5670_52  114   9 105 0.081459270
## 4 78356_61  124   9 104 0.065967016
## 5  13527_8  143   7 112 0.110444778
## 6 21772_25  115   7  99 0.257371314

head(fam_seg %>%
         filter(Segregation == "NULLS") %>%
         arrange(Two) %>%
         select(markerID:Two,NULLS)
       )

##   markerID Zero One Two      NULLS
## 1 33512_30  111   6  85 0.12093953
## 2 83568_57  115   0  89 0.02698651
## 3 33574_28  126   6  90 0.04697651
## 4 83568_25  116   0  90 0.02348826
## 5  3244_44  105   0  91 0.08245877
## 6 13823_51  122   2  91 0.04997501

Maximum of 10 erroneous individuals and genotype ratios very close to 1:0:1 in all 3 extreme cases.

ABxAB And for ABxAB markers.

head(fam_seg %>%
         filter(Segregation == "ABxAB") %>%
         arrange(Zero) %>%
         select(markerID:Two,ABxAB)
       )

##    markerID Zero One Two     ABxAB
## 1 176909_14   38 106  41 0.1459270
## 2 176909_32   38 106  41 0.1279360
## 3   1296_35   41 108  41 0.1854073
## 4   1296_20   42 106  42 0.2893553
## 5   1296_51   42 106  41 0.2538731
## 6  43457_39   44 104  40 0.3153423

head(fam_seg %>%
         filter(Segregation == "ABxAB") %>%
         arrange(One) %>%
         select(markerID:Two,ABxAB)
       )

##   markerID Zero One Two       ABxAB
## 1 90277_30   62  71  54 0.005497251
## 2 79093_58   57  73  56 0.013493253
## 3  4248_51   60  74  56 0.011994003
## 4   962_46   60  75  57 0.007496252
## 5 24044_13   61  75  57 0.009495252
## 6 55621_42   62  75  54 0.006996502

head(fam_seg %>%
         filter(Segregation == "ABxAB") %>%
         arrange(Two) %>%
         select(markerID:Two,ABxAB)
       )

##    markerID Zero One Two       ABxAB
## 1   1335_25   62  98  33 0.014992504
## 2   1335_68   62  95  34 0.013493253
## 3 121317_30   59 105  34 0.028485757
## 4     780_8   62  93  35 0.024487756
## 5   6690_10   66  90  35 0.005497251
## 6     46_28   49 109  37 0.120939530

This approach seems to work well. For now I feel confident that filtering by segregation distortion will deal with most of the dubious markers in these three categories (ABxAA, ABxAB and ABx–).

3. Splitting uninformative and informative markers

Let’s define the three categories

SegAss <- c("ABxAA","ABxAB","NULLS")
NoSeg <- c("NoSeg")
UnInf <- c("AAxBB","AAxAA","BBxBB")

Let’s re-display number of markers in each segregation, to check output’s accuracy

table(fam_seg$Segregation)

## 
## AAxAA AAxBB ABxAA ABxAB NoSeg NULLS 
##     4   863  8607  2005  4652   214

And now let’s save an R object for each category

3.1 Uninformative markers

fam_UnInf  <- fam_seg %>% filter(Segregation %in% UnInf)

nrow(fam_UnInf)

## [1] 867

#save(fam_UnInf, file = #"../1_DataPrep/3_2_FilteredMarkerData_Uninformative.RData")

3.2 Informative markers

fam_SegAss <- fam_seg %>% filter(Segregation %in% SegAss)

nrow(fam_SegAss)

## [1] 10826

#saveRDS(fam_SegAss, file = #"../1_DataPrep/3_2_FilteredMarkerData_Informative.RData")

3.3 No segregation assigned

fam_NoSeg <- fam_seg %>% filter(Segregation %in% NoSeg)

nrow(fam_NoSeg)

## [1] 4652

#save(fam_NoSeg, file = #"../1_DataPrep/3_2_FilteredMarkerData_NoSegAss.RData")

4. Conclusion

Done! In this section we have identified an ideal alpha value to distinguish between markers being correctly assigned to a segregation or not, we have then procedeed to assign a segregation and saved three files: one with informative markers (ABxAA, ABxAB and ABx–); one with uninformative markers (AAxAA, BBxBB); and one with markers for which a segregation could not be identified.

In the next step we will silence individual genotypes and remove markers when the offspring and parent genotypes do not match.