1. Background

In this section am going to filter and correct markers based on the expected mendelian segregations. I am going to do this by: - deleting markers where neither of the two parents matches the segregation observed in the offspring (e.g., the parents are AAxBB and the offspring have a 1AA:2AB:1BB propotion, thus expecting ABxAB parents) - correcting parents if correcting only one parent matches the expected segregation based on the offspring genotypes (e.g., parents are AAxAB and the offspring have a 1AA:2AB:1BB propotion, thus expecting ABxAB parents) - silence erroneour offspring genotypes (e.g., observed proportions match an AAxAB segregation, parents are indeed AAxAB but there are 2% of individuals with BB genotypes.)

First, I make sure I have a clear working environment

rm(list = ls())

And load the required packages

library(tidyverse)

2. Load and manipulate data

Genotype data

fam_dfs <- readRDS("../1_DataPrep/2_3_FilteredMarkerData_withParents_HC.RData")

Segregation data for informative markers only

fam_SegAss <- readRDS("../1_DataPrep/3_2_FilteredMarkerData_Informative.RData")

First, let’s have a look at the numbers of markers in each

Number of markers in the genotype tibbles

## [1] 16345

Number of markers in the segregation tibbles

## [1] 10826

These are pretty good numbers of informative markers.

Now, let’s join the two dataframes, to retain only the markers in fam_SegAss. We can do this with the inner_join function, which returns all values from x where there are matching values in y.

fam_dfs <- inner_join(fam_SegAss, fam_dfs, by = "markerID")

Before we proceed, let’s check that the new genotype tibbles have the same markers as the segregation tibbles

sum(fam_dfs$markerID == fam_SegAss$markerID) == nrow(fam_SegAss)
## [1] TRUE

Finally, let’s remove the columns with p-values from the chi.square test as we don’t need those anymore. We will keep everything else.

fam_dfs <- fam_dfs %>% select(-c(AAxBB:ABxBB))

3. Correct Parent Genotypes

And now let’s correct parent genotypes if possible, silencing offsprings with erroneous genotypes, and delete markers where the parent segregation doesn’t match the assigned segregation.

To make my life easier in later if statements, I create a column containing the genotypes for both parents.

fam_dfs <- fam_dfs %>% tidyr::unite("parents",
                                    ends_with("_FAT_M1"),
                                    ends_with("_MOT_F1"),
                                    remove = FALSE)

Now we can start correcting the parents according to the segregation assigned to that marker. There are three informative segregations: ABxAB, ABxAA and NULL alleles.

Let’s start with ABxAB. First, we break down what combinations of parent genotypes are correct, which ones can be corrected, and which ones can not.

ABxAB

Correct combinations

  • AB x AB [1_1]

Combinations that CAN be corrected

  • AB x AA [1_0] -> AB x AB
  • AA x AB [0_1] -> AB x AB
  • AB x BB [1_2] -> AB x AB
  • BB x AB [2_1] -> AB x AB

Combinations that CAN NOT be corrected

  • AA x BB [0_1]
  • BB x AA [1_0]
  • AA x AA [0_0]
  • BB x BB [1_1]

Note that for the last group, we could correct them and change both parents to ABxAB, but for this pipeline I decided to be more conservative and retain markers only if I could correct parent pairs by changing one parent only. This is also true for markers when one parent is correct and the other was not genotyped (e.g., 2_-)

Now, if the parent genotypes are correct we won’t do anything, if they can be corrected we correct them, and if they can’t be corrected we filter them out.

First, we create a dataframe of only ABxAB markers

fam_ABxAB <- fam_dfs %>% filter(Segregation == "ABxAB")

Let’s check parent genotypes for such markers

table(fam_ABxAB$parents)
## 
## -1_-1  -1_1  0_-1   0_2  1_-1   1_1  2_-1   2_0   2_1 
##   268   118     1    34   293  1244     2    44     1

Most parent pairs are already correct (both heterozygous 1), which is a good sign (1244 out of 2005). These are the markers for which we have the most confidence of them being correct. Let’s make a column assigning them the CORR tag, and all other markers the ERR tag.

For the remaining markers, we can see there are 2 categories that contain most of the remaining markers, -1_1 and 1_-1. These represent markers where one parent is heterozygous and the other wasn’t successfully sequenced. We will correct only those ones in this instance, as the other ones that could be corrected represent very small proportions, and thus the risk/return factor is not very good (as in, we would be retaining possibly incorrect markers, but there’s only a few of those so not worth the risk).

Finally, we will leave all markers where neither parent is heterozygous and filter them out.

fam_ABxAB <- fam_ABxAB %>%
  #assign tag "CORR" if parents already correct
    mutate(tag = ifelse(parents == "1_1", "CORR", "ERR")) %>%
  #correct parents from "-1_1" to "1_1"
    mutate(parents = case_when(
      parents == "-1_1" ~ "1_1",
      parents == "1_-1" ~ "1_1",
      TRUE ~ parents)) %>%
  #filter if parents are not "1_1" at this stage
    filter(parents == "1_1") %>%
  #split genotype of corrected parents
    separate(parents,c("FAT_M1","MOT_F1"),sep = "_") %>%
  #remove old uncorrected genotypes
    select(-ends_with("_FAT_M1"),-ends_with("_MOT_F1"))

Check how many markers are retained

nrow(fam_ABxAB)
## [1] 1655

Check how many were correct from the start and how many were corrected

table(fam_ABxAB$tag)
## 
## CORR  ERR 
## 1244  411

NULLS

Correct combinations

  • AB x – [1_-1]
  • – x AB [-1_1]

In this case, we will retain only markers where the parents already have the right genotype. But we will have to recode them. The -1 parent will become AA, AA offspring will become AB, and BB offspring will stay the same. AB offspring on the other hand will be silenced to -1.

fam_NULLS <- fam_dfs %>% filter(Segregation == "NULLS")
table(fam_NULLS$parents)
## 
## -1_-1  -1_1   0_0  1_-1 
##    78    64     1    71

Once again, most parents already have the correct genotype (135 out of 214).

Now we can modify parents genotypes from null alleles (AB and “-1” for missing data; “-1_1” and “1_-1”), to genotype calls that can be used for linkage mapping (AB and AA; "1_0). We are basically treating the null allele as an allele. It could be anything, (A, B or even a third allele C). But for simplicity, and because all other SNPs are biallelic, we keep null alleles biallelic too.

fam_NULLS <- fam_NULLS %>% 
  #retain only ABx-- parents
    filter(parents == "1_-1" | parents == "-1_1") %>%
  #change from ABx-- to ABxAA
    mutate(parents = case_when(
      parents == "1_-1" ~ "1_0",
      parents == "-1_1" ~ "0_1")) %>%
  #separate correct parent genotypes into two separate columns
    separate(parents,c("FAT_M1","MOT_F1"),sep = "_") %>%
  #remove non-corrected parent genotypes
    select(-ends_with("_MOT_F1"),-ends_with("_FAT_M1"))

Then we have to modify the offspring genotypes.

fam_NULLS <- fam_NULLS %>%
  #mutate only offspring (all start with LS)
    mutate_at(vars(starts_with("LS")),
              #change AB to missing 
              function(d) case_when(d == 1 ~ -1,
              #change BB to AB (AA stays the same)
                                    d == 2 ~ 1,
                                    TRUE ~ d)
              )

And finally we add a column called tag, to match the tibble for other informative segregations. For null alleles, we retained only markers were both parents were correct, so we give all of them the tag CORR

fam_NULLS <- fam_NULLS %>% mutate(tag = "CORR")

Let’s check how many null markers are left for each family:

nrow(fam_NULLS)
## [1] 135

Not many. This is interesting. When I did this analysis on the genotype data called by dart 1/3 of the markers were null alleles. Processing the raw fastq reads addressed this ‘issue’ in large part. Will need to investigate why there’s this massive difference. I suspect that it is because dart doesn’t use the 3 sequencing runs for the parents to build the ‘stacks’ (i.e., loci). They thus have way lower coverage for many of the markers in the parents and end up losing those markers.

ABxAA

Now onto the last class, ABxAA markers. To start, let’s see the distribution of parent genotypes

fam_ABxAA <- fam_dfs %>% filter(Segregation == "ABxAA")

table(fam_ABxAA$parents)
## 
## -1_-1  -1_0  -1_1  -1_2  0_-1   0_0   0_1   0_2  1_-1   1_0   1_1  2_-1   2_0 
##  1175   252   244    75   620    12  2715   138   530  2551    11    98   186

Fortunately, most parents are correct (2715 and 2551 out of 8607, or 5266/8607). These will be the markers we will use to build the framework maps most likely. But to try and increase the density of our maps, let’s see what other markers we might be able to correct.

Correct combinations

  • AB x AA [1_0]
  • AA x AB [0_1]

Combinations that CAN be corrected by modifying parent with missing data

  • AA x – [0_-1] -> AA x AB [0_1]
  • – x AA [-1_0] -> AB x AA [1_0]
  • – x AB [-1_1] -> AA x AB [0_1]
  • AB x – [1_-1] -> AB x AA [1_0]

Combinations that COULD be corrected by modifying one parent only

  • AA x BB [0_2] -> AA x AB [0_1]
  • BB x AA [2_0] -> AB x AA [1_0]

Combinations that can’t be corrected by modifying one parent only

  • – x – [-1_-1]
  • – x BB [-1_2]
  • BB x – [2_-1]
  • AA x AA [0_0]
  • AB x AB [1_1]

Before we correct them, we will have a look at what proportion the genotypes are in for each of the 6 parent genotype pairs that could be corrected to see how confident we are in their assigned segregation.

head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "0_-1"))
##   markerID Zero One Two parents
## 1   124_59  154 123   1    0_-1
## 2   183_33  133 150   2    0_-1
## 3   217_21  151 145   0    0_-1
## 4   221_65  149 151   0    0_-1
## 5   324_55  158 130   4    0_-1
## 6   324_66  155 131   4    0_-1
head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "0_-1")) %>%
    arrange(desc(Two))
##   markerID Zero One Two parents
## 1   324_55  158 130   4    0_-1
## 2   324_66  155 131   4    0_-1
## 3   183_33  133 150   2    0_-1
## 4   124_59  154 123   1    0_-1
## 5   217_21  151 145   0    0_-1
## 6   221_65  149 151   0    0_-1

A maximum of 4 erroneous genotypes out of 300. This is pretty good.

head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "-1_0"))
##   markerID Zero One Two parents
## 1   227_12  156 148   0    -1_0
## 2   227_39  156 148   0    -1_0
## 3   227_68  150 154   0    -1_0
## 4   293_42  141 131   0    -1_0
## 5    334_8  148 153   0    -1_0
## 6   334_17  147 153   0    -1_0
head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "-1_0")) %>%
    arrange(desc(Two))
##   markerID Zero One Two parents
## 1   227_12  156 148   0    -1_0
## 2   227_39  156 148   0    -1_0
## 3   227_68  150 154   0    -1_0
## 4   293_42  141 131   0    -1_0
## 5    334_8  148 153   0    -1_0
## 6   334_17  147 153   0    -1_0

Offspring genotypes for -1_0 and 0_-1 markers seem to perfectly match an ABxAA segregation, one of their parents is correct, and the other has a missing genotype. Thus, we will correct those.

head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "-1_1"))
##   markerID Zero One Two parents
## 1   171_49  144 159   0    -1_1
## 2    175_8  135 168   0    -1_1
## 3   210_25  162 144   0    -1_1
## 4   210_54  162 144   0    -1_1
## 5   325_64  156 134   0    -1_1
## 6   668_40  131 144   0    -1_1
head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "-1_1")) %>%
    arrange(desc(Two))
##   markerID Zero One Two parents
## 1   171_49  144 159   0    -1_1
## 2    175_8  135 168   0    -1_1
## 3   210_25  162 144   0    -1_1
## 4   210_54  162 144   0    -1_1
## 5   325_64  156 134   0    -1_1
## 6   668_40  131 144   0    -1_1
head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "1_-1"))
##   markerID Zero One Two parents
## 1    42_37  145 157   0    1_-1
## 2    42_58  158 143   0    1_-1
## 3    42_63  158 143   0    1_-1
## 4    63_62  133 151   0    1_-1
## 5    79_31  141 157   0    1_-1
## 6    90_51  156 145   0    1_-1
head(fam_ABxAA %>%
    select(markerID:Two,parents) %>%
    filter(parents == "1_-1")) %>%
    arrange(desc(Two))
##   markerID Zero One Two parents
## 1    42_37  145 157   0    1_-1
## 2    42_58  158 143   0    1_-1
## 3    42_63  158 143   0    1_-1
## 4    63_62  133 151   0    1_-1
## 5    79_31  141 157   0    1_-1
## 6    90_51  156 145   0    1_-1

The same goes for 1_-1 and -1_1 markers, so we will correct those as well.

On the other hand, there are no 1_2 and 2_1 markers, and 0_2 and 2_0 markers are few, and we would have to modify a parent call, instead of just calling a missing genotype. For these reasons, we will not correct those markers.

So, first step, let’s correct the parents from 1_-1, -1_1, 0_-1 and -1_0 markers.

fam_ABxAA <- fam_ABxAA %>%
  #assign "CORR" tag if parents already correct
  mutate(tag = ifelse((parents == "1_0" | parents == "0_1"),
                      "CORR", "ERR")) %>%
  #correct parents where one correct and the other is NA
  mutate(parents = case_when(
    parents == "-1_0" ~ "1_0",
    parents == "0_-1" ~ "0_1",
    parents == "-1_1" ~ "0_1",
    parents == "1_-1" ~ "1_0",
    TRUE ~ parents)) %>%
  #retain only correct markers at this stage
  filter(parents == "0_1" | parents == "1_0") %>%
  #separate parent genotypes
  separate(parents,c("FAT_M1","MOT_F1"),sep = "_") %>%
  #remove old parent genotypes
  select(-ends_with("_MOT_F1"),-ends_with("_FAT_M1"))

Then, let’s silence all erroneous offspring

fam_ABxAA <- fam_ABxAA %>%
  #correct only offspring genotypes (they start with "LS")
    mutate_at(vars(starts_with("LS")),
              #silence BB offspring
              function(d) case_when(d == 2 ~ -1,
                                    TRUE ~ d)
              )

Check how many markers are retained

nrow(fam_ABxAA)
## [1] 6912

Check how many were correct from the start and how many were corrected

table(fam_ABxAA$tag)
## 
## CORR  ERR 
## 5266 1646

4. Join informative markers and save

As a final step, let’s join all the informative markers into a single tibble per family and save it.

temp <- rbind(fam_ABxAB, fam_NULLS)
fam_SNP <- rbind(temp, fam_ABxAA)

rm(temp)

Let’s check the number of rows is correct between the individual tibbles and the final merged tibble

nrow(fam_ABxAB) + nrow(fam_NULLS) + nrow(fam_ABxAA)
## [1] 8702
nrow(fam_SNP)
## [1] 8702

Compare to total number of markers to which a segregation was assigned, to assess how many were lost per family

nrow(fam_dfs) - nrow(fam_SNP)
## [1] 2124

Clear the working environment from now unwanted objects

rm(fam_ABxAA, fam_NULLS, fam_ABxAB, fam_dfs, fam_SegAss)

And save the final dataset of informative SNP markers, to which a segregation was successfully assigned, and for which parents were either correct (CORR), or could be corrected (ERR).

#saveRDS(fam_SNP, file = #"../1_DataPrep/4_1_FilteredMarkerData_Informative_HC.RData")