9. Polishing Maps

1. Background
2. Polishing maps
3. Re-run Order markers without ‘bad’ markers
4. Select the best run
5. Produce LMPlot plots

1. Background

Now that we have identified the markers causing long (>15-20 cM) gaps at the end, let’s remove them and re-run the ordering module.

2. Polishing maps

Let’s remove for each LG the problematic markers, re-order the map and produce LMPlot to see if it fixed the order.

First, I need to produce a file containing marker ID and assigned linkage group, so that I can assign markers i want to exclude to linkage group 0 to exclude them from the ordering stage.

I achieve this with the following command

pr -mts$'\t' <(cut -d$'\t' -f1 ../1_DataPrep/5_2_LepMapInput_HC_A.txt | tail -n +7) <(cut -d$'\t' -f1 ../2_LinMap/6_3_joinsingles_HC | tail -n +2) > ../2_LinMap/9_1_markersList_HC

Next I silence the markers i identified before by assigning them to LG0. These markers are:

1886_8, 1886_34, 1886_53, 1886_57 for LG2
19546_11, 19546_37 for LG4
26181_55 for LG5
2634_6, 2634_20, 2634_21, 2634_27 for LG8
19546_13, 19546_18, 56691_24, 56691_26 for LG9
19546_24, 19546_53 for LG11

awk 'BEGIN{FS=OFS="\t"} $1~/1886_/ || $1~/19546_/ || $1~/2634_/ || $1=="26181_55" || $1=="56691_24" || $1=="56691_26" {$2="0"}1' ../2_LinMap/9_1_markersList_HC | awk 'BEGIN{FS="\t"} {print $2}' > ../2_LinMap/9_2_markersList_HC_Polished

3. Re-run Order markers without ‘bad’ markers

Now I re-run the orderMarkers module using the scripts 3_7_OrderChromosomes2_HC_Sex.sh (where Sex is replaced with the two sexes respectively:

#!/bin/bash

#PBS -N orderHC
#PBS -l ncpus=4
#PBS -l mem=15g
#PBS -l walltime=10:00:00
#PBS -J 1-195

##### Change Directory #####
cd ~/ProjectDirectory

##### Get params from array #####
# we can re-use the same array file as before
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_4_Array`
parameterArray=($parameters)

LG_INDEX=${parameterArray[0]}
REP_INDEX=${parameterArray[1]}
out_lod=2_LinMap/9_3_female/9_3_HC_Female_LG${LG_INDEX}_REP${REP_INDEX}_LOD
out_int=2_LinMap/9_3_female/9_3_HC_Female_LG${LG_INDEX}_REP${REP_INDEX}_INT
output=2_LinMap/9_3_female/9_3_HC_Female_LG${LG_INDEX}_REP${REP_INDEX}.txt

##### Run order chromosomes #####
java -cp 0_lepmap/bin/ OrderMarkers2 \
data=./1_DataPrep/5_2_LepMapInput_HC_A.txt map=./2_LinMap/9_2_markersList_HC_Polished \
chromosome=${LG_INDEX} computeLODScores=${out_lod} calculateIntervals=${out_int} informativeMask=23 numThreads=4 > $output

4. Select the best run

Now let’s extract the likelihood for each run. I did so with a series of bash for loops. The script loops through:

the 13 chromosomes { 1 .. 13}
the 11 iterations { a .. o}

#!/bin/bash
for LG in {1..13} ; do
  for REP in {a..o} ; do
    echo -n "$LG ${REP} " >> "9_4_HC_male_LIK.txt"
    echo | sed '2q;d' "9_3_HC_Male_LG${LG}_REP${REP}.txt" | awk '{print $7}' >> "9_4_HC_male_LIK.txt"
  done
done

for LG in {1..13} ; do
  for REP in {a..o} ; do
    echo -n "$LG ${REP} " >> "9_4_HC_female_LIK.txt"
    echo | sed '2q;d' "9_3_HC_Female_LG${LG}_REP${REP}.txt" | awk '{print $7}' >> "9_4_HC_female_LIK.txt"
  done
done

Now I have two files for each family containing the log likelihoods of each of the 11 runs for each of the 13 chromosomes. Let’s import them into R so we can identify which run for each chromosome has the best likelihood value, so that we can start plotting the resulting maps and identifying any possible errors.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## Warning: package 'tidyr' was built under R version 4.0.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggpubr)

HC_LIKm <- read_delim("../2_LinMap/9_4_HC_male_LIK.txt", delim = " ", col_names = c("LG","ITE","LIK"))

HC_LIKf <- read_delim("../2_LinMap/9_4_HC_female_LIK.txt", delim = " ", col_names = c("LG","ITE","LIK"))

Now, let’s plot them LG by LG, to see if there’s any obvious outliers

p1 <- ggplot(HC_LIKm,
             aes(x = LG, y = LIK, group = LG)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(1,13,1)) +
  ylab("Likelihood") + xlab("Linkage group") +
  theme(axis.text.x = element_text(size=6))

p2 <- ggplot(HC_LIKf,
             aes(x = LG, y = LIK, group = LG)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(1,13,1)) +
  ylab("Likelihood") + xlab("Linkage group") +
  theme(axis.text.x = element_text(size=6))

ggarrange(p1,p2,
          nrow = 2, align = "hv",
          font.label = list(size = 8))

Now, let’s find which iteration has the highest likelihood.

HC_best <- HC_LIKm %>% group_by(LG) %>%
  slice_max(n=1, order_by=LIK, with_ties=FALSE) %>%
  select(LG, ITE) %>% unite("Best", remove = TRUE, sep = "_")

write_delim(HC_best, file = "../2_LinMap/9_5_HC_male_Best.txt",
            delim = "\t", col_names = FALSE)

HC_best <- HC_LIKf %>% group_by(LG) %>%
  slice_max(n=1, order_by=LIK, with_ties=FALSE) %>%
  select(LG, ITE) %>% unite("Best", remove = TRUE, sep = "_")

write_delim(HC_best, file = "../2_LinMap/9_5_HC_female_Best.txt",
            delim = "\t", col_names = FALSE)

And let’s copy to a sub directory the files corresponding to the best run (both the map, the lods and the intervals). I rename them to start with “3_10_” and remove the REP information as it is not needed moving forward.

I do so with a for loop plus while script

#!/bin/bash

#loop through rows of file containing best run info
while read p; do
  #select LG from row
  LG=`awk -F'[_]' '{print $1}' <<< "$p"`
  #select REP id from row
   REP=`awk -F'[_]' '{print $2}' <<< "$p"`
  #copy and rename the files for the best run for each LG
  cp "9_3_HC_Male_LG${LG}_REP${REP}.txt" "9_6_HC_male_LG${LG}.txt"
  cp "9_3_HC_Male_LG${LG}_REP${REP}_LOD" "9_6_HC_male_LG${LG}_LOD"
  cp "9_3_HC_Male_LG${LG}_REP${REP}_INT" "9_6_HC_male_LG${LG}_INT"
done < "9_5_HC_male_Best.txt"

#loop through rows of file containing best run info
while read p; do
  #select LG from row
  LG=`awk -F'[_]' '{print $1}' <<< "$p"`
  #select REP id from row
  REP=`awk -F'[_]' '{print $2}' <<< "$p"`
  #copy and rename the files for the best run for each LG
  cp "9_3_HC_Female_LG${LG}_REP${REP}.txt" "9_6_HC_female_LG${LG}.txt"
  cp "9_3_HC_Female_LG${LG}_REP${REP}_LOD" "9_6_HC_female_LG${LG}_LOD"
  cp "9_3_HC_Female_LG${LG}_REP${REP}_INT" "9_6_HC_female_LG${LG}_INT"
done < "9_5_HC_female_Best.txt"

5. Produce LMPlot plots

Finally I produce dot files for all runs using LMPlot. This is an additional way to check your maps. Although at this stage I am pretty confident in their current condition.

#!/bin/bash
for LG in {1..13} ; do
#produce dot file
  java -cp ../../0_lepmap/bin/ LMPlot 9_6_HC_male_LG${LG}.txt > 9_7_HC_male_LG${LG}_LMPlot.txt

#plot dot file
  dot -Tpng 9_7_HC_male_LG${LG}_LMPlot.txt > 9_7_HC_male_LG${LG}_LMPlot.png
done

for LG in {1..13} ; do
#produce dot file
  java -cp ../../0_lepmap/bin/ LMPlot 9_6_HC_female_LG${LG}.txt > 9_7_HC_female_LG${LG}_LMPlot.txt

#plot dot file
  dot -Tpng 9_7_HC_female_LG${LG}_LMPlot.txt > 9_7_HC_female_LG${LG}_LMPlot.png
done