1. Background

Now that we have assigned markers to linkage groups is time to order markers within each linkage group.

2. Order markers

When running OrderMarkers as default, even though I do not add the sexAverage=1 option, LepMap maps all markers on both the male and the female map. I do not particularly like nor understand this, so now I run orderMarkers with male informative only (informativeMask=13) and female informative only (informativeMask=23) markers separately.

I run each chromosome 10 times and select the best run, as suggested by Pasi Rastas. To speed up things, I submit the job as an array with one job per linkage group.

#!/bin/bash

#PBS -l ncpus=4
#PBS -l mem=15g
#PBS -l walltime=10:00:00
#PBS -J 1-195

##### Change Directory #####
cd ~/ProjectDirectory/2_LinMap

##### Get params from array #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_4_Array`
parameterArray=($parameters)

LG_INDEX=${parameterArray[0]}
REP_INDEX=${parameterArray[1]}
out_lod=3_orderMarkers_SP/male/7_1_HC_Male_LG${LG_INDEX}_REP${REP_INDEX}_LOD
out_int=3_orderMarkers_SP/male/7_1_HC_Male_LG${LG_INDEX}_REP${REP_INDEX}_INT
output=3_orderMarkers_SP/male/7_1_HC_Male_LG${LG_INDEX}_REP${REP_INDEX}.txt

##### Run order chromosomes #####
java -cp 0_lepmap/bin/ OrderMarkers2 \
data=6_1_parcall_HC map=6_3_joinsingles_HC \
chromosome=${LG_INDEX} computeLODScores=${out_lod} calculateIntervals=${out_int} informativeMask=13 numThreads=4 > $output

I produced the 2_4_Array array file with the below command. It produces a file with 195 rows and two columns, where each linkage group is replicated 15 times (first column), and those 15 replicates are denoted by the letters a to o (second column).

for i in $(seq 1 13)
do
  for j in $(seq 1 15)
  do
    echo $i
  done
done > 0_metadata/2_4_array

for i in $(seq 1 13)
do
  printf "a\nb\nc\nd\ne\nf\ng\nh\nj\nk\ni\nl\nm\nn\no\n"done > 0_metadata/temp
  
paste -d' ' 0_metadata/2_4_array 0_metadata/temp > 0_metadata/2_4_array

rm 0_metadata/temp

For each run I saved:

3. Selecting the best run

Now let’s extract the likelihood for each run. I did so with a series of bash for loops. In this order, the script loops through:

#!/bin/bash
cd "./male"

for LG in {1..13} ; do
  for REP in {a..o} ; do
    echo -n "$LG ${REP} " >> "7_2_HC_male_LIK.txt"
    echo | sed '2q;d' "7_1_HC_Male_LG${LG}_REP${REP}.txt" | awk '{print $7}' >> "7_2_HC_male_LIK.txt"
  done
done

cd "./female"
for LG in {1..13} ; do
  for REP in {a..o} ; do
    echo -n "$LG ${REP} " >> "7_2_HC}_female_LIK.txt"
    echo | sed '2q;d' "7_1_HC_Female_LG${LG}_REP${REP}.txt" | awk '{print $7}' >> "7_2_HC_female_LIK.txt"
  done
done

Now I have two files for each family containing the log likelihoods of each of the 11 runs for each of the 13 chromosomes. Let’s import them into R so we can identify which run for each chromosome has the best likelihood value, so that we can start plotting the resulting maps and identifying any possible errors.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.0.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggpubr)
HC_LIKm <- read_delim("../2_LinMap/7_2_HC_male_LIK.txt",
                      delim = " ",
                      col_names = c("LG","ITE","LIK"))

HC_LIKf <- read_delim("../2_LinMap/7_2_HC_female_LIK.txt",
                      delim = " ",
                      col_names = c("LG","ITE","LIK"))

Now, let’s plot them LG by LG for each family, to see if there’s any obvious outliers

p1 <- ggplot(HC_LIKm, aes(x = LG, y = LIK, group = LG)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(1,13,1)) +
  ylab("Likelihood") + xlab("Linkage group") +
  theme(axis.text.x = element_text(size=6))


p2 <- ggplot(HC_LIKf, aes(x = LG, y = LIK, group = LG)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(1,13,1)) +
  ylab("Likelihood") + xlab("Linkage group") +
  theme(axis.text.x = element_text(size=6))

ggarrange(p1,p2,
          ncol = 1, nrow = 2, align = "hv",
          labels = c("male","female"),
          font.label = list(size = 8))

The likelihood is pretty consistent across the 15 runs. Now, let’s find which iteration has the highest likelihood.

HC_bestm <- HC_LIKm %>% group_by(LG) %>%
  slice_max(n=1, order_by=LIK, with_ties=FALSE) %>%
  select(LG, ITE) %>% unite("Best", remove = TRUE, sep = "_")

HC_bestf <- HC_LIKf %>% group_by(LG) %>%
  slice_max(n=1, order_by=LIK, with_ties=FALSE) %>%
  select(LG, ITE) %>% unite("Best", remove = TRUE, sep = "_")
write_delim(HC_bestm,
            file = "../2_LinMap/7_3_HC_male_Best.txt",
            delim = "\t", col_names = FALSE)

write_delim(HC_bestf,
            file = "../2_LinMap/7_3_HC_female_Best.txt",
            delim = "\t", col_names = FALSE)

Now, let’s copy to this directory the files corresponding to the best run (both the map, the lods and the intervals). I rename them to start with “3_4_” and remove the REP information as it is not needed moving forward.

I do so with a for loop plus while script.

DISCLAIMER: I Left this here for others that might find they have a similar issue (e.g., wanting to move a lot of files at once based on info on another file). The directory paths are all wrong as they match the directories on my HPC home directory. They will not work for you. I just thought it might be a useful script for others wanting to extract the best runs.

#!/bin/bash
cd ../3_orderMarkers_SP/male

#loop through the four families
for FAM in FC HC KI WC ; do
#loop through rows of file containing best run info
  while read p; do
  #select LG from row
    LG=`awk -F'[_]' '{print $1}' <<< "$p"`
  #select REP id from row
    REP=`awk -F'[_]' '{print $2}' <<< "$p"`
  #copy and rename the files for the best run for each LG
    cp "7_1_${FAM}_Male_LG${LG}_REP${REP}.txt" "7_4_${FAM}_male_LG${LG}.txt"
    cp "7_1_${FAM}_Male_LG${LG}_REP${REP}_LOD" "7_4_${FAM}_male_LG${LG}_LOD"
    cp "7_1_${FAM}_Male_LG${LG}_REP${REP}_INT" "7_4_${FAM}_male_LG${LG}_INT"
  done < "7_3_${FAM}_male_Best.txt"
done

cd ../female

#loop through the four families
for FAM in FC HC KI WC ; do
#loop through rows of file containing best run info
  while read p; do
  #select LG from row
    LG=`awk -F'[_]' '{print $1}' <<< "$p"`
  #select REP id from row
    REP=`awk -F'[_]' '{print $2}' <<< "$p"`
  #copy and rename the files for the best run for each LG
    cp "7_1_${FAM}_Female_LG${LG}_REP${REP}.txt" "7_4_${FAM}_female_LG${LG}.txt"
    cp "7_1_${FAM}_Female_LG${LG}_REP${REP}_LOD" "7_4_${FAM}_female_LG${LG}_LOD"
    cp "7_1_${FAM}_Female_LG${LG}_REP${REP}_INT" "7_4_${FAM}_female_LG${LG}_INT"
  done < "7_3_${FAM}_female_Best.txt"
done