Title: | Assignment of Haplotypes Based on SNP Dosages in Diploids and Polyploids |
---|---|
Description: | Infer the genetic composition of individuals in terms of haplotype dosages for a haploblock, based on bi-allelic marker dosages, for any ploidy level. Reference: Voorrips and Tumino: PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data. Submitted to BMC Bioinformatics (2021). |
Authors: | Roeland E. Voorrips |
Maintainer: | Roeland E. Voorrips <[email protected]> |
License: | GPL-2 |
Version: | 1.0.1 |
Built: | 2024-11-15 04:07:46 UTC |
Source: | https://github.com/cran/PolyHaplotyper |
Find all possible haplotypes for a haploblock from the haplotyping result
allhap(hapresults, haploblock)
allhap(hapresults, haploblock)
hapresults |
list as returned by inferhaplotypes, or one element of such a list (i.e. the results for one haploblock) |
haploblock |
if hapresults is one element of the return value of inferHaplotypes, haploblock should be missing of NULL; else haploblock is a single value indicating the haploblock: either its name of its index in hapresults |
This function works with the results of inferHaplotypes; the setting of dropUnused does not affect this function
an array with all possible haplotypes. The haplotypes are in columns, with the haplotype numbers as colnames; the markers are in rows.
data(PolyHaplotyper_small) # show the composition of all possible haplotypes with the markers # in the first haploblock: allhap(hapresults=phresults, haploblock=1)
data(PolyHaplotyper_small) # show the composition of all possible haplotypes with the markers # in the first haploblock: allhap(hapresults=phresults, haploblock=1)
Given a set of bi-allelic (SNP) marker names, generate all possible haplotypes
allHaplotypes(mrknames)
allHaplotypes(mrknames)
mrknames |
the names of the (bi-allelic) markers in the haploblock (contig) |
a matrix with markers in columns and all possible (2 ^ nmrk) haplotypes in rows. 0: haplotype contains the non-dosage-counted marker allele (the reference allele); 1: haplotype contains the dosage-counted (alternative) marker allele. The colnames are the marker names.
# show the 8 possible haplotypes with 3 bi-allelic markers: allHaplotypes(mrknames=c("mrkA", "mrkB", "mrkC"))
# show the 8 possible haplotypes with 3 bi-allelic markers: allHaplotypes(mrknames=c("mrkA", "mrkB", "mrkC"))
generate a list which contains for each marker dosage combination at a given ploidy all matching haplotype combinations
build_ahccompletelist(ploidy, maxmrk, savesec=1800, printsec=300, overwrite, shorten=FALSE)
build_ahccompletelist(ploidy, maxmrk, savesec=1800, printsec=300, overwrite, shorten=FALSE)
ploidy |
ploidy; may be even or odd, but inferHaplotypes only works with even ploidy if FS families are present, so building an ahccompletelist for odd ploidy is probably not useful |
maxmrk |
the list will countain all haplotype combinations for haploblocks of 1...maxmrk markers |
savesec |
default 1800: number of seconds between successive saves of the intermediate results |
printsec |
default 300: number of seconds between printout of the current set of haplotypes. NA or 0 suppresses printing. |
overwrite |
the new file ahccompletelist_<ploidy>x.RData is written in the working directory. If a file with that name already exists and would be changed, this is aborted and the old file maintained if overwrite is FALSE |
shorten |
if file ahccompletelist_<ploidy>x.RData already exists and covers haploblocks of lengths longer than maxmrk, this file is left unchanged if shorten is FALSE (default), but is shortened to maxmrk if shorten is TRUE |
An ahccompletelist reduces the processing time of inferHaplotypes enormously
but takes a long time to build. This should therefore be done when
PolyHaplotyper will be used multiple times.
If an ahccompletelist file already exists in the working directory, this file is
used as starting point; if maxmrk is larger than the length of the existing
list only the additional items are calculated.
If this function crashes (due to exceeding the memory limits of the computer
or of R) the file ahccompletelist_<ploidy>x.RData in the working directory
contains an intact version of the ahccompletelist for the last finished
marker count. Also, the temporary file
"buildAHCcompletelist_<datetime>"_"<nmrk>.RData" contains a part of the
list item for the number of markers being calculated at that moment.
Note that this function takes lots of time and memory already for
ploidy 4 / maxmrk 7 and ploidy 6 / maxmrk 5, and is not practicable above
ploidy 4 / maxmrk 8, ploidy 6 / maxmrk 6, ploidy 8 / maxmrk 5. Also, the
files are large and take up considerable memory, and most of the size is
used for the largest haploblock size. Therefore maxmrk should
not be taken larger than necessary.
NULL (invisible); the actual result is a file in the working directory named ahccompletelist_<ploidy>x.RData
# this example will create a file ahccompletelist_4x.RData # in the working directory, for ploidy=4x and 1 - 4 markers: build_ahccompletelist(ploidy=4, maxmrk=4, overwrite=FALSE)
# this example will create a file ahccompletelist_4x.RData # in the working directory, for ploidy=4x and 1 - 4 markers: build_ahccompletelist(ploidy=4, maxmrk=4, overwrite=FALSE)
produce a table of nr of markers vs nr of haplotypes
calcMrkHaptable(ovwFS)
calcMrkHaptable(ovwFS)
ovwFS |
a list as produced by overviewFS. Only the first 2 columns of list item ovw are used |
a frequency table with the numbers of haploblocks, with all combinations of marker counts and inferred haplotype counts per haploblock. The column with haplotype count NA (if any) shows the haploblocks for which no haplotype solution was found (the reason for that would usually be found in column 1 of ovwFS$messages)
data(PolyHaplotyper_small) phovw <- overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS, hapresults=phresults) # in this small dataset there are only 2 haploblocks, each with 4 markers: calcMrkHaptable(ovwFS=phovw) # in both haploblocks 5 haplotypes are inferred
data(PolyHaplotyper_small) phovw <- overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS, hapresults=phresults) # in this small dataset there are only 2 haploblocks, each with 4 markers: calcMrkHaptable(ovwFS=phovw) # in both haploblocks 5 haplotypes are inferred
calculate some statistics of the solutions of all haploblocks
calcStatistics(pedchk, ovwFS, indiv, haploblocks)
calcStatistics(pedchk, ovwFS, indiv, haploblocks)
pedchk |
a list as generated by pedigreeHapCheck |
ovwFS |
a list as produced by overviewFS, or NA if no FS families available. If ovwFS is specified, pedchk and ovwFS should have been calculated from the same underlying data |
indiv |
missing, NULL or a vector of names of individuals. If present, the component pedstats is calculated only over the individuals in the vector, else over all individuals. Component FSstats is not affected by indiv |
haploblocks |
missing, NULL or a vector of names or numbers of haploblocks, or a list of haploblocks as in the inferHaplotypes input. If present, components pedstats and FSstats are calculated only over the specified haploblocks, else over all haploblocks. |
a list with one or two items:
pedstats: a data.frame with one row per haploblock, showing numbers
of individuals in columns
- mrk: individuals with complete marker data
- hap: individuals with an assigned haplotype combination
- match.NA: individuals where a match between the haplotypes of the individual
and its parents could not be verified (either the individual itself or both
parents have no haplotype combination assigned)
- noDR.TRUE: individuals whose haplotype combination matches that of its
parents, assuming polysomic inheritance without Double Reduction
- noDR.FALSE: individuals whose haplotype combination doesn't match that of its
parents, assuming polysomic inheritance without Double Reduction
- withDR.TRUE: individuals whose haplotype combination matches that of its
parents, assuming polysomic inheritance, allowing Double Reduction
- withDR.FALSE: individuals whose haplotype combination doesn't match that of
its parents, assuming polysomic inheritance, allowing Double Reduction
The total number of individuals is
match.NA + noDR.TRUE + noDR.FALSE == match.NA + withDR.TRUE + withDR.FALSE;
the number of individuals with missing marker data or without assigned
haplotype dosages can be calculated by subtracting mrk or hap from that
total.
FSstats (only if ovwFS specified): a data.frame with for each FS family,
"rest" and "all" one row, with columns
- pop: the FS family numbers, "rest" (all individuals not belonging to the
FSs or their parents) and "all" (all individuals)
- bothparmrk: for how many haploblocks have both parents complete marker data?
- FSdone: for how many haploblocks has a solution based on polysomic
inheritance been found for this FS family?
- mrk: the average number of (FS) individuals with complete marker data
- imp: the average number of (FS) individuals with imputed marker data
- hap: the average number of (FS) individuals with an assigned haplotype
combination
The numbers for "all" include also the FS parents (each parent counted only
once, even if some parents produced more than one FS family), which are not
included in any of the other rows; therefore "all" is not equal to the sum of
the rows above.
data(PolyHaplotyper_small) phchk <- pedigreeHapCheck(ped=phped, mrkDosage=phdos, haploblock=phblocks, hapresults=phresults) phovw <- overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS, hapresults=phresults) calcStatistics(pedchk=phchk, ovwFS=phovw)
data(PolyHaplotyper_small) phchk <- pedigreeHapCheck(ped=phped, mrkDosage=phdos, haploblock=phblocks, hapresults=phresults) phovw <- overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS, hapresults=phresults) calcStatistics(pedchk=phchk, ovwFS=phovw)
check a marker dosages matrix or data.frame, select columns and rows, convert to matrix
checkmrkDosage(mrkDosage, ploidy, indiv=NULL, markers=NULL, generateMarkernames=TRUE)
checkmrkDosage(mrkDosage, ploidy, indiv=NULL, markers=NULL, generateMarkernames=TRUE)
mrkDosage |
matrix or data.frame. Markers are in rows, individuals in columns, each cell has a marker dosage. If mrkDosage is a matrix, the colnames are the individual names and the rownames are the marker names. If mrkDosage is a data.frame and the name of the first column starts with "marker" (upper/lowercase not relevant) the contents of this column are used as marker names, else the rownames are used as marker names. The (other) column names are the individual names. All marker dosages must be in 0:ploidy or NA. |
ploidy |
an integer value; all dosages are checked to be in 0:ploidy or NA |
indiv |
NULL (default) or a character vector with names of individuals to be selected. If NULL, all individuals are selected |
markers |
NULL (default) or a character vector with names of markers to be selected. If NULL, all markers are selected. |
generateMarkernames |
if TRUE (default) and mrkDosage has no markernames specified, markernames are generated automatically. Markernames must be either present or be generated, and none may be missing. |
This function is called by inferHaplotypes and by mergeReplicates, so normally there is no need for the user to call this function directly.
a matrix with the selected columns in the order of indiv and the selected rows in order of markers (or all columns or rows in the original order if indiv or markers are not specified), with names of individuals as column names, marker names as row names
mrkdos <- data.frame( marker=paste0("mrk", 1:3), indiv1=c(1, 2, 2), indiv2=c(4, 0, 0), indiv3=c(3, 0, 2)) # use all rows and columns: checkmrkDosage(mrkDosage=mrkdos, ploidy=4) # use only first and last row and column and change the order: checkmrkDosage(mrkDosage=mrkdos, ploidy=4, indiv=c("indiv3", "indiv1"), markers=c("mrk3", "mrk1"))
mrkdos <- data.frame( marker=paste0("mrk", 1:3), indiv1=c(1, 2, 2), indiv2=c(4, 0, 0), indiv3=c(3, 0, 2)) # use all rows and columns: checkmrkDosage(mrkDosage=mrkdos, ploidy=4) # use only first and last row and column and change the order: checkmrkDosage(mrkDosage=mrkdos, ploidy=4, indiv=c("indiv3", "indiv1"), markers=c("mrk3", "mrk1"))
compare haplotyping results with observed markers dosages
compareHapMrkDosages(mrkDosage, hapresults)
compareHapMrkDosages(mrkDosage, hapresults)
mrkDosage |
a data.frame or matrix with the marker dosages, in the format of inferHaplotypes() |
hapresults |
a list as returned by inferHaplotypes, with one item (itself a list) per haploblock with at least a matrix hapdos and a character vector markers. All markers in all haploblocks must also occur in mrkDosages, and all individuals in the hapdos matrices must also occur in mrkDosages |
a 3-D array with dimensions haploblock, individual, and chkresult: mrkNA (ALL markers in the haploblock have missing data for the individual), hapNA (the haplotype dosages do not sum to ploidy and/or are missing), match (TRUE if the non-missing marker dosages match the haplotype dosages, FALSE is there is a conflict, NA if mrkNA and/or hapNA are TRUE) Each element is itself a list with elements:
data(PolyHaplotyper_small) chmd <- compareHapMrkDosages(mrkDosage=phdos, hapresults=phresults) # show results for first haploblock, first 8 individuals: chmd[1, 1:8,]
data(PolyHaplotyper_small) chmd <- compareHapMrkDosages(mrkDosage=phdos, hapresults=phresults) # show results for first haploblock, first 8 individuals: chmd[1, 1:8,]
compare two haplotyping results, e.g. PolyHaplotyper and SATlotyper
compareHapresults(haploblock, hapresultsA, hapresultsB)
compareHapresults(haploblock, hapresultsA, hapresultsB)
haploblock |
a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock. |
hapresultsA |
and |
hapresultsB |
two list as returned by inferHaplotypes, with one item (itself a list) per haploblock with at least a matrix hapdos and a character vector markers. All haploblocks in param haploblock must occur in hapresultsA and in hapresultsB. The individual names (colnames of the hapdos items for each haploblock) must be identical and in the same order in hapresultsA and hapresultsB |
a list with one element per haploblock in param haploblock.
Each element is itself a list with elements:
$identical: TRUE or FALSE
$message: a single string, "" if the comparison is possible, else the reason
why not (if $message is not "", $identical is always FALSE). The next
elements are only present if $message is "":
$compindiv: a matrix comparing the two hapdos, with one column per individual
and 5 rows: Both_NA, A_NA, B_NA, Equal, Uneq. The last 2 have NA values if A
and/or B is NA
$haplofreq: a matrix with one row per haplotype occurring in A and/or B,
and columns A and B, with the total frequency of each haplotype in
hapdos A or hapdos B
A data.frame with a pedigree; used in vignette, stored in PolyHaplotyper_demo.RData
a data.frame with 661 rows (samples) and 4 columns: genotype, mother, father, sample_nr
A data.frame with the allele dosages for 30 SNPs in 661 samples; used in vignette, stored in PolyHaplotyper_demo.RData
a data.frame with 30 rows (SNPs) and 663 columns. The first two columns contain the SNP name and contig; the remaining columns contain the allele dosages (integers in 0..6 or NA) of the SNPs in 661 samples
Haplotype dosage matrices generated with the dropUnused=TRUE lack some haplotypes that are required by pedigreeHapCheck; this function adds them back.
expandHapdos(hapdos, nhap, hbname)
expandHapdos(hapdos, nhap, hbname)
hapdos |
a haplotype dosage matrix as returned by a.o. inferHaplotypes. The rownames are assumed to have all the same length: they are composed of the haploblock name to which the 0-padded haplotype number is appended, separated by "_" |
nhap |
the total number of haplotypes for the haploblock: nhap = 2^nmrk with nmrk the number of markers in the haploblock |
hbname |
the haploblock name (only used when hapdos has 0 rows) |
a matrix similar to hapdos with rows for the dropped haplotypes re-inserted, with the correct rownames and containing only 0 and NA values
# specify haplotype dosages of 4 tetraploid individuals, # only the 3 occurring haplotypes (1, 5 and 6) are given: haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3, dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4))) # add the rows for the absent haplotypes, assuming the haploblock consists # of 3 markers, so 8 haplotypes are possible: expandHapdos(hapdos=haplodosg, nhap=8)
# specify haplotype dosages of 4 tetraploid individuals, # only the 3 occurring haplotypes (1, 5 and 6) are given: haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3, dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4))) # add the rows for the absent haplotypes, assuming the haploblock consists # of 3 markers, so 8 haplotypes are possible: expandHapdos(hapdos=haplodosg, nhap=8)
get all FS haplotype combinations expected from two parental haplotype combinations, with their frequencies. Different from getFScombs in that it returns the unique FS combinations as well as their frequencies, and that it also considers Double Reduction
getFSfreqs(parhac, DRrate)
getFSfreqs(parhac, DRrate)
parhac |
matrix with one column for each parent and <ploidy> rows, giving the haplotype combinations for each parent |
DRrate |
the rate of double reduction per meiosis (NOT per allele!); e.g. with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of 0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD (each with a frequency of 0.16) |
a list of 2 elements:
FShac: a matrix with one column per unique FS haplotype combination and
the same row count as parhac, giving the FS haplotype combinations. There are
no duplicated columns but several (not necessarily adjacent) columns
may correspond to the same mrkdid
freq: a vector of length ncol(hapcomb), with for each FS haplotype combination
in hapcomb its frequency (0.0 ... 1.0). The colnames are NULL: haplotype
combinations are not named.
# specify combinations of haplotypes in two tetraploid parents: hapcomb <- matrix(c(2,2,5,6, 1,2,5,5), ncol=2) # FS frequencies without double reduction: getFSfreqs(parhac=hapcomb, DRrate=0) # FS frequencies with 5\% double reduction: getFSfreqs(parhac=hapcomb, DRrate=0.05)
# specify combinations of haplotypes in two tetraploid parents: hapcomb <- matrix(c(2,2,5,6, 1,2,5,5), ncol=2) # FS frequencies without double reduction: getFSfreqs(parhac=hapcomb, DRrate=0) # FS frequencies with 5\% double reduction: getFSfreqs(parhac=hapcomb, DRrate=0.05)
get all gametes and their frequencies for a parental haplotype combination
getGameteFreqs(parhac, DRrate)
getGameteFreqs(parhac, DRrate)
parhac |
vector of the parental haplotype combination of length <ploidy>, giving the <ploidy> haplotype numbers present per haplotype giving the dosage of that haplotype, summing to ploidy |
DRrate |
the rate of double reduction per meiosis (NOT per allele!); e.g. with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of 0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD (each with a frequency of 0.16) |
for hexaploids the DR gametes consist of a duplication of one of the
6 parental alleles, combined with one copy of one of the other 5 alleles.
Calculation is faster if DRrate is 0.0
a list of 2 elements:
hapcomb: a matrix with one column per unique gamete and ploidy/2 rows.
Each element is the number (ID) of a haplotype). Within columns the
haplotypes are sorted from low to high; the columns are ordered from left
to right, first on row 1, then on row 2 etc
freq: a vector of length ncol(hapcomb), with for each gamete in hapcomb its
frequency (0.0 ... 1.0)
# specify combination of haplotypes in a tetraploid parent: hapcomb <- c(2,2,5,6) # 2 copies of haplotype 2, 1 each of 5 and 6 # gamete frequencies without double reduction: getGameteFreqs(parhac=hapcomb, DRrate=0) # gamete frequencies with 5\% double reduction: getGameteFreqs(parhac=hapcomb, DRrate=0.05)
# specify combination of haplotypes in a tetraploid parent: hapcomb <- c(2,2,5,6) # 2 copies of haplotype 2, 1 each of 5 and 6 # gamete frequencies without double reduction: getGameteFreqs(parhac=hapcomb, DRrate=0) # gamete frequencies with 5\% double reduction: getGameteFreqs(parhac=hapcomb, DRrate=0.05)
converts matrices that contain haplotype combinations in columns (as the matrices in ahccompletelist or ahclist) to matrices with the haplotype dosages
hapcomb2hapdos(hapcomb, nhap)
hapcomb2hapdos(hapcomb, nhap)
hapcomb |
matrix with <ploidy> rows and any number of columns. Each column contains a set of haplotypes (numbers in 1 ... nhap). NAs allowed but no values outside this range (not checked). E.g. a column with numbers 1-1-3-12 means two copies of haplotype 1 and one of haplotypes 3 and 12 each. |
nhap |
the total number of possible haplotypes: 2^nmrk where nmrk is the number of bi-allelic markers in the haploblock |
matrix with nhap rows (one row for each possible haplotype) and as many columns as in hapmat, giving the dosages of each haplotype in each column of hapmat.
# specify haplotype combinations of 2 individuals: haplocomb <- matrix(c(1,5,5,6, 5,6,6,6), ncol=2, dimnames=list(NULL, c("indiv1", "indiv2"))) # convert to dosage matrix, # assuming haploblock has 3 markers, so 8 possible haplotypes: hapcomb2hapdos(hapcomb=haplocomb, nhap=8)
# specify haplotype combinations of 2 individuals: haplocomb <- matrix(c(1,5,5,6, 5,6,6,6), ncol=2, dimnames=list(NULL, c("indiv1", "indiv2"))) # convert to dosage matrix, # assuming haploblock has 3 markers, so 8 possible haplotypes: hapcomb2hapdos(hapcomb=haplocomb, nhap=8)
convert a vector or matrix of haplotype dosages to a vector or matrix of haplotype combinations
hapdos2hapcomb(hapdos, ploidy)
hapdos2hapcomb(hapdos, ploidy)
hapdos |
a vector with one item for some or all possible haplotypes, with the dosage of the haplotypes, summing to ploidy; or a matrix where each column is such a vector, each representing an individual. Missing data are allowed (but any NA makes the entire vector or matrix column unknown); if a matrix, 0 rows indicate missing values for all individuals. When hapdos is a matrix, the haplotype numbers must be given as rownames; when a vector, as names. Not all possible haplotype numbers need to be present but the ones that are present must be in ascending order, no duplicates or missing values allowed. |
ploidy |
the ploidy (one number) |
a sorted vector of length ploidy with all haplotype numbers present, or a matrix where each column is such a vector, with the same colnames as hapdos
# specify haplotype dosages of 4 tetraploid individuals, # only the 3 occurring haplotypes (1, 5 and 6) are given: haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3, dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4))) # usage with hapdos as matrix: hapdos2hapcomb(hapdos=haplodosg, ploidy=4) # usage with hapdos as vector: hapdos2hapcomb(hapdos=haplodosg[, 1], ploidy=4)
# specify haplotype dosages of 4 tetraploid individuals, # only the 3 occurring haplotypes (1, 5 and 6) are given: haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3, dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4))) # usage with hapdos as matrix: hapdos2hapcomb(hapdos=haplodosg, ploidy=4) # usage with hapdos as vector: hapdos2hapcomb(hapdos=haplodosg[, 1], ploidy=4)
calculate the marker dosages resulting from haplotype dosage combinations
hapdos2mrkdos(hapdos, allhap)
hapdos2mrkdos(hapdos, allhap)
hapdos |
a matrix with one column per combination of haplotypes and one row for each possible haplotype (corresponding to the rows of allhap) with dosage of the haplotypes in each combination. A vector is interpreted as a one-column matrix; all columns must sum to ploidy. The rownames of the matrix (or names of the vector) must contain the haplotype numbers |
allhap |
a matrix as returned by allHaplotypes |
if hapdos contains NA values, all values in the corresponding column of the result will also be NA
a matrix with columns corresponding to the columns of hapdos and one row for each marker, with the dosages of each marker in each combination; colnames are the mrkdids (marker dosage IDs), rownames are the marker names taken from allhap
# get a matrix of all haplotypes with the 3 specified markers: ah <- allHaplotypes(mrknames=c("mrkA", "mrkB", "mrkC")) # specify haplotype dosages of 4 tetraploid individuals, # only the 3 occurring haplotypes (1, 5 and 6) are given: haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3, dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4))) # calculate the corresponding marker (SNP) dosages: hapdos2mrkdos(hapdos=haplodosg, allhap=ah)
# get a matrix of all haplotypes with the 3 specified markers: ah <- allHaplotypes(mrknames=c("mrkA", "mrkB", "mrkC")) # specify haplotype dosages of 4 tetraploid individuals, # only the 3 occurring haplotypes (1, 5 and 6) are given: haplodosg <- matrix(c(1,2,1, 4,0,0, 0,4,0, 0,0,4), nrow=3, dimnames=list(paste0("demohap_", c(1,5,6)), paste0("indiv", 1:4))) # calculate the corresponding marker (SNP) dosages: hapdos2mrkdos(hapdos=haplodosg, allhap=ah)
convert a haploblock-defining data frame to a list as needed by inferHaplotypes
haploblock_df2list(df, mrkcol, hbcol, sorted=TRUE)
haploblock_df2list(df, mrkcol, hbcol, sorted=TRUE)
df |
a data.frame with at least the following two columns |
mrkcol |
the name or number of the column with the markers |
hbcol |
the name or number of the column with the haploblocks |
sorted |
if TRUE (default) the haploblock list is sorted in alphabetical order of haploblock name (the markers within haploblocks are not sorted); if FALSE the haploblocks will be in order of first occurrence in df |
function inferHaplotypes needs a list where each item is a vector of all markers in one haploblock. This function produces such a list from a data.frame where the markers are in one column and the haploblocks in another column. The markers and haploblocks columns may be character, factor or numeric, and the columns may be indicated by name or number.
the desired list
df1 <- data.frame( marker=paste0("mrk",1:9), block=LETTERS[c(1,2,3,2,3,1,1,2,2)], extracol=runif(9)) haploblock_df2list(df=df1, mrkcol="marker", hbcol=2)
df1 <- data.frame( marker=paste0("mrk",1:9), block=LETTERS[c(1,2,3,2,3,1,1,2,2)], extracol=runif(9)) haploblock_df2list(df=df1, mrkcol="marker", hbcol=2)
infer haplotypes for one or more haploblocks, for all individuals, using FS family(s) (with parents) if present, and infer haplotypes for non-FS material as well
inferHaplotypes(mrkDosage, indiv=NULL, ploidy, haploblock, parents=NULL, FS=NULL, minfrac=c(0.1, 0.01), errfrac=0.025, DRrate=0.025, maxmrk=0, dropUnused=TRUE, maxparcombs=150000, minPseg=1e-8, knownHap=integer(0), progress=TRUE, printtimes=FALSE, ahcdir)
inferHaplotypes(mrkDosage, indiv=NULL, ploidy, haploblock, parents=NULL, FS=NULL, minfrac=c(0.1, 0.01), errfrac=0.025, DRrate=0.025, maxmrk=0, dropUnused=TRUE, maxparcombs=150000, minPseg=1e-8, knownHap=integer(0), progress=TRUE, printtimes=FALSE, ahcdir)
mrkDosage |
matrix or data.frame. Markers are in rows, individuals in columns, each cell has a marker dosage. Names of individuals are the column names, marker names are the row names or (if a data.frame) in a column named MarkerNames. All marker dosages must be in 0:ploidy or NA. |
indiv |
NULL (default) or a character vector with names of all individuals
to be considered. If NULL, all columns of mrkDosage are selected. |
ploidy |
all marker dosages should be in 0:ploidy or NA |
haploblock |
a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock. Haplotype names are constructed from the haploblock names, which are used as prefixes to which the (zero-padded) haplotype numbers are are appended with separator '_'. |
parents |
a matrix with one row for each FS family and two columns for the two parents, containing the names of the female and male parent of each family. |
FS |
a list of character vectors. Each character vector has the names of the individuals of one FS family. The items of the list should correspond to the rows of the parents matrix, in the same order. |
minfrac |
vector of two fractions, default 0.1 and 0.01. A haplotype is considered to be certainly present if it must occur in at least a fraction minfrac[1] of all individuals; in the final stage for the "other" individuals (those that do not belong to the FS or its parents) this fraction is lowered to minfrac[2]; see also inferHaps_noFS |
errfrac |
default 0.025. The assumed fraction marker genotypes with an error (over all markers in the haploblock). The errors are assumed to be uniformly distributed over all except the original marker dosage combinations (mrkdids) |
DRrate |
default 0.025. The rate of double reduction per meiosis (NOT per allele!); e.g. with a DRrate of 0.04, a tetraploid parent with genotype ABCD will produce a fraction of 0.04 of DR gametes AA, BB, CC and DD (each with a frequency of 0.01), and a fraction of 0.96 of the non-DR gametes AB, AC, AD, BC, BD, CD (each with a frequency of 0.16) |
maxmrk |
Haploblocks with more than maxmrk markers will be skipped. Default 0: no haploblocks are skipped |
dropUnused |
TRUE (default) if the returned matrix should only contain rows for haplotypes that are present; if FALSE matrix contains rows for all possible haplotypes |
maxparcombs |
Parent 1 and 2 both may have multiple possible haplotype combinations. For each pair of haplotype combinations (one from P1 and one from P2) the expected FS segregation must be checked against the observed. This may take a long time if many such combinations need to be checked. This parameter sets a limit to the number of allowed combinations per haploblock; default 150000 takes about 45 min. |
minPseg |
default 1e-8. The minimum P-value of a chisquared test for segregation in FS families. The best solution for an FS family is selected based on a combination of P-value and number of required haplotypes, among all candidate solutions with a P-value of at least minPseg. If no such solution is found the FS and its parents are treated as unrelated material |
knownHap |
integer vector with haplotype numbers (haplotypes that must be present according to prior inference or knowledge, numbers refer to rows of matrix produced by allHaplotypes); default integer(0), i.e. no known haplotypes |
progress |
if TRUE, and new haplotype combinations need to be calculated, and the number of markers and the ploidy are both >= 6, progress is indicated by printed messages |
printtimes |
if TRUE, the time needed to process each haploblock is printed |
ahcdir |
a single directory, or not specified.
inferHaplotypes uses lists that for each combination of marker
dosages give all possible combinations of haplotype dosages. These lists
(ahclist and ahccompletelist) are loaded and saved at the directory
specified by ahcdir. If no ahcdir is specified it is set to the current
working directory. |
First we consider the case where one or more FS families and their
parents are present in the set of samples. In that case, initially the
possible haplotype configurations of the parents are determined.
From that, all their possible gametes (assuming polysomic
inheritance) are calculated and all possible FS haplotype configurations.
Comparing this with the observed FS marker dosages the most likely parental
and FS configurations are found.
It is possible that multiple parental combinations can explain the observed
marker dosages in the FS. In that case, if one is clearly more likely and/or
needs less haplotypes, that one is chosen. If there is no clear best solution
still the parents and FS individuals that have the same haplotype
configuration over all likely solutions are assigned that configuration.
For FS where no good solution is found (because of an error in the marker
dosages of a parent, or because the correct solution was not considered) the
parents and individuals will be considered as unrelated material.
If several FS families share common parents they are treated as a group,
and only solutions are considered that are acceptable for all families
in the group.
Finally (or if no FS families are present, immediately) the other samples
are haplotyped, which are considered as unrelated material. If FS families
have been solved the haplotypes in their parents are considered "known",
and known haplotypes can also be supplied (parameter knownHap). For these
samples we consecutively add haplotypes that must be present in a minimum
number of individuals, always trying to minimize the number of needed
haplotypes.
InferHaplotypes uses tables that, for each combination of dosages of the
markers in the haploblock, list all haplotype combinations (ahc) that
result in these marker dosages. In principle inferHaplotypes uses a list
(ahccompletelist) that, for a given ploidy, has all the haplotype combinations
for haploblocks from 1 up to some maximum number of markers. This list can be
computed with function build_ahccompletelist. If this list is not available
(or is some haploblocks contain more markers than the list), the ahc for
the (extra) marker.
See the PolyHaplotyper vignette for an illustrated explanation.
a list with for each haploblock one item that itself is a list
with items:
message; if this is "" the haploblock is processed and further
elements are present; else this message says why the haploblock was
skipped (currently only if it contains too many markers)
hapdos: a matrix with the dosages of each haplotype (in rows) for each
individual (in columns). For each individual the haplotype dosages
sum to the ploidy. If dropUnused is TRUE Only the haplotypes that
occur in the population are shown, else all haplotypes
mrkdids: a vector of the mrkdid (marker dosage ID) for each individual
(each combination of marker dosages has its own ID; if any of the
markers has an NA dosage the corresponding mrkdid is also NA).
The mrkdids can be converted to the marker dosages with function
mrkdid2mrkdos.
markers: a vector with the names of the markers in the haploblock
imputedGeno: a matrix in the same format as param mrkDosage, with one row
for each marker in the haploblock and one column per imputed
individual, with the dosages of the markers. These are the individuals
that have incomplete data in mrkDosage but where the available marker
dosages match only one of the expected marker genotypes in the FS
family (only individuals in FS families are imputed). It is possible
that an individual with imputed marker dosages is not haplotyped (as
is the case for individuals with complete marker data) if the
marker dosages match different possible haplotype combinations.
The next elements are only present if one or more FS families were
specified:
FSfit: a logical vector with one element per FS family; TRUE if a (or
more than one) acceptable solution for the FS is found (although
if multiple solution are found they might not be used if unclear
which one is the best solution). (Even if no
solution was found for an FS, still its individuals may have a
haplotype combination assigned ignoring their pedigree)
FSmessages: a character vector with one item per FS family: any
message relating to the fitting of a model for that FS,
not necessarily an error
FSpval: a vector of the chi-squared P-value associated with the selected
FS model for each FS family, or the maximum P value over all
models in case none was selected
If for new combinations of marker dosages the possible haplotype combinations
have to be calculated, an ahclist file is written to ahcdir
# this example takes about 1 minute to run: data(PolyHaplotyper_small) results <- inferHaplotypes(mrkDosage=phdos, ploidy=6, haploblock=phblocks, parents=phpar, FS=phFS) names(results) names(results[[1]])
# this example takes about 1 minute to run: data(PolyHaplotyper_small) results <- inferHaplotypes(mrkDosage=phdos, ploidy=6, haploblock=phblocks, parents=phpar, FS=phFS) names(results) names(results[[1]])
convert PolyHaplotyper input data to Happy-inf format for a single haploblock
make.Happyinf.input(mrkDosage, indiv=NULL, haploblock, ploidy, fname)
make.Happyinf.input(mrkDosage, indiv=NULL, haploblock, ploidy, fname)
mrkDosage |
matrix or data.frame of allele dosages; same as input for inferHaplotypes. Markers are in rows, individuals in columns, each cell has a marker dosage. All marker dosages must be in 0:ploidy or NA |
indiv |
the names of the individuals to include in the Happy-inf input data. Default NULL includes all individuals |
haploblock |
a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock. Only the markers in haploblock will be included in the Happy-inf input data |
ploidy |
single integer: the ploidy level |
fname |
filename of a tab-separated output file: this will contain the data in Happy-inf format (the saved data.frame is also the return value). If "" no file is written |
a data.frame in the Happy-inf input format: a header row with "SNPID", "block" and names of the individuals and one row per marker, with only the individuals, markers and blocks as specified. #'SNPID has the marker names, "Block" the haploblock names. All markers of a haploblock are in contiguous rows. Missing dosages are represented by "NA"
data(PolyHaplotyper_small) HAPin <- make.Happyinf.input(mrkDosage=phdos, haploblock=phblocks, ploidy=6, fname="") HAPin[,1:8]
data(PolyHaplotyper_small) HAPin <- make.Happyinf.input(mrkDosage=phdos, haploblock=phblocks, ploidy=6, fname="") HAPin[,1:8]
convert PolyHaplotyper marker data (input) to SATlotyper format for a single haploblock.
make.SATlotyper.input(mrkDosage, indiv=NULL, markers, ploidy, phenotype=0, fname)
make.SATlotyper.input(mrkDosage, indiv=NULL, markers, ploidy, phenotype=0, fname)
mrkDosage |
matrix or data.frame of allele dosages; same as input for inferHaplotypes. Markers are in rows, individuals in columns, each cell has a marker dosage. All marker dosages must be in 0:ploidy or NA |
indiv |
the names of the individuals to include in the ShesisPlus input data. Default NULL includes all individuals |
markers |
character vector with the names of the markers in the haploblock; all must occur in mrkDosage |
ploidy |
single integer: the ploidy level |
phenotype |
vector with the phenotypes of all individuals, in order of the columns of dosmat; default 0 |
fname |
filename of a *.csv output file: this will contain the data in the SATlotyper format (the saved data.frame is also the return value). If "" no file is written |
a data.frame in the SATlotyper input format: a header row with "Genotype" and the marker names, and one row per individual with the individual name plus for each marker the genotype as a sorted string of <ploidy> A's and B's, or <ploidy> N's
data(PolyHaplotyper_small) SATin <- make.SATlotyper.input(mrkDosage=phdos, markers=phblocks[[1]], ploidy=6, fname="") head(SATin)
data(PolyHaplotyper_small) SATin <- make.SATlotyper.input(mrkDosage=phdos, markers=phblocks[[1]], ploidy=6, fname="") head(SATin)
convert PolyHaplotyper marker data (input) to ShesisPlus format for a single haploblock.
make.ShesisPlus.input(mrkDosage, indiv=NULL, markers, ploidy, phenotype=0, fname="")
make.ShesisPlus.input(mrkDosage, indiv=NULL, markers, ploidy, phenotype=0, fname="")
mrkDosage |
matrix or data.frame of allele dosages; same as input for inferHaplotypes. Markers are in rows, individuals in columns, each cell has a marker dosage. All marker dosages must be in 0:ploidy or NA |
indiv |
the names of the individuals to include in the ShesisPlus input data. Default NULL includes all individuals |
markers |
character vector with the names of the markers in the haploblock; all must occur in mrkDosage |
ploidy |
single integer: the ploidy level |
phenotype |
vector with the phenotypes of all individuals, in order of the columns of dosmat; default 0 |
fname |
filename of the output file: this will contain the data in the ShesisPlus format (the saved data.frame is also the return value). If "" (default) no file is written |
ShesisPlus needs the data formatted as: 1 column with names of
individuals, 1 column with the phenotypes (with missing values represented
as "NA"), and for each of the selected markers <ploidy> columns,
with the (unphased) marker alleles. Here we use only biallelic markers;
their alleles are indicated by 1 and 2 for the ref and alt allele,
and 0 for missing data.
The contents of file fname can be pasted into the input data box of
the ShesisPlus web interface at http://shesisplus.bio-x.cn/SHEsis.html
a data.frame in the described ShesisPlus input format
data(PolyHaplotyper_small) SSPin <- make.ShesisPlus.input(mrkDosage=phdos, markers=phblocks[[1]], ploidy=6) SSPin[1:6,1:8]
data(PolyHaplotyper_small) SSPin <- make.ShesisPlus.input(mrkDosage=phdos, markers=phblocks[[1]], ploidy=6) SSPin[1:6,1:8]
merge replicate samples in dosage matrix
mergeReplicates(mrkDosage, replist, solveConflicts=TRUE)
mergeReplicates(mrkDosage, replist, solveConflicts=TRUE)
mrkDosage |
a dosage matrix with markers in rows and individuals in columns. row names are marker names, column names are individual names. |
replist |
a list of character vectors, each of which has the sample names of a set of replicates |
solveConflicts |
if TRUE (default) and there are conflicting dosage assignments between replicates for the same marker, the one with highest frequency is used, provided the total freq of other dosages = 1 OR <= 10% of the frequency of the most frequent dosage. If solveConflicts is FALSE and there are conflicting dosages, the consensus for that marker will be NA |
This function merges all sets of replicates, each to one column.
The column name of the one retained column is the first one for that set
in replist.
For each set of replicates it calls getConsensusmrkDosage
a version of mrkDosage in which only one column of each set of replicates is retained; this column (the first in its set as specified in replist) now has the consensus scores over all replicates. Also, if mrkDosage was a data.frame, it is converted into a matrix.
# construct a dosage matrix with some missing data: dosmat <- matrix(c(rep(c(3,0,1),3), rep(c(1,1,2),4)), nrow=3, dimnames=list(c("mrk1","mrk2","mrk3"), c("a1","a2","a3", "b1","b2","b3","b4"))) ix <- matrix(c(1,1, 3,1, 2,2, 1,3, 2,4, 1,5, 2,6), ncol=2, byrow=TRUE) dosmat[ix] <- NA dosmat # define 2 sets of replicates: reps <- list(c("a1","a2","a3"), c("b1","b2","b3","b4")) # merge: mergeReplicates(mrkDosage=dosmat, replist=reps) # introduce a conflicting dosage: dosmat[3,2] <- 2 # merge: mergeReplicates(mrkDosage=dosmat, replist=reps)
# construct a dosage matrix with some missing data: dosmat <- matrix(c(rep(c(3,0,1),3), rep(c(1,1,2),4)), nrow=3, dimnames=list(c("mrk1","mrk2","mrk3"), c("a1","a2","a3", "b1","b2","b3","b4"))) ix <- matrix(c(1,1, 3,1, 2,2, 1,3, 2,4, 1,5, 2,6), ncol=2, byrow=TRUE) dosmat[ix] <- NA dosmat # define 2 sets of replicates: reps <- list(c("a1","a2","a3"), c("b1","b2","b3","b4")) # merge: mergeReplicates(mrkDosage=dosmat, replist=reps) # introduce a conflicting dosage: dosmat[3,2] <- 2 # merge: mergeReplicates(mrkDosage=dosmat, replist=reps)
get the marker dosages from mrkdids (marker dosage IDs)
mrkdid2mrkdos(dosageIDs, nmrk, ploidy, mrknames=NULL)
mrkdid2mrkdos(dosageIDs, nmrk, ploidy, mrknames=NULL)
dosageIDs |
vector of marker-dosage-combination IDs (mrkdid) |
nmrk |
character vector of marker names in the haploblock |
ploidy |
the ploidy level, a single positive integer |
mrknames |
a vector of nmrk marker names (default NULL): if not NULL these are used as rownames of the returned matrix |
a matrix with in columns the marker dosages corresponding to the marker dosageIDs, with these mrkdids as colnames, and one row per marker, with marker names as rownames if mrknames are specified
# dosages of 3 markers in 3 tetraploid individuals: mrkdosg <- matrix(c(1,2,2, 4,0,0, 3,0,2), nrow=3, dimnames=list(c("mrkA", "mrkB", "mrkC"), c("indiv1", "indiv2", "indiv3"))) # get the "marker dosage IDs": dids <- mrkdos2mrkdid(mrkDosage=mrkdosg, ploidy=4) # convert dids back to marker dosages: mrkdid2mrkdos(dosageIDs=dids, nmrk=3, ploidy=4, mrknames=c("mrkA", "mrkB", "mrkC"))
# dosages of 3 markers in 3 tetraploid individuals: mrkdosg <- matrix(c(1,2,2, 4,0,0, 3,0,2), nrow=3, dimnames=list(c("mrkA", "mrkB", "mrkC"), c("indiv1", "indiv2", "indiv3"))) # get the "marker dosage IDs": dids <- mrkdos2mrkdid(mrkDosage=mrkdosg, ploidy=4) # convert dids back to marker dosages: mrkdid2mrkdos(dosageIDs=dids, nmrk=3, ploidy=4, mrknames=c("mrkA", "mrkB", "mrkC"))
get marker dosage IDs (mrkdid) from marker dosages
mrkdos2mrkdid(mrkDosage, indiv=NULL, ploidy, check=TRUE)
mrkdos2mrkdid(mrkDosage, indiv=NULL, ploidy, check=TRUE)
mrkDosage |
matrix or data.frame. Markers are in rows, individuals in columns, each cell has a marker dosage. Names of individuals are the column names, marker names are the row names or (if a data.frame) in a column named MarkerNames. All marker dosages must be in 0:ploidy or NA. If a data.frame, additional columns may be present. |
indiv |
NULL (default) or a character vector with names of individuals to be selected. If NULL, all columns are selected; if mrkDosage is a data.frame, that is probably not what is intended. |
ploidy |
all marker dosages are checked to be in 0:ploidy or NA |
check |
if TRUE (default) checkmrkDosage is called. If FALSE it is assumed that mrkDosage is a matrix (not a data.frame) and it is not checked. |
with ploidy==1 and (of course) all dosages 0 or 1 this function returns the haplotype numbers for the haplotype specified by each column
a vector of marker dosage IDs, one for each column of mrkDosage: each a number in 1:((ploidy+1)^nrow(mrkDosage)), NA for each column in dosages where any of the dosages are NA
# dosages of 3 markers in 3 tetraploid individuals: mrkdosg <- matrix(c(1,2,2, 4,0,0, 3,0,2), nrow=3, dimnames=list(c("mrkA", "mrkB", "mrkC"), c("indiv1", "indiv2", "indiv3"))) # get the "marker dosage IDs": dids <- mrkdos2mrkdid(mrkDosage=mrkdosg, ploidy=4) # convert dids back to marker dosages: mrkdid2mrkdos(dosageIDs=dids, nmrk=3, ploidy=4, mrknames=c("mrkA", "mrkB", "mrkC"))
# dosages of 3 markers in 3 tetraploid individuals: mrkdosg <- matrix(c(1,2,2, 4,0,0, 3,0,2), nrow=3, dimnames=list(c("mrkA", "mrkB", "mrkC"), c("indiv1", "indiv2", "indiv3"))) # get the "marker dosage IDs": dids <- mrkdos2mrkdid(mrkDosage=mrkdosg, ploidy=4) # convert dids back to marker dosages: mrkdid2mrkdos(dosageIDs=dids, nmrk=3, ploidy=4, mrknames=c("mrkA", "mrkB", "mrkC"))
generate an overview of the results by haploblock and by FS family
overviewByFS(haploblock, parents, FS, hapresults)
overviewByFS(haploblock, parents, FS, hapresults)
haploblock |
a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock. |
parents |
a matrix of two colums and one row per FS, containing the names of the two parents of each FS |
FS |
a list of character vectors with the names of the samples for each FS |
hapresults |
a list as returned by inferHaplotypes, with one item per haploblock, containing at least all those in the list specified by haploblock |
a list with two items:
- ovw: an integer matrix with one row per haploblock and the following
columns:
* nmrk: the number of markers in the haploblock
* nhap: the number of different haplotypes assigned over all individuals
(NA if no solution was found for this haplotype; the reason for that is
listed in the first column of item messages of the return value)
* for each FS family a set of 6 columns:
+ parmrk (0, 1 or 2: the number of parents with complete marker data)
+ fit (0=FALSE or 1=TRUE), indicating if a solution for the FS was found
based on polysomic inheritance)
+ mrk: the number of FS progeny with complete marker data
+ imp: the number of FS progeny where complete marker data were imputed
+ hap: the number of FS progeny with assigned haplotype combinations.
hap will be less than the mrk value if the same FS marker genotype can be
produced with different combinations of haplotypes that are all compatible
with the parental haplotype combinations) or if some FS marker genotypes
cannot be produced by the inferred parental haplotype combinations
+ P: the chi-squared P-value of the best fitting solution, even if this is
discarded because of lack of fit).
* For "rest" (all individuals that are not part of the FS's or their
parents) and "all" (all individuals) there are also columns mrk and hap,
and for "all" there is also a column imp, similar to those for the
FS families. The numbers for "all" are the sums
of those for the FS families, the FS parents (some FS may share a parent
but shared parents are counted only one) and the "rest".
- messages : a character matrix with one row per haploblock and the following
columns:
* haploblock: the reason why there is no solution for the
haploblock ("" if there is a solution)
* one column for each FS family with a possible message or "". A message
can indicate a failure to find a solution for the FS family but may also
describe less significant problems, such as some progeny with unexpected
marker dosages etc.
data(PolyHaplotyper_small) overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS, hapresults=phresults)
data(PolyHaplotyper_small) overviewByFS(haploblock=phblocks, parents=phpar, FS=phFS, hapresults=phresults)
pad an integer (prefix with zeroes to a fixed length)
padded(x, maxx=0)
padded(x, maxx=0)
x |
vector of non-negative integers |
maxx |
a single integer to whose nchar all x elements will be padded; if 0 (default) the largest value in x will be used |
a character vector representing the values of x left-padded with 0's to the length of integer maxx or of max(x)
padded(c(21, 1, 121, NA, 0)) padded(c(21, 1, 121, NA, 0), maxx=1000)
padded(c(21, 1, 121, NA, 0)) padded(c(21, 1, 121, NA, 0), maxx=1000)
For all haploblocks, check whether the inheritance of inferred haplotypes over the pedigree is consistent without or with allowing for double reduction (DR), and assuming polysomic inheritance.
pedigreeHapCheck(ped, mrkDosage, haploblock, hapresults)
pedigreeHapCheck(ped, mrkDosage, haploblock, hapresults)
ped |
data.frame or matrix listing the pedigree. Column 1 has names of all individuals (no duplicates, no NA), column 2 and 3 have the parents (duplicates and NA allowed, also individuals with one known parent or with two identical parents allowed). No sorting of the pedigree is needed. All parents should also be listed as individuals; if that is not the case the missing parents will be added as founders and a warning will be issued. |
mrkDosage |
a matrix with one row for each marker and one column for each individual, with the observed dosages. Individuals with one or more NA dosages are considered to have no marker data. Individual names are the colnames, no duplicates allowed; the set of individuals needs not be the same as those in ped; individual names common to mrkDosage and ped must be exactly identical (upper/lower case and whitespace included) |
haploblock |
a list of character vectors. The names are the names of the haploblocks, the character vectors have the names of the markers in each haploblock |
hapresults |
a list as returned by inferHaplotypes, with one item per haploblock, containing at least all those in the list specified by haploblock |
a list with two items:
- ped_arr: a 3D logical array with dimensions individuals, diagnostics and
haploblocks. For each individual and each haploblock there are 4
diagnostics:
* mrk: does the individual have complete marker dosage data?
* imp: were the marker dosages for this individual imputed?
* hap: is there a haplotype combination assigned to the individual?
* noDR: does the haplotype genotype of the individual match that of its
parents, assuming polysomic inheritance but no Double Reduction? NA if the
individual or both its parents do not have a haplotype genotype assigned
* withDR: as noDR, but allowing Double Reduction
- parents_arr: a 3D integer array with dimensions parents (all individuals
that occur as parents in the pedigree), diagnostics and haploblocks.
For each parent and each haploblock there are 7 diagnostics:
* par_mrkdata: 0=FALSE, 1=TRUE, does this parent have complete marker data?
* par_hapdata: 0=FALSE, 1=TRUE, does this parent have a haplotype
genotype assigned?
* totprogeny: how many first-generation progeny (children) does this parent
have (combined over all its matings, both as mother and as father)
* prog_mrkdata: how many progeny have complete marker dosage data?
* prog_hapdata: how many progeny have a haplotype combination assigned?
* nonDRmatch: how many progeny have a haplotype combination that is
compatible with their parent's haplotype combinations, assuming polysomic
inheritance but no Double Reduction
* DRmatch: as nonDRmatch, but also allowing DR
Both ped_arr and parents_arr contain all haploblocks in haploblock, also
those skipped because of too many markers and those without any haplotyped
individuals. These can be excluded by excluding them from the haploblock list.
data(PolyHaplotyper_small) phchk <- pedigreeHapCheck(ped=phped, mrkDosage=phdos, haploblock=phblocks, hapresults=phresults) # show the top of the ped_arr for haploblock 1: phchk$ped_arr[1:6,,1] # show the top of the parents_arr for haploblock 1: phchk$parents_arr[1:6,,1]
data(PolyHaplotyper_small) phchk <- pedigreeHapCheck(ped=phped, mrkDosage=phdos, haploblock=phblocks, hapresults=phresults) # show the top of the ped_arr for haploblock 1: phchk$ped_arr[1:6,,1] # show the top of the parents_arr for haploblock 1: phchk$parents_arr[1:6,,1]
convert the PedigreeSim true haplotypes to marker dosages and haplotype dosages
pedigreeSim2PH(ps_geno, haploblock, indiv=NULL, dropUnused=TRUE)
pedigreeSim2PH(ps_geno, haploblock, indiv=NULL, dropUnused=TRUE)
ps_geno |
the filename of a *_genotypes.dat file as produced by PedigreeSim, or a data.frame read from such a file (with a column marker, followed by <ploidy> columns per individual with the marker alleles for each homolog; the names of these columns must be <indivname>_<homolog number>). ps_geno should not contain missing data (this will be true unless the PedigreeSim output is midified) |
haploblock |
a list of character vectors. The names are the names of the
haploblocks, the character vectors have the names of the markers in each
haploblock. Haplotype names are constructed from the haploblock names, which
are used as prefixes to which the (zero-padded) haplotype numbers are are
appended with separator '_'. |
indiv |
the names of the individuals to be extracted; default NULL: all individuals in ps_geno. If not NULL, all indiv must be present in ps_geno |
dropUnused |
TRUE (default) if the returned matrix should only contain rows for haplotypes that are present; if FALSE matrix contains rows for all possible haplotypes |
if all alleles are in 0/1, these are kept. If only 2 allele symbols
occur in ps_geno all are converted (alphabetically) to 0/1 in
the same way (e.g. if the alleles are A and B, A -> 0 and B -> 1, even in
markers with only B's). If different allele symbols are used between
markers, per marker the (alphabetically or numerically) lowest -> 0 and
the highest -> 1.
So if more than two different allele symbols occur in ps_geno, and
one marker has only A and another only B alleles, both are converted
to 0's.
in mrkDosage, the dosage of the alleles (converted to) 1 is reported
a list with 2 items:
$mrkDosage: a matrix of marker dosages in the input format of inferHaplotypes,
with all markers that occur in haploblock, sorted according to haploblock,
and for the selected indiv in the specified order
$haplist: (only if haploblock is not NULL) a list with one element for each
haploblock, similar to the
inferHaplotypes output). Each element is itself a list, with two components:
$hapdos is a matrix with the haplotype dosages for that haploblock for each
individual
$markers: a vector with the names of
the markers in the haploblock in the output of inferHaplotypes
A list with for each haploblock the names of the markers it contains; used in manuals, stored in PolyHaplotyper_small.RData
a list with 2 character vectors
A matrix with the allele dosages for 8 SNPs in 661 samples; used in manuals, stored in PolyHaplotyper_small.RData
a data.frame with 30 rows (SNPs) and 663 columns. The first two columns contain the SNP name and contig; the remaining columns contain the allele dosages (integers in 0..6 or NA) of the SNPs in 661 samples
A list with for each FS family the names of the individuals it contains; used in manuals, stored in PolyHaplotyper_small.RData
a list with 4 vectors, each with the (here numeric) names of the FS members
A matrix with the names of the parents of each FS family; used in manuals, stored in PolyHaplotyper_small.RData
a matrix with 4 rows, 1 per FS family, and 2 columns, 1 for each parent
A data.frame with a pedigree; used in manuals, stored in PolyHaplotyper_small.RData
a data.frame with 661 rows (samples) and 4 columns: genotype, mother, father, sample_nr
A list with the results of function inferHaplotypes; used in manuals, stored in PolyHaplotyper_small.RData
a list with 2 elements, one per haploblock; each element itself a list with multiple components
read the haplotyping results from the Happy-inf output
read.Happyinf.output(file_prefix, dropUnused=TRUE)
read.Happyinf.output(file_prefix, dropUnused=TRUE)
file_prefix |
the prefix for the output files generated by Happy-inf (the value of the -o parameter) |
dropUnused |
TRUE (default) if the returned matrix should only contain rows for haplotypes that are present; if FALSE matrix contains rows for all possible haplotypes |
This function reads the <file_prefix>.stat.dat and <file_prefix>.stats.dat files. The first contains one row per SNP marker, grouped per haploblock and <ploidy> columns per individual with the SNP haplotypes, where SNP alleles are represented as 0 or 1, or NA for unknown. For one SNP, the <ploidy> alleles in an individual are all known or all unknown It is possible that of the different SNPs in a haploblock some are known and some are not; in the conversion to PolyHaplotyper format all these partially known haplotypes are made unknown. The latter file has a statistics "mismatch" and "ratio" for each individual / haploblock combination; for their meaning see the Happyinf readme file.
a list with 2 items:
$hapdos is a matrix with individuals in columns and haplotypes in rows,
giving the dosages of the haplotypes in each individual (summing to ploidy).
This is the same format as the hapdos components of the inferHaplotypes
results of PolyHaplotyper except that
$stats is a matrix with individuals in rows (matching dosmat) and two rows
named mismatch and ratio.
read the haplotyping results from the SATlotyper output
read.SATlotyper.output(fname, output="", allelecodes=c("A", "B"), sep, haploblockname="")
read.SATlotyper.output(fname, output="", allelecodes=c("A", "B"), sep, haploblockname="")
fname |
filename of an xml file produced by SATlotyper |
output |
character vector, console output of SATlotyper; this contains a table of calculated haplotypes with more info than the xml file |
allelecodes |
the codes used for the ref and alt SNP alleles in make.SATlotyper.input, default "A" and "B" |
sep |
the separator used in make.SATlotyper.input (SATlotyper cleverly identifies this separator and uses it for its output), default tab |
haploblockname |
if not "" (default) the haplotype IDs are given as <haploblockname>_<hapnr>, else just as <hapnr>, left_padded with zeroes |
The xml file is parsed using the package XML. The resulting list has 3 items named source, bootstrapping and haplotyping. In this function source is ignored and the results are obtained from haplotypings[[1]] (i.e. even if multiple haplotypings were done only the first is extracted), with the haplotype score from the bootstrapping item added to the haplotype.info.
a list with 3 items:
$hapdos is a matrix with individuals in columns and haplotypes in rows,
giving the dosages of the haplotypes in each individual (summing to ploidy).
This is the same format as the hapdos components of the inferHaplotypes
results of PolyHaplotyper
$haplotype.info is a data.frame that is a combination of information from two
or three tables:
columns Haplotype and necessity are from element
haplotypings[[1]]$haplotypes in the xml file,
column hapnr gives the PolyHaplotyper equivalents of the Haplotypes,
column score is from element bootstrapping in the xml file,
columns id, number, frequency, homozygous, necessary, distance and
neighbours are from (the console) output if present.
The haplotype.info data.frame is ordered by hapnr.
$HaplotypingScore is the single number in $haplotyping[[1]]$score.
Read the haplotyping results from the ShesisPlus output
read.ShesisPlus.output(SSPout, order.by=c("", "hapnr", "count")[2])
read.ShesisPlus.output(SSPout, order.by=c("", "hapnr", "count")[2])
SSPout |
filename of a text file with ShesisPlus output (as copied from the web page produced by running ShesisPlus web interface at http://shesisplus.bio-x.cn/SHEsis.html); or a character vector with the same output |
order.by |
how to order the rows of the hapstat data.frame. "hapnr" means ordering by haplotype number, "count" means ordering by decreasing Total.count, anything else results in no reordering. By default order by hapnr. |
If present, the markernames and the haplotype statistics are read from the file. ShesisPlus does not provide haplotype combinations for individuals
a list with 2 items: $markernames has the marker names in the markers in the haploblock (if present in the file); hapstat contains the haplotype statistics as read from the file with an additional (first) column hapnr: the haplotype numbers as defined in PolyHaplotyper. The other columns are Haplotype (as sequences of marker alleles), Total.count (of the haplotype over the whole population) and, according to an email from Zhiqiang Li of 13-04-2020) BETA: Regression coefficient, SE: Standard error, R2: Regression r-squared, T: t-distribution statistics, P: p-value
# we give a typical SSP output as character vector; instead we could also # give the name of a text file SSPout <- c( " Please cite:", "", " Shen, J. et al. SHEsisPlus, a toolset for genetic studies ...", " Shi, Y. et al. SHEsis, a powerful software platform ...", " Li, Z. et al. A partition-ligation-combination-subdivision ...", "", "if you find this tool useful in your research. Thanks!", "", "Haplotype Analysis:", "", "Haplotypes with frequency <0.03 are ignored.", "Loci chosen for haplotype analysis: m1, m2, m3, m4", "Haplotype Total count Beta SE R2 T p", "1122 2232 0.002 0.01 1.01e-04 0.252 0.8", "1222 230 -0.019 0.017 0.002 -1.15 0.25", "1221 152 -0.01 0.024 2.94e-04 -0.43 0.667", "2222 288 0.008 0.02 2.93e-04 0.429 0.667", "1121 142 -0.009 0.022 2.81e-04 -0.42 0.674" ) read.ShesisPlus.output(SSPout)
# we give a typical SSP output as character vector; instead we could also # give the name of a text file SSPout <- c( " Please cite:", "", " Shen, J. et al. SHEsisPlus, a toolset for genetic studies ...", " Shi, Y. et al. SHEsis, a powerful software platform ...", " Li, Z. et al. A partition-ligation-combination-subdivision ...", "", "if you find this tool useful in your research. Thanks!", "", "Haplotype Analysis:", "", "Haplotypes with frequency <0.03 are ignored.", "Loci chosen for haplotype analysis: m1, m2, m3, m4", "Haplotype Total count Beta SE R2 T p", "1122 2232 0.002 0.01 1.01e-04 0.252 0.8", "1222 230 -0.019 0.017 0.002 -1.15 0.25", "1221 152 -0.01 0.024 2.94e-04 -0.43 0.667", "2222 288 0.008 0.02 2.93e-04 0.429 0.667", "1121 142 -0.009 0.022 2.81e-04 -0.42 0.674" ) read.ShesisPlus.output(SSPout)
A simple interface to run SATlotyper
run.SATlotyper(path_to_SATlotyper, infile, outfile, SAT_solver="sat4j.conf")
run.SATlotyper(path_to_SATlotyper, infile, outfile, SAT_solver="sat4j.conf")
path_to_SATlotyper |
path to the folder where SATlotyper.jar and the *.conf files of the SAT-solvers are located |
infile |
name (and path) to the SATlotyper input file |
outfile |
name (and path) for the SATlotyper output file (an xml file) |
SAT_solver |
name of the *.conf file for the SAT-solver to use. Default "sat4j.conf" because this works under both Windows and Linux. |
This function issues a system command to invoke SATlotyper. Java and SATlotyper must be installed. This is just a simple interface for convenience; for more control run SATlotyper directly from the command window.
The return is a list with elements:
$cmd : the command passed to the system
$result: screen output from SATlotyper; this
contains some extra info not present in the output xml file)
The main result is the outfile
show marker and haplotype dosages for one FS family and its parents
showOneFS(FSnr, hbresults, mrkDosage, FS, parents)
showOneFS(FSnr, hbresults, mrkDosage, FS, parents)
FSnr |
the number of the FS family (indexes the FS list and parents) |
hbresults |
a list with the haplotyping results for one haploblock: one element of a list as returned by inferHaplotypes |
mrkDosage |
a matrix of marker dosages, may contain rows for more markers than only those in the current haploblock (the relevant markers are specified in the hbresults list) |
FS |
a list of which each item is a (character) vector with the names of the individuals in that FS family |
parents |
a (character) matrix with 2 columns and one row for each FS family in FS, with the names of the two parents of each family |
a list with 3 elements:
$mrkdat: a matrix with info on the marker dosages distribution in the FS. The
first two columns are for the parents: the parent name is between (brackets)
in the column name, their mrkdid (marker dosage ID) in row 1 and their marker
dosages below that. The remaining columns are for the different mrkdids
observed in the FS: the mrkdid itself in the column name, its frequency in
row 1 and its marker dosages below that. The final column gives the frequency
of individuals with one or more missing marker dosages.
$hapdat: a matrix with similar layout as mrkdat, but now with the haplotype
dosages rather than the marker dosages. Some mrkdids (columns) may not have
a haplotype dosage combination assigned (if multiple possible haplotype
combinations result in the same marker dosages)
usedhap: a matrix with the dosage (0 or 1) of each marker in each of the
used haplotypes; haplotype nrs in columns, markers in rows
data(PolyHaplotyper_small) # show the results of the first FS family in the first haploblock: showOneFS(FSnr=1, hbresults=phresults[[1]], mrkDosage=phdos, FS=phFS, parents=phpar)
data(PolyHaplotyper_small) # show the results of the first FS family in the first haploblock: showOneFS(FSnr=1, hbresults=phresults[[1]], mrkDosage=phdos, FS=phFS, parents=phpar)
calculate the total nr of possible haplotype combinations
totHapcombCount(ploidy, nmrk)
totHapcombCount(ploidy, nmrk)
ploidy |
a vector of 1 or more ploidy levels |
nmrk |
a vector of 1 or more numbers of markers per haploblock |
nmrk is used to calculate the total number of possible haplotypes = 2^nmrk. The shorter vector of ploidy and nmrk is recycled.
a vector with the number of possible haplotype combinations for each pair of ploidy and nmrk values
totHapcombCount(ploidy=4, nmrk=c(1:8))
totHapcombCount(ploidy=4, nmrk=c(1:8))
Find all haplotypes for a haploblock that were inferred to be present in the population (i.e. all haplotypes used for haplotyping any of the individuals)
usedhap(hapresults, haploblock)
usedhap(hapresults, haploblock)
hapresults |
list as returned by inferhaplotypes, or one element of such a list (i.e. the results for one haploblock) |
haploblock |
if hapresults is one element of the return value of inferHaplotypes, haploblock should be missing of NULL; else haploblock is a single value indicating the haploblock: either its name of its index in hapresults |
This function works with the results of inferHaplotypes; the setting of dropUnused does not affect this function
an array with the haplotypes that are used in the population. The haplotypes are in columns, with the haplotype numbers as colnames; the markers are in rows.
data(PolyHaplotyper_small) # show the composition of haplotypes inferred to be present # in the first haploblock: usedhap(hapresults=phresults, haploblock=1)
data(PolyHaplotyper_small) # show the composition of haplotypes inferred to be present # in the first haploblock: usedhap(hapresults=phresults, haploblock=1)