Reads and validates the input files for a polylinkR analysis. Handles
file path resolution, column name canonicalisation, data validation, and
optional filtering and data generation tasks like oordinate conversion or
gene score calculation.
Usage
plR_read(
input.path = NULL,
obj.info.path = NULL,
set.info.path = NULL,
set.obj.path = NULL,
var.info.path = NULL,
rec.rate.path = NULL,
min.set.n = 2L,
max.set.n = Inf,
group = NULL,
map.fun = "kosambi",
obj.buffer = "auto",
obj.stat.fun = "non.param",
bin.size = 250L,
obj.in = NULL,
obj.out = NULL,
set.in = NULL,
set.out = NULL,
set.merge = 0.95,
rem.genes = FALSE,
verbose = TRUE
)Arguments
- input.path
character; path to the directory with input files. Defaults toNULL. Not required if using separate file paths. Automatically searches for required files with the following labels:setinfo: a data frame with set-level information.objinfo: a data frame with object-level information.setobj: a data frame mapping objects to sets.
It also searches for optional files:
recrate: an optional file for genetic coordinate conversion.varinfo: an optional file for gene score generation.
Allowable file names include capitalisation of each word in the label and use of internal separators (e.g.,
set.info,set_info,SetInfo,Set.Info,Set_Info). Also see the section on required and optional input files for the columns needed in each input file.- obj.info.path
character; path to theobj.infofile. Defaults toNULL. Ignored ifinput.pathis provided; otherwise all required file paths must be specified.- set.info.path
character; path to theset.infofile. Defaults toNULL. Ignored ifinput.pathis provided; otherwise all required file paths must be specified.- set.obj.path
character; path to theset.objfile. Defaults toNULL. Ignored ifinput.pathis provided; otherwise all required file paths must be specified.- var.info.path
character; path to thevar.infofile. Defaults toNULL. Optional file used for gene score generation. Ignored ifinput.pathis provided.- rec.rate.path
character; path to therec.ratefile. Defaults toNULL. Optional file used for genetic coordinate conversion. Ignored ifinput.pathis provided.- min.set.n
integer; minimum size of gene sets to be retained. Defaults to2. Must be in the range[2L, max.set.n).- max.set.n
integer; maximum size of gene sets to be retained. Defaults toInf. Must be in the range(min.set.n, Inf).- group
character; label used to identify input files within a directory. Defaults toNULL.- map.fun
character; mapping function to convert physical to genetic distances. Options are"Haldane","Kosambi"(default),"Carter-Falconer", and"Morgan".- obj.buffer
numeric; interval around genes (in base pairs) to include when assigning values from thevar.infofile. Defaults to1e4ifvar.infois provided and user does not set a value, otherwise it is set to 0 if score assignment is not performed. User values must be in the range[0, 1e5L]. Note that if the user provides their own gene scores in theobj.infoinput file (i.e. not computed from thevar.infofile), then the start and end positions must include any buffer used to bin scores, otherwise polylinkR deconfounding and autocorrelation inference will not be performed appropriately.- obj.stat.fun
character; function used to correct maximum gene scores based on the number of overlapping summary statistics (SNPs or windows). Default isnon.param, a robust non-parametric method that uses binned data to calculate median and median absolute deviation (MAD) to normalise scores. Alternatively,lm.logNapplies a linear regression to the log-transformed SNP / bin counts (assumes a roughly linear relationship is appropriate). In both cases, expected scores are estimated and gene scores calculated as the residual value. Ignored if no var.info file is provided.- bin.size
integer; gene set size interval for non-parametric correction. Defaults to250L. Must be in the range[50L, 1e3L]; ignored if the parametric function is used.- obj.in
characterornumericvector;objIDs of genes to explicitly retain. Defaults toNULL.- obj.out
characterornumericvector;objIDs of genes to explicitly remove. Defaults toNULL.- set.in
characterornumericvector;setIDs of gene sets to explicitly retain. Defaults toNULL.- set.out
characterornumericvector;setIDs of gene sets to explicitly remove. Defaults toNULL.- set.merge
numeric; minimum proportion of shared genes for merging gene sets. Defaults to0.95. Must be in the range(0, 1].- rem.genes
logical; should genes with identical genomic positions be removed? Defaults toFALSE.- verbose
logical; should progress messages be printed to the console? Defaults toTRUE.
Value
A plR S3 object containing three complementary datasets:
obj.infodata.tablecontaining information for each gene (object).set.infodata.tablecontaining information for each gene set.set.objdata.tablecontaining the mapping of genes to gene sets.
All plR S3 objects include auxiliary information and summaries as
attributes:
plr.dataReusable datasets and parameters, including GPD fitting results and autocovariance estimation.
plr.argsArgument settings used in the function.
plr.sessionR session information and function run time.
plr.trackInternal tracking number indicating functional steps.
classS3 object class (i.e.,
plr).
Each attribute can be accessed using attr() or attributes().
The first four attribute classes aggregate information over successive
functions. For example, to access the plr.data attribute for the
plR object output after running plR_permute, use
attributes(X)$plR.data$permute.data, where X is the object
name. Similarly, the arguments used in plR_read are in
attributes(X)$plR.args$read.args.
The primary data structure of the plR object can be accessed using
print() or by simply typing the object's name.
Required input file structure
The following three files are compulsory for a polylinkR analysis
and extend the format introduced in Polysel. All files must be comma-
separated (.csv) or tab-separated (.tsv), with a header
(see examples at https://github.com/CMPG/polysel/tree/master/data).
set.infoA
data.table(anddata.frame) with gene set information.setID:
characterorfactorvector of unique gene set identifiers. Required.
obj.infoA
data.table(anddata.frame) with gene (objects) information.objID:
characterorfactorvector of unique gene identifiers. Required.objStat: optional
numericvector of pre-computed gene scores. Required ifvar.infois absent; otherwise computed fromvar.info.CovX:
numericvector of covariate scores, whereXis a positive integer denoting covariate number. Required for deconfounding gene scores (objStat) inplR_permute.chr:
characterornumericchromosome / contig labels. Required for gene score deconfounding and gene set score decorrelation in andplR_permuteandplR_rescale, respectively.startpos:
numericgene start position in base pairs. If the user is providing their own gene scores, then this position represents the original position minus any buffer used to assign summary statistics to genes. Required for gene score deconfounding and gene set score decorrelation in andplR_permuteandplR_rescale, respectively.endpos:
numericgene end position in base pairs. If the user is providing their own gene scores, then this position represents the original position plus any buffer used to assign summary statistics to genes. Required for gene score deconfounding and gene set score decorrelation in andplR_permuteandplR_rescale, respectively.startpos.base:
numericgene start position in base pairs. Default lower gene boundary (ignoring buffer used in gene assignment). Created internally if absent or gene scores are estimated (i.e. var.info provided). Required for gene score deconfounding and set score decorrelation inplR_permuteandplR_rescale, respectively.endpos.base:
numericgene end position base pairs. Default upper gene boundary (ignoring buffer used in gene assignment). Created internally if absent or gene scores are estimated (i.e. var.info provided). Required for gene score deconfounding and set score decorrelation inplR_permuteandplR_rescale, respectively.
set.objA
data.table(anddata.frame) mapping genes to gene sets.setID:
characterorfactorvector of unique gene set identifiers. Required.objID:
characterorfactorvector of unique gene identifiers. Required.
Optional input file structure
These optional comma-separated (.csv) or tab-separated (.tsv)
files provide the following plR_read functionality:
var.infoContains information used to compute gene scores (
objStat).chr:
characterornumericchromosome / contig labels. Required.pos:
numericposition where statistic was evaluated (e.g., SNP or central point in window). Required.value:
numericstatistical value from SNP / window score. Required.
rec.rateUsed to transform genetic coordinates from physical to genetic distances. Requires
startposandendposinobj.infoto be base pair coordinates. Uses HapMap format (see labelled examples at https://zenodo.org/records/11437540).chr:
characterornumericchromosome / contig labels. Required.pos:
numericbase pair position of upstream marker for the recombination interval. Required.rate:
numericrecombination rate in cM per bp in the downstream interval. Required.map:
numericgenetic distance in centiMorgans (cM). Optional; will be calculated from recombination rates if absent.
Examples
if (FALSE) { # \dontrun{
# Example 1: Read all files from a single folder
my_plr <- plR_read(input.path = "path/to/files")
# Example 2: Read all files with "POP1" in label from a single folder
my_plr <- plR_read(
input.path = "path/to/files",
group = "POP1"
)
# Example 3: Relax gene set merging criteria
my_plr <- plR_read(
input.path = "path/to/files",
set.merge = 0.50
)
# Example 4: Specify separate file paths and remove user-specified sets
my_plr <- plR_read(
set.info.path = "path/to/set.info",
set.obj.path = "path/to/set.obj",
obj.info.path = "path/to/obj.info",
set.out = c("set1", "set2")
)
# Example 5: Generate gene scores from var.info using regression
# and include a 50 kb buffer around genes
my_plr <- plR_read(
set.info.path = "path/to/set.info",
set.obj.path = "path/to/set.obj",
obj.info.path = "path/to/obj.info",
var.info.path = "path/to/var.info",
obj.stat.fun = "lm.logN", obj.buffer = 5e4
)
# Example 6: Convert distances to cM using rec.rate file
my_plr <- plR_read(
set.info.path = "path/to/set.info",
set.obj.path = "path/to/set.obj",
obj.info.path = "path/to/obj.info",
rec.rate.path = "path/to/rec.rate"
)
} # }
