
Controlling for Confounding Variables in Polygenic Linkage Analysis
Source:vignettes/covariates-deconfounding.Rmd
covariates-deconfounding.Rmd1. Why Control for Confounders in Polygenic Linkage Analysis
Confounding variables are factors that influence both the genetic architecture and the trait being studied, potentially leading to spurious associations. In polygenic pathway enrichment analysis, common confounders include:
- Age: Gene expression and methylation patterns change with age
- Sex: X-chromosome inactivation and sex-specific genetic effects
- Ancestry: Population stratification can create LD patterns unrelated to trait biology
Without accounting for these factors, pathway enrichment tests may:
- Inflate false positive rates: Genes correlated with confounders appear enriched
- Mask true signals: Real pathway associations may be obscured by confounder-induced noise
- Produce irreproducible results: Confounder-driven associations rarely replicate
polylinkR addresses this through a local quadratic regression approach that:
- Calculates Mahalanobis distances between genes in covariate space
- Applies kernel weighting to identify genetically proximal genes with similar covariate profiles
- Regresses out confounder effects while preserving biological signal
2. Creating Covariate Dataframes with Sample IDs Matching Genetic Data
The Anatolia_EF_CLR dataset contains gene-level
statistics but no covariate columns. For this vignette, we’ll create
synthetic confounders that simulate real-world scenarios.
First, let’s examine the structure of the data:
# Load the example data
data(Anatolia_EF_CLR)
# Check the structure
str(Anatolia_EF_CLR)
head(Anatolia_EF_CLR)Key columns for covariate integration:
-
objID: Unique gene identifier (must match between obj.info and covariates) -
objStat: Gene-level test statistic (the score to be deconfounded) -
chr,startpos,endpos: Genomic coordinates (required for deconfounding)
Now, let’s create synthetic covariates that simulate age, sex, and ancestry effects:
# Create a copy of the data for modification
obj_info <- copy(Anatolia_EF_CLR)
# Set seed for reproducibility
set.seed(42)
n_genes <- nrow(obj_info)
# Cov1: Age-related methylation proxy
# Genes on certain chromosomes tend to have age-related expression changes
obj_info[, Cov1 := rnorm(.N, mean = 0, sd = 1)]
# Add chromosome-specific patterns (simulating age effects)
obj_info[chr %in% c(1, 3, 5), Cov1 := Cov1 + 0.5]
obj_info[chr %in% c(2, 4, 6), Cov1 := Cov1 - 0.3]
# Cov2: Sex chromosome dosage
# X chromosome genes have different expression patterns
obj_info[, Cov2 := rnorm(.N, mean = 0, sd = 1)]
obj_info[chr == 23, Cov2 := Cov2 + 1.5] # X chromosome enrichment
obj_info[chr == 24, Cov2 := Cov2 - 1.0] # Y chromosome depletion
# Cov3: Ancestry principal component proxy
# Spatial autocorrelation simulating population structure
obj_info[, gene_mid := (startpos + endpos) / 2]
obj_info[, Cov3 := rnorm(.N, mean = 0, sd = 1)]
# Create regional patterns
for (chrom in unique(obj_info$chr)) {
chr_genes <- which(obj_info$chr == chrom)
if (length(chr_genes) > 10) {
# Smooth spatial pattern
positions <- obj_info$gene_mid[chr_genes]
sorted_idx <- order(positions)
trend <- sin(positions[sorted_idx] / max(positions) * 2 * pi)
obj_info$Cov3[chr_genes[sorted_idx]] <- obj_info$Cov3[chr_genes[sorted_idx]] + trend
}
}
# Remove temporary column
obj_info[, gene_mid := NULL]
# Display the new covariate columns
head(obj_info[, c("objID", "objName", "Cov1", "Cov2", "Cov3")])
# Summary statistics
summary(obj_info[, .(Cov1, Cov2, Cov3)])Now let’s simulate how these covariates might confound the gene scores:
# Add confounding effect to gene scores
# In real data, these effects are already present in the objStat values
# Here we demonstrate by adding synthetic confounding
obj_info[, objStat_original := objStat]
# Add covariate effects to simulate confounded data
obj_info[, objStat := objStat +
0.3 * Cov1 + # Age effect
0.2 * Cov2 + # Sex effect
0.25 * Cov3] # Ancestry effect
# Check correlation between covariates and scores
print("Correlations with confounded scores:")
cor_matrix <- cor(obj_info[, .(objStat, Cov1, Cov2, Cov3)])
print(cor_matrix)3. Using the covar Parameter in
plR_permute() and plR_rescale()
In polylinkR, covariates are detected automatically from columns with
the Cov prefix in the obj.info file. The
deconfounding process happens in plR_permute() through
these key parameters:
-
md.meth: Mahalanobis distance method ("robust"for rank-based,"raw"for Pearson) -
kern.bound: Genomic distance flanking region to exclude from kernel weights -
kern.func: Kernel function ("normal","exponential", or"inverse") -
kern.scale: Scaling parameter for kernel decay -
kern.wt.max: Maximum weight for any single gene
Let’s prepare the polylinkR input with the covariate-enriched
obj.info:
# Load additional required data
data(PolyLinkR_SetInfo)
data(PolyLinkR_SetObj)
# Create temporary directory for input files
temp_dir <- tempfile("polylinkR_covar_demo")
dir.create(temp_dir, recursive = TRUE)
# Write files with covariate columns
fwrite(obj_info, file.path(temp_dir, "ObjInfo.txt"), sep = "\t")
fwrite(PolyLinkR_SetInfo, file.path(temp_dir, "SetInfo.txt"), sep = "\t")
fwrite(PolyLinkR_SetObj, file.path(temp_dir, "SetObj.txt"), sep = "\t")
# Read the data into polylinkR
plr_obj <- plR_read(
input.path = temp_dir,
verbose = TRUE
)
# Verify the object structure
print(plr_obj)4. Example with Synthetic Confounders on Anatolia_EF_CLR Data
4.1 Running Analysis WITH Covariate Control
# Run deconfounding and permutation with covariate control
# Note: Using reduced permutations for demonstration (normally use 1e5 or more)
plr_with_covar <- plR_permute(
plR.input = plr_obj,
permute = TRUE,
n.perm = 1e4, # Reduced for demo; use 1e5+ for real analysis
n.boot = 10, # Reduced for demo
md.meth = "robust", # Use rank-based robust method
kern.bound = 1e5, # 100kb flanking region
kern.func = "normal", # Gaussian kernel
kern.scale = 2, # Standard decay
kern.wt.max = 0.05, # Max 5% weight for any gene
seed = 42,
verbose = TRUE,
n.cores = 1
)For this vignette, we’ll demonstrate the setup:
# Check that covariates are properly detected
obj_info_check <- copy(plr_obj$obj.info)
# Identify covariate columns
cov_names <- grep("^Cov[0-9]+$", names(obj_info_check), value = TRUE)
cat("Detected covariate columns:", paste(cov_names, collapse = ", "), "\n")
cat("Number of covariates:", length(cov_names), "\n")4.2 Running Analysis WITHOUT Covariate Control
To run without covariate control, simply exclude the Cov
columns from obj.info or set permute = FALSE
with covariates removed:
# Create obj.info without covariate columns
obj_info_no_covar <- obj_info[, .(objID, objStat = objStat_original, objName,
GeneLength, chr, startpos, endpos)]
# Write to temporary directory
temp_dir2 <- tempfile("polylinkR_no_covar")
dir.create(temp_dir2, recursive = TRUE)
fwrite(obj_info_no_covar, file.path(temp_dir2, "ObjInfo.txt"), sep = "\t")
fwrite(PolyLinkR_SetInfo, file.path(temp_dir2, "SetInfo.txt"), sep = "\t")
fwrite(PolyLinkR_SetObj, file.path(temp_dir2, "SetObj.txt"), sep = "\t")
# Read without covariates
plr_no_covar <- plR_read(
input.path = temp_dir2,
verbose = TRUE
)
# Check that no covariates are detected
cat("Covariate columns in no-covar object:",
length(grep("^Cov[0-9]+$", names(plr_no_covar$obj.info))), "\n")5. Interpreting Results With and Without Covariate Control
When comparing results with and without covariate control, look for these patterns:
5.1 Gene Score Changes
After deconfounding, examine how gene scores changed:
# Extract deconfounded scores (would be available after running plR_permute)
# deconfounded_scores <- plr_with_covar$obj.info$objStat_dc
# original_scores <- plr_obj$obj.info$objStat
# Plot comparison
# plot(original_scores, deconfounded_scores,
# xlab = "Original Scores", ylab = "Deconfounded Scores",
# main = "Impact of Covariate Adjustment")
# abline(0, 1, col = "red", lty = 2)5.2 Key Metrics to Compare
| Metric | Without Covariates | With Covariates | Interpretation |
|---|---|---|---|
| Pathway p-values | May be inflated | Generally more conservative | Confounders can create artificial enrichment |
| Significant pathways | Potentially many false positives | Fewer but more reliable | Covariates reduce spurious associations |
| Gene score variance | Higher | Reduced (confounder component removed) | Deconfounding removes systematic variation |
5.3 Diagnostic Plots
When running with verbose = TRUE, polylinkR generates
diagnostic information about:
- Covariate correlations: How strongly each covariate correlates with gene scores
- Mahalanobis distances: Distribution of distances in covariate space
- Kernel weights: How genes are weighted in the local regression
- Deconfounding statistics: Model fit and residual patterns
6. Best Practices for Selecting Covariates
6.1 What to Include
Essential covariates: - Age (continuous or binned) - Sex (binary or factor) - Ancestry principal components (typically PC1-PC5) - Technical factors (batch, platform, sequencing depth)
Optional covariates: - Cell type proportions (for bulk tissue) - BMI, smoking status (if relevant to trait) - Sample collection time/season
6.2 What to Avoid
- The trait of interest itself: Never include the outcome variable as a covariate
- Mediators: Variables on the causal pathway between genes and trait
- Colliders: Variables influenced by both genes and trait
- Too many PCs: Including >10 ancestry PCs often captures noise rather than structure
6.3 Covariate Preprocessing
# Example preprocessing pipeline
preprocess_covariates <- function(cov_data) {
# 1. Handle missing values
cov_data <- na.omit(cov_data)
# 2. Standardize continuous covariates
cont_cols <- c("Age", "BMI", "PC1", "PC2", "PC3")
for (col in cont_cols) {
if (col %in% names(cov_data)) {
cov_data[[col]] <- scale(cov_data[[col]])[, 1]
}
}
# 3. Encode categorical variables
if ("Sex" %in% names(cov_data)) {
cov_data$Sex <- as.numeric(factor(cov_data$Sex)) - 1 # 0/1 encoding
}
# 4. Remove collinear covariates
cor_matrix <- cor(cov_data[, -1]) # Exclude ID column
high_cor <- which(abs(cor_matrix) > 0.9 & upper.tri(cor_matrix), arr.ind = TRUE)
if (nrow(high_cor) > 0) {
cat("Highly correlated covariate pairs detected:\n")
for (i in seq_len(nrow(high_cor))) {
cat(names(cov_data)[high_cor[i, 1] + 1], "~",
names(cov_data)[high_cor[i, 2] + 1], "\n")
}
}
return(cov_data)
}6.4 Naming Convention
polylinkR automatically detects covariates using the pattern
Cov[0-9]+. Ensure your covariate columns follow this naming
scheme:
# Example: Proper naming for polylinkR
obj_info_named <- data.table(
objID = 1:100,
objStat = rnorm(100),
Cov1 = rnorm(100), # Age
Cov2 = rbinom(100, 1, 0.5), # Sex
Cov3 = rnorm(100), # PC1
Cov4 = rnorm(100), # PC2
chr = sample(1:22, 100, replace = TRUE),
startpos = sample(1e6:1e8, 100),
endpos = sample(1e6:1e8, 100)
)
# Order doesn't matter, but the prefix must be exactly "Cov"
head(obj_info_named)6.5 Parameter Tuning Guidelines
| Parameter | Default | When to Adjust | Conservative Setting |
|---|---|---|---|
md.meth |
"robust" |
Use "raw" for normal-distributed covariates |
"robust" (always recommended) |
kern.bound |
1e5 | Increase for gene-dense regions | 2e5 or higher |
kern.scale |
2 | Decrease for tighter local regression | 1 |
kern.wt.max |
0.05 | Decrease to limit influence of single genes | 0.02 or 0.01 |
Session Information
sessionInfo()
# Clean up temporary directories
unlink(temp_dir, recursive = TRUE)
unlink(temp_dir2, recursive = TRUE)References
Population Stratification: Price AL, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics.
Batch Effects: Leek JT, et al. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics.
Local Regression: Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association.
Mahalanobis Distance: Mahalanobis PC (1936) On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India.