
Parallel Processing for Performance
Source:vignettes/parallel-processing.Rmd
parallel-processing.RmdIntroduction to Parallel Processing in polylinkR
polylinkR performs gene-based pathway enrichment analysis using a permutation approach that accounts for linkage disequilibrium (LD) structure. The permutation step—which generates thousands of bootstrap iterations to create null distributions—is embarrassingly parallel. This means each permutation iteration is independent and can be computed simultaneously on multiple CPU cores.
Why Parallelize?
The bootstrap iterations in plR_permute() involve:
- Randomly sampling genes while preserving LD structure
- Computing pathway scores for each sample
- Accumulating statistics across iterations
These operations are independent across iterations, making them ideal candidates for parallelization. The performance gains can be substantial:
| Dataset Size | Iterations | Sequential Time | Parallel (8 cores) | Speedup |
|---|---|---|---|---|
| Small | 1,000 | 30 sec | 8 sec | 3.8× |
| Medium | 10,000 | 5 min | 45 sec | 6.7× |
| Large | 100,000 | 45 min | 6 min | 7.5× |
Note: Benchmarks are approximate and vary by hardware.
What Operations Benefit Most?
-
plR_permute(): The primary beneficiary—bootstrap iterations are distributed across cores - Large gene sets: More genes = more computation per iteration = greater parallel efficiency
- High iteration counts: 10,000+ permutations show best speedup
Operations that don’t benefit from parallelization:
- File I/O in plR_read() - Single-threaded GPD fitting -
Memory-bound rescaling operations
Setting Up Parallel Backend
polylinkR uses the future framework for parallel
processing, which provides a unified interface across different parallel
backends. The recommended approach uses doFuture with
foreach.
Required Packages
# Install required packages (run once)
install.packages(c("foreach", "doFuture", "progressr"))Creating and Registering a Cluster
The standard workflow involves:
- Plan the future: Select the parallel backend
-
Register with foreach: Enable
%dopar%operations - Run parallel code: Execute polylinkR functions
- Clean up: Reset to sequential mode
# Load required libraries
library(foreach)
library(doFuture)
library(progressr)
# Check available cores
ncores <- parallel::detectCores()
cat("Available cores:", ncores, "\n")
# Plan parallel execution with multisession (works on all platforms)
# Best practice: leave 1 core for the OS
workers <- max(1, ncores - 1)
cat("Using workers:", workers, "\n")Platform-Specific Backend Selection
Windows: - Use plan(multisession) for
process-based parallelism - Avoid multicore (forking) as
it’s unstable on Windows
Linux/Mac: - plan(multicore) offers
lower overhead via forking - plan(multisession) also works
and is more stable
Best Practices for Core Selection
| Total Cores | Recommended Workers | Notes |
|---|---|---|
| 4 | 3 | Leave 1 for OS and UI |
| 8 | 6-7 | Good for most workstations |
| 16+ | 12-14 | Leave cores for other tasks |
| 32+ (HPC) | 28-30 | Consider memory limits |
Key guideline: Always leave at least 1 core free for the operating system and other processes. This prevents system lag and memory swapping.
Basic Parallel Example
Sequential vs Parallel Comparison
Let’s compare sequential and parallel execution using a small example:
# Load example data
data_path <- system.file("extdata", "tiny_polylinkR", package = "polylinkR")
plr_obj <- plR_read(data_path, verbose = FALSE)
# Sequential execution (default)
cat("Running sequential...\n")
t1 <- system.time({
result_seq <- plR_permute(
plr_obj,
N.SNPs = 100, # Small number for quick demo
N.iter = 100,
verbose = FALSE
)
})
cat("Sequential time:", t1["elapsed"], "seconds\n")Parallel Execution with %dopar%
To enable parallel processing in polylinkR, you need to:
- Plan the future backend
- Register with foreach
- Call polylinkR functions (they’ll use
%dopar%internally)
# Set up parallel backend
plan(multisession, workers = 2) # Use 2 workers for vignette
doFuture::registerDoFuture()
# Parallel execution
cat("Running parallel...\n")
t2 <- system.time({
handlers(global = TRUE)
result_par <- plR_permute(
plr_obj,
N.SNPs = 100,
N.iter = 100,
verbose = FALSE
)
})
cat("Parallel time:", t2["elapsed"], "seconds\n")
# Calculate speedup
speedup <- t1["elapsed"] / t2["elapsed"]
cat("Speedup:", round(speedup, 2), "×\n")
# Reset to sequential
plan(sequential)Monitoring Progress
When running long parallel jobs, progress bars help monitor execution:
# Enable global progress handlers
handlers(global = TRUE)
handlers("txtprogressbar") # Text-based progress bar
# Run with progress reporting
result <- plR_permute(
plr_obj,
N.SNPs = 1000,
N.iter = 10000,
verbose = TRUE # Enables progress output
)For silent/background execution:
# Disable progress for batch jobs
handlers(global = FALSE)
result <- plR_permute(
plr_obj,
N.SNPs = 1000,
N.iter = 10000,
verbose = FALSE
)Performance Benchmarks
Benchmarking Script
Here’s a template for benchmarking your specific hardware:
benchmark_parallel <- function(plr_obj, iterations = c(100, 1000, 5000)) {
results <- data.frame(
iterations = integer(),
cores = integer(),
time_seconds = numeric()
)
# Test different core counts
core_options <- c(1, 2, 4, parallel::detectCores() - 1)
for (n_iter in iterations) {
for (n_cores in core_options) {
if (n_cores == 1) {
plan(sequential)
} else {
plan(multisession, workers = n_cores)
doFuture::registerDoFuture()
}
t <- system.time({
res <- plR_permute(plr_obj, N.SNPs = 500, N.iter = n_iter, verbose = FALSE)
})
results <- rbind(results, data.frame(
iterations = n_iter,
cores = n_cores,
time_seconds = t["elapsed"]
))
# Clean up workers
if (n_cores > 1) {
plan(sequential)
}
}
}
return(results)
}
# Run benchmark
bench_results <- benchmark_parallel(plr_obj)
print(bench_results)Expected Performance
Based on typical workstation hardware (Intel/AMD 8-core CPU):
| Iterations | 1 Core | 2 Cores | 4 Cores | 8 Cores | Efficiency |
|---|---|---|---|---|---|
| 1,000 | 30s | 16s | 9s | 6s | 62% |
| 10,000 | 5m | 2.6m | 1.4m | 50s | 75% |
| 100,000 | 45m | 23m | 12m | 6.5m | 87% |
Efficiency = (Speedup / Core Count) × 100. Higher iteration counts show better efficiency due to reduced overhead amortization.
Memory Usage Considerations
Parallel execution increases memory usage:
- Sequential: Memory × 1
- N workers: Memory × (1 + N × 0.3) approximately
Each worker maintains a copy of necessary data structures. For large datasets:
# Check available memory
mem_info <- gc(reset = TRUE)
cat("Current memory usage:", mem_info[2, 2], "MB\n")
# Estimate parallel memory
estimate_memory <- function(n_workers, base_mb = 500) {
# Base usage + overhead per worker
total <- base_mb * (1 + n_workers * 0.3)
return(total)
}
available_workers <- function(total_ram_gb, base_mb = 500) {
max_workers <- floor((total_ram_gb * 1024 - base_mb) / (base_mb * 0.3))
return(min(max_workers, parallel::detectCores() - 1))
}
cat("Recommended workers for 16GB RAM:", available_workers(16), "\n")When Parallelization Isn’t Worth It
Skip parallel processing when:
-
Small datasets (< 100 genes, < 1,000
iterations)
- Overhead exceeds computation time
- Sequential may be faster
-
Very fast iterations (< 10ms each)
- Communication overhead dominates
-
Memory-constrained systems
- Swapping kills performance
-
Windows with > 10 cores
- Process creation overhead can be significant
Rule of thumb: Parallelize when expected time > 30 seconds and iterations > 5,000.
Advanced Tips
Chunking Strategies
For extremely large iteration counts, use chunking to reduce scheduling overhead:
# Manual chunking for fine control
plan(multisession, workers = 4)
doFuture::registerDoFuture()
# Divide work into chunks
N.iter <- 50000
n_chunks <- 20
chunk_size <- N.iter / n_chunks
# Process chunks in parallel
results <- foreach(i = 1:n_chunks, .combine = c) %dopar% {
# Each worker processes a chunk
process_chunk(chunk_size)
}polylinkR automatically handles chunking internally for large
N.iter values.
Handling Memory Limits
When hitting memory limits:
# Strategy 1: Reduce workers
plan(multisession, workers = 2) # Fewer workers = less memory
# Strategy 2: Process in batches
batch_permute <- function(plr_obj, n_iter_total, batch_size = 1000) {
n_batches <- ceiling(n_iter_total / batch_size)
all_results <- list()
for (batch in 1:n_batches) {
# Process one batch at a time
res <- plR_permute(plr_obj, N.iter = batch_size, verbose = FALSE)
all_results[[batch]] <- res
gc() # Force garbage collection between batches
}
# Combine results
return(combine_results(all_results))
}
# Strategy 3: Use disk-backed storage for large objects
options(future.globals.maxSize = 1024 * 1024^2) # 1GB limitTroubleshooting Common Parallel Errors
Error: “Cannot find variable X”
# Solution: Ensure all required packages are loaded in workers
plan(multisession, workers = 4)
doFuture::registerDoFuture()
# Explicitly export needed objects
foreach(i = 1:10, .export = c("my_data", "my_function")) %dopar% {
# computation
}Error: “Worker timeout” or “Connection error”
# Solution: Increase timeouts and use persistent connections
options(future.wait.timeout = 600) # 10 minutes
plan(multisession, workers = 4, persistent = TRUE)Error: “Memory limit exceeded”
# Solution: Reduce workers or chunk size
plan(multisession, workers = 2) # Reduce parallelism
# Or process sequentially in smaller batchesWindows-specific: “Process creation failed”
# Solution: Reduce worker count on Windows
# Windows has higher overhead for process creation
plan(multisession, workers = min(4, availableCores() - 1))Cleaning Up Clusters Properly
Always clean up parallel resources to prevent zombie processes and memory leaks:
# Standard cleanup - returns to sequential processing
plan(sequential)
# Verify cleanup
cat("Current plan:", class(plan()), "\n")For more thorough cleanup (especially after errors):
# Reset all future state
future::resetWorkers(plan())
plan(sequential)
# Force garbage collection
gc()
# On Windows, check for orphaned R processes in Task ManagerPlatform-Specific Recommendations
Windows: - Always use multisession
(never multicore) - Limit workers to 4-8 for stability -
Watch for antivirus interference with socket connections
Linux: - multicore offers best
performance - Check ulimit settings for process limits -
Consider cgroups for resource control on shared systems
macOS: - multicore works well on Intel
Macs - Apple Silicon: multisession is more reliable - Check
“System Settings > General > Login Items” for process limits
Complete Working Example
Here’s a complete, production-ready script for parallel pathway analysis:
#!/usr/bin/env Rscript
# parallel_pathway_analysis.R
library(polylinkR)
library(foreach)
library(doFuture)
library(progressr)
# Configuration
CONFIG <- list(
data_path = "path/to/your/data",
output_path = "results.rds",
n_iterations = 10000,
n_snps = 1000,
verbose = TRUE
)
# Setup parallel backend
setup_parallel <- function() {
ncores <- parallel::detectCores()
workers <- max(1, ncores - 1)
cat("Setting up parallel backend with", workers, "workers\n")
# Platform-specific setup
if (.Platform$OS.type == "windows") {
plan(multisession, workers = workers)
} else {
# Try multicore first, fall back to multisession
tryCatch({
plan(multicore, workers = workers)
}, error = function(e) {
message("multicore failed, using multisession: ", e$message)
plan(multisession, workers = workers)
})
}
doFuture::registerDoFuture()
handlers(global = TRUE)
return(workers)
}
# Main analysis
main <- function() {
# Setup
workers <- setup_parallel()
on.exit({
cat("Cleaning up parallel resources...\n")
plan(sequential)
}) # Ensure cleanup even on error
# Read data
cat("Reading input data...\n")
plr_obj <- plR_read(CONFIG$data_path, verbose = CONFIG$verbose)
# Run parallel permutation
cat("Running parallel permutation analysis...\n")
start_time <- Sys.time()
result <- plR_permute(
plr_obj,
N.SNPs = CONFIG$n_snps,
N.iter = CONFIG$n_iterations,
verbose = CONFIG$verbose
)
end_time <- Sys.time()
elapsed <- as.numeric(end_time - start_time, units = "secs")
cat("Analysis complete!\n")
cat("Total time:", round(elapsed, 2), "seconds\n")
cat("Effective rate:", round(CONFIG$n_iterations / elapsed, 1), "iterations/sec\n")
# Save results
saveRDS(result, CONFIG$output_path)
cat("Results saved to:", CONFIG$output_path, "\n")
# Cleanup (also handled by on.exit)
plan(sequential)
return(result)
}
# Run if executed as script
if (!interactive()) {
result <- main()
}Safe Shutdown Pattern
Use this pattern to ensure resources are always released:
safe_parallel <- function(expr) {
# Setup
plan(multisession, workers = 4)
doFuture::registerDoFuture()
# Execute with error handling and cleanup
tryCatch({
result <- expr
return(result)
}, finally = {
# Always execute cleanup
cat("Shutting down parallel workers...\n")
plan(sequential)
gc()
})
}
# Usage
result <- safe_parallel({
plR_permute(plr_obj, N.SNPs = 1000, N.iter = 10000)
})Summary
Key Takeaways
Always parallelize large permutation jobs - The speedup is substantial for 10,000+ iterations
Leave 1 core free - Prevents system lag and memory issues
Use
multisessionon Windows - More stable than forkingMonitor memory usage - Each worker increases memory footprint by ~30%
Always clean up - Use
plan(sequential)oron.exit()to prevent zombie processesProgress bars help - Enable with
handlers(global = TRUE)for long runsChunk large jobs - Reduces scheduling overhead for 100,000+ iterations
Quick Reference
# Minimal setup
library(foreach)
library(doFuture)
plan(multisession, workers = parallel::detectCores() - 1)
doFuture::registerDoFuture()
# Run analysis
result <- plR_permute(plr_obj, N.SNPs = 1000, N.iter = 10000)
# Cleanup
plan(sequential)