
Sample Size and Inter-Sample Correlation in CNV Detection
This technical review examines how cohort size (number of samples) and inter-sample correlation influence the power and error rates of detecting genomic copy number variations. It covers theoretical and practical findings, scientific evidence from the literature, algorithmic comparisons, design considerations, and the importance of sample size.
What Are Copy Number Variants (CNVs)?
Genetic diversity in the human genome encompasses many variations that are not yet fully understood. These differences span a broad range, from single-base alterations in the DNA sequence to larger structural rearrangements. For example, single nucleotide polymorphisms (SNPs) are common variants observed in a significant portion of the population (greater than 1%) and represent changes at a single nucleotide position.
CNVs, on the other hand, are structural variants that involve increases or decreases in the copy number of specific regions of the DNA. These alterations occur through the deletion or duplication of the affected genomic segment. They typically impact large regions of DNA—often thousands of bases long—and may span multiple genes; however, their size and frequency can vary widely.
Purpose of Copy Number Alteration Detection:
- To measure deviations in coverage depth or read density across genomic regions,
- To assess diploid imbalance through allele frequency and heterozygosity,
- To segment the genome using segmentation algorithms and identify statistically abnormal regions.
Impact of Sample Size on CNV Detection
When the number of samples is low, the reference distribution cannot be adequately represented, leading to increased false-positive and false-negative rates in single-exon or small multi-exon CNVs. This issue is particularly pronounced in panel-based sequencing data, where target regions naturally exhibit large variations in read depth. Accurately modeling these technical fluctuations requires a sufficiently large sample pool.
The study by Babadi et al., which analyzed 7,962 exomes along with 197,306 individuals from the UK Biobank, clearly demonstrates that large cohorts are critical for reliably detecting rare CNVs. As sample size increases, both detection accuracy and the diversity of captured variants improve. Additionally, the statistical power provided by inter-sample correlation significantly enhances CNV resolution. Achieving recall rates of up to 95% for rare CNVs spanning two or more exons strongly indicates that larger cohorts and inter-sample statistical relationships directly improve CNV analysis performance.
Furthermore, genetic correlation among samples and cross-referencing with existing microarray or genome sequencing data enhance the reliability of CNV detection. In other words, consistency and comparability across individuals help reduce erroneous calls.
NGS Cloud’s Innovative and Standardized Approach to CNV Detection
Accurate detection of rare copy number alterations critically depends on the number of samples. The study by Babadi et al. shows that as sample size increases, both the diversity of detectable variants and the accuracy of CNV calls improve. Furthermore, strong correlation among samples can raise the recall rate of multi-exon variants to as high as 95%.
NGS Cloud’s CNV algorithm operates on datasets that share similar correlation characteristics across parameters such as read depth, sequencing kit and instrument, and total data yield. In addition, analyses performed using more than 30,000 samples from NGS Cloud’s internal database provide up to 95% accuracy, high reproducibility, and fully standardized CNV detection results.

Figure 1: Aneuploid copy number alteration on chromosome 9 captured with moderate accuracy on NGS Cloud.
Why Is Inter-Sample Correlation So Important in CNV Analyses?
Using a well-selected reference set or samples drawn from the same distribution (e.g., matched normals or controls with similar coverage profiles) increases the sensitivity of CNV calling; global coverage correlation among samples can be leveraged when constructing the reference set.
Population-specific CNVs or clonality (tumor subclones) can create overlapping CNV patterns across samples; therefore, measuring inter-sample correlation helps identify biologically meaningful shared CNVs and distinguish them from technical artifacts.
Correlation-based CNV calling algorithms employed by NGS Cloud—particularly those relying on read-depth—show improved accuracy when the correlation structure among samples is incorporated (e.g., through correlation matrices or genomic correlation integration).
Quality Control of Raw Data
Statistical and visual evaluation of raw sequencing data is the first step in minimizing false-positive and false-negative results. At this stage, analyzing the following metrics is recommended:
- Coverage histogram: The distribution of coverage across the whole genome or targeted regions enables early detection of systematic deviations.
- GC-bias analysis: The nonlinear relationship between GC content and coverage can severely distort CNV calling. This bias must be quantitatively measured and reported.
- Duplicate rate: An important indicator of library quality; in cases where low complexity is present, the performance of CNV detection algorithms declines.
This QC step provides a foundational reference for subsequent normalization procedures and allows early identification of batch effects.
Calculation and Clustering of Inter-Sample Correlation
Similarity in coverage profiles across samples is important for understanding both biological and technical shared signals. Therefore, after generating normalized coverage profiles for each sample:
- An inter-sample correlation matrix should be computed.
- Hierarchical clustering should be performed on this matrix.
If strong correlation clusters (e.g., batch clusters) are observed:
- A separate normalization set should be used for each cluster,
- Reference samples should be redefined within each cluster,
- CNV callers should be executed on a cluster-specific basis.
This approach has been highlighted in publications in Nature Genetics and Nature Communications as a critical step for achieving high accuracy in CNV detection.
GC and Mappability Correction with PCA-Based Reduction of Technical Components
A two-step normalization process is required to reduce technical noise:
1. GC and Mappability Correction
GC content and regional mappability scores introduce systematic effects on coverage. Therefore:
- GC content should be calculated for each target region,
- Coverage values should be normalized according to a GC model,
- Regions with low mappability should be filtered or weighted.
Studies from pure.mpg.de report that this correction reduces false CNV calls by approximately 20–40%, particularly in targeted sequencing (WES) projects.
2. PCA-Based Technical Component Reduction
After normalization, principal component analysis (PCA) is recommended to remove residual batch signals:
- The first few technical components account for a large portion of coverage variance.
- These components can be regressed out to enhance biological signals.
- PCA has become an industry standard for decomposing sources of variation.
CNV Callers, Ensemble Approaches, and Machine Learning
Relying on a single CNV caller can lead to inconsistent results due to algorithmic limitations. Therefore:
- It is recommended to use both read-depth–based and split-read/paired-end–based callers together.
- Ensemble models significantly reduce erroneous calls by integrating shared signals across different callers.
- Studies published in PMC demonstrate that machine-learning–based validation layers improve sensitivity.
Example tools: CNVkit, ExomeDepth, Control-FREEC, GATK gCNV, Lumpy, Delly.
Biological and Technical Cross-Validation of Results
To increase the reliability of detected CNVs, the following validation strategies are recommended:
- Use of matched normals: Essential for somatic variant analysis.
- Orthogonal assay validation: Techniques such as qPCR, MLPA, array-CGH, or optical genome mapping can be used.
- Population-based validation: Databases such as gnomAD SV and other large community datasets should be used to confirm rarity and assess inter-sample variance.
- Publications in PMC demonstrate that multi-layered validation is the most reliable approach for clinical reporting.
Machine Learning and Deep Learning–Based CNV Calling Approaches
In recent years, copy number variant (CNV) detection has undergone a major shift, expanding beyond classical statistical methods and signal-based segmentation approaches. The integration of machine learning (ML) and deep learning (DL) models into genomic analysis pipelines has played a transformative role in this evolution. The accumulation of large cohorts and the rapid expansion of high-volume genomic datasets (WES, WGS, and targeted panel data) have enabled the use of a wide range of models—from decision trees to convolutional neural networks (CNNs)—in CNV calling workflows.
This section presents ML/DL-based CNV calling methodologies, their advantages, application examples, and a review of the relevant scientific literature.
The Role of Random Forest and Gradient Boosting Models in CNV Calling
The first wave of machine learning–based CNV detection approaches emerged in scenarios where a rich feature space—coverage, GC content, mappability scores, segment variance, z-scores, and mapping quality metrics—could be leveraged effectively for classification tasks. In these contexts, random forest and gradient boosting methods stood out as powerful model families for CNV identification.
Random forest models provide the following advantages:
- Strong generalization capability in high-dimensional and noisy signal environments
- Ability to learn local variations within read-depth profiles
- Capacity to evaluate multiple features simultaneously and distinguish false-positive segments
- Robustness against overfitting in low-sample datasets
In particular, in hybrid Bayesian + ML models such as ExomeDepth, it has been demonstrated that learned statistical classifiers significantly improve CNV calling scores.
CNN-Based Segment Signal Classifiers
With the introduction of deep learning models into the genomic signal processing domain, convolutional neural networks (CNNs) have become powerful architectures capable of taking raw coverage signals as input and performing CNV classification directly.
CNN-based models:
- Process read-depth fluctuations as spatial signals.
- Learn both local and global segment patterns simultaneously.
- Automatically capture complex variance structures that are not easily detectable by human inspection.
- Handle challenges stemming from segment length, coverage heterogeneity, and GC bias.
For example, DeepCNV leverages CNNs together with attention mechanisms to accurately distinguish both very small and large CNVs. Notably, in low-depth WES datasets, it has been shown to produce lower false-positive rates compared with classical methods.
Autoencoder-Based Noise Reduction Models
In datasets with low quality or substantial technical noise, CNV signals are often masked, making detection difficult. Autoencoder models:
- Compress the high-dimensional structure of coverage signals.
- Remove technical variance (batch effects, GC bias, and kit-derived variation), thereby amplifying the underlying biological signal.
- Improve the sensitivity of CNV callers by operating on a “denoised” coverage profile.
This approach significantly enhances CNV resolution, particularly in targeted panel data, when used in combination with CNN- or random-forest-based models.
Graph Learning and Genomic Segmentation
In recent studies, graph neural network (GNN)-based models have been developed in which the topological relationships of genomic segments and neighboring regions are learned through graph representations. This approach:
- Represents the linear structure of the genome and the regional continuity of the coverage signal on a graph.
- Processes characteristics such as inter-segment correlation, mappability similarity, and GC proximity as edge weights.
- Can better capture local signal changes in short segments thanks to its topological structure.
Graph learning models, especially in multi-sample analyses (cohort-based CNV detection), enhance detection power by integrating relationships between samples into the graph structure.

Figure 2. Overview of the CNV-P Framework. (A) Schematic representation of the CNV-P pipeline, which evaluates candidate CNV calls and assigns each as either True or False based on a supervised classification strategy. (B) Summary of the feature set incorporated into the training process of the supervised machine learning models, capturing both coverage-derived metrics and CNV-level contextual attributes.
ML-Based CNV Validation Layers
Machine learning (ML) and deep learning (DL) approaches are used not only during the CNV calling stage but also in the downstream filtering and validation of detected variants. These validation layers leverage multiple signal- and feature-level metrics, such as:
- Coverage z-score
- B-allele frequency (BAF) patterns
- Segment variance
- Post-normalization PCA component effects
- Control-sample correlation patterns
Using these metrics as input features, ML models can classify candidate variants as either “true CNVs” or “artifacts.” In large-scale cohorts, ML-based validation layers have been shown to reduce false positive rates by approximately 10–25% compared to traditional rule-based filtering strategies.
References
- Thapar, A., & Cooper, M. (2013). Copy number variation: What is it and what has it told us about child psychiatric disorders? Journal of the American Academy of Child & Adolescent Psychiatry, 52(8), 802–812. https://doi.org/10.1016/j.jaac.2013.05.013
- Moreno-Cabrera JM, del Valle J, Castellanos E, Feliubadaló L, Pineda M, Brunet J, Serra E, Capellà G, Lázaro C, Gel B. (2020). Evaluation of CNV detection tools for NGS panel data in genetic diagnostics. European Journal of Human Genetics, 28, 1645–1655. https://doi.org/10.1038/s41431-020-0675-z
- Babadi, M., Fu, J. M., Lee, S. K., Smirnov, A. N., Gauthier, L. D., Walker, M., Benjamin, D. I., Zhao, X., Karczewski, K. J., Wong, I., Collins, R. L., Sanchis-Juan, A., Brand, H., Banks, E., & Talkowski, M. E. (2023). GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data. Nature Genetics, 55, 1589–1597. https://doi.org/10.1038/s41588-023-01449-0
- Wineinger NE, Tiwari HK, et al. The impact of errors in copy number variation detection algorithms on association results. PLoS One. 2012;7(4):e32396. https://doi.org/10.1371/journal.pone.0032396
- Talevich, E., Shain, A. H., Botton, T., & Bastian, B. C. (2016). CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLOS Computational Biology, 12(4), e1004873. https://doi.org/10.1371/journal.pcbi.1004873
- Auwerx, C., Lepamets, M., Sadler, M. C., Patxot, M., Stojanov, M., Baud, D., Mägi, R., Porcu, E., Reymond, A., & Kutalik, Z. (2022). The individual and global impact of copy-number variants on complex human traits. The American Journal of Human Genetics, 109(4), 647–668. https://doi.org/10.1016/j.ajhg.2022.02.010
- Luo, X., Qin, F., Cai, G., & Xiao, F. (2021). Integrating genomic correlation structure improves copy number variations detection. Bioinformatics, 37(3), 312–317. https://doi.org/10.1093/bioinformatics/btaa737
- Auwerx, C., et al. (2022). The individual and global impact of copy-number variants on complex human traits. American Journal of Human Genetics. https://dx.doi.org/10.17632/z54dc3b6jz.1
- Demidov, G., et al. (2024). Comprehensive reanalysis for CNVs in ES data from routine diagnostics. npj Genomic Medicine. https://doi.org/10.1038/s41525-024-00436-6
- Stamoulis, C., et al. (2011). Estimation of correlations between copy-number variants. PLoS ONE, 6(9), e25673. https://doi.org/10.1109/iembs.2011.6091345
- Glessner, J. T., Hou, X., Zhong, C., Zhang, J., Khan, M., Brand, F., … Wei, Z. (2021). DeepCNV: A deep learning approach for authenticating copy number variations. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbaa381
- Tan, R., & Shen, Y. (2022). Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning. Nucleic Acids Research, 50(21), e123. https://doi.org/10.1093/nar/gkac788
- Zhang, Y., Jin, L., Wang, B., Hu, D., Wang, L., Li, P., … Lang, J. (2020). DL-CNV: A deep learning method for identifying copy number variations based on next generation target sequencing. Mathematical Biosciences and Engineering, 17(1), 202–215. https://doi.org/10.3934/mbe.2020011
- Wang, T., Sun, J., Zhang, X., Wang, W-J., & Zhou, Q. (2021). CNV-P: A machine-learning framework for predicting high-confident copy number variations. PeerJ, 9, e12564. https://doi.org/10.7717/peerj.12564

