Netboost: statistical modeling strategies for high-dimensional data

Abstract: Background:
State-of-the art methods often fail to identify weak but cumulative effects of variables found in high-dimensional omics datasets. Nevertheless, these effects play important roles in many diseases, such as the clonal development of leukemic cells and CKD metabolism.

Results:
We propose Netboost, a three-step dimension reduction technique. First, boosting-based filters are combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally, module information is aggregated by principal components. The primary analysis is then carried out on these summary measures instead of the original data, allowing for a localized dimensionality reduction.

We demonstrate the application of the newly developed Netboost in integration with CoxBoost for survival prediction, genetic association studies to understand the human metabolism and random forests for disease classification. We applied our method in 7 independent cohorts spanning 6 diseases, a variety of high-dimensional data types (DNA methylation, metabolomics, miRNA, RNA arrays, RNA sequencing) and human as well as murine in vivo samples.
In many of these settings, we were able to show significant advantages over state-of-the-art competitive analysis strategies with respect to prediction errors, power and mis-classification rates by cross-validation, general resampling and independent replication.

By integration of our novel method in analysis of several biomedical research projects, we were able to attain and confirm biological insights which could not have been reached by the compared state-of-the-art methods.
In particular, the two biologically most insightful findings in this dissertation were both replicated in independent datasets.
First, we identified a chromatin modifying enzyme signature associated with overall survival, which separates patients into two groups with a threefold difference in median survival time. Second, we established the central concept in the human urinary metabolism to be the list of ADME processes, which was originally defined in the context of pharmacological research.

Furthermore, we demonstrated in several datasets a lower sampling uncertainty of Netboost overall networks as well as individual components of the networks across Netboost, WGCNA and k-means and found that method uncertainty dominated sampling uncertainty.

Finally, we integrate Netboost with robust methodology designing a Netboost adaption, which is invariant to monotone transformations of variables and thus obtain an advantageous extension in cases of non-linear relationships between variables.

Conclusion:
The newly developed approach Netboost offers a versatile statistical modeling strategy for high-dimensional data, which is
freely available as a Bioconductor R package. Via dimensionality reduction it improves accuracy, power and stability in various analysis settings, including time-to-event analysis, GWAS and classification

Location
Deutsche Nationalbibliothek Frankfurt am Main
Extent
Online-Ressource
Language
Englisch
Notes
Universität Freiburg, Dissertation, 2019

Classification
Naturwissenschaften

Event
Veröffentlichung
(where)
Freiburg
(who)
Universität
(when)
2019
Creator

DOI
10.6094/UNIFR/151256
URN
urn:nbn:de:bsz:25-freidok-1512562
Rights
Kein Open Access; Der Zugriff auf das Objekt ist unbeschränkt möglich.
Last update
25.03.2025, 1:46 PM CET

Data provider

This object is provided by:
Deutsche Nationalbibliothek. If you have any questions about the object, please contact the data provider.

Time of origin

  • 2019

Other Objects (12)