Workflow recommendations using deep learning and machine learning tools for exploratory and predictive data analysis in Galaxy
Abstract: Galaxy is an open-source web platform for scientific data analysis. The Galaxy Europe platform constitutes over 3,000 scientific tools for processing and analysing scientific datasets. Also, it accesses a large computing cluster comprising thousands of CPU cores, several terabytes (TB) of memory and a few petabytes (PB) of storage for executing those tools on scientific datasets. To promote exploratory and predictive data analysis in Galaxy Europe, the thesis develops a few approaches broadly divided into two parts - (a) create a workflow recommendation system predicting tools using deep learning (DL) to extend data analysis and workflows, and (b) develop tools and infrastructure with robust machine learning (ML) approaches for researchers to perform scalable and reproducible applied machine learning research.
Scientific analyses are carried out using several blocks of tools that collectively process datasets stepwise, transforming their raw nature into conclusion-bearing insights. Such blocks of scientific tools are chained together to form workflows in Galaxy. Creating meaningful workflows is a complicated task and requires knowledge of many tools. Therefore, creating a guidance system to assist researchers in creating complex scientific workflows is essential by recommending high-quality tools at each step. In addition, multiple recommendations at each step pave the way for creating divergent workflows, enabling exploratory data analysis. The Galaxy recommendation system is created by training two deep learning architectures, namely recurrent neural networks (RNN) and Transformers (publications P1 and P2, respectively) on workflows stored in Galaxy Europe. These architectures learn the underlying sequential nature of scientific workflows to recommend the most useful tools. Galaxy workflows are directed acyclic graphs consisting of tool sequences, and these architectures are robust for learning sequential patterns. Tool recommendation models have been created by training both architectures on tool sequences following a multi-label, multi-class classification. A Galaxy API has been developed to predict recommendations using trained DL models, taking a tool or a tool sequence as input. These recommendations are displayed in Galaxy using two user interface (UI) integrations.
Machine learning methods are widely used for predictive analysis tasks in Bioinformatics, such as the classification of DNA sequences, protein functions and gene expression patterns, biomedical image analysis, drug-response prediction and many more, achieving state-of-the-art accuracy. These tasks using machine learning methods on high-dimensional biological datasets often require enormous computing resources consisting of several CPU cores and GPUs, ample disk space, and high memory. Such large computing resources are readily available only to a few researchers. JupyterLab is a popular program editor for rapidly developing prototypes and end-to-end analyses for machine learning and data science projects. As part of this thesis, it has been integrated into Galaxy as an online tool (available through Galaxy Europe) that can access its large computing cluster. The tool is developed as a Docker container with several machine learning software packages installed. The software packages installed in a Docker container ensure reproducibility and secure execution of Python scripts written in JupyterLab notebooks. In addition, the JupyterLab tool can also be used as a regular Galaxy tool and can be directly integrated into any workflow. Additionally, researchers can perform machine learning model training remotely. The resulting model, represented in an open neural network exchange format (ONNX), and other supporting datasets become available in Galaxy.
The usage of the JupyterLab tool in Galaxy is demonstrated by two use cases - prediction of infected regions in COVID-19 CT scans and 3D structure of proteins (publication P3). However, researchers needing more programming expertise may be unable to utilise this machine learning tool for their predictive tasks, such as developing machine learning analysis notebooks inside the JupyterLab tool. Addressing this gap, several methods from Scikit-learn, TensorFlow, and XGBoost have been wrapped and integrated into Galaxy to provide researchers access to UI-based machine learning tools in Galaxy running on its large computing cluster. These Galaxy-ML tools have myriad functions, broadly divided into data preprocessing, classification, regression, and clustering. Researchers can use these tools on Galaxy to create end-to-end machine learning analysis workflows. Several ready-to-use Galaxy Training Network (GTN) tutorials have also been developed to demonstrate the creation of an end-to-end machine learning analysis for researchers. These tutorials showcase the usage of Galaxy-ML tools for reproducing results of two scientific publications to predict chronological human age using RNA-seq and DNA-methylation datasets (publication P4) and the classification of two types of leukaemia, a type of blood cancer using a gene expression dataset
- Standort
-
Deutsche Nationalbibliothek Frankfurt am Main
- Umfang
-
Online-Ressource
- Sprache
-
Englisch
- Anmerkungen
-
Universität Freiburg, Dissertation, 2024
- Schlagwort
-
Maschinelles Lernen
Empfehlungssystem
E-Learning
Data Mining
Prozessmanagement
Empfehlungssystem
Deep Learning
Maschinelles Lernen
- Ereignis
-
Veröffentlichung
- (wo)
-
Freiburg
- (wer)
-
Universität
- (wann)
-
2024
- Urheber
- Beteiligte Personen und Organisationen
- DOI
-
10.6094/UNIFR/257331
- URN
-
urn:nbn:de:bsz:25-freidok-2573314
- Rechteinformation
-
Open Access; Der Zugriff auf das Objekt ist unbeschränkt möglich.
- Letzte Aktualisierung
-
15.08.2025, 07:32 MESZ
Datenpartner
Deutsche Nationalbibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.
Beteiligte
- Kumar, Anup
- Backofen, Rolf
- Nekrutenko, Anton
- Albert-Ludwigs-Universität Freiburg. Fakultät für Angewandte Wissenschaften
- Universität
Entstanden
- 2024