Designing efficient Data Augmentation Strategies for Flow Cytometry Data 

Supervisor: Michael Reiter 


Flow Cytometry is a laser-based technique to measure antigen expression levels of blood cells. It is used in research as well as in daily clinical routines for immunophenotyping and for monitoring residual numbers of cancer cells during chemotherapy. One patient’s sample contains approximately 50-300k cells (also called events) with up to 15 different features (markers) measured. Each feature corresponds to either physical properties of a cell (cell size, granularity) or to the level of expression of a specific antigen marker on the cell’s surface. 

Problem Statement 

Since Flow Cytometry samples are directly obtained during cancer therapy the number of available samples is very limited. However, modern deep learning models rely on vast number of training samples that represent various variations of the input data. By slightly modifying the cell data of existing samples the pool of available training data can be artificially extended. The challenge in this problem is to introduce variations in the training data without breaking the relative relationship between different cell population and without disturbing the sample such that it is biological implausible. 


More than 600 different clinical samples of Acute Lymphoblastic Leukemia (ALL) patients are available from 3 different centers: St. Anna Hospital Vienna, Charité Berlin, and Garrahan Hospital Buenos Aires. With each sample containing roughly 200k events the overall data pool contains 650 x 2 * 10^5 = 130 million cells. 


Different simple augmentation strategies, such as linear data shifts and population scaling have proven to be effective, however the possibilities of augmentation are not yet exhausted for this type of data. The Goal of this work is to further explore, develop and validate new data augmentation strategies for Flow Cytometry data. Different augmentation strategies should be developed and compared by their performance on the given data and given prediction models as well as by practical aspects. 

Our project in the Media


Literature research 

Implementation in Python 


Written report or thesis (in English) and final presentation 



Basic knowledge in Deep Learning (PyTorch, Tensorflow) and Machine Learning