Multi-step classification of flow cytometry cell data with set transformers 

Supervisors: Michael Reiter, Matthias Wödlinger, Florian Kowarsch


Flow Cytometry is a laser-based technique to measure antigen expression levels of blood cells. It is used in research as well as in daily clinical routines for immunophenotyping and for monitoring residual numbers of cancer cells during chemotherapy. One patient’s sample contains approximately 50-300k cells (also called events) with up to 15 different features (markers) measured. Each feature corresponds to either physical properties of a cell (cell size, granularity) or to the level of expression of a specific antigen marker on the cell’s surface. 

In a process called Gating medical experts draw several polygons among different cell populations on 2D plots in order to hierarchically sub-select and track down cancer cell populations. For instance, usually on the first plot measuring artifacts are excluded. The resulting data is further filtered by selection certain cell populations based on their biological features. This process is repeated until the filtered data only consists of cancer cells. These annotations are used as ground truth to train automated prediction models. 

Figure 1 Illustration of the Gating Process 

Problem Statement 

Powerful deep learning architectures such as set transformers1 have proven to be suitable for cell classification in high dimensional Flow Cytometry Data2. Wödlinger’s set transformer model inputs all events of a given Flow Cytometry sample and predicts the cell class (cancer cell or not) for each cell. An alternative approach could not only predict if a certain cell is considered as cancer cell or not, but rather predict the cell class for each Gating step (respectively predicting in which Gating step the cell was excluded).  


More than 600 different clinical samples of Acute Lymphoblastic Leukemia (ALL) patients are available from 3 different centers: St. Anna Hospital Vienna, Charité Berlin, and Garrahan Hospital Buenos Aires. With each sample containing roughly 200k events the overall data pool contains 650 x 2 * 10^5 = 130 million cells. 


The goal of this work is to implement a set transformer network similar to Wödlinger’s approach but with difference, that each step of the Gating process is predicted for each cell. Two different approaches should be considered: 

  1. Predicting all cell classes at once (with one forward pass) 
  1. Sequentially predicting each Gating step after each other (autoregressive execution of multiple forward passes) 

The developed network should be evaluated against Wödlinger’s set transformer on the given dataset and advantages and disadvantages as well as practical aspects of the different architectures should be discussed. 

Our project in the Media


Literature research 

Implementation in Python 


Written report or thesis (in English) and final presentation 



Basic knowledge in Deep Learning (PyTorch, Tensorflow) and Machine Learning