### Workshop C2 - Computational Mathematical Biology with emphasis on the Genome

**Organizers:** Steve Smale (City University of Hong Kong, China & University of California at Berkeley, USA) - Mike Shub (City University of New York, USA) - Indika Rajapakse (University of Michigan, USA)

## Talks

July 17, 14:30 ~ 15:20 - Room B2

## Biological Clocks in Cell Division and Infectious Disease

### John Harer

### Duke University, Geometric Data Analytics and Mimetics Biosciences, United States - john.harer@duke.edu

Gene regulatory networks (GRNs) can drive cyclic and temporally ordered processes in biological systems. One of the best-known GRNs of this type is the circadian clock, which drives rhythmic behaviors with a period of approximately 24hrs. The circadian clock network exerts its control, in part, by regulating a dynamic program of gene expression where substantial fractions of the genome are expressed during distinct phases of the circadian cycle. We have proposed that a different GRN controls the cyclic program of gene expression that is observed during the cell division cycle. In fact, we have observed similar gene expression programs across time scales from hours to days, and across organisms that are evolutionarily diverged by millions of years. These observations suggest that a class of GRNs may serve as central mechanisms that drive temporal gene expression programs in biological systems.

In this talk we will discuss a pipeline of mathematical methods designed to infer GRNs that control various cyclic and temporally ordered processes, including the Yeast Cell Cycle and Parasite Clock Networks.

Joint work with Steve Haase, Duke University and Mimetics Biosciences, USA, Anastasia Deckard, Geometric Data Analytics, USA, Francis Motta, Duke University, Tina Kelliher, Duke University, Kevin McGoff, UNC Charlotte, USA and Zin Guo, The Hong Kong Polytechnic University.

July 17, 15:30 ~ 16:00 - Room B2

## Regulation networks for stem cells

### Natalia Komarova

### University of California - Irvine, USA - komarova@uci.edu

Design principles of biological networks have been studied extensively in the context of protein-protein interaction networks, metabolic networks, and regulatory (transcriptional) networks. Here we focus on less studied regulation networks that occur on larger scales, namely, the cell-to-cell signaling networks that connect groups of cells in multicellular organisms. These are the feedback loops that orchestrate the complex dynamics of cell fate decisions and are necessary for the maintenance of homeostasis in stem cell lineages. We focus on ``minimal" networks, that is those that have the smallest possible numbers of controls. Using the formalism of digraphs, we show that in two-compartment lineages, reducible systems must contain two 1-cycles, and irreducible systems one 1-cycle and one 2-cycle; stability follows from the signs of the controls and does not require magnitude restrictions. In three-compartment systems, irreducible digraphs have a tree structure or have one 3-cycle and at least two more shorter cycles, at least one of which is a 1-cycle. Our results may serve as a first step toward an understanding of ways in which these networks fail in cancer.

July 17, 16:00 ~ 16:30 - Room B2

## An Approach to Control Partially Known Networks

### Gemunu Gunaratne

### University of Houston, USA - gemunu.gunaratne@gmail.com

Coupled networks are needed to represent a wide-range of problems including bio-molecular processes in cells, interactions within social groups, and ecosystems. One of the major goals of network analyses is to design methods to control such systems to a pre-specified target state. For example, in cellular processes, we may inquire if consequences of genetic mutations or chromosomal rearrangements can be mitigated through external intervention. The obvious approach is to start from a realistic model of the underlying network. Unfortunately, this is an extremely difficult task. Precise quantitative forms of interactions between biomolecules are currently unavailable, and are unlikely to be available in the near future. The model-independent approach proposed here relies on representing the “state" of a cell through its gene expression profile; i.e., the levels of mRNA within a cell. The data can be extracted using techniques such as RNASeq. Under the assumption, key ingredients needed for control are “response surfaces," each of which expresses how the gene expression profile responds to a specific external perturbation. The number of control nodes, i.e., nodes whose levels are to be externally controlled, can be systematically increased in order to reach the target state. Importantly, the most appropriate control node to be added at a given level is determined computationally from prior data. Analyses of synthetic models and (experiments on) nonlinear electrical circuits show that the target state can be typically reached with a few (3 or 4) control genes. This observation is consistent with seminal work on reprogramming mature cells to stem cells using the transcription factors Myc, Oct3/4, Sox2, and Klf4 [Takahashi and Yamanaka, Cell 126, 663, 2006] and on reprogramming fibroblasts to cardiomyocites using three transcription factors Gata4, Mef2c and Tbx5 [Ieda et al., Cell 142, 375, 2010]. The work outlined is a model-free approach to select, systematically, a sequence of genes whose levels need to be subjected to external control in order to approach the pre-specified final state. We propose experimental validations of the approach in the sleep-deprivation network and in an addiction network in Drosophila.

July 17, 17:00 ~ 17:30 - Room B2

## Genome and Transcriptome Dynamics in Cancer Development

### Thomas Ried

### National Cancer Institute, USA - riedt@mail.nih.gov

The development of carcinomas, e.g., cervical or colorectal tumors is defined by the sequential acquisition of chromosomal copy number changes, also referred to as aneuploidy. Importantly, these changes can be invariably characterized as follows: (1) they occur in premalignant lesions before the transition to invasive disease; (2) they are surprisingly tissue specific. In cervical cancers we observe copy number gains of chromosome 3, whereas colon cancers carry extra copies of chromosomes 7, 13 and 20, along with losses of 18; (3) these chromosomal aneuploidies are maintained in metastases and in cell lines, despite ongoing, apparently random, chromosome mis-segregation. Our observations indicate that they are drivers of cancer development and its evolutionary bottleneck. We have conducted extensive analyses to show that genome copy number changes directly affect the expression of genes that reside on the affected chromosomes, both in primary tumors and in experimental models, and therefore result in a complex deregulation of the cancer transcriptome. In addition to chromosomal aberrations, specific genetic signaling pathways are activated by mutations. We are presently analyzing how these cancer specific genetic alterations affect structure/function relationships in the cell nucleus. This is achieved by correlating high-resolution maps of nuclear chromosomal architecture with global gene expression analyses. We realize that such changes cannot be understood in a meaningful way without considering the time aspect in which they occur. We propose applying data guided mathematical methods to understand the dynamical aspect of tumor development, with the ultimate goal of controlling the cancer genome, and reversing cancer cells to the differentiated state they were derived from.

Joint work with Indika Rajapakse (university of Michigan).

July 17, 17:30 ~ 18:00 - Room B2

## Considering Quantum Information and Other Non-sequence Information in Understanding the Operating Parameters of the Genome

### Christian Macedonia

### Johns Hopkins University School of Medicine, USA - cmacedo2@jhmi.edu

The discovery of biological information encoding in nucleotide sequences was one of the most profound scientific breakthroughs of the 20th century. Scientists learned that the information necessary to regenerate living things resided in the nucleus but more pointedly found that the basic code for the generation of proteins was reducible to a remarkably simple series of three nucleotide combinations. This discovery was the result of an inspiring collaboration between biologists, chemists, and mathematicians. Unfortunately, it also set forth a "central dogma" that has readily been disproved but has had lingering negative effects on the full comprehension of genomic information storage and processing. For instance, biologists continue to use the term "non-coding regions" to describe areas of the human genome that clearly have information encoded but not encoded to produce proteins. Biological information is also manifest within chromosomal material in non-sequence motifs such as through epigentics, histone coiling, and other geometric modifications of DNA and RNA crystal structure. A single diploid human zygote contains roughly 1.5 GB of raw data of which only about 30 MB is exomic and yet something as extraordinarily complex as a newborn baby is produced from such sparse information. Either nature uses compression algorithms heretofore unimaginable to the world's best cryptographers or there are layers of information encoding and processing yet to be discovered. While the application of particle physics and quantum information theory are used ever more frequently in the worlds top data encryption laboratories, there appears to be no serious effort in applying these disciplines to understanding information encoding and processing in the genome. While it is known that DNA is held together with quantum entangled electrons along the entirety of the DNA backbone, we have yet to explore the possibility that this or other non-sequence physical property has any contribution to information storage and management in the genome. Genomics as an information sciences discipline needs to expand beyond simply understanding nucleotide sequence. Without this, a full understanding of genomic contributions to health and disease can never be completely realized.

July 17, 18:00 ~ 18:30 - Room B2

## The Local Edge Machine: inference of dynamic models of gene regulation

### Xin GUO

### The Hong Kong Polytechnic University, Hong Kong - x.guo@polyu.edu.hk

We present a novel approach, the Local Edge Machine, for the inference of regulatory interactions directly from time-series gene expression data. We demonstrate its performance, robustness, and scalability on in silico datasets with varying behaviors, sizes, and degrees of complexity. Moreover, we demonstrate its ability to incorporate biological prior information and make informative predictions on a well-characterized in vivo system using data from budding yeast that have been synchronized in the cell cycle. Finally, we use an atlas of transcription data in a mammalian circadian system to illustrate how the method can be used for discovery in the context of large complex networks.

Joint work with Kevin A. McGoff (UNC Charlotte, USA), Anastasia Deckard (Duke University, USA), Christina M. Kelliher (Duke University, USA), Adam R. Leman (Duke University, USA), Lauren J. Francey (Duke University, USA), John B. Hogenesch (University of Cincinnati, USA), Steven B. Haase (Duke University, USA) and John L. Harer (Duke University, USA).

July 18, 14:30 ~ 15:20 - Room B2

## Topological determination of H-bonds rotations in Proteins

### Jørgen Andersen

### Aarhus University, Denmark - jea.qgm@gmail.com

In this talk we will first recall our H-bond rotation descriptor of the local geometry in a proteins around an H-bond. We will then report on recent experimental work in progress on how to determine this descriptor by combinatorial topological methods. Finally we will discuss possible strategies for determining combinatorial topological data from Primary sequence.

July 18, 15:30 ~ 16:00 - Room B2

## Exploration of Cancer Detection and Treatment with Data Science

### Xuan Michael

### UniData Technology, China - michael.xuan@unidt.com

There is a lot of attention and exploration in the detection and treatment of certain types of cancer with the help of big data and machine learning methods.

We would like to report some of development and exploration in this field in UniData Technology

July 18, 16:00 ~ 16:30 - Room B2

## Centrality in biological networks

### Alfred Hero

### University of Michigan, USA - hero@eecs.umich.edu

Vertex centrality measures arise in topological network analysis, in particular for community detection, clustering, and vertex nomination. This talk will discuss centrality in the context of biological networks. We will then introduce a new variational representation for betweenness centrality in large Euclidean graphs that can be used to characterize influential vertices in gene expression networks.

July 18, 17:00 ~ 17:30 - Room B2

## DSGRN: Global guide to regulatory network dynamics I: Applications

### Tomas Gedeon

### Department of Mathematical Sciences, Montana State University, United States - gedeon@math.montana.edu

Dynamic Signatures Generated by Regulatory Networks (DSGRN) provides a queryable description of global dynamics over the entire parameter space. DSGRN is based on a new approach to dynamical systems, which moves the focus away from trajectories and invariant sets, and toward robust, scalable and computable description of dynamics in terms of lattices and posets. On the level of software, DSGRN takes as input a regulatory network and outputs a queryable SQL database that provides information about the structure of global dynamics over all of the associated parameter space.

In this first lecture of the series of two, we apply DSGRN to challenging problems in gene regulation. Experimental data on gene regulation is mostly qualitative, and quantitative data is subject to large uncertainty. There are no first principle models for gene regulatory networks, and no canonical choices of nonlinearities. Given these realities, it is very difficult to make reliable predictions of dynamical behavior of these networks using mathematical models. The current approach of choosing reasonable nonlinearities, reasonable parameter values, a few initial conditions and ru simulations, is severely subsampling both the parameter and phase space, and does not provide provable predictions about the dynamics. The coarse description of dynamics over entire parameter space obtained by our method provably captures several important aspects of the dynamics, and addresses the limitations imposed by sub-cellular biology. In this talk we describe work on E2F-Rb network, that controls mammalian cell cycle restriction point and has been implicated in many cancers. We show that a large portion of the parameters support either the proliferative state, quiescent state, or hysteresis between these two states. We sample perturbations of this network and study robustness of this dynamics in the network space.

Joint work with Bree Cummins (Montana State University, USA), Shaun Harker (Rutgers University, USA) and Konstantin Mischaikow (Rutgers University, USA).

July 18, 17:30 ~ 18:00 - Room B2

## Global guide to regulatory network dynamics II: Theory

### Konstantin Mischaikow

### Rutgers University, USA - mischaik@math.rutgers.edu

Dynamic Signatures Generated by Regulatory Networks (DSGRN) provides a queryable description of global dynamics over the entire parameter space. DSGRN is based on a new approach to dynamical systems, which moves the focus away from trajectories and invariant sets, and toward robust, scalable and computable description of dynamics in terms of lattices and posets. On the level of software, DSGRN takes as input a regulatory network and outputs a queryable SQL database that provides information about the structure of global dynamics over all of the associated parameter space.

In this second lecture of the series of two, we discuss the theory and algorithms that form the basis for the DSGRN computations. In particular, we will describe tasks that DSGRN performs including:

1. the decomposes parameter space into a finite collection of regions with explicit expressions as semi-algebraic sets,

2. for each element of parameter space representation of the dynamics in the form of a state transition graph with the property that the state transition graph is constant over the above mentioned regions of the parameter space,

3. the reduction of information of the state transition graph into a poset, called a Morse graph, for which the nodes represent recurrent dynamics and the edges indicate the direction of gradient-like dynamics.

Time permitting we will discuss current questions including the following:

1. theorems that relate the DSGRN information to more classical concepts from dynamical systems such as attractors and Morse decompositions,

2. introducing algebraic topological methods for analyzing the structure of the nonlinear dynamics,

3. refinement of the computations,

4. potential methods for compressing the information in the database.

Joint work with Tomas Gedeon (Montana State University, USA), Shaun Harker (Rutgers University, USA) and Bree Cummins (Montana State University, USA).

July 18, 18:00 ~ 18:30 - Room B2

## Dynamics in Large Genetic Networks

### Leon Glass

### McGill University, Canada - glass@cnd.mcgill.ca

Genetic activity is partially regulated by a complicated network of proteins, coded by genes, called transcription factors. I will describe a mathematical framework dating to the 1970s to relate the structure of a network to qualitative features of the dynamics. The underlying idea is to first capture the topology and logic of the network interactions by a Boolean network, and to then embed the Boolean network into a continuous piecewise linear differential equation. A symbolic representation of any given network and of its dynamics is given by a directed graph on a Boolean N-cube, where N is the number of genes. Assuming no self-input, each different network of N genes generates a different directed N-cube in which each edge is uniquely directed. An attracting cycle on a directed N-cube is a cycle for which each vertex on the cycle is connected to N-2 edges not on the cycle that are directed towards it. In any dimension, an attracting cycle on the hypercube implies a stable unique stable limit cycle in an associated piecewise linear differential equation. A simple example is the 3-dimensional repressilator. More complicated examples in 4 and higher dimensions can have chaotic dynamics. I pose the question: What can we tell about the dynamics in a given piecewise linear equation from its associated directed N-cube as N becomes large? If this model is a reasonable way to think about real genetic networks, then there must be strong restrictions on the class of possible logical structures in order to have well controlled dynamics. I have benefited from collaborations with many colleagues, especially S. A. Kauffman and J. Pasternack in the 1970s and R. Edwards more recently.

July 18, 18:30 ~ 19:00 - Room B2

## An Algorithm for Cellular Reprogramming

### Rajapakse Indika

### University of Michigan, USA - indikar@umich.edu

In 2007, a remarkable discovery was made that with just four external inputs (transcription factors), it was possible to change differentiated cells into embryonic-like cells. This type of cellular reprogramming changes the fundamental nature of a cell. It invites the possibility of building a universal template for transcription factor-guided reprogramming. I will present our work on an algorithm for cellular reprogramming that uses advanced genomics technologies + mathematics.

Joint work with Anthony Bloch (University of Michigan), Roger Brockett (Harvard University).

July 19, 14:30 ~ 15:00 - Room B2

## Can a maximum entropy model explain codon and amino acid abundances in genomes and proteomes?

### Ignacio Enrique Sánchez

### Universidad de Buenos Aires, Argentina - nachoquique@gmail.com

Living cells share some nearly universal features, such as the genetic code and the monomers that constitute proteins and nucleic acids. Our hypothesis is that the relative abundances of codons in genomes and amino acids in proteomes can be explained from a reduced number of nearly universal principles. In previous work, we were able to explain the relative abundances of amino acids in over 100 proteomes by postulating that all organisms minimize the energy flux through amino acid metabolism. We now present a tow-layer maximum entropy model that can simultaneously explain the relative abundances of amino acids in proteomes and codons in genomes by adding coding constraints to the analysis.

Joint work with Michael Shub and Teresa Krick.

July 19, 15:00 ~ 15:30 - Room B2

## Large Scale 3D Chromatin Reconstruction From Chromosomal Contacts

### Shuaicheng Li

### City University of Hong Kong, Hong Kong - shuaicli@cityu.edu.hk

Recent advances in genome analysis techniques have established that chromatins have preferred three-dimensional (3D) confirmations. Spatial folding can bring two distant genes along the nucleotide sequence into contact. Identifying these contacts is important to understand activities of genes. This has motivated the proposal of methods like Hi-C to detect long-range interactions in the past decade. One can further gain insights on the contacts by applying distance geometry techniques to infer the chromosomal 3D structures from Hi-C data, as has been demonstrated by algorithms such as ChromSDE and ShRec3D. These matrix-based algorithms, however, are space- and time- consuming on very large datasets. A human genome of 100 kilobase resolution would involve ~30,000 loci, requiring gigabytes in merely storing the matrices. In this work, we propose a succinct representation of the distance matrix which tremendously reduces this space requirement. We give a complete solution, called SuperRec, for the inference of chromosomal structures from Hi-C data, through solving the large-scale weighted multidimensional scaling problem. SuperRec runs faster than earlier systems without compromise on result accuracy. Using SuperRec, we were able to reconstruct a structure of 30,000 loci more than 400 times faster than existing methods --- SuperRec solved the structure in 43 minutes whereas the state of the art method ShRec3D took 13 days.

July 19, 15:30 ~ 16:00 - Room B2

## Clustering with sparse data: the case of single-cell genomic profiles

### Jun Li

### University of Michigan, USA - junzli@med.umich.edu

Sequencing-based genomic profiling of individual cells, such as single-cell RNA-sequencing (scRNA-seq), produces low-count integer data for the number of transcripts of each gene measured in each cell. These data are called "sparse", because of high fractions of missing values as well as the issue of low-rank approximation. My group, along with other members of the Michigan Center for Single-Cell Genomic Data Analytics, have been wrestling with the problem of ab initio class discovery using such sparse data. Many specialized methods have appeared in recent years to address noise modeling, normalization, batch effect correction, rare cell type identification, network inference, and integration across sparse 'omics datatypes. However, several challenges remain in making statistically principled inference. Foremost among these is the need for community standards in declaring and objectively assessing clustering solutions. This is challenging as the true "structure" of the data is not known beforehand, and the full range of generative models include not only manifolds of all possible shapes, but also discrete vs. continuous clusters, simple partitioning vs. hierarchical, full membership vs partial membership, or combinations of these scenarios. Further, distribution density in the high-dimensional space could vary greatly from one region of the manifold to another, calling for adaptive learning of "local structure". In practice, however, development of new methods and their application in real datasets have often considered only a subset of these models; and the underlying assumptions are often unacknowledged. With this backdrop, our Center has begun designing best practices that systematically evaluate diverse generative models, and providing guidance on how to identify the most appropriate ones. We also assess how the underlying assumptions impact the choice of data normalization and clustering algorithms. Addressing these challenges are relevant in other data science areas, including learning from electronic health records, consumer purchasing and rating data, location and usage pattern of mobile devices, connectivity in social networks, and medical imaging data. Similarly, the task of manifold alignment across omics data types represents a special case of integration over multiple networks inferred with imperfect information.

Joint work with Sue Hammoud, Xiang Zhou and Anna Gilbert (University of Michigan).

July 19, 16:00 ~ 16:30 - Room B2

## Mathematics of Transcription: Decoding cis-regulation from the Drosophiladae to the Sepsidae and back again

### John Reinitz

### University of Chicago, USA - reinitz@galton.uchicago.edu

The syncytial organization of the blastoderm stage of Drosophila development affords unique advantages for the mathematical modeling of transcriptional control. It is possible to use the entire embryo as a spatially resolved microarray in which the response of reporters to quantitatively assayed transcription factors can be monitored at cellular resolution. This provides an opportunity to construct quantitative and predictive models of transcriptional control that are not limited to single enhancers. I will discuss progress in this effort, with emphasis on a case where understanding conservation of function in the absence of conservation of sequence leads to the identification of functional clusters of binding sites that are arguably interpretable as cis-regulatory codons.

July 19, 17:00 ~ 17:30 - Room B2

## A Differential Inclusion Approach to Variable Selection and Applications

### Yuan Yao

### Hong Kong University of Science and Technology, China - yuany@ust.hk

A differential inclusion approach is presented for variable selection in high dimensional statistics. The differential inclusions arise from the limits of the Linearized Bregman Iterations, or equivalently the Mirror Descent algorithm associated with sparsity-induced Bregman distances. A path consistency theory is presented to explain its observed superb performance in comparison to the (generalized) Lasso estimate. Application examples are given, including a case study in Alzheimer Disease detection.

Joint work with this talk contains joint work with: Chendi Huang (Peking University), Lingjing Hu (Chinese Capital Medical University), Stan Osher (UCLA), Feng Ruan (Stanford), Xinwei Sun (Peking University), Yizhou Wang (Peking University) and and Jiechao Xiong (Peking University and Tencent AI Lab).