Robust inference of gene regulatory networks : System properties, variable selection, subnetworks, and design of experiments

Abstract: In this thesis, inference of biological networks from in vivo data generated by perturbation experiments is considered, i.e. deduction of causal interactions that exist among the observed variables. Knowledge of such regulatory influences is essential in biology.A system property–interampatteness–is introduced that explains why the variation in existing gene expression data is concentrated to a few “characteristic modes” or “eigengenes”, and why previously inferred models have a large number of false positive and false negative links. An interampatte system is characterized by strong INTERactions enabling simultaneous AMPlification and ATTEnuation of different signals and we show that perturbation of individual state variables, e.g. genes, typically leads to ill-conditioned data with both characteristic and weak modes. The weak modes are typically dominated by measurement noise due to poor excitation and their existence hampers network reconstruction.The excitation problem is solved by iterative design of correlated multi-gene perturbation experiments that counteract the intrinsic signal attenuation of the system. The next perturbation should be designed such that the expected response practically spans an additional dimension of the state space. The proposed design is numerically demonstrated for the Snf1 signalling pathway in S. cerevisiae.The impact of unperturbed and unobserved latent state variables, that exist in any real biological system, on the inferred network and required set-up of the experiments for network inference is analysed. Their existence implies that a subnetwork of pseudo-direct causal regulatory influences, accounting for all environmental effects, in general is inferred. In principle, the number of latent states and different paths between the nodes of the network can be estimated, but their identity cannot be determined unless they are observed or perturbed directly.Network inference is recognized as a variable/model selection problem and solved by considering all possible models of a specified class that can explain the data at a desired significance level, and by classifying only the links present in all of these models as existing. As shown, these links can be determined without any parameter estimation by reformulating the variable selection problem as a robust rank problem. Solution of the rank problem enable assignment of confidence to individual interactions, without resorting to any approximation or asymptotic results. This is demonstrated by reverse engineering of the synthetic IRMA gene regulatory network from published data. A previously unknown activation of transcription of SWI5 by CBF1 in the IRMA strain of S. cerevisiae is proven to exist, which serves to illustrate that even the accumulated knowledge of well studied genes is incomplete.