Inferring Genetic Regulatory Networks

from Microarray Data

with the Webmind AI Engine

Ben Goertzel

Webmind Diehards, LLC

May 31, 2001

This document is written for Webmind insiders who know a lot about Webmind AI and not much about genetics and microarrays. It reviews the data analysis problem of “genetic regulatory network inference” – the mathematical structure of the problem, the pragmatic nature of the data, and various prior approaches – and then outlines a Webmind-AI-Engine based approach, giving reasons why it is believed this approach will be superior.

1. Introduction

Now that the human genome project is essentially complete, the bioinformatics community is looking ahead to the next challenge: figuring out how the genetic code actually does anything. How do these sequences of amino acids decode themselves, making use of other molecules in their environment to create organisms like us?

We’re a long way from fully understanding this, but thanks to recent revolutionary advances in experimental apparatus design, we now have data that lets us begin to approach the problem. DNA chips, microarrays and related technologies allow scientists to collect time series data regarding the expression of genes during cell development.

For the yeast organism, for example, there is a particularly good dataset available online: we have information about gene expression at 20 time points during cell development, for each of the roughly 6000 genes that yeast possesses. At each of the 20 time points, we can see which genes are expressed and which are not. From this data, it should be possible to infer something of the genetic regulatory network of yeast. Of course, yeast in itself is not generally that interesting to humans (except bakers), but as far as cell development goes, yeast and humans are about 80% the same. There are other datasets available as well, but the yeast dataset is the most thorough one available online for free.

A number of researchers, including well-known ones at excellent biotech labs, have tried to solve this particular data analysis problem. But as I’ll review here, the results have not been tremendously successful. Here I will discuss the likely reasons for their lack of success, and propose a method that may well overcome the obstacles they encountered.

There are many parallels between this data analysis problem and financial market prediction, a problem with which I have ample experience. For instance:

à The data is noisy (here in the literal sense; there is ample experimental error in measuring the underlying biophysical system)

à The data is distressingly scant (2000 time series points would be a lot nicer than 20)

à Broadly speaking, current approaches fall into two categories:

· &nbs p; extremely simplistic approaches that ignore most of the structure in the data

· &nbs p; overly complex approaches with too many parameters, that are less successful than the simplistic approaches

We have seen in the case of market prediction that, in spite of these obstacles, a fairly straightforward adaptive statistical pattern recognition approach (the so-called Extended Simplex Predictor) works wonders. The regulatory network inference problem is by no means mathematically identical to the market prediction problem – in market prediction there the time series are hundreds of points long, and one is interested in predicting the course of a single series based on other series rather than in inferring an underlying interactional network that is known to exist. But in an abstract way it’s in the same “problem species,” so that the general lessons learned in the one case may be applicable to the other.

In outlining my proposed solution to the problem, I will assume familiarity with the Webmind AI Engine, at the level of the April 2001 version of Digital Intuition (Goertzel et al, 2001). Essentially my proposal is to:

à Define a special kind of logical expression to represent a conjectural regulatory pattern

à Use inference (HOI) to assess whether a conjectural regulatory pattern fits the data or not, incorporating biological background knowledge where appropriate

à Use CRN crossover/mutation as well as inference and perhaps specialized heuristics to create conjectural regulatory patterns

For the inference component, my intuition is that for this kind of problem, PTL will outperform NARS, because node probabilities are supplied directly by the data.

A conjectural regulatory pattern represents a relationship between genes. A key point here is that a gene is not exclusively characterized by its expression profile as determined by microarray experiments. A gene may also have other characteristics associated with it, as mined from annotations in online datasets. For instance, many genes will be annotated with information as to their functions, as determined by various non-microarray experiments. This annotation data serves two functions:

à The system may use them in the process of conceiving conjectural regulatory patterns

à Some of them may bias the system toward assessing certain conjectural regulatory patterns as more plausible than others

The ability to incorporate diverse data in this way is a major strength of the Webmind approach to this problem. However, getting the use of the annotation data right will require some biological intuition.

2. About Microarrays

Chemists have long had methods for carrying out many simultaneous chemical reactions. Most simply, trays can be created with 96 or 384 wells, each containing a different chemical and a unique bar code.

The last few years, however, have seen the development of methodologies that push even further in this direction –making possible experiments that scientists only a few years ago would have called impossible. The application of these new methodologies to the analysis of gene and protein data has led to a new area of research that may be called massively parallel genomics and proteomics.

Most of the work done so far has been in genomics; the extension to proteomic analysis is more recent. Here, for concreteness, I’ll discuss microarrays as used for genomic analysis; the proteomics case is basically the same from the point of view of data analysis, though vastly more difficult from the point of view of experimental apparatus biomechanics. (If you are in need of a review of very basic genetics, there are many resources available online. For instance, a tutorial at the level of first year college biology may be found at http://vector.cshl.org/dnaftb/26/concept/index.html.)

There are several types of microarrays used in genomics, but they all embody a common methodology. Single stranded DNA/RNA molecules are anchored by one end to some kind of surface (a chip or a plate depending on the type of apparatus). The surface is then placed in a solution, and the molecules affixed to the chip will seek to hybridize with complementary strands (“target molecules”) floating in the solution. (Hybridization refers to the formation of base pairs between complementary regions of two strands of DNA that were not originally paired).

Affymetrix’s technology, pioneered by Dr. Stephan Fodor, involves making DNA chips in a manner similar to the manufacture of semiconductor chips. A process known as “photolithography” is used to create a huge number of molecules, directly on a silicon wafer. A single chip measuring 1.28 cm X 1.28 cm can hold more than 400,000 “probe” molecules. The procedure of gene chip manufacture has been fully automated for a while now, and Affymetrix manufactures 5-10,000 DNA chips per month.

Affymetrix DNA chips have a significant limitation in terms of the size of the molecules that can be affixed to them. So far they’re normally used with DNA/RNA segments of length 25 or less. Also, they are very expensive. It currently costs about $500,000 to fabricate the light masks for a new array design, so their technology is most appropriate when the same chip needs to be used again and again and again. The main example of this kind of use case is disease diagnosis.

On the other hand, spotted microarrays, first developed by Pat Brown at Stanford, are ordinary microscope slides on which robot arms lay down rows of tiny drops from racks of previously prepared DNA/RNA samples. At present this technology can lay down tens of thousands of probe molecules, at least an order of magnitude off from what Affymetrix can do. The advantage of this approach is that any given DNA/RNA probe can be hundreds of bases long, and can, in principle, be made from any DNA/RNA sample.

There are other approaches as well. For instance, Agilent Technologies, a spin-off from HP, is manufacturing array makers using ink-jet printer technology. Their approach is interesting in that it promises to make practical the synthesis of a single instance of a given array design. Lynx Corporation is pursing a somewhat Affymetrix-like approach, but circumventing Affymetrix’s patents by using addressable beads instead of a silicon wafer. And so forth.

So let’s suppose that, one way or another, we have a surface with a number of DNA/RNA molecules attached to it. How do we do chemical reactions and measure their results?

First, the target molecules are fluorescently labeled, so that the spots on the chip/array where hybridization occurs can be identified. The strength of the fluorescence emanating from a given region of the surface is a rough indicator of the amount of target substance that bound to the molecule affixed to that region. In practical terms, what happens is that an image file is created, a photograph of some sort of the pattern of fluorescence emanating from the microarray itself. Typically the image file is then “gridded”, i.e. mapped into a pixel array with a pixel corresponding to each probe molecule. Then, there is a bit of black art involved in computing the hybridization level for a spot, involving various normalization functions that seem to have more basis in trial-and-error than in fundamentals.

This data is very noisy, however. To get more reliable results, researchers generally work with a slightly more complex procedure. First, they prepare two related samples, each of which is colored with a different fluorescent substances (usually, one green, one red). They then compare the relative amounts of expressed DNA/RNA in the two sample. The ratio of green/red at a given location is a very meaningful number. Using this ratio is a way of normalizing out various kinds of experiment-specific “noise”, assuming that these noise factors will be roughly constant across the two samples.

But even this ratio data is still highly noise-ridden, for a number of reasons beyond the usual risk of experimental error or manufacturing defects in the experimental apparatus. For one thing, there are many different factors influencing the strength of the bond formed between two single stranded DNA/RNA molecules, including but not limited to:

à the length of the bonded molecules

à the base composition of the molecules (A/T and C/G bond with different energies)

à the number and location of base mismatches (location being important since, for example, end mismatches are less likely to cause hybridization to fail)

Frequently, false positive hybridization events will occur, due to the ability of DNA to bind to sequences that are roughly complementary but not an exact match. This can be controlled to some extent by the application of heat, which breaks bonds between molecules – getting the temperature just right will break false positive bonds and not true positive ones. Other laboratory conditions besides temperature can have similar effects. Another problem is that the “probe molecules” affixed to the surface may fold up and self-hybridize, thus rendering them relatively inaccessible to hybridize with the target.

All these issues mean that a single data point in a large microarray data set cannot be taken all that seriously. The data as a whole is extremely valuable and informative, but there are a lot of things that can go wrong and lead to spurious data points. This means that data analysis methods, to be successfully applied to microarray data, have got to be extremely robust with respect to noise.

What’s the use of these high volumes of noisy data? There are many different applications, actually. They can be used for sequencing variants of a known genome, for detecting single nucleotide polymorphisms (SNPs) or identifying a specific strain of virus (e.g. the Affymetrix HIV-1 array). They can be used to measure the differences in gene expression between normal cells and tumor cells, which helps determine which genes may cause/cure cancer, or identify which treatment a specific tumor should respond to best. They can measure differences in gene expression between different tissue types, to determine what makes one cell type different than another.

The application of relevance to the present discussion, however, is the identification of genes involved in cell development. For example, a classic 1999 paper by Cho et al gives a series of measurements of nearly the complete yeast genome taken at various stages during the cell division cycle.

3. Available Data

While the human genome is the one of most interest to us humans, the prize for the highest volume of online data goes to the yeast genome. There are a couple of reasons for this. First, yeast study is of less immediate economic value, so that information on the yeast genome is not so closely guarded. Second, yeastkind is simpler than humankind; there are fewer genes and so it has been easier to do comprehensive studies.

Of course, there are many aspects of human genetics and epigenetic development that cannot be approach by studying yeast. However, there are many simple and unresolved problems that are common to both humans and yeast. It seems reasonable to believe that, if one can’t crack a given problem in regard to yeast, one won’t be able to crack it in regard to humans. Thus a reasonable way to proceed is to demonstrate one’s methods on yeast datasets, and then, once they’ve been proven in this simple case, move on to apply them to the analysis of human gene expression data.

A lot of information about yeast genetics and epigenetics, and microarray data analysis in general, is available at the following sites:

Cho et al’s paper giving time series data for gene expression during the yeast development cycle (derived using an Affymetrix DNA chip) is available online on the Science website, to subscribers. The abstract of the paper is as follows (see http://cmgm.stanford.edu/pbrown/sporulation/abstract.html):

Diploid cells of budding yeast produce haploid cells through the developmental program of sporulation, which consists of meiosis and spore morphogenesis. DNA microarrays containing nearly every yeast gene were used to assay changes in gene expression during sporulation. At least seven classes with distinct temporal patterns of induction were observed. The transcription factor Ndt80 appears to be important for induction of a large group of genes at the end of meiotic prophase. Consensus sequences know or proposed to be responsible for temporal regulation could be identified solely from analysis of sequences of coordinately expressed genes. The temporal expression pattern provided clues to potential functions of hundreds of previously uncharacterized genes, some of which have vertebrate homologues that may function during gametogenesis.

The raw data from the experiments discussed in this paper are available online in tab-separated format at http://cmgm.stanford.edu/pbrown/sporulation/additional/spospread.tx t

Chen et al (1999) include in their (online) paper an informative discussion of how to filter this data for easy computational analysis. For instance, they point out that for many genes in the data set, either

Expression levels are so low as to be within experimental error of zero
Expression levels don’t change significantly over the course of the experiment

These genes may sensibly be omitted from consideration as “noise”, from the point of view of regulatory network inference. They give the specific thresholds that they used in order to choose the set of relevant genes from the full data set, by these criteria.

There may be more recent and more thorough data than the Cho et al data in the public domain, but I have not located any. This data set has been well studied during the last year, that’s for sure.

Wessels et al (2001), in their systematic comparative study of methods for the inference of regulatory networks, created a number of synthetic data sets which are much smaller than the actual yeast data set. I haven’t found their synthetic data sets online, but they may be available, and if not I’m sure the authors would supply us with them. They do tell how they made these datasets, in their paper. It is probably worth experimenting with these toy datasets before tackling yeast.

Next, there is non-numerical data that may also be valuable for the analysis of gene expression. When biologists interpret time series like the Cho et al data, they use a great deal of background knowledge about gene function. On the other hand, automated data analysis methods tend to go purely by the numbers. The ability to integrate numerical and non-numerical information about genes and gene expression can potentially add a great deal to the analysis of microarray data.

Fortunately, a strong effort is underway to standardize the expression of non-numerical genetic data. One part of this is the Gene Ontology Project, described in detail at http://www.geneontology.org/ . To quote that site,

The Gene Ontology (GO) project was established to provide a common language to describe aspects of a gene product's biology. The use of a consistent vocabulary allows genes from different species to be compared based on their GO annotations. For each of three categories of biological information--molecular function, biological process, and cellular component--a set of terms has been selected and organized. Each set of terms uses a controlled vocabulary, and parent-child relationships between terms are defined. This combination of a controlled vocabulary with defined relationships between items is referred to as an ontology. Within an ontology, a child may be a "part of" or an example ("instance") of its parent. There are three independently organized controlled vocabularies, or gene ontologies, one for molecular function, one for biological process, and one for cellular component. Many-to-many parent-child relationships allowed in the ontologies. A gene may be annotated to any level in an ontology, and to more than one item within an ontology. The Gene Ontology project is a collaboration between three model organism databases, FlyBase (Drosophila), Saccharomyces Genome Database (SGD) and Mouse Genome Informatics (MGI).

The standard GO notation is used in the creation of Gene Summary Paragraphs, which now exist online for 300+ yeast genes, and which are described as follows:

A Gene Summary Paragraph is a summary of published biological information for a gene and its product which is designed to familiarize both yeast and non-yeast researchers with the general facts and important subtleties regarding a locus. SGD curators compose Gene Summary Paragraphs using natural language and a controlled vocabulary based on the Gene Ontology (GO). Gene Summary Paragraphs contain references and links to further information, and highlight connections between genes from yeast and other species wherever possible.

Of course, tuning an NL system to automatically parse gene summary paragraphs would be a very good thing, and a not terribly difficult task. On the other hand, converting Gene Summary Paragraphs into a formal language by hand would be even easier, since currently for yeast there are only 300+ of them, and they use a controlled vocabulary.

Avery simple example of useful information derivable from Gene Summary Paragraphs is similarity information. Knowing which genes are functionally similar can be very useful when analyzing noisy data that points one toward a lot of possible gene similarities, some real and some spurious. Of the many gene similarities suggested by the data, the ones that should be attended most vigorously are the ones that also agree with functional information from Gene Summary Paragraphs (although the other ones shouldn’t be ignored either).

Gene Summaries cover only a small percentage of the yeast genome at present. However, machine learning methods may be useful for extending the knowledge they contain to more of the genome. For instance, Brown et al (1999) use Support Vector Machines to infer functional categories for a large number of genes, based on given functional categories for a small number of genes, plus microarray expression data for all genes. Thus, one could proceed by using machine learning tools to extrapolate knowledge from well-studied genes to other genes, and then use this extrapolated knowledge as well as the directly human-encoded knowledge as background information for the inference of regulatory networks.

4. A Mathematical Formulation of the Genetic Regulatory Network Inference Problem

In (Wessels et al, 2001), the problem of inferring regulatory networks from microarray data is formulated mathematically in a simple and useful, but overly limited, way. Taking my cue from their treatment, I’ll introduce a formalism vaguely similar to theirs but far more general, and in my view clearer.

First some notation:

g is the space of genes
g is a particular gene
G is the space of genomes (a genome is a set of genes associated with a particular organism)
G is a particular genome
E is the space of expression level vectors
E is a particular expression level vector (each entry in such a vector gives the expression level of a particular gene, as measured in a particular experiment on a particular genome)
S is the space of gene expression level time series
S is a member of S; that is, it is a set of expression level vectors, each associated with a given point in time.
F is the space of genetic dynamics (i.e. functions mapping Eⁿ into E for some n, most simply n=1)
F is an instance of a genetic dynamic
C is the space of environmental conditions for a expression experiments
C is a particular set of conditions for a given experiment

We may say e.g. S = {E(t(1)), E(t(2)),….,E(t(n)) }where E(t(i)) is the expression level vector associated with time t(i), and i>j è t(i)> t(j).. In reality the elements of the series are not always equispaced,i.e. t(i+1) – t(i) may not be constant..

Next, let S_j,k= {E(t(j)),…,E(t(k))} be a subset of S. Then the regulatory network inference problem can be cast as follows:

Given S ,

one must infer F so that for all i>n+1, F(S_i,i-n) = E(t(i+1))

This is a classic inverse problem, of inferring dynamics from data. James Crutchfield has written a great deal about this type of problem under the general name of Computational Mechanics (see e.g. Crutchfield and Young, 1990; Crutchfield, 1989).

Supposing that there is a real F underlying the data, which we may call F*, then the goal is to find an F that best approximates F* .

Wessels et al (2001) describe a number of mathematical criteria for judging algorithms that infer regulatory networks from microarray data. Cast in the current notation, these include:

· inferential power (defined as the similarity between F* and F)

· predictive power (defined as how closely, given F* and not F, one can reconstitute the actual time series S from the initial conditions E(t(0)) )

· robustness (how noise-tolerant is the method)

· consistency (does the method infer too many possible networks underlying given data)

· stability (are the predicted expression levels finitely bounded)

· computational cost

These are all definitely important criteria.

A key point is that different approaches to the problem make different assumptions about what the space F is. Some assume it’s a space of linear functions, some that it’s a space of finite Boolean networks, some that it consists of nonlinear functions of some specific form.

5. Approaches to the Inference of Genetic Regulatory Networks

Many different approaches have been taken to the problem of inferring genetic regulatory networks from microarray data, none of them particularly successful. I won’t give a complete review here, but will just review what appear to me to be the highlights of the field.

Existing models may be categorized roughly and not entirely completely as follows:

Continuous

Pairwise

Rough Network

Complex Network

Discrete

Boolean network

Bayesian

Discrete models involve some discretization of the input data as a first step. Continuous models do not.

5.1 Continuous Models

The continuous models are well-reviewed in Wessels et a. (2001).

Pairwise models seek to find pairs of genes whose expression time series are appropriately correlated. Arkin and Ross (1997) did this statistically; Chen et al (1999) took a more innovative computer-science-oriented approach, involving the analysis of peaks in gene expression time series. The problem with pairwise models is that they ignore many types of gene interactions, which are not binary in nature.

What Wessels et al call “rough network models” are models that use linear differential equations to model gene interactions. These models capture stimulation and inhibition relationships among genes, but they miss interactions that involve intermediate products such as proteins and metabolites. A good example of this kind of work is (Someren, 2000).

Finally, complex network models attempt to model the full complexity of genetic regulation using nonlinear equations of various sorts. The best example here is (Savageau, 1999). The problem with this kind of model is the large number of parameters involved, which the data would currently seem to be too scant to determine.

Von Dassow et al (2000) is worth reading as an indication of the complexity of the equations and the number of parameters needed to really model a genetic regulatory network (the segment polarity network in insects, in this case). Their equations are derived in an integrative way from a large number of different data types, some quantitative, some qualitative. Inferring this kind of equational model from microarray data in its current quantity and quality is, as they note, not possible.

5.2 Discrete Models

Among the first significant work in this area was Liang et al (1998)’s work on the REVEAL system. This algorithm infers a Boolean network from microarray data, but it is a very inefficient algorithm internally. Akutsu et al (1999) devised a more efficient approach to the same problem, and showed that if the in-degree of the networks involved is bounded, then only log(N) examples are needed to infer a network of size N. In practical examples, they found about 100 examples were needed to infer a network of size 100,000. However, they only tested their algorithm on simulated data, not real microarray data, so its functionality in practice remains unknown.

There are certainly other discrete approaches as well. For instance, Kirillova (1999) has described an innovative cellular automaton model of genetic regulation. And John Koza (1999) has applied genetic programming to the problem of inferring chemical reactions from time series data (a consequence of the general principle that John Koza has applied genetic programming to every problem on Earth). Koza has created an appealing representation of chemical reactions in terms of program trees. But his work so far has been with toy problems, involving unrealistically clean and abundant data; it remains unknown how well the GP methodology would extend to the analysis of realistic microarray data.

5.3 Discussion of Existing Approaches

There are some basic problems confronted in any situation in which a complex underlying system is represented by a relatively scant amount of noisy data. One is generally caught between two errors:

Oversimplistic models that don’t capture the real subtlety of the underlying processes
Complex models with too many parameters, that don’t wind up understanding the data as well as the oversimplistic models

In the case of financial analysis, the trend in the industry is to stick with oversimplistic linear models (e.g. the Barra models, which decompose market behavior as sums of linear factors). Nonlinear models of market behavior traditionally have suffered from overfitting and haven’t succeeded, even though everyone knows financial markets are in essence a nonlinear phenomenon. We have learned through experience that it is possible to get nonlinear financial forecasting right, via a statistical pattern-recognition approach, but only through very careful attention to practicalities as opposed to abstract theory.

In the case of inferring genetic regulatory networks, it seems that

Existing linear-equation and Boolean network models are too oversimplified
Existing complex network models have too many parameters to be realistically filled in by the scant and noisy data

This suggests that a statistical pattern recognition approach may be successful. In this approach, one searches for the patterns that are there in the data, not restricting search to a very simple class of patterns (linear or Boolean), but neither seeking to impose a theoretically-derived pattern on the data (as in a complex equational model of gene regulation, derived from biological theory but disconnected from the actual data).

6. A Possible Webmind-Based Solution

How can we approach the problem of inferring regulatory networks using Webmind AI? Given the above background, the approach can be described fairly briefly. The basic idea is to use Webmind as a statistical pattern-recognition tool. I.e., to

use CompoundRelationNodes as patterns
use inference to assess the quality of these patterns for analyzing particular data
use inferential, evolutionary and associational methods to create new conjectural patterns.

The discussion in this section is conceptually completely in line with the section on genetic regulatory network inference in my earlier document “Webmind Based Data Mining” (Goertzel, 2001), but the notation and terminology are a little different there, and things are worked out in more detail.

Proceeding more systematically, the first step, as with all other approaches, is to make a particular assumption regarding the space W of genetic dynamical equations. We will assume that they are defined in a certain way in terms of sets of implications in uncertain logic (uncertain logic meaning either NARS or PTL).

Note that this assumption lies somewhere between the continuous and discrete models pursued in the literature. We are creating continuous functions by adding uncertainty to logical networks. Unlike in the Bayesian network approach, we are not relying upon human intuition to construct the patterns being evaluated. We are identifying a large space of possible patterns and then using AI to search this space and probability theory to evaluate pattern quality.

The elementary relationships involved in these logical relationships are of the form

expression( g, t)

where g is a gene, t is a point in time, and “expression” is a special ConceptNode. This is the form in which the empirical data from the microarray data files is to be fed into Webmind.

These elementary gene expression relationships are joined using AND, OR and NOT to form more complex logical relationships. Both the elementary gene expression relationships and the compound logical relationships may have truth values in [0,1], which is what differentiates the present approach from the Boolean network approach.

Appropriate normalization of empirical expression values into [0,1] is a nontrivial task which may be done according to various mathematical functions; this is basically the same as the link strength normalization problem we have encountered before (see Digital Intuition). What we discovered there is that the optimal solution depends on the probability distribution of the total set of values being mapped.

An important function in microarray data analysis is clustering. A review of some clustering methods used on this data is given in Brazma and Vilo (2000). Basically they’re the standard ones: K-means, hierarchical clustering, and so forth. Clustering can be valuable here because there are sets of genes whose expressions vary very closely together; these genes can be assumed to be coregulated (i.e. regulated by the same other factors).

In a Webmind context, we may define a GeneNode as a ConceptNode that represents a gene. Clustering then produces ConceptNodes embodying categories of genes, i.e. ConceptNodes with MemberLinks to GeneNodes.

We may define a genetic node as any node that either

Is a GeneNode
Has MemberLinks only to GeneNodes

An atomic genetic expression relationship is then any relation of the form

expression(N, t)

where N is a genetic node and t is a time. A compound genetic expression relationship is a CompoundRelationNode whose atoms are genetic expression relationships.

What we are looking for, to solve the regulatory network inference problem, are patterns represented by high-truth-value ImplicationLinks of the form

Atomic or compound genetic expression relationship pertinent only to times <= t

expression(g,s) AND after(s,t,I)

(*)

for some gene g and some time interval I.

In order to evaluate the truth values of these expressions most effectively, the system has to have some knowledge of number. For instance, it must recognize that expression(g,.6) is very similar (in the logical sense) to expression(g,.62). For initial work, it may be adequate to simply code this as an explicit grounding for the expression(,) relationship.

What does all this add up to? In the above-introduced notation, given an S, how do we find the corresponding F – hopefully approximating the real genetic dynamic F*? In the Webmind model, the space F of regulatory mappings is isomorphic to the set of sets of implications of the form (*).

Specifically, this isomorphism is defined as follows. Suppose one has done some Webmind analysis and come up with a set of such implications, call the set I. Then, how is a genetic dynamic F defined in terms of I? Suppose one is given an argument S_i-n,i . Then, all implications in the set I are evaluated at the times t(i-n), …, t(i). Consequences for the expressions of genes g at time t(i) are then gathered, and used to form a gene expression level vector E(t(i+1)). This mapping from S_i-n,i to E(t(i+1)) is F.

I have not said how the appropriate implication relationships are conceived, nor how the appropriate uncertain logical expressions are constructed. That’s because this is Webmind cognition. By casting the problem as I’ve done, I’ve rendered the genetic regulatory network inference problem as a special case of the problem of general cognition, which can be solved by our general-purpose cognition algorithms. In this case the key ingredients would seem to be:

Inference
Evolution (for the creation of new logical expressions from old via crossover and mutation)
Association formation (“halo spreading”), to guide the course of evolution and inference
Attention allocation (to determine which genes have interesting patterns so far, and hence deserve more pattern recognition time associated with them)

In addition to these general mechanisms, however, some specialized algorithms for CRN (CompoundRelationNode) formation may be valuable. It would be interesting, for example, to implement the algorithm of Akutsu et al (2000) to guess Boolean networks governing gene regulation. Their algorithm does not produce probabilistic results, but, it could be used to produce “CRN skeletons” that are then probabilistically evaluated. The idea is that it might provide better initial guesses than random CRN formation.

Note that this is not free-ranging, purely self-organizing cognition, it’s goal-directed cognition, where the goal is to create ImplicationLinks of a certain form. Implementationally, this means that the process should be driven by a hand-coded cognitive schema, linked to an application-specific GoalNode. This cognitive schema effectively performs fitness evaluation. It looks at the CompoundRelationNodes in the system, and tries to build ImplicationLinks of the appropriate form to them. It gives importance to the nodes and links involved if it succeeds.

Finally, the above describes a first approach to solving the genetic regulatory network inference problem using Webmind, and because of this it refers to numerical microarray data only. But the next step is clear – to incorporate non-numerical annotations regarding functional properties of various genes. Once this information is in the form of Webmind nodes and links, processing like inference and association-finding can use it automatically, requiring no significant modifications to the above-described application logic. For instance, non-numerical data can give us ConceptNodes embodying a priori gene categories, which can be combined with ConceptNodes inferred from statistical clustering. It can also tell us that in other organisms a certain gene tends to regulate a certain other gene, which will bias the system to look for regulatory relationships between those two genes. And so forth.

Of course, what I’ve described here is actually a general methodology for Webmind-based time series prediction, not really specifically restricted to gene expression time series prediction. The heuristics for forming CRN’s will be specific to the application, however, as will be the numerous parameter settings.

Finally, it should be noted that the main use of this kind of pattern analysis tool in practice is to guide biologists in designing new experiments. Once one has guessed what the regulatory network is, using noisy microarray data, one’s hypotheses can be tested using more accurate methods. Thus it is important to have a comprehensible output of the model derived by the program. Fortunately, uncertain logical expressions resemble Boolean expressions, which are easily comprehended, so that biological interpretability of the program’s output shouldn’t be a problem. A graphical viewer of genetically relevant CompoundRelationNodes as trees would seem to be an appropriate tool around which to build an interface for viewing the system’s results.

References

Tatsuya Akutsu, Satoru Miyano, Satoru Kuhara (1999). Identification Of Genetic Networks From A Small Number Of Gene Expression Patterns Under The Boolean Network Model, http://citeseer.nj.nec.com/akutsu99identification.html

Arkin, Shen and Ross (1997). A test case of correlation metric construction of a reaction pathway from measurements. Science 277

Catherine A. Ball, Kara Dolinski, Selina S. Dwight, Midori A. Harris, Laurie Issel-Tarver, Andrew Kasarskis, Charles R. Scafe, Gavin Sherlock, Gail Binkley, Heng Jin, Mira Kaloper, Sidney D. Orr, Mark Schroeder, Shuai Weng, Yan Zhu, David Botstein and J. Michael Cherry(2000). Integrating functional genomic information into the Saccharomyces Genome Database. Nucleic Acids Research. Online at http://nar. oupjournals.org/cgi/content/full/28/1/77

Brazma, Avil and Jaak Vilo (2000). Gene Expression Data Analysis. FEBS Letters 480,

http://www.e lsevier.nl/febs/118/19/21/00023893.pdf

Brown, Michael, William Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares Jr., David Haussler (1999). Support Vector Machine classification of microarray gene expression data. http://www.cse.ucsc.edu/research/compbio/genex/genexTR2html/genex .html

Chen, Ting, Vladimir Filkov and Steven S. Skiena (1999). Identifying gene regulatory networks from experimental data. http://citese er.nj.nec.com/chen99identifying.html

Cho, R., et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2:65-73, raw data online at http:// cmgm.stanford.edu/pbrown/sporulation/additional/

Crutchfield, J. P. (1989). Inferring the Dynamic, Quantifying Physical Complexity, in Measures of Complexity and Chaos, A. M. Albano, N. B. Abraham, P. E. Rapp, and A. Passamante, editors, Plenum Press, New York, p. 327, http://www.santafe.edu/projects/CompMech/papers/CompMechCommun.html< /a>

Crutchfield, J.P. and K. Young (1990). , Computati on at the Onset of Chaos, in Entropy, Complexity, and Physics of Information, W. Zurek, editor, SFI Studies in the Sciences of Complexity, VIII, Addison-Wesley, Reading, Massachusetts, 223-269, http://www.santafe.edu/projects/CompMech/papers/CompMechCommun.html< /a>

De Risi, Iyer and Brown ( 1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278.

Goertzel, Ben (2001). Webmind-Based Data Mining. Webmind Diehards LLC Technical report.

Goertzel, Ben (2001). Digital Intuition: A Conceptual Overview of the Webmind AI Engine. Webmind Diehards LLC technical report, based on some prior Webmind Inc. material.

Hartemink, Alexander, David K. Gifford, Tommi S. Jaakkola, Richard A. Young (2001). Using Graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing. htt p://www.psrg.lcs.mit.edu/publications/Papers/psbabs.html

Kirillova, Olga (2001). Cellular Automata Model for Gene Networks. http://www.csa.ru/Inst/gorb_dep/inbios/genet/Netmodel/bool/camod. htm

Koza, John, William Mydlowec, Guido Lanza, Jessen Yu, Martin A. Keane (2000). Reverse Engineering of Metabolic Pathways from observed data using genetic programming. http://smi-web.stanford.edu/pubs/SMI_Abstracts/SMI-2000-0851.html< /p>

Liang, Fuhrman and Somogyi (1998). REVEAL: a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. on Biocomputing '98, 3:18-- 29

Maki, Y. D. Tominaga, M. Okamoto, S. Watanabe, Y. Eguchi (2001). Development of a system for the inference of large-scale genetic networks. http://citese er.nj.nec.com/maki01development.html

Raychaudhuri S, Stuart JM, Altman RB. (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series. Pacific Symposium on Biocomputing 2000;:455-66

Savageau, M.A. (1998). Rules for the evolution of gene circuitry. Pacific Symposium on Biocomputing, 3:54--65.

Sherman, Fred (1998). An Introduction to the Genetics and Molecular Biology of the

Yeast Saccharomyces cerevisiae, h ttp://dbb.urmc.rochester.edu/labs/Sherman_f/yeast/Index.html

E.P. van Someren, L.F.A. Wessels and M.J.T. Reinders (2000). Linear Modeling of Genetic Networks from Experimental Data, In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB00), p. 355-366, La Jolla, California, August 2000, see http://www.genlab.tu delft.nl/publications/

Jaak Vilo, Alvis Brazma, Inge Jonassen, Alan Robinson, and Esko Ukkonen (2000).
Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data, ISMB-2000 August 2000, http: //industry.ebi.ac.uk/~vilo/Publications/ISMB2000.pdf

Von Dassow, Eli Meir, Edwin Munro, Garrett Odell (2000). The Segment Polarity Network is a Robust Developmental Module, Nature 406 pp.188-192, http://www.euchromatin .org/Biomath01.htm

Wen, X., S. Fuhrman, G.S. Michales, D.B. Carr, S. Smith, J. Barker and R. Somogyi. PNAS 95:334-339 (1998). "Large-Scale Temporal Gene Expression Mapping of Central Nervous System Development"

Wessels, L.F.A., E.P. van Someren and M.J.T. Reinders (1999), Genetic network models: a comparative study, http://cite seer.nj.nec.com/vansomeren01genetic.html

Yuh, Bolouri and Davidson (1997) . Genomic Cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279