with the Webmind AI Engine
Ben Goertzel
Webmind Diehards, LLC
May 31, 2001
This document is written for Webmind insiders who know a lot about Webmind AI and not much about genetics and microarrays. It reviews the data analysis problem of “genetic regulatory network inference” – the mathematical structure of the problem, the pragmatic nature of the data, and various prior approaches – and then outlines a Webmind-AI-Engine based approach, giving reasons why it is believed this approach will be superior.
1. Introduction
Now that the human genome project is essentially complete, the bioinformatics community is looking ahead to the next challenge: figuring out how the genetic code actually does anything. How do these sequences of amino acids decode themselves, making use of other molecules in their environment to create organisms like us?
We’re a long way from fully understanding this, but thanks to recent revolutionary advances in experimental apparatus design, we now have data that lets us begin to approach the problem. DNA chips, microarrays and related technologies allow scientists to collect time series data regarding the expression of genes during cell development.
For the yeast organism, for example, there is a particularly good dataset available online: we have information about gene expression at 20 time points during cell development, for each of the roughly 6000 genes that yeast possesses. At each of the 20 time points, we can see which genes are expressed and which are not. From this data, it should be possible to infer something of the genetic regulatory network of yeast. Of course, yeast in itself is not generally that interesting to humans (except bakers), but as far as cell development goes, yeast and humans are about 80% the same. There are other datasets available as well, but the yeast dataset is the most thorough one available online for free.
A number of researchers, including well-known ones at excellent biotech labs, have tried to solve this particular data analysis problem. But as I’ll review here, the results have not been tremendously successful. Here I will discuss the likely reasons for their lack of success, and propose a method that may well overcome the obstacles they encountered.
There are many parallels between this data analysis problem and financial market prediction, a problem with which I have ample experience. For instance:
à The data is noisy (here in the literal sense; there is ample experimental error in measuring the underlying biophysical system)
à The data is distressingly scant (2000 time series points would be a lot nicer than 20)
à Broadly speaking, current approaches fall into two categories:
· &nbs p; extremely simplistic approaches that ignore most of the structure in the data
· &nbs p; overly complex approaches with too many parameters, that are less successful than the simplistic approaches
We have seen in the case of market prediction that, in spite of these obstacles, a fairly straightforward adaptive statistical pattern recognition approach (the so-called Extended Simplex Predictor) works wonders. The regulatory network inference problem is by no means mathematically identical to the market prediction problem – in market prediction there the time series are hundreds of points long, and one is interested in predicting the course of a single series based on other series rather than in inferring an underlying interactional network that is known to exist. But in an abstract way it’s in the same “problem species,” so that the general lessons learned in the one case may be applicable to the other.
In outlining my proposed solution to the problem, I will assume familiarity with the Webmind AI Engine, at the level of the April 2001 version of Digital Intuition (Goertzel et al, 2001). Essentially my proposal is to:
à Define a special kind of logical expression to represent a conjectural regulatory pattern
à Use inference (HOI) to assess whether a conjectural regulatory pattern fits the data or not, incorporating biological background knowledge where appropriate
à Use CRN crossover/mutation as well as inference and perhaps specialized heuristics to create conjectural regulatory patterns
For the inference component, my intuition is that for this kind of problem, PTL will outperform NARS, because node probabilities are supplied directly by the data.
A conjectural regulatory pattern represents a relationship between genes. A key point here is that a gene is not exclusively characterized by its expression profile as determined by microarray experiments. A gene may also have other characteristics associated with it, as mined from annotations in online datasets. For instance, many genes will be annotated with information as to their functions, as determined by various non-microarray experiments. This annotation data serves two functions:
à The system may use them in the process of conceiving conjectural regulatory patterns
à Some of them may bias the system toward assessing certain conjectural regulatory patterns as more plausible than others
The ability to incorporate diverse data in this way is a major strength of the Webmind approach to this problem. However, getting the use of the annotation data right will require some biological intuition.
2. About Microarrays
Chemists have long had methods for carrying out many simultaneous chemical reactions. Most simply, trays can be created with 96 or 384 wells, each containing a different chemical and a unique bar code.
The last few years, however, have seen the development of methodologies that push even further in this direction –making possible experiments that scientists only a few years ago would have called impossible. The application of these new methodologies to the analysis of gene and protein data has led to a new area of research that may be called massively parallel genomics and proteomics.
Most of the work done so far has been in genomics; the extension to proteomic analysis is more recent. Here, for concreteness, I’ll discuss microarrays as used for genomic analysis; the proteomics case is basically the same from the point of view of data analysis, though vastly more difficult from the point of view of experimental apparatus biomechanics. (If you are in need of a review of very basic genetics, there are many resources available online. For instance, a tutorial at the level of first year college biology may be found at http://vector.cshl.org/dnaftb/26/concept/index.html.)
There
are several types of microarrays used in genomics, but they all embody a
common
methodology. Single stranded
DNA/RNA
molecules are anchored by one end to some kind of surface (a chip or a plate
depending on the type of apparatus).
The surface is then placed in a solution, and the molecules affixed to
the chip will seek to hybridize with complementary strands (“target molecules”)
floating in the solution.
(Hybridization refers to the formation of base pairs between complementary regions of two strands of DNA that were
not originally paired).
Affymetrix’s
technology, pioneered by Dr. Stephan Fodor, involves making DNA chips in
a
manner similar to the manufacture of semiconductor chips. A process known as “photolithography” is
used to create a huge number of molecules, directly on a silicon wafer. A single chip measuring 1.28 cm X
1.28 cm
can hold more than 400,000 “probe” molecules.
The procedure of gene chip manufacture has been fully automated for a
while now, and Affymetrix manufactures 5-10,000 DNA chips per month.
Affymetrix
DNA chips have a significant limitation in terms of the size of the molecules
that can be affixed to them. So far
they’re normally used with DNA/RNA segments of length 25 or less. Also, they are very expensive. It currently costs about $500,000
to
fabricate the light masks for a new array design, so their technology is
most
appropriate when the same chip needs to be used again and again and again. The main example of this kind of
use case is
disease diagnosis.
On
the other hand, spotted microarrays, first developed by Pat Brown at Stanford,
are ordinary microscope slides on which robot arms lay down rows of tiny
drops
from racks of previously prepared DNA/RNA samples. At present this technology can lay down tens of thousands of
probe molecules, at least an order of magnitude off from what Affymetrix
can
do. The advantage of this
approach is
that any given DNA/RNA probe can be hundreds of bases long, and can, in
principle, be made from any DNA/RNA sample.
There
are other approaches as well. For
instance, Agilent Technologies, a spin-off from HP, is manufacturing array
makers using ink-jet printer technology.
Their approach is interesting in that it promises to make practical the
synthesis of a single instance of a given array design. Lynx Corporation is pursing a
somewhat
Affymetrix-like approach, but circumventing Affymetrix’s patents by using
addressable beads instead of a silicon wafer.
And so forth.
So
let’s suppose that, one way or another, we have a surface with a number of
DNA/RNA molecules attached to it. How
do we do chemical reactions and measure their results?
First,
the target molecules are fluorescently labeled, so that the spots on the
chip/array where hybridization occurs can be identified. The strength of the fluorescence
emanating
from a given region of the surface is a rough indicator of the amount of
target
substance that bound to the molecule affixed to that region. In practical terms, what happens
is that an
image file is created, a photograph of some sort of the pattern of fluorescence
emanating from the microarray itself.
Typically the image file is then “gridded”, i.e. mapped into
a pixel
array with a pixel corresponding to each probe molecule. Then, there is a bit of black art
involved
in computing the hybridization level for a spot, involving various normalization
functions that seem to have more basis in trial-and-error than in
fundamentals.
This
data is very noisy, however. To get more reliable results, researchers
generally work with a slightly more complex procedure. First, they prepare two related samples,
each of which is colored with a different fluorescent substances (usually, one
green, one red). They then compare the
relative amounts of expressed DNA/RNA in the two sample. The ratio of green/red at a given
location
is a very meaningful number. Using this
ratio is a way of normalizing out various kinds of experiment-specific “noise”,
assuming that these noise factors will be roughly constant across the two
samples.
But
even this ratio data is still highly noise-ridden, for a number of reasons
beyond the usual risk of experimental error or manufacturing defects in
the
experimental apparatus. For one thing,
there are many different factors influencing the strength of the bond formed
between two single stranded DNA/RNA molecules, including but not limited
to:
à the
length of the bonded molecules
à the
base composition of the molecules (A/T and C/G bond with different energies)
à the
number and location of base mismatches (location being important since,
for
example, end mismatches are less likely to cause hybridization to fail)
Frequently,
false positive hybridization events will occur, due to the ability of DNA to
bind to sequences that are roughly complementary but not an exact match. This can be controlled to some
extent by
the application of heat, which breaks bonds between molecules – getting the
temperature just right will break false positive bonds and not true positive
ones. Other laboratory conditions
besides temperature can have similar effects.
Another problem is that the “probe molecules” affixed to the
surface may
fold up and self-hybridize, thus rendering them relatively inaccessible
to
hybridize with the target.
All
these issues mean that a single data point in a large microarray data set
cannot be taken all that seriously.
The
data as a whole is extremely valuable and informative, but there are a lot of
things that can go wrong and lead to spurious data points. This means that data analysis methods, to be
successfully applied to microarray data, have got to be extremely
robust
with respect to noise.
What’s
the use of these high volumes of noisy data?
There are many different applications, actually. They can be used for sequencing variants of
a known genome, for detecting single nucleotide polymorphisms (SNPs) or
identifying a specific strain of virus (e.g. the Affymetrix HIV-1 array). They can be used to measure the differences
in gene expression between normal cells and tumor cells, which helps determine
which genes may cause/cure cancer, or identify which treatment a specific tumor
should respond to best. They
can
measure differences in gene expression between different tissue types, to
determine what makes one cell type different than another.
The
application of relevance to the present discussion, however, is the
identification of genes involved in cell development. For example, a classic 1999 paper by Cho et al gives a series of
measurements of nearly the complete yeast genome taken at various stages
during
the cell division cycle.
3. Available Data
While
the human genome is the one of most interest to us humans, the prize for
the
highest volume of online data goes to the yeast genome. There are a couple of reasons for
this. First, yeast study is of less immediate
economic value, so that information on the yeast genome is not so closely
guarded. Second, yeastkind is simpler
than humankind; there are fewer genes and so it has been easier to do
comprehensive studies.
Of
course, there are many aspects of human genetics and epigenetic development
that cannot be approach by studying yeast.
However, there are many simple and unresolved problems that are common
to both humans and yeast. It
seems
reasonable to believe that, if one can’t crack a given problem in regard to
yeast, one won’t be able to crack it in regard to humans. Thus a reasonable way to proceed
is to
demonstrate one’s methods on yeast datasets, and then, once they’ve
been proven
in this simple case, move on to apply them to the analysis of human gene
expression data.
A lot of information about yeast genetics and epigenetics, and microarray data analysis in general, is available at the following sites:
Cho
et al’s paper giving time series data for gene expression during the yeast
development cycle (derived using an Affymetrix DNA chip) is available online on
the Science website, to subscribers.
The abstract of the paper is as follows (see
http://cmgm.stanford.edu/pbrown/sporulation/abstract.html):
Diploid cells of budding yeast produce haploid cells through the developmental program of sporulation, which consists of meiosis and spore morphogenesis. DNA microarrays containing nearly every yeast gene were used to assay changes in gene expression during sporulation. At least seven classes with distinct temporal patterns of induction were observed. The transcription factor Ndt80 appears to be important for induction of a large group of genes at the end of meiotic prophase. Consensus sequences know or proposed to be responsible for temporal regulation could be identified solely from analysis of sequences of coordinately expressed genes. The temporal expression pattern provided clues to potential functions of hundreds of previously uncharacterized genes, some of which have vertebrate homologues that may function during gametogenesis.
The raw data from the experiments discussed in this paper are available online in tab-separated format at http://cmgm.stanford.edu/pbrown/sporulation/additional/spospread.tx t
Chen et al (1999) include in their (online) paper an informative discussion of how to filter this data for easy computational analysis. For instance, they point out that for many genes in the data set, either
These genes may sensibly be omitted from consideration as “noise”, from the point of view of regulatory network inference. They give the specific thresholds that they used in order to choose the set of relevant genes from the full data set, by these criteria.
There may be more recent and more thorough data than the Cho et al data in the public domain, but I have not located any. This data set has been well studied during the last year, that’s for sure.
Wessels et al (2001), in their systematic comparative study of methods for the inference of regulatory networks, created a number of synthetic data sets which are much smaller than the actual yeast data set. I haven’t found their synthetic data sets online, but they may be available, and if not I’m sure the authors would supply us with them. They do tell how they made these datasets, in their paper. It is probably worth experimenting with these toy datasets before tackling yeast.
Next, there is non-numerical data that may also be valuable for the analysis of gene expression. When biologists interpret time series like the Cho et al data, they use a great deal of background knowledge about gene function. On the other hand, automated data analysis methods tend to go purely by the numbers. The ability to integrate numerical and non-numerical information about genes and gene expression can potentially add a great deal to the analysis of microarray data.
Fortunately, a strong effort is underway to standardize the expression of non-numerical genetic data. One part of this is the Gene Ontology Project, described in detail at http://www.geneontology.org/ . To quote that site,
The Gene Ontology (GO) project was established to provide
a common language to describe aspects of a gene product's biology. The use of a
consistent vocabulary allows genes from different species to be compared
based
on their GO annotations. For each of three categories of biological
information--molecular function, biological process, and cellular component--a
set of terms has been selected and organized. Each set of terms uses a
controlled vocabulary, and parent-child relationships between terms are
defined. This combination of a controlled vocabulary with defined relationships
between items is referred to as an ontology. Within an ontology, a child
may be
a "part of" or an example ("instance") of its parent. There
are three independently organized controlled vocabularies, or gene ontologies,
one for molecular
function, one for biological
process, and one for cellular
component. Many-to-many parent-child relationships allowed in the
ontologies. A gene may be annotated to any level in an ontology, and to
more
than one item within an ontology. The Gene Ontology project is a collaboration
between three model organism databases, FlyBase (Drosophila), Saccharomyces
Genome Database (SGD) and Mouse Genome Informatics (MGI).
The standard GO notation is used in the creation of Gene Summary Paragraphs, which now exist online for 300+ yeast genes, and which are described as follows:
A Gene Summary Paragraph is a summary of published
biological information for a gene and its product which is designed to
familiarize both yeast and non-yeast researchers with the general facts
and
important subtleties regarding a locus. SGD curators compose Gene Summary
Paragraphs using natural language and a controlled vocabulary based on the Gene
Ontology (GO). Gene Summary Paragraphs contain references and links
to
further information, and highlight connections between genes from yeast
and
other species wherever possible.
Of course, tuning an NL system to automatically parse gene summary paragraphs would be a very good thing, and a not terribly difficult task. On the other hand, converting Gene Summary Paragraphs into a formal language by hand would be even easier, since currently for yeast there are only 300+ of them, and they use a controlled vocabulary.
Avery simple example of useful information derivable from Gene Summary Paragraphs is similarity information. Knowing which genes are functionally similar can be very useful when analyzing noisy data that points one toward a lot of possible gene similarities, some real and some spurious. Of the many gene similarities suggested by the data, the ones that should be attended most vigorously are the ones that also agree with functional information from Gene Summary Paragraphs (although the other ones shouldn’t be ignored either).
Gene Summaries cover only a small percentage of the yeast genome at present. However, machine learning methods may be useful for extending the knowledge they contain to more of the genome. For instance, Brown et al (1999) use Support Vector Machines to infer functional categories for a large number of genes, based on given functional categories for a small number of genes, plus microarray expression data for all genes. Thus, one could proceed by using machine learning tools to extrapolate knowledge from well-studied genes to other genes, and then use this extrapolated knowledge as well as the directly human-encoded knowledge as background information for the inference of regulatory networks.
4. A
Mathematical Formulation of the Genetic Regulatory Network Inference Problem
In (Wessels et al, 2001), the problem of inferring regulatory networks from microarray data is formulated mathematically in a simple and useful, but overly limited, way. Taking my cue from their treatment, I’ll introduce a formalism vaguely similar to theirs but far more general, and in my view clearer.
First some notation:
We may say e.g. S = {E(t(1)), E(t(2)),….,E(t(n)) }where E(t(i)) is the expression level vector associated with time t(i), and i>j è t(i)> t(j).. In reality the elements of the series are not always equispaced,i.e. t(i+1) – t(i) may not be constant..
Next, let Sj,k = {E(t(j)),…,E(t(k))} be a subset of S. Then the regulatory network inference problem can be cast as follows:
Given S ,
one must infer F so that for all i>n+1, F(Si,i-n) = E(t(i+1))
This is a classic inverse problem, of inferring dynamics from data. James Crutchfield has written a great deal about this type of problem under the general name of Computational Mechanics (see e.g. Crutchfield and Young, 1990; Crutchfield, 1989).
Supposing that there is a real F underlying
the data,
which we may call F*, then the goal is to find an F that best
approximates F* .
Wessels et al (2001) describe
a
number of mathematical criteria for judging algorithms that infer regulatory
networks from microarray data. Cast in
the current notation, these include:
·
inferential power (defined as the similarity between
F*
and F)
·
predictive power (defined as how closely, given F*
and not F, one can reconstitute the actual time series S from the
initial conditions E(t(0)) )
·
robustness (how noise-tolerant is the method)
·
consistency (does the method infer too many possible
networks underlying given data)
·
stability (are the predicted expression levels finitely
bounded)
·
computational cost
These are all definitely important
criteria.
A key point is that different
approaches
to the problem make different assumptions about what the space F
is. Some assume it’s a space of linear
functions, some that it’s a space of finite Boolean networks, some that it
consists of nonlinear functions of some specific form.
5. Approaches to the Inference of Genetic Regulatory Networks
Many different approaches have been taken to the problem of inferring genetic regulatory networks from microarray data, none of them particularly successful. I won’t give a complete review here, but will just review what appear to me to be the highlights of the field.
Existing models may be categorized roughly and not entirely completely as follows:
Continuous
Pairwise
Rough
Network
Complex
Network
Discrete
Boolean
network
Bayesian
Discrete models involve some
discretization of the input data as a first step. Continuous models do not.
5.1
Continuous Models
The continuous models are
well-reviewed in Wessels et a. (2001).
Pairwise models seek to find
pairs
of genes whose expression time series are appropriately correlated.
Arkin and Ross (1997) did this statistically; Chen et al (1999) took a
more innovative computer-science-oriented approach, involving the analysis of
peaks in gene expression time series.
The problem with pairwise models is that they ignore many types of gene
interactions, which are not binary in nature.
What Wessels et al call “rough
network models” are models that use linear differential equations to model gene
interactions. These models capture
stimulation and inhibition relationships among genes, but they miss
interactions that involve intermediate products such as proteins and
metabolites. A good example
of this
kind of work is (Someren, 2000).
Finally, complex network models
attempt to model the full complexity of genetic regulation using nonlinear
equations of various sorts. The best
example here is (Savageau, 1999). The
problem with this kind of model is the large number of parameters involved,
which the data would currently seem to be too scant to determine.
Von Dassow et al (2000) is worth
reading as an indication of the complexity of the equations and the number of
parameters needed to really model a genetic regulatory network (the segment
polarity network in insects, in this case).
Their equations are derived in an integrative way from a large number of
different data types, some quantitative, some qualitative. Inferring this kind of equational
model from
microarray data in its current quantity and quality is, as they note, not
possible.
Among the first significant work in this area was Liang et al (1998)’s work on the REVEAL system. This algorithm infers a Boolean network from microarray data, but it is a very inefficient algorithm internally. Akutsu et al (1999) devised a more efficient approach to the same problem, and showed that if the in-degree of the networks involved is bounded, then only log(N) examples are needed to infer a network of size N. In practical examples, they found about 100 examples were needed to infer a network of size 100,000. However, they only tested their algorithm on simulated data, not real microarray data, so its functionality in practice remains unknown.
There are certainly other discrete approaches as well. For instance, Kirillova (1999) has described an innovative cellular automaton model of genetic regulation. And John Koza (1999) has applied genetic programming to the problem of inferring chemical reactions from time series data (a consequence of the general principle that John Koza has applied genetic programming to every problem on Earth). Koza has created an appealing representation of chemical reactions in terms of program trees. But his work so far has been with toy problems, involving unrealistically clean and abundant data; it remains unknown how well the GP methodology would extend to the analysis of realistic microarray data.
There are some basic problems confronted in any situation in which a complex underlying system is represented by a relatively scant amount of noisy data. One is generally caught between two errors:
In the case of financial analysis, the trend in the industry is to stick with oversimplistic linear models (e.g. the Barra models, which decompose market behavior as sums of linear factors). Nonlinear models of market behavior traditionally have suffered from overfitting and haven’t succeeded, even though everyone knows financial markets are in essence a nonlinear phenomenon. We have learned through experience that it is possible to get nonlinear financial forecasting right, via a statistical pattern-recognition approach, but only through very careful attention to practicalities as opposed to abstract theory.
In the case of inferring genetic regulatory networks, it seems that
This suggests that a statistical
pattern recognition approach may be successful. In this approach, one searches for the patterns that are
there in
the data, not restricting search to a very simple class of patterns (linear or
Boolean), but neither seeking to impose a theoretically-derived pattern
on the
data (as in a complex equational model of gene regulation, derived from
biological theory but disconnected from the actual data).
6. A Possible Webmind-Based Solution
How can we approach the problem of inferring regulatory networks using Webmind AI? Given the above background, the approach can be described fairly briefly. The basic idea is to use Webmind as a statistical pattern-recognition tool. I.e., to
The discussion in this section is conceptually completely in line with the section on genetic regulatory network inference in my earlier document “Webmind Based Data Mining” (Goertzel, 2001), but the notation and terminology are a little different there, and things are worked out in more detail.
Proceeding more systematically, the first step, as with all other approaches, is to make a particular assumption regarding the space W of genetic dynamical equations. We will assume that they are defined in a certain way in terms of sets of implications in uncertain logic (uncertain logic meaning either NARS or PTL).
Note that this assumption lies somewhere between the continuous and discrete models pursued in the literature. We are creating continuous functions by adding uncertainty to logical networks. Unlike in the Bayesian network approach, we are not relying upon human intuition to construct the patterns being evaluated. We are identifying a large space of possible patterns and then using AI to search this space and probability theory to evaluate pattern quality.
The elementary relationships involved in these logical relationships are of the form
expression( g, t)
where g is a gene, t is a point in time, and “expression” is a special ConceptNode. This is the form in which the empirical data from the microarray data files is to be fed into Webmind.
These elementary gene expression relationships are joined using AND, OR and NOT to form more complex logical relationships. Both the elementary gene expression relationships and the compound logical relationships may have truth values in [0,1], which is what differentiates the present approach from the Boolean network approach.
Appropriate normalization of empirical expression values into [0,1] is a nontrivial task which may be done according to various mathematical functions; this is basically the same as the link strength normalization problem we have encountered before (see Digital Intuition). What we discovered there is that the optimal solution depends on the probability distribution of the total set of values being mapped.
An important function in microarray data analysis is clustering. A review of some clustering methods used on this data is given in Brazma and Vilo (2000). Basically they’re the standard ones: K-means, hierarchical clustering, and so forth. Clustering can be valuable here because there are sets of genes whose expressions vary very closely together; these genes can be assumed to be coregulated (i.e. regulated by the same other factors).
In a Webmind context, we may define a GeneNode as a ConceptNode that represents a gene. Clustering then produces ConceptNodes embodying categories of genes, i.e. ConceptNodes with MemberLinks to GeneNodes.
We may define a genetic node as any node that either
An atomic genetic expression relationship is then any relation of the form
expression(N, t)
where N is a genetic node and t is a time. A compound genetic expression relationship is a CompoundRelationNode whose atoms are genetic expression relationships.
What we are looking for, to solve the regulatory network inference problem, are patterns represented by high-truth-value ImplicationLinks of the form
Atomic or compound genetic expression relationship
pertinent only to times <= t
è
expression(g,s) AND after(s,t,I)
(*)
for some gene g and some time interval I.
In order to evaluate the truth values of these expressions most effectively, the system has to have some knowledge of number. For instance, it must recognize that expression(g,.6) is very similar (in the logical sense) to expression(g,.62). For initial work, it may be adequate to simply code this as an explicit grounding for the expression(,) relationship.
What does all this add up to? In the above-introduced notation, given an S, how do we find the corresponding F – hopefully approximating the real genetic dynamic F*? In the Webmind model, the space F of regulatory mappings is isomorphic to the set of sets of implications of the form (*).
Specifically, this isomorphism is defined as follows. Suppose one has done some Webmind analysis and come up with a set of such implications, call the set I. Then, how is a genetic dynamic F defined in terms of I? Suppose one is given an argument Si-n,i . Then, all implications in the set I are evaluated at the times t(i-n), …, t(i). Consequences for the expressions of genes g at time t(i) are then gathered, and used to form a gene expression level vector E(t(i+1)). This mapping from Si-n,i to E(t(i+1)) is F.
I have not said how the appropriate implication relationships are conceived, nor how the appropriate uncertain logical expressions are constructed. That’s because this is Webmind cognition. By casting the problem as I’ve done, I’ve rendered the genetic regulatory network inference problem as a special case of the problem of general cognition, which can be solved by our general-purpose cognition algorithms. In this case the key ingredients would seem to be:
In addition to these general mechanisms, however, some specialized algorithms for CRN (CompoundRelationNode) formation may be valuable. It would be interesting, for example, to implement the algorithm of Akutsu et al (2000) to guess Boolean networks governing gene regulation. Their algorithm does not produce probabilistic results, but, it could be used to produce “CRN skeletons” that are then probabilistically evaluated. The idea is that it might provide better initial guesses than random CRN formation.
Note that this is not free-ranging, purely self-organizing cognition, it’s goal-directed cognition, where the goal is to create ImplicationLinks of a certain form. Implementationally, this means that the process should be driven by a hand-coded cognitive schema, linked to an application-specific GoalNode. This cognitive schema effectively performs fitness evaluation. It looks at the CompoundRelationNodes in the system, and tries to build ImplicationLinks of the appropriate form to them. It gives importance to the nodes and links involved if it succeeds.
Finally, the above describes a first approach to solving the genetic regulatory network inference problem using Webmind, and because of this it refers to numerical microarray data only. But the next step is clear – to incorporate non-numerical annotations regarding functional properties of various genes. Once this information is in the form of Webmind nodes and links, processing like inference and association-finding can use it automatically, requiring no significant modifications to the above-described application logic. For instance, non-numerical data can give us ConceptNodes embodying a priori gene categories, which can be combined with ConceptNodes inferred from statistical clustering. It can also tell us that in other organisms a certain gene tends to regulate a certain other gene, which will bias the system to look for regulatory relationships between those two genes. And so forth.
Of course, what I’ve described here is actually a general methodology for Webmind-based time series prediction, not really specifically restricted to gene expression time series prediction. The heuristics for forming CRN’s will be specific to the application, however, as will be the numerous parameter settings.
Finally, it should be noted that the main use of this kind of pattern analysis tool in practice is to guide biologists in designing new experiments. Once one has guessed what the regulatory network is, using noisy microarray data, one’s hypotheses can be tested using more accurate methods. Thus it is important to have a comprehensible output of the model derived by the program. Fortunately, uncertain logical expressions resemble Boolean expressions, which are easily comprehended, so that biological interpretability of the program’s output shouldn’t be a problem. A graphical viewer of genetically relevant CompoundRelationNodes as trees would seem to be an appropriate tool around which to build an interface for viewing the system’s results.
Tatsuya Akutsu,
Satoru Miyano, Satoru Kuhara (1999). Identification Of Genetic Networks From A Small Number Of Gene
Expression Patterns Under The Boolean Network Model,
http://citeseer.nj.nec.com/akutsu99identification.html
Arkin, Shen and
Ross (1997). A test case
of
correlation metric construction of a reaction pathway from measurements. Science 277
Catherine A. Ball, Kara Dolinski, Selina S. Dwight, Midori A. Harris, Laurie Issel-Tarver, Andrew Kasarskis, Charles R. Scafe, Gavin Sherlock, Gail Binkley, Heng Jin, Mira Kaloper, Sidney D. Orr, Mark Schroeder, Shuai Weng, Yan Zhu, David Botstein and J. Michael Cherry (2000). Integrating functional genomic information into the Saccharomyces Genome Database. Nucleic Acids Research. Online at http://nar. oupjournals.org/cgi/content/full/28/1/77
Brazma, Avil and Jaak Vilo (2000). Gene Expression Data Analysis. FEBS Letters 480,
http://www.e lsevier.nl/febs/118/19/21/00023893.pdf
Brown, Michael, William Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares Jr., David Haussler (1999). Support Vector Machine classification of microarray gene expression data. http://www.cse.ucsc.edu/research/compbio/genex/genexTR2html/genex .html
Chen, Ting, Vladimir Filkov and Steven S. Skiena (1999). Identifying gene regulatory networks from experimental data. http://citese er.nj.nec.com/chen99identifying.html
Cho, R., et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2:65-73, raw data online at http:// cmgm.stanford.edu/pbrown/sporulation/additional/
Crutchfield, J. P. (1989). Inferring the Dynamic, Quantifying Physical Complexity, in Measures of Complexity and Chaos, A. M. Albano, N. B. Abraham, P. E. Rapp, and A. Passamante, editors, Plenum Press, New York, p. 327, http://www.santafe.edu/projects/CompMech/papers/CompMechCommun.html< /a>
Crutchfield, J.P. and K. Young (1990). , Computati on at the Onset of Chaos, in Entropy, Complexity, and Physics of Information, W. Zurek, editor, SFI Studies in the Sciences of Complexity, VIII, Addison-Wesley, Reading, Massachusetts, 223-269, http://www.santafe.edu/projects/CompMech/papers/CompMechCommun.html< /a>
.
De Risi, Iyer and Brown ( 1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278.
Goertzel, Ben (2001). Webmind-Based Data Mining. Webmind Diehards LLC Technical report.
Goertzel, Ben (2001). Digital Intuition: A Conceptual Overview of the Webmind AI Engine. Webmind Diehards LLC technical report, based on some prior Webmind Inc. material.
Hartemink, Alexander, David K. Gifford, Tommi S. Jaakkola, Richard A. Young (2001). Using Graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing. htt p://www.psrg.lcs.mit.edu/publications/Papers/psbabs.html
Kirillova, Olga (2001). Cellular Automata Model for Gene Networks. http://www.csa.ru/Inst/gorb_dep/inbios/genet/Netmodel/bool/camod. htm
Koza, John, William Mydlowec, Guido Lanza, Jessen Yu, Martin A. Keane (2000). Reverse Engineering of Metabolic Pathways from observed data using genetic programming. http://smi-web.stanford.edu/pubs/SMI_Abstracts/SMI-2000-0851.html< /p>
Liang, Fuhrman and Somogyi (1998). REVEAL: a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. on Biocomputing '98, 3:18-- 29
Maki, Y. D. Tominaga, M. Okamoto, S. Watanabe, Y. Eguchi (2001). Development of a system for the inference of large-scale genetic networks. http://citese er.nj.nec.com/maki01development.html
Raychaudhuri S, Stuart JM, Altman RB. (2000). Principal
components analysis to summarize microarray experiments: application to
sporulation time series. Pacific Symposium on Biocomputing 2000;:455-66
Savageau, M.A. (1998).
Rules for
the evolution of gene circuitry. Pacific Symposium on Biocomputing,
3:54--65.
Sherman, Fred (1998). An Introduction to the Genetics and Molecular Biology of the
Yeast Saccharomyces cerevisiae, h ttp://dbb.urmc.rochester.edu/labs/Sherman_f/yeast/Index.html
E.P. van Someren, L.F.A. Wessels and M.J.T. Reinders (2000). Linear Modeling of Genetic Networks from Experimental Data, In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB00), p. 355-366, La Jolla, California, August 2000, see http://www.genlab.tu delft.nl/publications/
Jaak Vilo, Alvis Brazma, Inge Jonassen, Alan
Robinson,
and Esko Ukkonen (2000).
Mining for
Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data,
ISMB-2000 August 2000, http:
//industry.ebi.ac.uk/~vilo/Publications/ISMB2000.pdf
Von Dassow, Eli Meir, Edwin Munro, Garrett Odell (2000). The Segment Polarity Network is a Robust Developmental Module, Nature 406 pp.188-192, http://www.euchromatin .org/Biomath01.htm
Wen, X., S. Fuhrman, G.S. Michales, D.B. Carr, S. Smith, J. Barker and R. Somogyi. PNAS 95:334-339 (1998). "Large-Scale Temporal Gene Expression Mapping of Central Nervous System Development"
Wessels, L.F.A., E.P. van Someren and M.J.T. Reinders (1999), Genetic network models: a comparative study, http://cite seer.nj.nec.com/vansomeren01genetic.html
Yuh, Bolouri and Davidson (1997) . Genomic Cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279