in the
virtual world context means the evaluation of one of a set of pre-specified perception predicates,
with an argument consisting of one of the entities in the observed environment.
Given N entities (provided by the Entity filter), there are usually O(N2) potential perceptions
in the Atomspace due to binary perceptions like
near(owner bird)
inside(toy box>
The perception filter proceeds by computing the entropy of any potential perceptions hap-
pening during a learning session. Indeed if the entropy of a given perception P is high that
means that a conditional if (P 81 E2) has a rather balanced probability of taking Branch El
or 82. On the other hand if the entropy is low then the probability of taking these branches
is unbalanced, for instance the probability of taking El may be significantly higher than the
probability of taking 82, and therefore if (P 81 E2) could reasonably be substituted by 81.
For example, assume that during the teaching sessions, the predicate near (owner bird) is
false 99% percent of the time; then near (owner bird) will have a low entropy and will possibly
be discarded by the filter (depending on the threshold). If the bird is always far from the owner
then it will have entropy 0 and will surely be discarded, but if the bird comes and goes it will
have a high entropy and will pass the filter. Let P be such a perception and Pi returns 1 when
the perception is true at time t or 0 otherwise, where t ranges over the set of instants, of size N,
recorded between the beginning and the end of the demonstrated trick. The calculation goes as
follows
EFTA00624354
208 32 Procedure Learning via Adaptively Biased Hillclimbing
Entropy(P) = H I E 8
where H(p) = —p log(p) — (1 — Alog(1 — p). There are additional subtleties when the perception
involves random operators, like near (owner random object) that is the entropy is calculated by
taking into account a certain distribution over entities grouped under the term random_object.
The calculation is optimized to ignore instants when the perception relates to object that have
not moved which makes the calculation efficient enough, but there is room to improve it in
various ways; for instance it could be made to choose perceptions based not only on entropy
but also inferred relevancy with respect to the context using PLN.
32.4 Using Action Sequences as Building Blocks
A heuristic that has been shown to work, in the "virtual pet trick" context, is to consider
sequences of actions that are compatible with the behavior demonstrated by the avatar showing
the trick as building blocks when defining the neighborhood of a candidate. For instance if the
trick is to fetch a ball, compatible sequences would be
got o(ball ), grab(ball), got o(owner), drop
goto(random_object), grab(nearest_ objact), got o(ovner), drop
Sub-sequences can be considered as well - though too many building blocks also increase the
neighborhood exponentially, so one has to be careful when doing that. In practice using the
set of whole compatible sequences worked well. This for instance can speed up many fold the
learning of the trick triple_kick as shown in Section 32.6.
32.5 Automatically Parametrizing the Program Size Penalty
A common heuristic for program learning is an "Occam penalty" that penalizes large programs,
hence biasing search toward compact programs. The function we use to penalize program size
is inspired by Ray Solomonoff's theory of optimal inductive inference ISo164a, So16411; simply
said, a program is penalized exponentially with respect to its size. Also, one may say that since
the number of program candidates grows exponentially with their size, exploring solutions with
higher size must be exponentially worth the cost.
In the next subsections we describe the particular penalty function we have used and how
to tune its parameters.
32.5.1 Definition of the complexity penalty
Let p be a program candidate and penalty(p) a function with domain Al] measuring the
complexity of p. If we consider the complexity penalty function penalty(p) as if it denotes
the prior probability of p, and score(p) (the quality of p as utilized within the hill climbing
algorithm) as denoting the conditional probability of the desired behavior knowing p, then
EFTA00624355
32.5 Automatically Parametrizing the Program Size Penalty 209
Bayer rules tells us that
fitness(p) = score(p) x penalty(p)
denotes the conditional probability of p knowing the right behavior to imitate, the fitness
function that we want to maximize.
It happens that in the pet trick learning context which is our main example in this chapter,
score(p) does not denote such a probability; instead it measures how similar the behavior
generated by p and the behavior to imitate are. However, we utilize the above formula anyway,
with a heuristic interpretation. One may construct assumptions under which score(p) does
represent a probability but this would take us too far afield.
The penalty function we use is then given by:
penalty(p) = exp(—a x log(b x x
where I I is the program size, lAl its alphabet size and e = exp(1). The reason I AI enters
into the equation is because the alphabet size varies from one problem to another due to the
perception and action filters. Without that constraint the term log(b x IAI + e) could simply
be included in a. The higher a the more intense the penalty is. The parameter b controls how
that intensity varies with the alphabet size.
It is important to remark the difference between such a penalty function and lexicographic
parsimony pressure (literally said everything being equal, choose the shortest program). Because
of the use of sequences as building blocks, without such a penalty function the algorithm may
rapidly reach an optimal program (a mere long sequence of actions) and remain stuck in an
apparent optimum while missing the very, logic of the action sequence that the human wants
to convey.
32.5.2 Parameterizing the complexity penalty
Due to the nature of the search algorithm (hill climbing with restart), the choice of the candidate
used to restart the search is crucial. In our case we restart with the candidate with the best
fitness so far which has not been yet used to restart. The danger of such an approach is that
when the algorithm enters a region with local optima (like a plateau), it may basically stay
there as long as there exist better candidates in that region than outside of it non used yet for
restart. Longer programs tend to generate larger regions of local optima (because they have
exponentially more syntactic variations that lead to close behaviors), so if the search enters
such region via an overly complex program it is likely to take a very long time to get out
of it. Introducing probability in the choice of the restart may help avoiding that sort of trap
but having experimented with that it turned out not to be significantly better on average for
learning relatively simple things (indeed although the restart choice Ls more diverse it still tends
to occur in large region of local optima).
However, a significant improvement we have found is to carefully choose the size penalty
function so that the search will tend to restart on simpler programs even if they do not exhibit
• Bayes rule as used here is P(MID) = P(M)P(DIA"P(D) where Al denotes the Model (the program) and D
denotes the data (the behavior to imitate), here P(D) is ignored, that is the data is assumed to be distributed
uniformly
EFTA00624356
210 32 Procedure Learning via Adaptively Biased HilIclimbing
the best behaviors, but will still be able to reach the optimal solution even if it is a complex
one.
The solution we suggest is to choose a and b such that penalty(p) is:
1. as penalizing as possible, to focus on simpler programs first (although that constraint may
possibly be lightened as the experimentation shows),
2. but still correct in the sense that the optimal solution p maximizes fitness(p).
And we want that to work for all problems we are interested in. That restriction is an
important point because it is likely that in general the second constraint will be too strict to
produce a good penalty function.
We will now formalize the above problem. Let i be an index that ranges over the set of
problems of interest (in our case pet tricks to learn), score; and fitness; denotes the score and
fitness functions of the dub problem. Let 6); (s) denote the set of programs of score s
&As) = {plscore(p) = s}
Define a family of partial functions
: [0,1) N
so that
Rs) = argminIPI
pe9.(p)
What this says is that for any given score s we want the size of the shortest program p with
that score. And fi is partial because there may not be any program returning a given score.
Let be the family of partial functions
g:: [0,11 H [0,1)
parametrized by a and b such that
gds) = s x exp(—a x (log(b x IAI -F e) x f i(s))
That is: given a score s, ye(s) returns the fitness fitness(p) of the shortest program p that
marks that score.
32.5.3 Definition of the Optimization Problem
Let si be the highest score obtained for fitness function i (that is the score of the program
chosen as the current best solution of i). Now the optimization problem consists of finding some
a and b such that
Vi argmax gi(s) =
that is the highest score has also the highest fitness. We started by choosing a and b as high
as possible, it is a good heuristic but not the best, the best one would be to choose a and b so
EFTA00624357
32.6 Some Simple Experimental Results 211
that they minimize the number of iterations (number of restarts) to reach a global optimum,
which is a harder problem.
Also, regarding the resolution of the above equation, it is worth noting we do not need the
analytical expression of scom(p). Using past learning experiences we can get a partial description
of the fitness landscape of each problem just by looking at the traces of the search.
Overall we have found this optimization works rather well; that is, tricks that would otherwise
take several hours or days of computation can be learned in seconds or minutes. And the method
also enables fast learning for new tricks, in fact all tricks we have experimented with so far could
be learned reasonably fast (seconds or minutes) without the need to retune the penalty function.
In the current CogPrime codebase, the algorithm in charge of calibrating the parameters
of the penalty function has been written in Python. It takes in input the log of the imitation
learning engine that contains the score, the size, the penalty and the fitness of all candidates
explored for all tricks taken in consideration for the parameterizing. The algorithm proceeds in
2 steps:
1. Reconstitute the partial functions f1 for all fitness functions i already attempted, based on
the traces of these previously optimized fitness functions.
2. Try to find the highest a and b so that
Vi argmax gi(s) =
For step 2, since there are only two parameters to tune, we have used a 2D grid, enumerating
all points (a, b) and zooming when necessary. So the speed of the process depends largely on
the resolution of the grid but (on an ordinary 2009 PC processor) usually it does not require
more than 20 minutes to both extract f; and find a and b with a satisfactory resolution.
32.6 Some Simple Experimental Results
To test the above ideas in a simple context, we initially used them to enable an OpenCog
powered virtual world agent to learn a variety of simple "dog tricks" based on imitation and
reinforcement learning in the Multiverse virtual world. We have since deployed them on a variety
of other applications in various domains.
We began these experiments by running learning on two tricks, fetch_ball and triple_kick
to be described below. in order to calibrate the size penalty function:
1. fetch_ball, which corresponds to the Combo program
and_seg (goto (ball)
grab (ball)
goto (owner)
drop)
2. triple_kick, if the stick is near the ball then kick 3 times with the left leg and otherwise 3
times with the right leg. So for that trick the owner had to provide 2 exemplars, one for
kickL (with the stick near the ball) and one for kickR, and move away the ball from the
stick before showing the second exemplar. Below is the Combo program of triple_kick
if (near (stick ball)
and_seg (kickL kickL kickL)
and_seg(kickR kickR kickR) )
EFTA00624358
212 32 Procedure Learning via Adaptively Biased HilIclimbing
Before choosing an exponential size penalty function and calibrating it fetch_ball would be
learned rather rapidly in a few seconds, but triple_kick would take more than an hour. After
calibration both fetch_ball and triple_kick would be learned rapidly, the later in less than a
minute.
Then we experimented with a new few tricks, some simpler, like sit_under_tree
and_seg(goto(tree) sit)
and others more complex like double_dance, where the trick consists of dancing until the
owner emits the message "stop dancing", and changing the dance upon owner's actions
while(not(says(owner "stop dancing"))
if(last_action(owner "kickL")
tap_dance
lean_rock_dance))
That is the pet performs a tap_dance when the last action of the owner is kickL, and
otherwise performs a lean_rock_dance.
We tested learning for 3 tricks, fetch_ball, triple_kick and double_dance. Each trick was
tested in 7 settings denoted confi to confio summarized in Table 32.1.
• confi is the default configuration of the system, the parameters of the size penalty function
are a = 0.03 and b = 0.34, which is actually not what is returned by the calibration
technique but close to. That is because in practice we have found that on average learning
is working slightly faster with these values.
• conf2 is the configuration with the exact values returned by the calibration, that is a = 0.05,
= 0.94 .
• conk has the reduction engine disabled.
• corala has the entropy filter disabled (threshold is null so all perceptions pass the filter).
• conk has the intensity of the penalty function set to 0.
• conk has the penalty function set with low intensity.
• conf7 and conk have the penalty function set with high intensity.
• conk has the action sequence building block enabled
• conho has the action sequence building block enabled but with a slightly lower intensity of
the size penalty function than normal.
lieduct ActSeq Entropy a b Setting
On Off 0.1 0.03 0.34 confi
On Off 0.1 0.05 0.94 conf2
Off Off 0.1 0.03 0.34 conh
On Off 0 0.03 0.34 conk
On Off 0.1 0 0.34 conh
On Off 0.1 0.0003 0.34 confo
On Off 0.1 0.3 0.34 conk
On Off 0.1 3 0.34 conk
On On 0.1 0.03 0.34 conh
On On 0.1 0.025 0.34 confto
Table 32.1: Settings for each learning experiment
EFTA00624359
32.6 Some Simple Experimental Results 213
Setting Percep Restart Eval Time
confr 3 .4 653 5218
conk 3 3 245 2s
conk 3 3 1073 8s42
conk 136 3 28287 4mn7s
coots 3 >700 >500000 >lh
conk 3 3 653 5218
confr 3 8 3121 23s42
conk 3 147 65948 8mnlOs
confg 3 0 89 410ms
confio 3 0 33 l6lms
Table 32 2: Learning time for fetch_ball
Setting Percep Restart Eval Time
ccmh 1 18 2783 21s47
conk 1 110 11426 Inin532
conk 1 49 15069 2nin152
canf4 124 oO co 00
conk 1 >800 >200K >lh
conk 1 7 1191 9s67
confr 1 >2500 >200K >lh
conk 1 >2500 >200K >lh
conk 1 0 107 146ms
confi a 1 0 101 164ms
Table 32.3: Learning time for triple_kick
Setting Percep Restart Eval Time
1C73
conk 113 ds
conk 113 4s
conk 150 6s20ms
conk >60K >1h
conk 113 ds
FM FM FM FM FM FM 10 CM CM CM
conk 113 ds
conk 113 ds
ccmfa >300K >1h A
ccmfo 138 dsl9lms
conf10 219K 56mn3s
Table 32.4: Learning time for double_dance
Tables 32.2, 32.3 and 32.4 contain the results of the learning experiment for the three tricks,
fetch_ball, triple_kick and double_dance. In each table the column Percept gives the number
perceptions which is taken into account for the learning. Restart gives the number of restarts
hill climbing had to do before reaching the solution. Eval gives the number of evaluations and
Time the search time.
In Table 32.2 and 32.4 we can see that fetch_ball or double_dance are learned in a few
seconds both in conh and conf2. In 32.3 however learning is about five times faster with conh
than with c1242, which was the motivation to go with conf2 as default configuration, but oonf2
still performs well.
EFTA00624360
214 32 Procedure Learning via Adaptively Biased HilIclimbing
As Tables 32.2, 32.3 and 32.4 demonstrate for setting confs, the reduction engine speeds the
search up by less than twice for fetch_ball and double_dance, and many times for triple_kick.
The results for conf4 shows the importance of the filtering function, learning is dramatically
slowed down without it. A simple trick like fetch_ball took few minutes instead of seconds,
double_dance could not be learned after an hour, and triple_kick might never be learned
because it did not focus on the right perception from the start.
The results for conf5 shows that without any kind of complexity penalty learning can be
dramatically slowed down, for the reasons explained in Section 32.5 that is the search loses
itself in large regions of sub-optima. Only double_dance was not affected by that, which is
probably explained by the fact that only one restart occurred in double_dance and it happened
to be the right one.
The results for cogs show that when action sequence building-block is disabled the intensity
of the penalty function could be set even lower. For instance triple_ kick is learned faster (9s67
instead of 21s47 for co:Q. Conversely the results for lamb show that when action sequence
building-block is enabled, if the Occam's razor is too weak it can dramatically slow down the
search. That is because in this circumstance the search is misled by longer candidates that fit
and takes a very cut before it can reach the optimal more compact solution.
32.7 Conclusion
In our experimentation with hillclimbing for learning pet tricks in a virtual world, we have
shown that the combination of
1. candidate reduction into normal form,
2. filtering operators to narrow the alphabet,
3. using action sequences that are compatible with the shown behavior as building blocks,
4. adequately choosing and calibrating the complexity penalty function,
can speed up imitation learning so that moderately complex tricks can be learned within seconds
to minutes instead of hours, using a simple "hill climbing with restarts" learning algorithm.
While we have discussed these ideas in the context of pet tricks, they have of course been
developed with more general applications in mind, and have been applied in many additional
contexts. Combo can be used to represent any sort of procedure, and both the hillclimbing
algorithm and the optimization heuristics described here appear broad in their relevance.
Natural extensions of the approach described here include the following directions:
1. improving the Entity and Entropy filter using ECAN and PLN so that filtering is not only
based on entropy but also relevancy with respect to the context and background knowledge,
2. using transfer learning (see Section 33.5 of Chapter 33) to tune the parameters of algorithm
using contextual and background knowledge.
Indeed these improvements are under active investigation at time of writing, and some may
well have been implemented and tested by the time you read this.
EFTA00624361
Chapter 33
Probabilistic Evolutionary Procedure Learning
Co-authored with Moshe Looks"
33.1 Introduction
The CogPrime architecture fundamentally requires, as one of its components, some powerful
algorithm for automated program learning. This algorithm must be able to solve procedure
learning problems relevant to achieving human-like goals in everyday human environments, re-
lying on the support of other cognitive processes, and providing them with support in turn. The
requirement is not that complex human behaviors need to be learnable via program induction
alone, but rather that when the best way for the system to achieve a certain goal seems to be
the acquisition of a chunk of procedural knowledge, the program learning component should be
able to carry out the requisite procedural knowledge.
As CogPrime is a fairly broadly-defined architecture overall, there are no extremely precise
requirements for its procedure learning component. There could be variants of CogPrime in
which procedure learning carried more or less weight, relative to other components.
Some guidance here may be provided by looking at which tasks are generally handled by
humans primarily using procedural learning, a topic on which cognitive psychology has a fair
amount to say, and which is also relatively amenable to commonsense understanding based on
our introspective and social experience of being human. When we know how to do something,
but can't explain very clearly to ourselves or others how we do it, the chances are high that we
have acquired this knowledge using some form of "procedure learning" as opposed to declarative
learning. This is especially the case if we can do this same sort of thing in many different
contexts, each time displaying a conceptually similar series of actions, but adapted to the specific
situation. We would like CogPrime to be able to carry out procedural learning in roughly the
same situations ordinary humans can (and potentially other situations as well: maybe even at
the start, and definitely as development proceeds), largely via action of its program learning
component.
In practical terms, our intuition (based on considerable experience with automated program
learning, in OpenCog and other contexts) is that one requires a program learning component
capable of learning programs with between dozens and hundreds of program tree nodes, in
Combo or some similar representation - not able to learn arbitrary programs of this size, but
rather able to solve problems arising in everyday human situations in which the simplest ac-
ceptable solutions involve programs of this size. We also suggest that the majority of procedure
" First author
215
EFTA00624362
216 33 Probabilistic Evolutionary Procedure Learning
learning problems arising in everyday human situation can be solved via program with hierar-
chical structure, so that it likely suffices to be able to learn programs with between dozens and
hundreds of program tree nodes, where the programs have a modular structure, consisting of
modules each possessing no more than dozens of program tree nodes. Roughly speaking, with
only a few dozen Combo tree nodes, complex behaviors seem only achievable via using very
subtle algorithmic tricks that aren't the sort of thing a human-like mind in the early stages of
development could be expected to figure out; whereas, getting beyond a few hundred Combo
tree nodes, one seems to get into the domain where an automated program learning approach is
likely infeasible without rather strong restrictions on the program structure, so that a more ap-
propriate approach within CogPrime would be to use PLN, concept creation or other methods
to fuse together the results of multiple smaller procedure learning runs.
While simple program learning techniques like hillclimbing (as discussed in Chapter 32 above)
can be surprisingly powerful, they do have fundamental limitations, and our experience and
intuition both indicate that they are not adequate for serving as CogPrime's primary program
learning component. This chapter describes an algorithm that we do believe is thus capable -
CogPrime's most powerful and general procedure learning algorithm, MOSES, an integrative
probabilistic evolutionary program learning algorithm that was briefly overviewed in Chapter
6 of Part I.
While MOSES as currently designed and implemented embodies a number of specific algo-
rithmic and structural choices, at bottom it embodies two fundamental insights that are critical
to generally intelligent procedure learning:
• Evolution is the right approach to the learning of difficult procedures
• Enhancing evolution with probabilistic methods is necessary. Pure evolution, in the vein of
the evolution of organisms and species, is too slow for broad use within cognition; so what
is required is a hybridization of evolutionary and probabilistic methods, where probabilistic
methods provide a more directed approach to generating candidate solutions than is possible
with typical evolutionary heuristics like crossover and mutation
We summarize these insights in the phrase Probabilistic Evolutionary Program Learning (PEPL);
MOSES is then one particular PEPL algorithm, and in our view a very good one. We have also
considered other related algorithms such as the PLEASURE algorithm roe0Sal (which may
also be hybridized with MOSES), but for the time being it appears to us that MOSES satisfies
CogPrime's needs.
Our views on the fundamental role of evolutionary dynamics in intelligence were briefly pre-
sented in Chapter 3 of Part 1. Terrence Deacon said it even more emphatically: "At every step
the design logic of brains is a Darwinian logic: overproduction, variation, competition, selec-
tion ... it should not come as a surprise that this same logic is also the basis for the normal
millisecond-by-millisecond information processing that continues to adapt neural software to
the world." pea981 He has articulated ways in which, during neurodevelopment, different com-
putations compete with each other (e.g., to determine which brain regions are responsible for
motor control). More generally, he posits a kind of continuous flux as control shifts between
competing brain regions, again, based on high-level "cognitive demand."
Deacon's intuition is similar to the one that led Edelman to propose Neural Darwinism
tEcle931, and Calvin and Bickerton IC13001 to pose the notion of mind as a "Darwin Machine".
The latter have given plausible neural mechanisms ("Darwin Machines") for synthesizing short
"programs". These programs are for tasks such as rock throwing and sentence generation, which
are represented as coherent firing patterns in the cerebral cortex. A population of such patterns,
EFTA00624363
33.1 Introduction 217
competing for neurocomputational territory, replicates with variations, under selection pressure
to conform to background knowledge and constraints.
To incorporate these insights, a system is needed that can recombine existing solutions in
a non-local synthetic fashion, learning nested and sequential structures, and incorporate back-
ground knowledge (e.g. previously learned routines). MOSES is a particular kind of program
evolution intended to satisfy these goals, using a combination of probability theory with ideas
drawn from genetic programming, and also incorporating some ideas we have seen in previous
chapters such as program normalization.
The main conceptual assumption about CogPrime's world, implicit in the suggestion of
MOSES as the primary, program learning component, is that the goal-relevant knowledge that
cannot effectively be acquired by the other methods at CogPrime's disposal (PLN, ECAN, etc.),
forms a body of knowledge that can effectively be induced across via probabilistic modeling on
the space of programs for controlling a CogPrime agent. If this is not true, then MOSES will
provide no advantage over simple met hods like well-timed hillclimbing as described in Chapter
32. If it is true, then the effort of deploying a complicated algorithm like MOSES is worthwhile.
In essence, the assumption is that there are relatively simple regularities among the programs
implementing those procedures that are most critical for a human-like intelligence to acquire
via procedure learning rather than other methods.
33.1.1 Explicit versus Implicit Evolution in CogPrime
Of course, the general importance of evolutionary dynamics for intelligence does not imply the
need to use explicit evolutionary algorithms in one's AGI system. Evolution can occur in an
intelligent system whether or not the low-level implementation layer of the system involves any
explicitly evolutionary processes. For instance it's clear that the human mind/brain involves
evolution in this sense on the emergent level - we create new ideas and procedures by varying
and combining ones that we've found useful in the past, and this occurs on a variety of levels
of abstraction in the mind. In CogPrime, however, we have chosen to implement evolutionary
dynamics explicitly, as well as encouraging them to occur implicitly.
CogPrime is intended to display evolutionary dynamics on the derived-hypergraph level,
and this is intended to be a consequence of both explicitly-evolutionary and not-explicitly-
evolutionary dynamics. Cognitive processes such as PLN inference may lead to emergent evo-
lutionary dynamics (as useful logical relationships are reasoned on and combined, leading to
new logical relationships in an evolutionary manner); even though PLN in itself is not explicitly
evolutionary in character, it becomes emergently evolutionary via its coupling with CogPrime's
attention allocation subsystem, which gives more cognitive attention to Atoms with more impor-
tance, and hence creates an evolutionary, dynamic with importance as the fitness criterion and
the whole constellation of MindAgents as the novelty-generation mechanism. However, MOSES
explicitly embodies evolutionary dynamics for the learning of new patterns and procedures that
are too complex for hillclimbing or other simple heuristics to handle. And this evolutionary,
learning subsystem naturally also contributes to the creation of evolutionary patterns on the
emergent, derived-hypergraph level.
EFTA00624364
218 33 Probabilistic Evolutionary Procedure Learning
33.2 Estimation of Distribution Algorithms
There is a long history in AI of applying evolution-derived methods to practical problem-solving;
John Holland's genetic algorithm No174 initially a theoretical model, has been adapted suc-
cessfully to a wide variety of applications (see e.g. the proceedings of the GECCO conferences).
Briefly, the methodology applied is as follows:
1. generate a random population of solutions to a problem
2. evaluate the solutions in the population using a predefined fitness function
3. select solutions from the population proportionate to their fitnesss
4. recombine/mutate them to generate a new population
5. go to step 2
Holland's paradigm has been adapted from the case of fixed-length strings to the evolution of
variable-sized and shaped trees (typically Lisp S-expressions), which in principle can represent
arbitrary computer programs Il<oz92, Koz9-11.
Recently, replacements-for/extensions-of the genetic algorithm have been developed (for
fixed-length strings) which may be described as estimation-of-distribution algorithms (see
IPe101 for an overview). These methods, which outperform genetic algorithms and related
techniques across a range of problems, maintain centralized probabilistic models of the pop-
ulation learned with sophisticated datamining techniques. One of the most powerful of these
methods is the Bayesian optimization algorithm (BOA) IPe1051.
The basic steps of the BOA are:
1. generate a random population of solutions to a problem
2. evaluate the solutions in the population using a predefined fitness function
3. from the promising solutions in the population, learn a generative model
4. create new solutions using the model, and merge them into the existing population
5. go to step 2.
The neurological implausibility of this sort of algorithm is readily apparent - yet recall that in
CogPrime we are attempting to roughly emulate human cognition on the level of behavior not
structure or dynamics.
Fundamentally, the BOA and its ilk (the competent adaptive optimization algorithms) dif-
fer from classic selecto-recombinative search by attempting to dynamically learn a problem
decomposition, in terms of the variables that have been pre-specified. The BOA represents
this decomposition as a Bayesian network (directed acyclic graph with the variables as nodes,
and an edge from x to y indicating that y is probabilistically dependent on x). An extension,
the hierarchical Bayesian optimization algorithm (hBOA) uses a Bayesian network with lo-
cal structure to more accurately represent hierarchical dependency relationships. The BOA and
hBOA are scalable and robust to noise across the range of nearly decomposable functions. They
are also effective, empirically, on real-world problems with unknown decompositions, which may
or may not be effectively representable by the algorithms; robust, high-quality results have been
obtained for Ising spin glasses and NfaxSAT, as well as a variety of real-world problems.
EFTA00624365
33.3 Competent Program Evolution via MOSES 219
33.3 Competent Program Evolution via MOSES
In this section we summarize meta-optimizing semantic evolutionary search (MOSES), a system
for competent program evolution, described more thoroughly in ILoo061. Based on the viewpoint
developed in the previous section, MOSES is designed around the central and unprecedented
capability of competent optimization algorithms such as the hBOA, to generate new solutions
that simultaneously combine sets of promising assignments from previous solutions according
to a dynamically learned problem decomposition. The novel aspects of MOSES described herein
are built around this core to exploit the unique properties of program learning problems. This
facilitates effective problem decomposition (and thus competent optimization).
33.3.1 Statics
The basic goal of MOSES is to exploit the regularities in program spaces outlined in the previ-
ous section, most critically behavioral decomposability and white box execution, to dynamically
construct representations that limit and transform the program space being searched into a
relevant subspace with a compact problem decomposition. These representations will evolve as
the search progresses.
33.3.1.1 An Example
Let's start with an easy example. What knobs (meaningful parameters to vary) exist for the
family of programs depicted in Figure ?? on the left? We can assume, in accordance with the
principle of white box execution, that all symbols have their standard mathematical interpre-
tations, and that x, y, and z are real-valued variables.
In this case, all three programs correspond to variations on the behavior represented graph-
ically on the right in the figure. Based on the principle of behavioral decomposability, good
knobs should express plausible evolutionary variation and recombination of features in behav-
ior space, regardless of the nature of the corresponding changes in program space. It's worth
repeating once more that this goal cannot be meaningfully addressed on a syntactic level - it
requires us to leverage background knowledge of what the symbols in our vocabulary (cos, +,
0.35, etc.) actually mean.
A good set of knobs will also be orthogonal Since we are searching through the space of
combinations of knob settings (not a single change at a time, but a set of changes), any knob
whose effects are equivalent to another knob or combination of knobs is tuidesirable.t Corre-
spondingly, our set of knobs should span all of the given programs (i.e., be able to represent
them as various knob settings).
A small basis for these programs could be the 3-dimensional parameter space, x1 E {x, z,0}
(left argument of the root node), x2 E {y, x} (argument of cos), and x3 E [0.3, 0.4] (multiplier
for the cos-expression). However, this is a very, limiting view, and overly tied to the particulars
t First because this will increase the number of samples needed to effectively model the structure of knob-space,
and second because this modeling will typically be quadratic with the number of knobs, at least for the BOA
or hBOA.
EFTA00624366
220 33 Probabilistic Evolutionary Procedure Learning
of how these three programs happen to be encoded. Considering the space behaviorally (right of
Figure ??), a number of additional knobs can be imagined which might be turned in meaningful
ways, such as:
1. numerical constants modifying the phase and frequency of the cosine expression,
2. considering some weighted average of x and y instead of one or the other,
3. multiplying the entire expression by a constant,
4. adjusting the relative weightings of the two arguments to +.
33.3.1.2 Syntax and Semantics
This kind of representation-building calls for a correspondence between syntactic and semantic
variation. The properties of program spaces that make this difficult are over-representation
and chaotic execution, which lead to non-orthogonality, oversampling of distant behaviors, and
undersampling of nearby behaviors, all of which can directly impede effective program evolution.
Non-orthogonality is caused by over-representation. For example. based on the properties
of commutativity and associativity, an -F a2 -F + may be expressed in exponentially many
different ways, if + is treated as a non-commutative and non-associative binary operator.
Similarly, operations such as addition of zero and multiplication by one have no effect, the
successive addition of two constants is equivalent to the addition of their sum, etc. These effects
are not quirks of real-valued expressions; similar redundancies appear in Boolean formulae (x
AND x e x), list manipulation (cdr(cons(x, L)) a L), and conditionals (if x then y else
z e if NOT x then z else y).
Without the ability to exploit these identities, we are forced to work in a greatly expanded
space which represents equivalent expression in many different ways, and will therefore be very
far from orthogonality. Completely eliminating redundancy is infeasible, and typically NP-hard
(in the domain of Boolean formulae it is reducible to the satisfiability problem, for instance),
but one can go quite far with a heuristic approach.
Oversampling of distant behaviors is caused directly by chaotic execution, as well as a
somewhat subtle effect of over-representation, which can lead to simpler programs being heavily
oversampled. Simplicity is defined relative to a given program space in terms of minimal length,
the number of symbols in the shortest program that produces the same behavior.
Undersampling of nearby behaviors is the flip side of the oversampling of distant
behaviors. As we have seen, syntactically diverse programs can have the same behavior; this
can be attributed to redundancy, as well as non-redundant programs that simply compute the
same result by different means. For example, 3*x can also be computed as r+r+x; the first
version uses less symbols, but neither contains any obvious "bloat" such as addition of zero or
multiplication by one. Note however that the nearby behavior of 3.1 *x, is syntactically close
to the former, and relatively far from the latter. The converse is the case for the behavior
of ex+y. In a sense, these two expressions can be said to exemplify differing organizational
principles, or points of view, on the underlying function.
Differing organizational principles lead to different biases in sampling nearby behaviors. A
superior organizational principle (one leading to higher-fitness syntactically nearby programs
for a particular problem) might be considered a metaptation (adaptation at the second tier).
Since equivalent programs organized according to different principles will have identical fitnesss,
some methodology beyond selection for high fitnesss must be employed to search for good
organizational principles. Thus, the resolution of undersampling of nearby behaviors revolves
EFTA00624367
33.3 Competent Program Evolution via MOSES 221
around the management of neutrality in search, a complex topic beyond the scope of this
chapter.
These three properties of program spaces greatly affect the performance of evolutionary,
methods based solely on syntactic variation and recombination operators, such as local search
or genetic programming. In fact, when quantified in terms of various fitness-distance correlation
measures, they can be effective predictors of algorithm performance, although they are of course
not the whole story. A semantic search procedure will address these concerns in terms of the
underlying behavioral effects of and interactions between a language's basic operators; the
general scheme for doing so in MOSES is the topic of the next subsection.
33.3.1.3 Neighborhoods and Normal Forms
The procedure MOSES uses to construct a set of knobs for a given program (or family of
structurally related programs) is based on three conceptual steps: ?eduction to normal form,
neighborhood enumeration, and neighborhood reduction.
Reduction to normal form
- Redundancy is heuristically eliminated by reducing programs to a normal form. Typically, this
will be via the iterative application of a series of local rewrite rules (e.g., Vx, x t 0 x ), until
the target program no longer changes. Note that the well-known conjunctive and disjunctive
normal torus for Boolean formulae are generally unsuitable for this purpose; they destroy the
hierarchical structure of formulae, and dramatically limit the range of behaviors (in this case
Boolean functions) that can be expressed compactly. Rather, hierarchical normal forms for
programs are required.
Neighborhood enumeration
- A set of possible atomic perturbations is generated for all programs under consideration (the
overall perturbation set will be the union of these). The goal is to heuristically generate new
programs that correspond to behaviorally nearby variations on the source program, in such a
way that arbitrary sets of perturbations may be composed combinatorially to generate novel
valid programs.
Neighborhood reduction
- Redundant perturbations are heuristically culled to reach a more orthogonal set. A straightfor-
ward way to do this is to exploit the reduction to normal form outlined above; if multiple knobs
lead to the same normal forms program, only one of them is actually needed. Additionally,
note that the number of symbols in the normal form of a program can be used as a heuristic
approximation for its minimal length - if the reduction to normal form of the program resulting
from twiddling some knob significantly decreases its size, it can be assumed to be a source of
oversampling, and hence eliminated from consideration. A slightly smaller program is typically
EFTA00624368
222 33 Probabilistic Evolutionary Procedure Learning
a meaningful change to make, but a large reduction in complexity will rarely be useful (and
if so, can be accomplished through a combination of knobs that individually produce small
changes).
At the end of this process, we will be left with a set of knobs defining a subspace of programs
centered around a particular region in program space and heuristically centered around the cor-
responding region in behavior space as well. This is part of the meta aspect of MOSES, which
seeks not to evaluate variations on existing programs itself, but to construct parameterized
program subspaces (representations) containing meaningful variations, guided by background
knowledge. These representations are used as search spaces within which an optimization algo-
rithm can be applied.
33.3.2 Dynamics
As described above, the representation-building component of MOSES constructs a parameter-
ized representation of a particular region of program space, centered around a single program
family of closely related programs. This is consistent with the line of thought developed above,
that a representation constructed across an arbitrary region of program space (e.g., all programs
containing less than n symbols), or spanning an arbitrary collection of unrelated programs, is
unlikely to produce a meaningful parameterization (i.e., one leading to a compact problem
decomposition).
A sample of programs within a region derived from representation-building together with
the corresponding set of knobs will be referred to herein as a deme;t a set of denies (together
spanning an arbitrary, area within program space in a patchwork fashion) will be referred to as
a metapopulationfi MOSES operates on a metapopulation. adaptively creating, removing, and
allocating optimization effort to various denies. Demo management is the second fundamental
nets aspect of MOSES, after (and above) representation-building; it essentially corresponds to
the problem of effectively allocating computational resources to competing regions, and hence
to competing programmatic organizational- representational schemes.
33.3.2.1 Algorithmic Sketch
The salient aspects of programs and program learning lead to requirements for competent
program evolution that can be addressed via a representation-building process such as the one
shown above, combined with effective deme management. The following sketch of MOSES,
elaborating Figure 33.1 repeated here from Chapter 8 of Part 1, presents a simple control flow
that dynamically integrates these processes into an overall program evolution procedure:
1. Construct an initial set of knobs based on some prior (e.g., based on an empty program)
and use it to generate an initial random sampling of programs. Add this deme to the
metapopulation.
2. Select a deme from the metapopulation and update its sample, as follows:
A term borrowed from biology, referring to a somewhat isolated local population of a species.
Another term borrowed from biology, referring to a coup of somewhat separate populations (the demes) that
nonetheless interact.
EFTA00624369
33.3 Competent Program Evolution via MOSES 223
Repreatuationaudding
Realm Sampling
Samoa
Opomareioa
Fig. 33.1: The top-level architectural components of MOSES, with directed edges indicating the
flow of information and program control.
a. Select some promising programs from the deme's existing sample to use for modeling,
according to the fitness function.
b. Considering the promising programs as collections of knob settings, generate new col-
lections of knob settings by applying some (competent) optimization algorithm.
c. Convert the new collections of knob settings into their corresponding programs, reduce
the programs to normal form, evaluate their fltnesss, and integrate them into the deme's
sample, replacing less promising programs.
3. For each new program that meets the criteria for creating a new deme, if any:
a. Construct a new set of knobs (via representation-building) to define a region centered
around the program (the deme's exemplar). and use it to generate a new random sam-
pling of programs, producing a new dente.
b. Integrate the new deme into the metapopulation, possibly displacing less promising
domes.
4. Repeat from step 2.
The criterion for creating a new deme is behavioral non-dominance (programs which are not
dominated by the exemplars of any existing denies are used as exemplars to create new denies),
which can be defined in a domain-specific fashion. As a default, the fitness function may be
used to induce dominance, in which case the set of exemplar programs for demos corresponds
to the set of top-fitness programs.
33.3.3 Architecture
The preceding algorithmic sketch of MOSES leads to the top-level architecture depicted in
Figure ??. Of the four top-level components, only the fitness function is problem-specific. The
EFTA00624370
224 33 Probabilistic Evolutionary Procedure Learning
representation-building process is domain-specific, while the random sampling methodology
and optimization algorithm are domain-general. There is of course the possibility of improving
performance by incorporating domain and/or problem-specific bias into random sampling and
optimization as well.
33.3.4 Example: Artificial Ant Problem
Let's go through all of the steps that are needed to apply MOSES to a small problem, the
artificial ant on the Santa Fe trail lkoz92], and describe the search process. The artificial ant
domain is a two-dimensional grid landscape where each cell may or may not contain a piece of
food. The artificial ant has a location (a cell) and orientation (facing up, down, left, or right),
and navigates the landscape via a primitive sensor, which detects whether or not there is food
in the cell that the ant is facing, and primitive actuators move (take a single step forward),
right (rotate 90 degrees clockwise), and left (rotate 90 degrees counter-clockwise). The Santa
Fe trail problem is a particular 32 x 32 toroidal grid with food scattered on it (Figure 4), and a
fitness function counting the number of unique pieces of food the ant eats (by entering the cell
containing the food) within 600 steps (movement and 90 degree rotations are considered single
steps).
Programs are composed of the primitive actions taking no arguments, a conditional (if-food-
ahead),' which takes two arguments and evaluates one or the other based on whether or not
there is food ahead, and progn, which takes a variable number of arguments and sequentially
evaluates all of them from left to right. To fitness a program, it is evaluated continuously until
600 time steps have passed, or all of the food is eaten (whichever comes first). Thus for example,
the program if-food-ahead(m, r) moves forward as long as there is food ahead of it, at which
point it rotates clockwise until food is again spotted. It can successfully navigate the first two
turns of the placeSanta Fe trail, but cannot cross "gaps" in the trail, giving it a final fitness of
11.
The first step in applying MOSES is to decide what our reduction rules should look like.
This program space has several clear sources of redundancy leading to over-representation that
we can eliminate, leading to the following reduction rules:
1. Any sequence of rotations may be reduced to either a left rotation, a right rotation, or a
reversal, for example:
progn(left, left, left)
reduces to
right
1. Any if-food-ahead statement which is the child of an if-food-ahead statement may be elim-
inated, as one of its branches is clearly irrelevant, for example:
if-food-ahead(m, if-food-ahead(l, r))
reduces to
if-food-ahead(m, r)
I This formulation is equivalent to using a general three-argument if-then-else statement with a predicate as
the first argument, as there is only a single predicate (food-ahead) for the ant problem.
EFTA00624371
33.3 Competent Program Evolution via MOSES 225
1. Any progn statement which is the child of a progn statement may be eliminated and replaced
by its children, for example:
pnogn(pnogn(left, move), move)
reduces to
progn(left, move, move)
The representation language for the ant problem is simple enough that these are the only
three rules needed - in principle there could be many more. The first rule may be seen as a
consequence of general domain-knowledge pertaining to rotation. The second and third rules
are fully general simplification rules based on the semantics of if-then-else statements and
associative functions (such as progn). respectively.
These rules allow us to naturally parameterize a knob space corresponding to a given program
(note that the arguments to the progn and if-food-ahead functions will be recursively reduced
and parameterized according to the same procedure). Rotations will correspond to knobs with
four possibilities (left, right, reversal, no rotation). Movement commands will correspond to
knobs with two possibilities (move, no movement). There is also the possibility of introducing a
new command in between, before, or after, existing commands. Some convention (a "canonical
form") for our space is needed to determine how the knobs for new commands will be introduced.
A representation consists of a rotation knob, followed by a conditional knob, followed by a
movement knob, followed by a rotation knob, etc."
The structure of the space (how large and what shape) and default knob values will be
determined by the "exemplar" program used to construct it. The default values are used to
bias the initial sampling to focus around the prototype associated to the exemplar: all of the n
direct neighbors of the prototype are first added to the sample, followed by a random selection
of n programs at a distance of two from the prototype, n programs at a distance of three,
etc., until the entire sample is filled. Note that the hBOA can of course effectively recombine
this sample to generate novel programs at any distance from the initial prototype. The empty
program progn (which can be used as the initial exemplar for MOSES), for example, leads to
the following prototype:
progn(
rotate? [default no rotation],
if-food-ahead(
progn(
rotate? (default no rotation],
move? (default no movement/),
progn(
rotate? (default no rotation/,
move? (default no movement/)),
move? (default no movement()
There are six parameters here, three which are quaternary (rotate), and three which are
binary (move). So the program
I That there is some fixed ordering on the knobs is important, so that two rotation knobs are not placed next to
each other (as this would introduce redundancy). In this case, the precise ordering chosen (rotation, conditional,
movement) does not appear to be critical.
EFTA00624372
226 33 Probabilistic Evolutionary Procedure Learning
progn(left, if-food-ohead(move, left))
would be encoded in the space as
[left, no rotation, move, left, no movement, no movement]
with knobs ordered according to a pre-order left-to-right traversal of the program's parse tree
(this is merely for exposition; the ordering of the parameters has no effect on MOSES). For
a prototype program already containing an if-food-ahead statement, nested conditionals would
be considered.
60
SO
40
0
30
O
tiro
O 2°
vl
10
goo &A00 12.1103 16= 20.C00
# program evaluations
Technique Computational Effort
Genetic Programming 450,000 evaluations
Evolutionary Programming 136,000 evaluations
MOSES 23,000 evaluations
Fig. 33.2: On the top, histogram of the number of global optima found after a given number of
program evaluations for 100 rims of MOSES on the artificial ant problem (each run is counted
once for the first global optimum reached). On the bottom, computational effort required to
find an optimal solution for various techniques with probability p=.99 (for MOSES p=1, since
an optimal solution was found in all runs).
A space with six parameters in it is small enough that MOSES can reliably find the optimum
(the program progn(right, if-food-ahead(pmgnaleft), move)), with a very small population. Af-
ter no further improvements have been made in the search for a specified number of generations
(calculated based on the size of the space based on a model derived from 1231 that is general
to the hBOA, and not at all tuned for the artificial ant problem), a new representation is
constructed centered around this program." Additional knobs are introduced 9n between" all
" MOSES reduces the exemplar program to normal form before constructing the representation; in this
particular case however, no transformations are needed. Similarly, in general neighborhood reduction would be
EFTA00624373
33.3 Competent Program Evolution via MOSES 227
existing ones (e.g., an optional move in between the first rotation and the first conditional),
and possible nested conditionals are considered (a nested conditional occurring in a sequence
after some other action has been taken is not redundant). The resulting space has 39 knobs,
still quite tractable for hBOA, which typically finds a global optimum within a few genera-
tions. If the optimum were not to be found, MOSES would construct a new (possibly larger
or smaller) representation, centered around the best program that was found, and the process
would repeat.
The artificial ant problem is well-studied, with published benchmark results available for
genetic programming as well as evolutionary programming based solely on mutation (i.e., a
form of population-based stochastic hill climbing). Furthermore, an extensive analysis of the
search space has been carried out by Langdon and Poli 111)04 with the authors concluding:
1. The problem is "deceptive at all levels", meaning that the partial solutions that must be
recombined to solve the problem to optimality have lower average fitness than the partial
solutions that lead to inferior local optima.
2. The search space contains many symmetries (e.g., between left and right rotations),
3. There is an unusually high density of global optima in the space (relative to other common
test problems);
4. even though current evolutionary methods can solve the problem, they are not significantly
more effective (in terms of the number of program evaluations require) than random sample.
5. "If real program spaces have the above characteristics (we expect them to do so but be still
worse) then it is important to be able to demonstrate scalable techniques on such problem
spaces".
33.3.4.1 Test Results
Koza [Koz92J reports on a set of 148 runs of genetic programming with a population size of
500 which had a 16% success rate after 51 generations when the runs were terminated (a total
of 25,500 program evaluations per run). The minimal "computational effort" needed to achieve
success with 99% probability was attained by processing through generation 14 was 450,000
(based on parallel independent runs). Chellapilla the971 reports 47 out of 50 successful rums
with a minimal computational effort (again, for success with 99% probability) of 136,000 for
his stochastic hill climbing method.
In our experiment with the artificial ant problem, one hundred runs of MOSES were executed.
Beyond the domain knowledge embodied in the reduction and knob construction procedure, the
only parameter that needed to be set was the population scaling factor, which was set to 30
(MOSES automatically adjusts to generate a larger population as the size of the representa-
tion grows, with the base case determined by this factor). Based on these "factory" settings,
MOSES found optimal solutions on every run out of 100 trials, within a maximum of 23,000
program evaluations (the computational effort figure corresponding to 100% success). The av-
erage number of program evaluations required was 6952, with 95% confidence intervals of ±856
evaluations.
Why does MOSES outperform other techniques? One factor to consider first is that the
language programs are evolved in is slightly more expressive than that used for the other
used to eliminate any extraneous knobs (based on domain-specific heuristics). For the ant domain however no
such reductions are necessary.
EFTA00624374
228 33 Probabilistic Evolutionary Procedure Learning
techniques; specifically, a progn is allowed to have no children (if all of its possible children are
"turned off"), leading to the possibility of if-food-ahead statements which do nothing if food
is present (or not present). Indeed, many of the smallest solutions found by MOSES exploit
this feature. This can be tested by inserting a "do nothing" operation into the terminal set for
genetic programming (for example). Indeed, this reduces the computational effort to 272,000;
an interesting effect, but still over an order of magnitude short of the results obtained with
MOSES (the success rate after 50 generations is still only 20%).
Another possibility is that the reductions in the search space via simplification of programs
alone are responsible. However, the results past attempts at introducing program simplification
into genetic programming systems [27, 28J have been mixed; although the system may be sped
up (because programs are smaller), there have been no dramatic improvement in results noted.
To be fair, these results have been primarily focused on the symbolic regression domain; I am
not aware of any results for the artificial ant problem.
The final contributor to consider is the sampling mechanism (knowledge-driven knob-creation
followed by probabilistic model-building). We can test to what extent model-building con-
tributes to the bottom line by simply disabling it and assuming probabilistic independence
between all knobs. The result here is of interest because model-building can be quite expensive
(O(n2N) per generation, where n is the problem size and N is the population sizett). In 50
independent runs of MOSES without model-building, a global optimum was still discovered in
all runs. However, the variance in the number of evaluations required was much higher (in two
cases over 100,000 evaluations were needed). The new average was 26,355 evaluations to reach
an optimum (about 3.5 times more than required with model-building). The contribution of
model-building to the performance of MOSES is expected to be even greater for more difficult
problems.
Applying MOSES without model-building (i.e., a model assuming no interactions between
variables) is a way to test the combination of representation-building with an approach re-
sembling the probabilistic incremental program learning (PIPE) algorithm ISS03I, which learns
programs based on a probabilistic model without any interactions. PIPE has now been shown
to provide results competitive with genetic programming on a number of problems (regression,
agent control, etc.).
It Ls additionally possible to look inside the models that the hBOA constructs (based on the
empirical statistics of successful programs) to see what sorts of linkages between knobs are being
learned.tt For the 6-knob model given above for instance an analysis of the linkages learned
shows that the three most common pairwise dependencies uncovered, occurring in over 90% of
the models across 100 runs, are between the rotation knobs. No other individual dependencies
occurred in more than 32% of the models. This preliminary finding is quite significant given
Landgon and Poli's findings on symmetry. and their observation that "It[hese symmetries lead
to essentially the same solutions appearing to be the opposite of each other. E.g. either a pair
of Right or pair of Left terminals at a particular location may be important."
In this relatively simple case, all of the components of MOSES appear to mesh together to
provide superior performance - which is promising, though it of course does not prove that
these same advantages will apply across the range of problems relevant to human-level AGI.
ti The fact that reduction to normal tends to reduce the problem size is another synergy between it and the
application of probabilistic model-building.
*1 There is in fact even more information available in the hBOA models concerning hierarchy and direction of
dependence, but this is difficult to analyze.
EFTA00624375
33.3 Competent Program Evolution via MOSES 229
33.3.5 Discussion
The overall MOSES design is unique. However, it Ls instructive at this point to compare its
two primary unique facets (representation-building and deme management) to related work in
evolutionary computation.
Rosca's adaptive representation architecture Illos991 is an approach to program evolution
which also alternates between separate representation-building and optimization stages. It is
based on Koza's genetic programming, and modifies the representation based on a syntactic
analysis driven by the fitness function, as well as a modularity bias. The representation-building
that takes place consists of introducing new compound operators, and hence modifying the
implicit distance function in tree-space. This modification is uniform, in the sense that the new
operators can be placed in any context, without regard for semantics.
In contrast to Rosca's work and other approaches to representation-building such as Koza's
automatically defined functions IKA95], MOSES explicitly addresses the underlying (semantic)
structure of program space independently of the search for any kind of modularity or problem
decomposition. This preliminary stage critically changes neighborhood structures (syntactic
similarity) and other aggregate properties of programs.
Regarding deme management, the embedding of an evolutionary algorithm within a super-
ordinate procedure maintaining a metapopulation is most commonly associated with "island
model" architectures ISWM901. One of the motivations articulated for using island models has
been to allow distinct islands to (asually implicitly) explore different regions of the search space,
as MOSES does explicitly. MOSES can thus be seen as a very particular kind of island model
architecture, where programs never migrate between islands (demos), and islands are created
and destroyed dynamically as the search progresses.
In MOSES, optimization does not operate directly on program space, but rather on a sub-
space defined by the representation-building process. This subspace may be considered as being
defined by a sort of template assigning values to some of the underlying dimensions (e.g., it
restricts the size and shape of any resulting trees). The messy genetic algorithm [(NUM%
an early competent optimization algorithm, uses a similar mechanism - a common "competi-
tive template" is used to evaluate candidate solutions to the optimization problem which are
themselves underspecified. Search consequently centers on the template(s), much as search in
MOSES centers on the programs used to create new domes (and thereby new representations).
The issue of deme management can thus be seen as analogous to the issue of template selection
in the messy genetic algorithm.
33.3.6 Conclusion
Competent evolutionary optimization algorithms are a pivotal development, allowing encoded
problems with compact decompositions to be tractably solved according to normative princi-
ples. We are still faced with the problem of representation-building - casting a problem in terms
of knobs that can be twiddled to solve it. Hopefully, the chosen encoding will allow for a com-
pact problem decomposition. Program learning problems in particular rarely possess compact
decompositions, due to particular features generally present in program spaces (and in the map-
ping between programs and behaviors). This often leads to intractable problem formulations,
even if the mapping between behaviors and fitness has an intrinsic separable or nearly decom-
EFTA00624376
230 33 Probabilistic Evolutionary Procedure Learning
posable structure. As a consequence, practitioners must often resort to manually carrying out
the analogue of representation-building, on a problem-specific basis. Working under the thesis
that the properties of programs and program spaces can be leveraged as inductive bias to remove
the burden of manual representation-building, leading to competent program evolution, we have
developed the MOSES system. and explored its properties.
While the discussion above has highlighted many of the features that make MOSES uniquely
powerful, in a sense it has told only half the story. Part of what makes MOSES valuable for
CogPrime is that it's good on its own; and the other part is that it cooperates well with the
other cognitive processes within CogPrime. We have discussed aspects of this already in Chapter
8 of Part 1, especially in regard to the MOSES/PLN relationship. In the following section we
proceed further to explore the interaction of MOSES with other aspects of the CogPrime system
— a topic that will arise repeatedly in later chapters as well.
33.4 Integrating Feature Selection Into the Learning Process
In the typical workflow of applied machine learning, one begins with a large number of features,
each applicable to some or all of the entities one wishes to learn about; then one applies some
feature selection heuristics to whittle down the large set of features into a smaller one; then
one applies a learning algorithm to the reduced set of features. The reason for this approach is
that the more powerful among the existing machine learning algorithms tend to get confused
when supplied with too many features. The problem with this approach is that sometimes one
winds up throwing out potentially very useful information during the feature selection phase.
This same sort of problem exists with MOSES in its simplest form, as described above.
The human mind, as best we understand it, does things a bit differently than this standard
"feature selection followed by learning" process. It does seem to perform operations analogous
to feature selection, and operations analogous to the application of a machine learning algorithm
to a reduced feature set - but then it also involves feedback from these "machine learning like"
operations to the "feature selection like" operations, so that the intermediate results of learning
can cause the introduction into the learning process of features additional to those initially
selected, thus allowing the development of better learning results.
Compositional spatiotemporal deep learning (CSDLN) architectures like HTM 1/1131161 or
DeSTIN IA RCO9al, as discussed in 27 incorporate this same sort of feedback. The lower levels
of such an architecture, in effect, carry out "feature selection" for the upper levels - but then
feedback from the upper to the lower levels also occurs, thus in effect modulating the "feature
selection like" activity at the lower levels based on the more abstract learning activity on the
upper levels. However, such CSDLN architectures are specifically biased toward recognition of
certain sorts of patterns - an aspect that may be considered a bug or a feature of this class of
learning architecture, depending on the context. For visual pattern recognition, it appears to be
a feature, since the hierarchical structure of such algorithms roughly mimics the architecture of
visual cortex. For automated learning of computer programs carrying out symbolic tasks, on the
other hand, CSDLN architectures are awkward at best and probably generally inappropriate.
For cases like language learning or abstract conceptual inference, the jury, is out.
In this section we explore the question of: how to introduce an appropriate feedback between
feature selection and learning in the case of machine learning algorithms with general scope
and without explicit hierarchical structure - such as MOSES. We introduce a specific technique
EFTA00624377
33.4 Integrating Feature Selection Into the Learning Process 231
enabling this, which we call LIFES, short for Learning-Incorporated Feature Selection. We argue
that LIFES is particularly applicable to learning problems that possess the conjunction of two
properties that we call data focusability and feature focusability. We illustrate LIFES in a
MOSES context, via describing a specific incarnation of the LIFES technique that does feature
selection repeatedly during the MOSES learning process, rather than just doing it initially prior
to MOSES learning.
33.4.1 Machine Learning, Feature Selection and AGI
The relation between feature selection and machine learning appears an excellent example of
the way that, even when the same basic technique is useful in both narrow AI and AGI, the
method of utilization is often quite different. In most applied machine learning tasks, the need
to customize feature selection heuristics for each application domain (and in some cases, each
particular problem) is not a major difficulty. This need does limit the practical utilization of
machine learning algorithms, because it means that many ML applications require an expert
user who understands something about machine learning, both to deal with feature selection
issues and to interpret the results. But it doesn't stand in the way of ML's fundamental usability.
On the other hand, in an AGI context, the situation is different, and the need for human-crafted,
context-appropriate feature selection does stand in the way of the straightforward insertion of
most ML algorithms into an integrative AGI systems.
For instance, in the OpenCog integrative AGI architecture that we have co-architected
rea131, the MOSES automated program learning algorithm plays a key role. It is OpenCog's
main algorithm for acquiring procedural knowledge, and is used for generating some sorts of
declarative knowledge as well. However, when MOSES tasks are launched automatically via
the OpenCog scheduler based on an OpenCog agent's goals, there is no opportunity for the
clever choice of feature selection heuristics based on the particular data involved. And crude
feature selection heuristics based on elementary statistics, are often insufficiently effective, as
they rule out too many valuable features (and sometimes rule out the most critical features).
In this context, having a variant of MOSES that can sift through the scope of possible features
in the course of its learning is very important.
An example from the virtual dog domain pursued in [GEMS' would be as follows. Each
procedure learned by the virtual dog combines a number of different actions, such as "step
forward", "bark", "turn around", "look right","lift left front leg," etc. In the virtual dog a-
periments done previously, the number of different actions permitted to the dog was less than
100, so that feature selection was not a major issue. However, this was an artifact of the rela-
tively simplistic nature of the experiments conducted. For a real organism, or for a robot that
learns its own behavioral procedures (say, via a deep learning algorithm) rather than using a
pre-configured set of "animated" behaviors, the number of passible behavioral procedures to
potentially be combined using a MOSES-learned program may be very large. In this case, one
must either use some crude feature selection heuristic, have a human select the features, or use
something like the LIFES approach described here. LIFES addresses a key problem in moving
from the relatively simple virtual dog work done before, to related work with virtual agents
displaying greater general intelligence.
As an other example, suppose an OpenCog-controlled agent is using MOSES to learn pro-
cedures for navigating in a dynamic environment. The features that candidate navigation pro-
EFTA00624378
232 33 Probabilistic Evolutionary Procedure Learning
cedures will want to pay attention to, may be different in a well-lit environment than in a
dark environment. However, if the MOSES learning process is being launched internally via
OpenCog's goal system, there is no opportunity for a human to adjust the feature selection
heuristics based on the amount of light in the environment. Instead, MOSES has got to figure
out what features to pay attention to all by itself. LIFES is designed to allow MOSES (or other
comparable learning algorithms) to do this.
So far we have tested LIFES in genomics and other narrow-AI application areas, as a way of
initially exploring and validating the technique. As our OpenCog work proceeds, we will explore
more AGI-oriented applications of MOSES-LIFES. This will be relatively straightforward on a
software level as MOSES is fully integrated with OpenCog.
33.4.2 Data- and Feature- Focusable Learning Problems
Learning-integrated feature selection as described here is applicable across multiple domain
areas and types of learning problem - but it is not completely broadly applicable. Rather it
is most appropriate for learning problems possessing two properties we call data focusability
and feature focusability. While these properties can be defined with mathematical rigor, here
we will not be proving any theorems about them, so we will content ourselves with semi-formal
definitions, sufficient to guide practical work.
We consider a fitness function 0, defined on a space of programs f whose inputs are
features defined on elements of a reference dataset S, and whose outputs lie in the inter-
val [0,1]. The features are construed as functions mapping elements of S into [0,1]. Where
F(x) = F"(x)) is the set of features evaluated on x E .9, we use f (x) as a short-
hand for f(F(x)).
We are specifically interested in 0 which are "data focusable", in the sense that, for a large
number of highly fit programs f, there is some subset Si on which f is highly concentrated
(note that St will be different for different f). By "concentrated" it is meant that
ExEs, f (x)
& Es f (x)
is large. A simple case is where f is Boolean and f(x) = I <=> x E
One important case is where is "property-based", in the sense that each element x E $ has
some Boolean or numeric property p(x), and the fitness function 0(f) rewards f for predicting
p(x) given x for x E Si, where $ f is some non-trivial subset of S. For example, each element
of $ might belong to some category, and the fitness function might represent the problem of
placing elements of S into the proper category - but with the twist that f gets rewarded if
it accurately places some subset St of elements in S into the proper category, even if it has
nothing to say about all the elements in S but not St.
For instance, consider the case where $ is a set of images. Suppose the function p(x) indicates
whether the image x contains a picture of a cat or not. Then, a suitable fitness function 0 would
be one measuring whether there is some non-trivially large set of images St so that if x E Si,
then f can accurately predict whether x contains a picture of a cat or not. A key point is that
the fitness function doesn't care whether f can accurately predict whether x contains a picture
of a cat or not, for x outside St.
EFTA00624379
33.4 Integrating Feature Selection Into the Learning Process 233
Or, consider the case where S is a discrete series of time points, and p(x) indicates the
value of some quantity (say, a person's EEG) at a certain point in time. Then a suitable fitness
function 0 might measure whether there is sonic non-trivially large set of time-points St so
that if x E Sp, then f can accurately predict whether x will be above a certain level L or not.
Finally, in addition to the property of data-focusability introduced above, we will concern
ourselves with the complementary property of "feature-focusability." This means that, while
the elements of S are each characterized by a potentially large set of features, there are many
highly fit programs f that utilize only a small subset of this large set of features. The case of
most interest here is where there are various highly fit programs f, each utilizing a different
small subset of the overall large set of features. In this case one has (loosely speaking) a pattern
recognition problem, with approximate solutions comprising various patterns that combine
various different features in various different ways. For example, this would be the case if
there were many different programs for recognizing pictures containing cats, each one utilizing
different features of cats and hence applying to different subsets of the overall database of
images.
There may, of course, be many important learning problems that are neither data nor feature
focusable. However, the LIFE$ technique presented here for integrating feature selection into
learning is specifically applicable to objective functions that are both data and feature focusable.
In this sense, the conjunction of data and feature focusability appears to be a kind of "tractabil-
ity" that allows one to bypass the troublesome separation of feature selection and learning, and
straightforwardly combine the two into a single integrated process. Being property-based in the
sense described above does not seem to be necessary for the application of LIFES, though most
practical problems do seem to be property-based.
33.4.3 Integrating Feature Selection Into Learning
The essential idea proposed here is a simple one. Suppose one has a learning problem involving
a fitness function that is both data and feature focusable. And suppose that, in the course
of learning according to some learning algorithm, one has a candidate program f, which is
reasonably fit but merits improvement. Suppose that f uses a subset F1 of the total set F of
possible input features. Then, one may do a special feature selection step, customized just for
f. Namely, one may look at the total set F of possible features, and ask which features or
small feature-sets display desirable properties on the set Sp. This will lead to a new set
of features potentially worthy of exploration; let's call it F1. We can then attempt to improve
f by creating variants of f introducing some of the features in Pf - either replacing features
in F1 or augmenting them. The process of creating and refining these variants will then lead
to new candidate programs g, potentially concentrated on sets $ g different from St, in which
case the process may be repeated. This is what we call LIFES - Learning-Integrated Feature
Selection.
As described above the LIFES process is quite general, and applies to a variety of learning
algorithms - basically any learning algorithm that includes the capability to refine a candidate
solution via the introduction of novel features. The nature of the "desirable properties" used
to evaluate candidate features or feature-sets on St needs to be specified, but a variety of
standard techniques may be used here (along with more advanced ideas) - for instance, in the
case where the fitness function is defined in terms of some property mapping p as describe,
EFTA00624380
234 33 Probabilistic Evolutionary Procedure Learning
above, then given a feature F. one can calculate the mutual information of P with p over Sp
Other measures than mutual information may be used here as well.
The LIFES process doesn't necessarily obviate the need for np-front feature selection. What
it does, is prevent up-front feature selection from limiting the ultimate feature usage of the
learning algorithm. It allows the initially selected features to be used as a rough initial guide
to learning - and for the candidates learned using these initial features, to then be refined and
improved using additional features chosen opportunistically along the learning path. In some
cases, the best programs ultimately learned via this approach might not end up involving any
of the initially selected features.
33.4.4 Integrating Feature Selection into MOSES Learning
The application of the general LIFES process in the MOSES context is relatively straightfor-
ward. Quite simply, given a reasonably fit program f produced within a deme, one then isolates
the set Si on which f is concentrated, and identifies a set F't of features within F that displays
desirable properties relative to $f. One then creates a new deme 7, with exemplar f, and with
a set of potential input features consisting of Ff U F;.
What does it mean to create a deme f'with a certain set of "potential input features"
Ff U Fr Abstractly, it means that Ff. = Ff U F. Concretely, it means that the knobs in the
deme's exemplar 7 must be supplied with settings corresponding to the elements of Ff U F.
The right way to do this will depend on the semantics of the features.
For instance, it may be that the overall feature space F is naturally divided into groups of
features. In that case. each new feature F1 in would be added, as a potential knob setting,
to any knob in f corresponding to a feature in the same group as P.
On the other hand, if there is no knob in f corresponding to features in FIN knob group,
then one has a different situation, and it is necessary to "mutate" f by adding a new node
with a new kind of knob corresponding to or replacing an existing node with a new one
corresponding to P.
33.4.5 Application to Genomic Data Classification
To illustrate the effectiveness of LIFES in a MOSES context, we now briefly describe an exam-
ple application, in the genomics domain. The application of MOSES to gene expression data is
described in more detail in ilmo0711, and is only very briefly summarized here. To obtain the
results summarized here, we have used MOSES, with and without LIFES, to analyze two differ-
ent genomics datasets: an Alzheimers SNP (single nucleotide polymorphism) dataset [Meal
previously analyzed using ensemble genetic programming [CC P1-109] . The dataset is of the form
"Case vs. Control", where the Case category, consists of data from individuals with Alzheimers
and Control consists of matched controls. MOSES was used to learn Boolean program trees
embodying predictive models that take in a subset of the genes in an individual, and output a
Boolean combination of their discretized expression values, that is interpreted as a prediction of
whether the individual is in the Case or Control category. Prior to feeding them into MOSES,
expression values were first Q-normalized, and then discretized via comparison to the median
EFTA00624381
33.4 Integrating Feature Selection Into the Learning Process 235
expression measured across all genes on a per-individual basis (1 for greater than the median,
0 for less than). Fitness was taken as precision, with a penalty factor restriction attention to
program trees with recall above a specified minimum level.
These study was carried out, not merely for testing MOSES and LIFES, but as part of a
practical investigation into which genes and gene combinations may be the best drug targets
for Alzheimers Disease. The overall methodology for the biological investigation, as described
in IGCPNI061, is to find a (hopefully diverse) ensemble of accurate classification models, and
then statistically observe which genes tend to occur most often in this ensemble, and which
combinations of genes tend to co-occur most often in the models in the ensemble. These most
frequent genes and combinations are taken as potential therapeutic targets for the Case cate-
gory of the underlying classification problem (which in this case denotes inflammation). This
methodology has been biologically validated by follow-up lab work in a number of cases; see
e.g. IGea05] where this approach resulted in the first evidence of a genetic basis for Chronic
Fatigue Syndrome. A significant body of unpublished commercial work along these lines has
been done by Biomind LLC thttp: / /biomind. comj for its various customers.
Comparing MOSES-LIFES to MOSES with conventional feature selection, we find that the
former finds model ensembles combining greater diversity with greater precision, and equivalent
recall. This is because conventional feature selection eliminates numerous genes that actually
have predictive value for the phenotype of inflammation, so that MOSES never gets to see them.
LIFES exposes MOSES to a much greater number of genes, some of which MOSES finds useful.
And LIFES enables MOSES to explore this larger space of genes without getting bollixed by
the potential combinatorial explosion of possibilities.
Algorithm Train. Precision Train. Recall Test Precision Test Recall
MOSES .81 .51 .65 .42 best training precision
MOSES .80 .52 .69 .43 best test precision
MOSES-LIFES .84 .51 .68 .38 best training precision
MOSES-LIFES .82 .51 .72 .48 best test precision
Table 33.1: Impact of LIFES on MOSES classification of Alzheimers Disease SNP data. Fitness
function sought to maximize precision consistent with a constraint of precision being at least
0.5. Precision and recall figures are average figures over 10 folds, using 10-fold cress-validation.
The results shown here are drawn from a larger set of runs, and are selected according to
two criteria: best training precision (the fair way to do it) and best test precision (just for
comparison). We see that use of LIFES increases precision by around 3% in these tests, which
is highly statistically significant according to permutation analysis.
The genomics example shows that LIFES makes sense and works in the context of MOSES,
broadly speaking. It seems very plausible that LIFES will also work effectively with MOSES
in an integrative AGI context, for instance in OpenCog deployments where MOSES is used
to drive procedure learning, with fitness functions supplied by other OpenCog components.
However, the empirical validation of this plausible conjecture remains for future work.
EFTA00624382
236 33 Probabilistic Evolutionary Procedure Learning
33.5 Supplying Evolutionary Learning with Long-Term Memory
This section introduces an important enhancement to evolutionary learning, which extends the
basic PEPL framework, by forming an adaptive hybridization of PEPL optimization with PLN
inference (rather than merely using PLN inference within evolutionary learning to aid with
modeling).
The first idea here is the use of PLN to supply evolutionary learning with a long-term memory.
Evolutionary, learning approaches each problem as an isolated entity, but in reality, a CogPrime
system will be confronting a long series of optimization problems, with subtle interrelationships.
When trying to optimize the function 1, CogPrime may make use of its experience in optimizing
other functions g.
Inference allows optimizers of g to be analogically transformed into optimizers of f, for
instance it allows one to conclude:
Inheritance f g
EvaluationLink f x
EvaluationLink g x
However, less obviously, inference also allows patterns in populations of optimizers of g to be
analogically transformed into patterns in populations of optimizers off For example, if pat is
a pattern in good optimizers of f, then we have:
InheritanceLink f g
ImplicationLink
EvaluationLink f x
EvaluationLink pat x
ImplicationLink
EvaluationLink g x
EvaluationLink pat x
(with appropriate probabilistic truth values), an inference which says that patterns in the
population of f-optimizers should also be patterns in the population of g-optimizers).
Note that we can write the previous example more briefly as:
InheritanceLink f g
ImplicationLink (EvaluationLink f) (EvaluationLink pat)
ImplicationLink (EvaluationLink g) (EvaluationLink pat)
A similar formula holds for $imilarityLinIcs.
We may also infer:
ImplicationLink (EvaluationLink g) (EvaluationLink pat_g)
ImplicationLink (EvaluationLink f) (EvaluationLink pat_f)
ImplicationLink
(EvaluationLink (g AND f))
(EvaluationLink (pat_g AND pat_f))
and:
ImplicationLink (EvaluationLink f) (EvaluationLink pat)
EFTA00624383
33.6 Hierarchical Program Learning 237
ImplicationLink (EvaluationLink -f) (EvaluationLink -pat)
Through these sorts of inferences, PLN inference can be used to give evolutionary learning
a long-term memory. allowing knowledge about population models to be transferred from one
optimization problem to another. This complements the more obvious use of inference to transfer
knowledge about specific solutions from one optimization problem to another.
For instance in the problem of finding a compact program generating some given sequences of
bits the system might have noticed that when the number of 0 roughly balances the number of
1 (let us call this property STR_BALANCE) successful optimizers tend to give greater biases
toward conditionals involving comparisons of the number of 0 and 1 inside the condition, let
us call this property over optimizers COMP_CARD_DIGIT_BIAS. This can be expressed in
PLN as follows
AverageQuantifierLink (tv)
ListLink
$X
$Y
ImplicationLink
ANDLink
InheritanceLink
STR_BALANCE
$X
EvaluationLink
SUCCESSFUL_OPTIMIZER_OF
ListLink
$Y
$X
InheritanceLink
COMP_CARD_DIGIT_BIAS
$Y
which translates by, if the problem SX inherits from STR_BALANCE and $V is a successful
optimizer of $X then, with probability p calculated according to tv, SY tends to be biased
according to the property described by COMP_CARD_DIGIT_BIAS.
33.6 Hierarchical Program Learning
Next we discuss hierarchical program structure, and its reflection in probabilistic modeling, in
more depth. This is a surprisingly subtle and critical topic, which may be approached from
several different complementary, angles. To an extent, hierarchical structure is automatically
accounted for in MOSES, but it may also be valuable to pay more explicit mind to it.
In human-created software projects, one common approach for dealing with the existence of
complex interdependencies between parts of a program is to give the program a hierarchical
structure. The program is then a hierarchical arrangement of programs within programs within
programs, each one of which has relatively simple dependencies between its parts (however its
EFTA00624384
238 33 Probabilistic Evolutionary Procedure Learning
parts may themselves be hierarchical composites). This notion of hierarchy is essential to such
programming methodologies as modular programming and object-oriented design.
Pelikan and Goldberg discuss the hierarchal nature of human problem-solving, in the context
of the hBOA (hierarchical BOA) version of BOA. However, the hBOA algorithm does not
incorporate hierarchical program structure nearly as deeply and thoroughly as the hierarchical
procedure learning approach proposed here. In hBOA the hierarchy is implicit in the models of
the evolving population, but the population instances themselves are not necessarily explicitly
hierarchical in structure. In hierarchical PEPL as we describe it here, the population consists of
hierarchically structured Combo trees, and the hierarchy of the probabilistic models corresponds
directly to this hierarchical program structure.
The ideas presented here have some commonalities to John Koza's ADFs and related tricks for
putting reusable subroutines in GP trees, but there are also some very substantial differences,
which we believe will make the current approach far more effective (though also involving
considerably more computational overhead).
We believe that this sort of hierarchically-savvy modeling is what will be needed to get
probabilistic evolutionary learning to scale to large and complex programs, just as hierarchy-
based methodologies like modular and object-oriented programming are needed to get human
software engineering to scale to large and complex programs.
33.6.1 Hierarchical Modeling of Composite Procedures in the
AtomSpace
The passibility of hierarchically structured programs is (intentionally) present in the CogPrime
design, even without any special effort to build hierarchy into the PEPL framework. Combo
trees may contain Nodes that point to PredicateNodes, which may in turn contain Combo trees,
etc. However, our current framework for learning Combo trees does not take advantage of this
hierarchy. What is needed, in order to do so, is for the models used for instance generation to
include events of the form:
Combo tree Node at position x has type PredicateNode; and the PredicateNode at position x
contains a Combo tree that possesses property P.
where x is a position in a Combo tree and P is a property that may or may not be true of
any given Combo tree. Using events like this, a relatively small program explicitly incorporat-
ing only short-range dependencies; may implicitly encapsulate long-range dependencies via the
properties R
But where do these properties P come from? These properties should be patterns learned as
part of the probabilistic modeling of the Combo tree inside the PredicateNode at position x.
For example, if one is using a decision tree modeling framework, then the properties might be
of the form decision tree D evaluates to Trim. Note that not all of these properties have to be
statistically correlated with the fitness of the PredicateNode at position x (although some of
them surely will be).
Thus we have a multi-level probabilistic modeling strategy. The top-level Combo tree has
a probabilistic model whose events may refer to patterns that are parts of the probabilistic
models of Combo trees that occur within it and so on down.
In instance generation, when a newly generated Combo tree is given a PredicateNode at
position x, two possibilities exist:
EFTA00624385
33.6 Hierarchical Program Learning 239
• There is already a model for PredicateNodes at position x in Combo trees in the given
population, in which case a population of PredicateNodes potentially living at that position
is drawn from the known model, and evaluated.
• There is no such model (because it has never been tried to create a PredicateNode at
position x in this population before), in which case a new population of Combo trees is
created corresponding to the position, and evaluated.
Note that the fitness of a Combo tree that is not at the top level of the overall process, is assessed
indirectly in terms of the fitness of the higher-level Combo tree in which it is embedded, due to
the requirement of having certain properties, etc.
Suppose each Combo tree in the hierarchy has on average R adaptable sub-programs (rep-
resented as Nodes pointing to PredicateNodes containing Combo trees to be learned). Suppose
the hierarchy is K levels deep. Then we will have about R x K program tree populations in the
tree. This suggests that hierarchies shouldn't get too big, and indeed, they shouldn't need to,
for the same essential reason that human-created software programs, if well-designed, tend not
to require extremely deep and complex hierarchical structures.
One may also introduce a notion of reusable components across various program learning
runs, or across several portions of the same hierarchical program. Here is one learning patterns
of the form:
If property P1(C,x) applies to a Combo tree C and a node x within it, then it is often good
for node x to refer to a PredicateNode containing a Combo tree with property P2.
These patterns may be assigned probabilities and may be used in instance generation. They
are general or specialized programming guidelines, which may be learned over time.
33.6.2 Identifying Hierarchical Structure In Combo trees via
Metallodes and Dimensional Embedding
One may also apply the concepts of the previous section to model a population of CTs that
doesn't explicitly have a hierarchical structure, via introducing the hierarchical structure during
the evolutionary process, through the introduction of special extra Combo tree nodes called
Metallodes. For instance Metallodes may represent subtrees of Combo trees, which have proved
useful enough that it seems justifiable to extract them as "macros." This concept may be
implemented in a couple different ways, here we will introduce a simple way of doing this based
on dimensional embedding, and then in the next section we will allude to a more sophisticated
approach that uses inference instead.
The basic idea is to couple decision tree modeling with dimensional embedding of subtrees,
a trick that enables small decision tree models to cover large regions of a CT in an approximate
way, and which leads naturally to a form of probabilistically-guided crossover.
The approach as described here works most simply for C11 that have many subtrees that can
be viewed as mapping numerical inputs into numerical outputs. There are clear generalizations
to other sorts of CTs, but it seems advisable to test the approach on this relatively simple case
first.
The first part of the idea is to represent subtrees of a CT as numerical vectors in a relatively
low-dimensional space (say N=50 dimensions). This can be done using our existing dimensional
embedding algorithm, which maps any metric space of entities into a dimensional space. All
EFTA00624386
240 33 Probabilistic Evolutionary Procedure Learning
that's required is that we define a way of measuring distance between subtrees. If we look at
subtrees with numerical inputs and outputs, this is easy. Such a subtree can be viewed as a
function mapping Rn into Km; and there are many standard ways to calculate the distance
between two functions of this sort (for instance one can make a Monte Carlo estimate of the
if metric which is defined as:
[Sum(x) (f(x) - g(x))";() (1/p)
Of course, the same idea works for subtrees with non-numerical inputs and outputs, the
tuning and implementation are just a little trickier.
Next, one can augment a CT with meta-nodes that correspond to subtrees. Each meta-node is
of a special CT node type Metallode, and comes tagged with an N-dimensional vector. Exactly
which subtrees to replace with Metallodes is an interesting question that must be solved via
some heuristics.
Then, in the course of executing the PEPL algorithm, one does decision tree modeling as
usual, but making use of Metallodes as well as ordinary CT nodes. The modeling of Metallodes
is quite similar to the modeling of Nodes representing ConceptNodes and PredicateNodes using
embedding vectors. In this way, one can use standard, small decision tree models to model fairly
large portions of CTs (because portions of CTs are approximately represented by Metallodes).
But how does one do instance generation, in this scheme? What happens when one tries to do
instance generation using a model that predicts a Metallode existing in a certain location in a
CT? Then, the instance generation process has got to find sonic CT subtree to put in the place
where the Metallode is predicted. It needs to find a subtree whose corresponding embedding
vector is close to the embedding vector stored in the Metallode. But how can it find such a
subtree?
There seem to be two ways:
1. A reasonable solution is to look at the database of subtrees that have been seen before
in the evolving population, and choose one from this database, with the probability of
choosing subtree X being proportional to the distance between X's embedding vector and
the embedding vector stored in the Metallode.
2. One can simply choose good subtrees, where the goodness of a subtree is judged by the
average fitness of the instances containing the target subtree.
One can use a combination of both of these processes (luring instance generation.
But of course, what this means is that we're in a sense doing a form of crossover, because we're
generating new instances that combine subtrees from previous instances. But we're combining
subtrces in a judicious way guided by probabilistic modeling, rather than in a random way as
in GP-style crossover.
33.6.2.1 Inferential Metallodes
Metallodes are an interesting and potentially powerful technique, but we don't believe that
they, or any other algorithmic trick, is going to be the solution to the problem of learning
hierarchical procedures. We believe that this is a cognitive science problem that probably isn't
amenable to a purely computer science oriented solution. In other words, we suspect that the
correct way to break a Combo tree down into hierarchical components depends on context,
algorithms are of course required, but they're algorithms for relating a CT to its context rather
EFTA00624387
33.6 Hierarchical Program Learning 241
than pure CT-manipulation algorithms. Dimensional embedding is arguably a tool for capturing
contextual relationships, but it's a very crude one.
Generally speaking, what we need to be learning are patterns of the form "A subtree meeting
requirements X Ls often fit when linked to a subtree meeting requirements Y, when solving a
problem of type Z". Here the context requirements Y will not pertain to absolute tree position
but rather to abstract properties of a subtree.
The Metallode approach as outlined above is a kind of halfway measure toward this goal,
good because of its relative computational efficiency, but ultimately too limited in its power
to deal with really hard hierarchical learning problems. The reason the Metallode approach is
crude Ls simply because it involves describing subtrees via points in an embedding space. We
believe that the correct (but computationally expensive) approach is indeed to use Metallodes
- but with each Metallode tagged, not with coordinates in an embedding space, but with a
set of logical relationships describing the subtree that the Metallode stands for. A candidate
subtree's similarity to the Metallode may then be determined by inference rather than by the
simple computation of a distance between points in the embedding space. (And, note that we
may have a hierarchy of Metallodes, with small subtrees corresponding to Metallodes, larger
subtrees comprising networks of small subtrees also corresponding to Metallodes, etc.)
The question then becomes which logical relationships one tries to look for, when character-
izing a Metallode. This may be partially domain-specific, in the sense that different properties
will be more interesting when studying motor-control procedures than when studying cognitive
procedures.
To intuitively understand the nature of this idea, let's consider some abstract but common-
sense examples. Firstly, suppose one is learning procedures for serving a ball in tennis. Suppose
all the successful procedures work by first throwing the ball up really high, then doing other
stuff. The internal details of the different procedures for throwing the ball up really high may
be wildly different. What we need is to learn the pattern
Implication
Inheritance X "throwing the ball up really high"
"X then Y" is fit
Here X and Y are Metallodes. But the question is how do we learn to break trees down into
Metallodes according to the formula "tree='X then Y' where X inherits from 'throwing the ball
up really high."'?
Similarly, suppose one is learning procedures to do first-order inference. What we need is to
learn a pattern such as:
Implication
AND
F involves grabbing pairs from the AtomTable
G involves applying an inference rule to those each pair
H involves putting the results back in the AtomTable
"F I G (H)))" is fit
Here we need Metallodes for F, G and H, but we need to characterize e.g. the Metallode F
by a relationship such as "involves grabbing pairs from the AtomTable."
Until we can characterize !Actallodes using abstract descriptors like this, one might argue
we're just doing "statistical learning" rather than "general intelligence style" procedure learning.
But to do this kind of abstraction intelligently seems to require some background knowledge
about the domain.
EFTA00624388
242 33 Probabilistic Evolutionary Procedure Learning
In the "throwing the ball up really high" case the assignment of a descriptive relationship
to a subtree involves looking, not at the internals of the subtree itself, but at the state of the
world after the subtree has been executed.
In the "grabbing pairs from the AtomTable" case it's a bit simpler but still requires some
kind of abstract model of what the subtree is doing, i.e. a model involving a logic expression
such as 'The output of F is a set S so that if P belongs to S then P is a set of two Atoms Al
and A2. and both Al and A2 were produced via the getAtom operator."
How can this kind of abstraction be learned? It seems unlikely that abstractions like this will
be found via evolutionary search over the space of all possible predicates describing program
subtrees. Rather, they need to be found via probabilistic reasoning based on the terms combined
in subtrees, put together with background knowledge about the domain in which the fitness
function exists. In short, integrative cognition is required to learn hierarchically structured pro-
grams in a truly effective way, because the appropriate hierarchical breakdowns are contextual
in nature, and to search for appropriate hierarchical breakdowns without using inference to take
context into account, involves intractably large search spaces.
33.7 Fitness Function Estimation via Integrative Intelligence
If instance generation is very cheap and fitness evaluation is very expensive (as is the case
in many applications of evolutionary learning hi CogPrime ), one can accelerate evolutionary
learning via a "fitness function estimation" approach. Given a fitness function embodied in a
predicate P, the goal is to learn a predicate Q so that:
1. Q is much cheaper than P to evaluate, and
2. There is a high-strength relationship:
Similarity Q P
or else
ContextLink C (Similarity Q P)
where C is a relevant context.
Given such a predicate Q, one could proceed to optimize P by ignoring evolutionary learning
altogether and just repeatedly following the algorithm:
• Randomly generate N candidate solutions.
• Evaluate each of the N candidate solutions according to Q.
• Take the kcN solutions that satisfy Q best, and evaluate them according to P.
improved based on the new evaluations of P that are done. Of course, this would not be as good
as incorporating fitness function estimation into an overall evolutionary, learning framework.
Heavy utilization of fitness function estimation may be appropriate, for example, if the
entities being evolved are schemata intended to control an agent's actions in a real or simulated
environment. In this case the specification predicate F., in order to evaluate P(S), has to actually
use the schema S to control the agent in the environment. So one may search for Q that do
not involve any simulated environment, but are constrained to be relatively small predicates
involving only cheap-to-evaluate terms (e.g. one may allow standard combinatory, numbers,
EFTA00624389
33.7 Fitness Function Estimation via Integrative Intelligence 243
strings, ConceptNodes, and predicates built up recursively from these). Then Q will be an
abstract predictor of concrete environment success.
We have left open the all-important question of how to find the "specification approximating
predicate" Q.
One approach is to use evolutionary learning. In this case, one has a population of predi-
cates, which are candidates for Q. The fitness of each candidate Q is judged by how well it
approximates P over the set of candidate solutions for P that have already been evaluated. If
one uses evolutionary learning to evolve Qs, then one is learning a probabilistic model of the
set of Qs, which tries to predict what sort of Q,s will better solve the optimization problem
of approximating P's behavior. Of course, using evolutionary, learning for this purpose poten-
tially initiates an infinite regress, but the regress can be stopped by, at some level, finding Qs
using a non-evolutionary learning based technique such as genetic programming, or a simple
evolutionary learning based technique like standard BOA programming.
Another approach to finding Q is to use inference based on background knowledge. Of course,
this is complementary rather than contradictory to using evolutionary learning for finding Q.
There may be information in the knowledge base that can be used to "analogize" regarding
which Qs may match P. Indeed, this will generally be the case in the example given above,
where P involves controlling actions in a simulated environment but Q does not.
An important point is that, if one uses a certain Q1 within fitness estimation, the evidence
one gains by trying Q1 on numerous fitness cases may be utilized in future inferences regarding
other Q2 that may serve the role of Q. So, once inference gets into the picture, the quality
of fitness estimators may progressively improve via ongoing analogical inference based on the
internal structures of the previously attempted fitness estimators.
EFTA00624390
EFTA00624391
Section V
Declarative Learning
EFTA00624392
EFTA00624393
Chapter 34
Probabilistic Logic Networks
Co-authored with Matthew lkle
34.1 Introduction
Now we turn to CogPrime's methods for handling declarative knowledge - beginning with
a series of chapters discussing the Probabilistic Logic Networks (PLN) IG?,•11H08J approach
to uncertain logical reasoning, and then turning to chapters on pattern mining and concept
creation. In this first of the chapters on PLN, we give a high-level overview, summarizing
material given in the book Probabilistic Logic Networks IGNIIII081 more compactly and in
a somewhat differently-organized way. For a more thorough treatment of the concepts and
motivations underlying PLN, the reader is encouraged to read EGNIIHOSI.
PLN is a mathematical and software framework for uncertain inference, operative within
the CogPrime software framework and intended to enable the combination of probabilistic
truth values with general logical reasoning rules. Some of the key requirements underlying the
development of PLN were the following:
• To enable uncertainty-savvy versions of all known varieties of logical reasoning, including for
instance higher-order reasoning involving quantifiers, higher-order functions, and so forth
• To reduce to crisp "theorem prover" style behavior in the limiting case where uncertainty
tends to zero
• To encompass inductive and abductive as well as deductive reasoning
• To agree with probability theory in those reasoning cases where probability theory, in its
current state of development, provides solutions within reasonable calculational effort based
on assumptions that are plausible in the context of real-world embodied software systems
• To gracefully incorporate heuristics not explicitly based on probability theory, in cases
where probability theory, at its current state of development, does not provide adequate
pragmatic solutions
• To provide "scalable" reasoning, in the sense of being able to carry out inferences involving
billions of premises.
• To easily accept input from, and send input to, natural language processing software systems
In practice, PLN consists of
• a set of inference rules (e.g. deduction, Bayes rule, variable unification, modus pollens, etc.),
each of which takes one or more logical relationships or terms (represented as CogPrime
Atoms) as inputs, and produces others as outputs
247
EFTA00624394
248 34 Probabilistic Logic Networks
• specific mathematical formulas for calculating the probability value of the conclusion of
an inference rule based on the probability values of the premises plus (in some cases)
appropriate background assumptions.
PLN also involves a particular approach to estimating the confidence values with which
these probability values are held (weight of evidence, or second-order uncertainty). Finally,
the implementation of PLN in software requires important choices regarding the structural
representation of inference rules, and also regarding "inference control" - the strategies required
to decide what inferences to do in what order, in each particular practical situation. Currently
PLN is being utilized to enable an animated agent to achieve goals via combining actions in
a game world. For example, it can figure out that to obtain an object located on top of a
wall, it may want to build stairs leading from the floor to the top of the wall. Earlier PLN
applications have involved simpler animated agent control problems, and also other domains,
such as reasoning based on information extracted from biomedical text using a language parser.
For all its sophistication, however, PLN falls prey to the same key weakness as other logical
inference systems: combinatorial explosion. In trying to find a logical chain of reasoning leading
to a desired conclusion, or to evaluate the consequences of a given set of premises, PLN may
need to explore an unwieldy number of possible combinations of the Atoms in CogPrime's
memory. For PLN to be practical beyond relatively simple and constrained problems (and most
definitely, for it to be useful for AGI at the human level or beyond), it must be coupled with a
powerful method for "inference tree pruning" - for paring down the space of passible inferences
that the PLN engine must evaluate as it goes about its business in pursuing a given goal in a
certain context. Inference control will be addressed in Chapter 36.
34.2 A Simple Overview of PLN
The key elements of PLN are its rules and Formulas. In general,a PLN rule has
• Input: A tuple of Atoms (which must satisfy certain criteria, specific to the Rule)
• Output: A tuple of Atoms
Actually, in nearly all cases, the output is a single Atom; and the input is a single Atom or a
pair of Atoms.
The prototypical example is the DeductionRule. Its input must look like
X_Link A B
X_Link B C
And its output then looks like
X_Link A C
Here, X_Link may be either InheritanceLink, SubsetLink, ImplicationLink or Extension-
allmplicationLink.
A PLN formula goes along with a PLN rule, and tells the uncertain truth value of the output,
based on the uncertain truth value of the input. For example, if we have
X_Link A B <sAB>
X_Link B C <BBC>
then the standard PLN deduction formula tells us
EFTA00624395
34.2 A Simple Overview of PLN 249
X_Link A C <sAC>
with
— SAO (SC — SESBC)
SAC = SABSBC ÷
1 — .48
where e.g. s A denote the strength of the truth value of node A.
In this example, the uncertain truth value of each Atom is given as a single "strength" number.
In general, uncertain truth values in PLN may take multiple forms, such as
• Single strength values like .8, which may indicate probability or fuzzy truth value, depending
on the Atom type
• (strength, confidence) pairs like (.8, .4)
• (strength, count) pairs like (.8, 15)
• indefinite probabilities like (.G,. .9, .95) which indicate credible intervals of probabilities
34.2.1 Forward and Backward Chaining
Typical patterns of usage of PLN are forward-chaining and backward-chaining inference.
Forward chaining basically means:
1. Given a pool (a list) of Atoms of interest
2. One applies PLN rules to these Atoms, to generate new Atoms, hopefully also of interest
3. Adding these new Atoms to the pool, one returns to Step 1
EXAMPLE: "People are animals" and "animals breathe" are in the pool of Atoms. These are
combined by the Deduction rule to form the conclusion "people breathe".
Backward chaining falls into two cases. First:
• "'Truth value query."' Given a target Atom whose truth value is not known (or is too
uncertainly known), plus a pool of Atoms, find a way to estimate the truth value of the
target Atom, via combining the Atoms in the pool using the inference Rules
EXAMPLE: The target is "do people breathe?" (InheritanceLink people breathe). The truth
value of the target is estimated via doing the inference "People are animals, animals breathe,
therefore people breathe."
Second:
• "'Variable fulfillment query"'. Given a target Link (Atoms may be Nodes or Links) with
one or more VariableAtoms among its targets, figure out what Atoms may be put in place
of these VariableAtoms, so as to give the target Link a high strength* confidence (i.e. a
"high truth value").
EXAMPLE: The target is "what breathes?", i.e. "InheritanceLink $X breathe"... Direct lookup
into the Atomspace reveals the Atom "InheritanceLink animal breathe", indicating that the slot
$X may be filled by "animal". Inference reveals that "Inheritance people breathe" so that the
slot $X may also be filled by "people".
EFTA00624396
250 34 Probabilistic Logic Networks
EXAMPLE: the target is "what breathes and adds", ie "(InheritanceLink $X breathe) AND
(InheritanceLink $X add)". Inference reveals that the slot $X may be filled by "people" but not
"cats" or "computers."
Common-sense inference may involve a combination of backward chaining and forward chain-
ing.
The hardest part of inference is "inference control" - that is, knowing which among the
many possible inference steps to take, in order to obtain the desired information (in backward
chaining) or to obtain interesting new information (in forward chaining). In an Atomspace
with a large number of (often quite uncertain) Atoms, there are many, many possibilities and
powerful heuristics are needed to choose between them. The best guide to inference control is
some sort of induction based on the system's past history of which inferences have been useful.
But of course, a young system doesn't have much history to go on. And relying on indirectly
relevant history is, itself, an inference problem - which can be solved best by a system with
some history to draw on!
34.3 First Order Probabilistic Logic Networks
We now review the essentials of PLN in a more formal way. PLN is divided into first-order and
higher-order sub-theories (FOPLN and HOPLN). These terms are used in a nonstandard way
drawn conceptually from NARS [Wan061. We develop FOPLN first, and then derive HOPLN
therefrom.
FOPLN is a term logic, involving terms and relationships (links) between tents. It is an
uncertain logic, in the sense that both terms and relationships are associated with truth value
objects, which may come in multiple varieties ranging from single numbers to complex structures
like indefinite probabilities. Terms may be either elementary, observations, or abstract tokens
drawn from a token-set T.
34.3.1 Core FOPLN Relationships
"Core FOPLN" involves relationships drawn from the set: negation; Inheritance and probabilis-
tic conjunction and disjunction; Member and fuzzy conjunction and disjunction. Elementary
observations can have only Member links, while token terms can have any kinds of links. PLN
makes clear distinctions, via link type semantics, between probabilistic relationships and fuzzy
set relationships. Member semantics are usually fuzzy relationships (though they can also be
crisp), whereas Inheritance relationships are probabilistic, and there are rules governing the
interoperation of the two types.
Suppose a virtual agent makes an elementary VisualObservation o of a creature named Fluffy.
The agent might classify o as belonging, with degree 0.9, to the fuzzy set of furry objects. The
agent might also classify o as belonging with degree 0.8 to the fuzzy set of animals. The agent
could then build the following links in its memory:
Member o furry < 0.9 >
Member o animals < 0.8 >
EFTA00624397
34.3 First Order Probabilistic Logic Networks 251
The agent may later wish to refine its knowledge, by combining these MemberLinks. Using
the minimum fuzzy conjunction operator, the agent would conclude:
fuzzyAND < 0.8 >
Member o furry
Member o animals
meaning that the observation o is a visual observation of a fairly furry, animal object.
The semantics of (extensional) Inheritance are quite different from, though related to, those
of the MemberLink. ExtensionalInheritance represents a purely conditional probabilistic subset
relationship and is represented through the Subset relationship. If A is Fluffy and B is the set
of cat, then the statement
Subset < 0.9 >
A
B
means that
Pfr is in the set BIx is in the set A) = 0.9.
34.3.2 PLN Truth Values
PLN is equipped with a variety of different types of truth-value types. In order of increasing
information about the full probability distribution, they are:
• strength truth-values, which consist of single numbers; e.g., < s > or < .8 >. Usually
strength values denote probabilities but this is not always the case.
• SimpleTruthValues. consisting of pairs of numbers. These pairs come in two forms: < s, w >,
where s is a strength and w is a "weight of evidence" and < s, N >, where N is a "count."
"Weight of evidence" is a qualitative measure of belief, while "count" is a quantitative mea-
sure of accumulated evidence.
• IndefiniteTruthValues, which quantify truth-values in terms of an interval [L, UI, a credibil-
ity level b, and an integer k (called the lookahead). IndefiniteTruthValues quantify the idea
that after k more observations there is a probability b that the conclusion of the inference
will appear to lie in IL,
• DistributionalrruthValues, which are discretized approximations to entire probability dis-
tributions.
34.3.3 Auxiliary FOPLN Relationships
Beyond the core FOPLN relationships, FOPLN involves additional relationship types of two
varieties. There are simple ones like Similarity, defined by
Similarity A B
EFTA00624398
252 34 Probabilistic Logic Networks
We say a relationship R is simple if the truth value of R A B can be calculated solely in terms of
the truth values of core FOPLN relationships between A and B. There are also complex "aux-
iliary" relationships like IntensionalInheritance. which as discussed in depth in the Appendix
??, measures the extensional inheritance between the set of properties or patterns associated
with one term and the corresponding set associated with another.
Returning to our example, the agent may observe that two properties of cats are that they
are furry, and purr. Since the Fluffy is also a furry animal, the agent might then obtain, for
example
IntensionalInheritance < 0.5 >
Fluffy
cat
meaning that the Fluffy shares about 50% of the properties of cat. Building upon this relation-
ship even further, PLN also has a mixed intensional/extensional Inheritance relationship which
is defined simply as the disjunction of the Subset and IntensionalInheritance relationships.
As this example illustrates, for a complex auxiliary relationship R, the truth value of R A B
is defined in terms of the truth values of a number of different FOPLN relationships among
different terms (others than A and B), specified by a certain mathematical formula.
34.3.4 PLN Rules and Formulas
A distinction is made in PLN between rules and formulas. PLN logical inferences take the
form of "syllogistic rules," which give patterns for combining statements with matching terms.
Examples of PLN rules include, but are not limited to,
• deduction ((A —> B) A (B 4. (A -, C)),
• induction ((A B) A (A -, C)),
• abduction ((A —> C) A (B 0* (A —> C)),
• revision, which merges two versions of the same logical relationship that have different truth
values
• inversion ((A —) B) a (B —) A)).
The basic schematic of the first four of these rules is shown in Figure 34.1. We can see that the
first three rules represent the natural ways of doing inference on three interrelated terms. We
can also see that induction and abduction can be obtained from the combination of deduction
and inversion, a fact utilized in PLN's truth value formulas.
Related to each rule is a formula which calculates the truth value resulting from application
of the rule. As an example, suppose sA , SD, 3C, SAS, and sBc represent the truth values for the
terms A, B, C, as well the truth values of the relationships A —> B and B C, respectively.
Then, under suitable conditions imposed upon these input truth values, the formula for the
deduction rule is given by:
- .AD) ( 4C SDSDC)
SAC = SABSDC
- SB
EFTA00624399
34.3 First Order Probabilistic Logic Networks 233
deduction abduction
M
A
/
•
Ste
- 40 P 4/—i• p
induction revision
•
4
b. I )
S
p Ns. ..
Fig. 34.1: The four most basic first-order PLN inference rules
where sAc represents the truth value of the relationship A —> C. This formula is directly derived
from probability theory given the assumption that A -, B and B C are independent.
For inferences involving solely fuzzy operators, the default version of PLN uses standard
fuzzy logic with min/max truth value formulas (though alternatives may also be employed
consistently with the overall PLN framework). Finally, the semantics of combining fuzzy and
probabilistic operators is hinted at in IGNIIII081 but addressed more rigorously in 'GUN, which
gives a precise semantics for constructs of the form
Inheritance A B
where A and B are characterized by relationships of the form Member C A, Member D B, etc.
It is easy to see that, in the crisp case, where all MemberLinks and InheritanceLinks have
strength 0 or 1, FOPLN reduces to standard propositional logic. Where inheritance is crisp
but membership isn't, FOPLN reduces to higher-order fuzzy logic (including fuzzy statements
about terms or fuzzy statements, etc.).
34.3.5 Inference Trails
Inference trails are a mechanism used in some implementations of PLN, borrowed from the
NABS inference engine lWan061. In this approach, each Atom contains a nail structure. which
keeps a record of which Atoms were used in deriving the given Atom's TruthValue. In its simplest
form, the Mail can just be a list of Atoms. The total set of Atoms involved in a given nail, in
EFTA00624400
254 34 Probabilistic Logic Networks
principle, could be very large; but one can in practice cap Trail size at 50 or some other similar
number. In a more sophisticated version, one can record the rules used with the Atoms in the
Trail as well, allowing recapitulation of the whole inference history producing an Atom's truth
value. If the PLN MindAgents store all the inferences they do in some global inference history
structure, then nails are obviated, as the information in the Mail can be found via consulting
this history structure.
The purpose of keeping inference trails is to avoid errors due to double-counting of evidence.
If links L1 and L2 are both derived largely based on link Lo, and L1 and L2 both lead to ity
as a consequence - do we want to count this as two separate, independent pieces of evidence
about L4? Not really, because most of the information involved comes from the single Atom Lo
anyway. If all the Atoms maintain nails then this sort of overlapping evidence can be identified
easily; otherwise it will be opaque to the reasoning system.
While Trails can be a useful tool, there is reason to believe they're not strictly necessary.
If one just keeps doing probabilistic inference iteratively without using nails, eventually the
dependencies and overlapping evidence bases will tend to be accounted for, much as in a loopy
Bayes net. The key question then comes down to: how long is "eventually" and can the reasoning
system afford to wait? A reasonable strategy seems to be
• Use Trails for high-STI Atoms that are being reasoned about intensively, to minimize the
amount of error
• For lower-STI Atoms that are being reasoned on more casually in the background, allow
the double-counting to exist in the short term, figuring it will eventually "come out in the
wash" so it's not worth spending precious compute resources to more rigorously avoid it in
the short term
34.4 Higher-Order PLN
Higher-order PLN (HOPLN) is defined as the subset of PLN that applies to predicates (con-
sidered as functions mapping arguments into truth values). It includes mechanisms for dealing
with variable-bearing expressions and higher-order functions.
A predicate, in PLN. is a special kind of term that embodies a function mapping terms or
relationships into truth-values. HOPLN contains several relationships that act upon predicates
including Evaluation, Implication, and several types of quantifiers. The relationships can involve
constant terms, variables, or a mixture.
The Evaluation relationship, for example, evaluates a predicate on an input term. An agent
can thus create a relationship of the form
Evaluation
near
(Bob's house, Fluffy)
or, as an example involving variables,
Evaluation
near
(X, Fluffy)
EFTA00624401
34.4 Higher-Order PLN 255
The Implication relationship is a particularly simple kind of HOPLN relationship in that it
behaves very much like FOPLN relationships, via substitution of predicates in place of simple
terms. Since our agent knows, for example,
Implication
is_Fluffy
AND is_furry purrs
and
Implication
AND is_furry puns
is_cat
the agent could then use the deduction rule to conclude
Implication is_Fluffy is_cat
PLN supports a variety of quantifiers, including traditional crisp and fuzzy quantifiers, plus
the AverageQuantifier defined so that the truth value of
AverageQuantifter X F(X)
is a weighted average of F(X) over all relevant inputs X. AverageQuantifier is used implicitly
in PLN to handle logical relationships between predicates, so that e.g. the conclusion of the
above deduction is implicitly interpreted as
AverageQuantifier X
Implication
Evaluation is_Fluffy X
Evaluation is_cat X
We can now connect PLN with the SRAM model (defined in Chapter 7 of Part 1).
Suppose for instance that the agent observes Fluffy from across the room, and that it has
previously learned a Fetch procedure that tells it how to obtain an entity once it sees that
entity. Then, if the agent has the goal of finding a cat, and it has concluded based on the above
deduction that Fluffy is indeed a cat (since it is observed to be furry and purr), the cognitive
schematic (knowledge of the form Context & Procedure —> Coat as explained in Chapter 8 of
Part 1) may suggest it to execute the Fetch procedure.
34.4.1 Reducing HOPLN to FOPLN
In ICMIH081 it is shown that in principle, over any finite observation set, HOPLN reduces to
FOPLN. The key ideas of this reduction are the elimination of variables via use of higher-order
functions, and the use of the set-theoretic definition of function embodied in the SatisfyingSet
operator to map function-argument relationships into set-member relationships.
As an example, consider the Implication link. In HOPLN, where X is a variable
EFTA00624402
256 34 Probabilistic Logic Networks
Implication
RI AX
R2 B X
may be reduced to
Inheritance
SatisfyingSet(Ri A X)
SatisfyingSet(R2 B X)
where e.g. SatisfyingSet(R1 A X) is the fuzzy set of all X satisfying the relationship RI(A, X).
Furthermore in Appendix ??, we show how experience-based possible world semantics can
be used to reduce PLN's existential and universal quantifiers to standard higher order PLN
relationships using AverageQuantifier relationships. This completes the reduction of HOPLN to
FOPLN in the SRAM context.
One may then wonder why it makes sense to think about HOPLN at all. The answer is that it
provides compact expression of a specific subset of FOPLN expressions, which is useful in cases
where agents have limited memory and these particular expressions provide agents practical
value (it biases the agent's reasoning ability to perform just as well as in first or higher orders).
34.5 Predictive Implication and Attraction
This section briefly reviews the notions of predictive implication and predictive attraction,
which are critical to many aspects of CogPrime dynamics including goal-oriented behavior.
Define
Attraction A B <s>
as P(BIA) - P(BHA) = s, or in node and link terms
s (Inheritance A B).s - (Inheritance -.A B).s
For instance
(Attraction fat pig).s
(Inheritance fat pig).s - (Inheritance -.fat pig).s
Belatedly, in the temporal domain, we have the link type PredictiveImplication, where
Predictivelmplication A B <s>
roughly means that s is the probability that
Implication A B <s>
holds and also A occurs before B. More sophisticated versions of Predictivelmplication come
along with more specific information regarding the time lag between A and B: for instance a
time interval T in which the lag mast lie, or a probability distribution governing the lag between
the two events.
We may then introduce
PredictiveAttraction A B <s>
to mean
s (PredictiveImplication A B).s - (Predictivelmplication B).s
For instance
EFTA00624403
34.6 Confidence Decay 257
(PredictiveAttraction kiss_Ben be_happy).s -
1Predictivelmplication kiss_Ben be_happyl.s
- (Predictivelmplication -.kiss_Ben be_happy).s
This is what really matters in terms of determining whether kissing Ben is worth doing in
pursuit of the goal of being happy, not just how likely it is to be happy if you kiss Ben, but
how differentially likely it is to be happy if you kiss Ben.
Along with predictive implication and attraction, sequential logical operations are important,
represented by operators such as SequentialAND, SimultaneousAND and SimultaneousOR. For
instance:
PredictiveAttraction
SeguentialAND
Teacher says 'fetch'
I get the ball
I bring the ball to the teacher
I get a reward
combines SequentialAND and PredictiveAttraction. In this manner, an arbitrarily complex
system of serial and parallel temporal events can be constructed.
34.6 Confidence Decay
PLN is all about uncertain truth values, yet there is an important kind of uncertainty it doesn't
handle explicitly and completely in its standard truth value representations: the decay of infor-
mation with time.
PLN does have an elegant mechanism for handling this: in the <s,d> formalism for truth
values, strength s may remain untouched by time (except as new evidence specifically corrects
it), but d may decay over time. So, our confidence in our old observations decreases with time.
In the indefinite probability formalism, what this means is that old truth value intervals get
wider, but retain the same mean as they had back in the good old days.
But the tricky question is: How fast does this decay happen?
This can be highly context-dependent.
For instance, 20 years ago we learned that the electric guitar is the most popular instrument
in the world, and also that there are more bacteria than humans on Earth. The former fact is
no longer true (keyboard synthesizers have outpaced electric guitars). but the latter is. And,
if you'd asked us 20 years ago which fact would be more likely to become obsolete, we would
have answered the former - because we knew particulars of technology would likely change far
faster than basic facts of earthly ecology.
On a smaller scale, it seems that estimating confidence decay rates for different sorts of
knowledge in different contexts is a tractable data mining problem, that can be solved via
the system keeping a record of the observed truth values of a random sampling of Atoms as
they change over time. (Operationally, this record may be maintained in parallel with the
SystemActivityTable and other tables maintained for purposes of effort estimation, attention
allocation and credit assignment.) If the truth values of a certain sort of Atom in a certain
context change a lot, then the confidence decay rate for Atoms of that sort should be increased.
This can be quantified nicely using the indefinite probabilities framework.
For instance, we can calculate, for a given sort of Atom in a given context, separate b-level
credible intervals for the L and U components of the Atom's truth value at time t-r, centered
EFTA00624404
258 34 Probabilistic Logic Networks
about the corresponding values at time t. (This would be computed by averaging over all t
values in the relevant past. where the relevant past is defined as some particular multiple of r;
and over a number of Atoms of the same sort in the same context.)
Since historically-estimated credible-intervals won't be available for every exact value of r,
interpolation will have to be used between the values calculated for specific values of r.
Also, while separate intervals for L and U would be kept for maximum accuracy, for reasons
of pragmatic memory efficiency one might want to maintain only a single number x, considered
as the radius of the confidence interval about both L and U. This could be obtained by averaging
together the empirically obtained intervals for L and U.
Then, when updating an Atom's truth value based on a new observation, one performs a
revision of the old TV with the new, but before doing so, one first widens the interval for the
old one by the amounts indicated by the above-mentioned credible intervals.
For instance, if one gets a new observation about A with TV (L„c,„, Une„,), and the prior
TV of A, namely (Lad, Udd), is 2 weeks old, then one may calculate that Ldd should really be
considered as:
SCLJolcll - x, L_rold}+x)3
and U_old should really be considered as:
SW_fold} - x, QJ old} + x)$
so that (L_new, U_new) should actually be revised with:
Sct.Joicil - x, U_(old} + x)$
to get the total:
(L,U)
for the Atom after the new observation.
Note that we have referred fuzzily to "sort of Atom" rather than "type of Atom" in the above.
This is because Atom type is not really the right level of specificity to be looking at. Rather -
as in the guitar vs. bacteria example above - confidence decay rates may depend on semantic
categories, not just syntactic (Atom type) categories. To give another example, confidence in
the location of a person should decay more quickly than confidence in the location of a building.
So ultimately confidence decay needs to be managed by a pool of learned predicates, which are
applied periodically. These predicates are mainly to be learned by data mining, but inference
may also play a role in some cases.
The ConfidenceDecay MindAgent must take care of applying the confidence-decaying pred-
icates to the Atoms in the AtomTahle. periodically.
The ConfidenceDecayUpdater MindAgent must take care of:
• forming new confidence-decaying predicates via data mining, and then revising them with
the existing relevant confidence-decaying predicates.
• flagging confidence-decaying predicates which pertain to important Atoms but are uncon-
fident, by giving them STICurrency, so as to make it likely that they will be visited by
inference.
34.6.1 An Example
As an example of the above issues, consider that the confidence decay of:
EFTA00624405
34.6 Confidence Decay 259
Inh Ari male
should be low whereas that of:
Inh Ari tired
should be higher, because we know that for humans, being male tends to be a more permanent
condition than being tired.
This suggests that concepts should have context-dependent decay rates, e.g. in the context
of humans, the default decay rate of maleness is low whereas the default decay rate of tired-ness
is high.
However, these defaults can be overridden. For instance, one can say "As he passed through
his 80's, Grandpa just got tired, and eventually he died." This kind of tiredness, even in the
context of humans, does not have a rapid decay rate. This example indicates why the confidence
decay rate of a particular Atom needs to be able to override the default.
In terms of implementation, one mechanism to achieve the above example would be as follows.
One could incorporate an interval confidence decay rate as an optional component of a truth
value. As noted above one can keep two separate intervals for the L and U bounds; or to simplify
things one can keep a single interval and apply it to both bounds separately.
Then, e.g., to define the decay rate for tiredness among humans, we could say:
ImplicationLink_HOJ
InheritanceLink $X human
InheritanceLink $X tired <confidenceDecay - (0,.1J>
or else (preferably):
ContextLink
human
InheritanceLink $X tired <confidenceDecay - [0,.1(>
Similarly, regarding maleness we could say:
ContextLink
human
Inh SX male <confidenceDecay a 10,.00001)>
Then one way to express the violation of the default in the case of grandpa's tiredness would
be:
InheritanceLink grandpa tired <confidenceDecay - (0,.001l>
(Another way to handle the violation from default, of course, would be to create a separate
Atom:
tired_from_old_age
and consider this as a separate sense of "tired" from the normal one, with its own confidence
decay setting.)
In this example we see that, when a new Atom is created (e.g. InheritanceLink Ari tired),
it needs to be assigned a confidence decay rate via inference based on relations such as the
ones given above (this might be done e.g. by placing it on the queue for immediate attention
by the ConfidenceDecayUpdater MindAgent). And periodically its confidence decay rate could
be updated based on ongoing inferences (in case relevant abstract knowledge about confidence
decay rates changes). Making this sort of inference reasonably efficient might require creating a
special index containing abstract relationships that tell you something about confidence decay
adjustment. such as the examples given above.
EFTA00624406
260 34 Probabilistic Logic Networks
34.7 Why is PLN a Good Idea?
We have explored the intersection of the family of conceptual and formal structures that is
PLN, with a specific formal model of intelligent agents (SRAM) and its extension using the
cognitive schematic. The result is a simple and explicit formulation of PLN as a system by
which an agent can manipulate tokens in its memory, thus represent observed and conjectured
relationships (between its observations and between other relationships), in a way that assists
it in choosing actions according to the cognitive schematic.
We have not, however, rigorously answered the question: What is the contribution of PLN
to intelligence, within the formal agents framework introduced above? This is a quite subtle
question, to which we can currently offer only an intuitive answer, not a rigorous one.
Firstly, there is the question of whether probability theory is really the best way to manage
uncertainty, in a practical context. Theoretical results like those of Ca t ox611 and de Finetti
FIF371 demonstrate that probability theory is the optimal way to handle uncertainty, if one
makes certain reasonable assumptions. However, these reasonable assumptions don't actually
apply to real-world intelligent systems, which must operate with relatively severe computa-
tional resource constraints. For example, one of Cox's axioms dictates that a reasoning system
must assign the same truth value to a statement, regardless of the route it uses to derive the
statement. This is a nice idealization, but it can't be expected of any real-world, finite-resources
reasoning system dealing with a complex environment. So an open question exists, as to whether
probability theory is actually the best way for practical AGI systems to manage uncertainty.
Most contemporary Al researchers assume the answer is yes, and probabilistic AI has achieved
increasing popularity in recent years. However, there are also significant voices of dissent, such
as Pei Wang tWanthil in the AGI community, and many within the fuzzy logic community.
PLN is not strictly probabilistic, in the sense that it combines formulas derived rigorously
from probability theory with others that are frankly heuristic in nature. PLN was created in a
spirit of open-mindedness regarding whether probability theory is actually the optimal approach
to reasoning under uncertainty using limited resources, versus merely an approximation to the
optimal approach in this case. Future versions of PLN might become either more or less strictly
probabilistic, depending on theoretical and practical advances.
Next, aside from the question of the practical value of probability theory, there is the question
of whether PLN in particular is a good approach to carrying out significant parts of what an
AGI system needs to do, to achieve human-like goals in environments similar to everyday human
environments.
Within a cognitive architecture where explicit utilization the cognitive schematic (Context
& Procedure —> Goal) is useful, clearly PLN is useful if it works reasonably well - so this
question partially reduces to: what are the environments in which agents relying on the cognitive
schematic are intelligent according to formal intelligent measures like those defined in Chapter
7 of Part 1. And then there is the possibility that some uncertain reasoning formalism besides
PLN could be even more useful in the context of the cognitive schematic.
In particular, the question arises: What are the unique, peculiar aspects of PLN that makes
it more useful in the context of the cognitive schematic, than some other, more straightforward
approach to probabilistic inference? Actually there are multiple such aspects that we believe
make it particularly useful. One is the indefinite probability approach to truth values, which
we believe is more robust for AGI than known alternatives. Another is the clean reduction of
higher order logic (as defined in PLN) to first-order logic (as defined in PLN), and the utilization
EFTA00624407
34.7 Why is PLN a Good Idea? 261
of term logic instead of predicate logic wherever possible — these aspects make PLN inferences
relatively simple in most cases where, according to human common sense, they should be simple.
A relatively subtle issue in this regard has to do with PLN intension. The cognitive schematic
is formulated in terms of PredictiveExtensionallmplication (or any other equivalent way like
PredictiveExtensionalAttraction), which means that intensional PLN links are not required
for handling it. The hypothesis of the usefulness of intensional PLN links embodies a subtle
assumption about the nature of the environments that intelligent agents are operating in. As
discussed in roe06] it requires an assumption related to Peirce's philosophical axiom of the
"tendency to take habits," which posits that in the real world, entities possessing some similar
patterns have a probabilistically surprising tendency to have more similar patterns.
Reflecting on these various theoretical subtleties and uncertainties, one may get the feeling
that the justification for applying PLN in practice is quite insecure! However, it must be noted
that no other formalism in AI has significantly better foundation, at present. Every AI method
involves certain heuristic assumptions, and the applicability of these assumptions in real life is
nearly always a matter of informal judgment and copious debate. Even a very rigorous technique
like a crisp logic formalism or support vector machines for classification, requires non-rigorous
heuristic assumptions to be applied to the real world (how does sensation and actuation get
translated into logic formulas, or SW feature vectors)? It would be great if it were possible to
use rigorous mathematical theory to derive an AGI design, but that's not the case right now,
and the development of this sort of mathematical theory seems quite a long way off. So for now,
we must proceed via a combination of mathematics, practice and intuition.
In terms of demonstrated practical utility, PLN has not yet confronted any really ambitious
AGI-type problems, but it has shown itself capable of simple practical problem-solving in areas
such as virtual agent control and natural language based scientific reasoning r HMOS" The
current PLN implementation within CogPrime can be used to learn to play fetch or tag, draw
analogies based on observed objects, or figure out how to carry out tasks like finding a cat.
We expect that further practical applications, as well as very ambitious AGI development, can
be successfully undertaken with PLN without a theoretical understanding of exactly what are
the properties of the environments and goals involved that allow PLN to be effective. However,
we expect that a deeper theoretical understanding may enable various aspects of PLN to be
adjusted in a more effective manner.
EFTA00624408
EFTA00624409
Chapter 35
Spatiotemporal Inference
35.1 Introduction
Most of the problems and situations humans confront every day involve space and time explicitly
and centrally. Thus, any AGI system aspiring to humanlike general intelligence must have some
reasonably efficient and general capability to solve spatiotemporal problems. Regarding how
this capability might get into the system, there is a spectrum of possibilities, ranging from rigid
hard-coding to tabula rasa experiential learning. Our bias in this regard is that it's probably
sensible to somehow "wire into" CogPrime some knowledge regarding space and time - these
being, after all, very basic categories for any embodied mind confronting the world.
It's arguable whether the explicit insertion of prior knowledge about spacetime is necessary
for achieving humanlike AGI using feasible resources. As an argument against the necessity
of this sort of prior knowledge, Ben Kuipers and his colleagues ISNIK 21 have shown that
an AI system can learn via experience that its perceptual stream comes from a world with
three, rather than two or four dimensions. There is a long way from learning the number of
dimensions in the world to learning the full scope of practical knowledge needed for effectively
reasoning about the world - but it does seem plausible, from their work, that a broad variety
of spatiotemporal knowledge could be inferred from raw experiential data. On the other hand,
it also seems clear that the human brain does not do it this way, and that a rich fund of
spatiotemporal knowledge is "hard-coded" into the brain by evolution - often in ways so low-
level that we take them for granted, e.g. the way some motion detection neurons fire in the
physical direction of motion, and the way somatosensory cortex presents a distorted map of the
body's surface. On a psychological level, it is known that some fundamental intuition for space
and time is hard-coded into the human infant's brain 1.1011051. So while we consider the learning
of basic spatiotemporal knowledge from raw experience a worthy research direction, and fully
compatible with the CogPrime vision; yet for our main current research, we have chosen to
hard-wire some basic spatiotemporal knowledge.
If one does wish to hard-wire some basic spatiotemporal knowledge into one's AI system,
multiple alternate or complementary methodologies may be used to achieve this, including spa-
tiotemporal logical inference, internal simulation, or techniques like recurrent neural nets whose
dynamics defy simple analytic explanation. Though our focus in this chapter is on inference, we
must emphasize that inference, even very broadly conceived, is not the only way for an intelli-
gent agent to solve spatiotemporal problems occurring in its life. For instance, if the agent has
a detailed map of its environment, it may be able to answer some spatiotemporal questions by
263
EFTA00624410
264 35 Spatiotemporal Inference
directly retrieving information from the map. Or, logical inference may be substituted or aug-
mented by (implicitly or explicitly) building a model that satisfies the initial knowledge - either
abstractly or via incorporating "visualization" connected to sensory memory - and then inter-
pret new knowledge over that model instead of inferring it. The latter is one way to interpret
what DeSTIN and other CSDLNs do; indeed, DeSTIN's perceptual hierarchy is often referred to
as a "state inference hierarchy." Any CSDLN contains biasing toward the commonsense struc-
ture of space and time, in its spatiotemporal hierarchical structure. It seems plausible that the
human mind uses a combination of multiple methods for spatiotemporal understanding, just as
we intend CogPrime to do.
In this chapter we focus on spatiotemporal logical inference, addressing the problem of creat-
ing a spatiotemporal logic adequate for use within an AGI system that confronts the same sort
of real-world problems that humans typically do. The idea is not to fully specify the system's un-
derstanding of space and time in advance. but rather to provide some basic spatiotemporal logic
rules, with parameters to be adjusted based on experience, and the opportunity for augmenting
the logic over time with experientially-acquired rules. Most of the ideas in this chapter are
reviewed in more detail, with more explanation. in the book Real World Reasoning [MC+ Ill;
this chapter represent a concise summary, compiled with the AGI context specifically in mind.
A great deal of excellent work has already been done in the areas of spatial, temporal and
spatiotemporal reasoning; however, this work does not quite provide an adequate foundation
for a logic-incorporating AGI system to do spatiotemporal reasoning, because it does not ade-
quately incorporate uncertainty. Our focus here is to extend existing spatiotemporal calculi to
appropriately encompass uncertainty, which we argue is sufficient to transform them into an
AGI-ready spatiotemporal reasoning framework. We also find that a simple extension of the
standard PLN uncertainty representations, inspired by P(Z)-logic Tian lOj, allows more elegant
expression of probabilistic fuzzy predicates such as arise naturally in spatiotemporal logic.
In the final section of the chapter, we discuss the problem of planning, which has been con-
sidered extensively in the Al literature. We describe an approach to planning that incorporates
PLN inference using spatiotemporal logic, along with MOSES as a search method, and some
record-keeping methods inspired by traditional Al planning algorithms.
35.2 Related Work on Spatio-temporal Calculi
We now review several calculi that have previously been introduced for representing and rea-
soning about space, time and space-t line combined.
Spatial Calculi
Calculi dealing with space usually model three types of relationships between spatial regions:
topological, directional and metric.
The most popular calculus dealing with topology is the Region Connection Calculus (RCC)
IRC(793], relying on a base relationship C (for Connected) and building up other relationships
from it, like P (for PartOf), or 0 (for Overlap). For instance P(X, Y), meaning X is a part of
Y, can be defined using C as follows
EFTA00624411
35.2 Related Work on Spatio-temporal Calculi 265
P(X, Y) iff VZ E Li, C(Z, X) C(Z, Y) (35.1)
Where U is the universe of regions. RCC-8 models eight base relationships. see Figure 35.1. And
DC(X. Y) EC(X. Y) TPP(X. Y) NTPP(X. Y)
PO(X. Y) EQ(X. Y) TPPi(X. Y) NTPPi(X. Y)
Fig. 35.1: The eight base relationships of RCC-8
it is possible, using the notion of convexity, to model more relationships such as inside, partially
inside and outside, see Figure 35.2. For instance RCC-23 is an extension of RCC-8 using
relationships based on the notion of convexity. The 9-intersection calculus IWM95, Kur09I is
Insidc(%L Y) P—Insidc(X Y) Outsidc(X. Y)
Fig. 35.2: Additional relationships using convexity
another calculus for reasoning on topological relationships, but handling relationships between
heterogeneous objects, points, lines, surfaces.
Regarding reasoning about direction, the Cardinal Direction Calculus IC Eot , ZI41)(08] con-
siders directional relationships between regions, to express propositions such as "region A is to
the north of region B".
And finally regarding metric reasoning, spatial reasoning involving qualitative distance (such
as close, medium, far) and direction combined is considered in ICFH97I.
Some work has also been done to extend and combine these various calculi, such as combining
RCC-8 and the Cardinal Direction Calculus IMAM], or using size [GR00) or shape ICoh95]
information in RCC.
Temporal Calculi
The best known temporal calculus is Allen's Interval Algebra which considers 13 rela-
tionships over time intervals, such as Before, During, Overlap, Meet, etc. For instance one
EFTA00624412
266 35 Spatiotemporal Inference
can express that digestion occurs after or right after eating by
Before(Eet, Digest) V Meet(Eat, Digest)
equivalently denoted Eet{Bef ore,Meet}Digest. There also exists a generalization of Allen's
Interval Algebra that works on semi-intervals IFF92j, that axe intervals with possibly undefined
start or end.
There are modal temporal logics such as LTL and CT L, mostly used to check temporal
constraints on concurrent systems such as deadlock or fairness using Model Checking IN lai001.
Calculi with Space and Time Combined
There exist calculi combining space and time, first of all those obtained by "temporizing" spatial
calculus, that is tagging spatial predicates with timestamps or time intervals. For instance STCC
(for Spatio-temporal Constraint Calculus) IC NO2j is basically RCC-8 combined with Allen's
Algebra. With STCC one can express spatiotemporal propositions such as
Meet(DC( Finger, Key), EC( Finger, Key))
which means that the interval during which the finger is away from the key meets the interval
during which the finger is against the key.
Another way to combine space and time is by modeling motion; e.g. the Qualitative Tra-
jectory Calculus (QTC) IWKB05] can be used to express whether 2 objects are going for-
ward/backward or left/right relative to each other.
Uncertainty in Spatio-temporal Calculi
In many situations it is worthwhile or even nectsary to consider non-crisp extensions of these
calculi. For example it is not obvious how one should consider in practice whether two regions
are connected or disconnected. A desk against the wall would probably be considered connected
to it even if there is a small gap between the wall and the desk. Or if A is not entirely part
of B it may still be valuable to consider to which extent it is, rather than formally rejecting
PartOf (A, B). There are several ways to deal with such phenomena; one way is to consider
probabilistic or fuzzy extensions of spatiotemporal
For instance in ISIXICK08b, SIX'CKOSal the RCC relationship C (for Connected) is replaced
by a fuzzy predicate representing closeness between regions and all other relationships based
on it are extended accordingly. So e.g. DC (for Disconnected) is defined as follows
DC(X, Y) =1 — C(X, Y) (35.2)
P (for Part0f) is defined as
P(X, Y) = g gc(z, x),C(Z, Y)) (35.3)
where / is a fuzzy implication with some natural properties (usually /(x l , x2) = max(1 —xi , x2)).
Or, FA (for Equal) is defined as
EFTA00624413
35.3 Uncertainty with Distributional Fuzzy Values 267
EQ(X, Y) = min(P(X, Y), P(Y, X)) (35.4)
and so on.
However the inference rules cannot determine the exact fuzzy values of the resulting rela-
tionships but only a lower bound, for instance
T(P(X, Y), P(Y, Z)) ≤ P(X, Z) (35.5)
where T(x l , 22) = max(0, x1 +22 -1). This is to be expected since in order to know the resulting
fuzzy value one would need to know the exact spatial configuration. For instance Figure 35.3
depicts 2 possible configurations that would result in 2 different values of P(X, Z).
(A) (b)
Fig. 35.3: Depending on where Z is, in dashline, P(X, Z) gets a different value.
One way to address this difficulty is to reason with interval-value fuzzy logic [DP001, with the
downside of ending up with wide intervals. For example applying the same inference rule from
Equation 35.5 in the case depicted in Figure 35.4 would result in the interval [0, 1], corresponding
to a state of total ignorance. This is the main reason why, as explained in the next section, we
have decided to use distributional fuzzy values for our AGloriented spatiotemporal reasoning.
There also exist attempts to use probability with RCC. For instance in [Win0(1, RCC re-
lationships are extracted from computer images and weighted based on their likelihood as
estimated by a shape recognition algorithm. However, to the best of our knowledge, no one
has used distributional fuzzy values [Yan WI in the context of spatiotemporal reasoning; and we
believe this Ls important for the adaptation of spatiotemporal calculi to the AGI context.
35.3 Uncertainty with Distributional Fuzzy Values
P(Z) Ivan is an extension of fuzzy logic that considers distributions of fuzzy values rather
than mere fuzzy values. That is, fuzzy connectors are extended to apply over probability density
functions of fuzzy truth value. For instance the connector (often defined as —x = 1 — x) is
extended such that the resulting distribution p, : [0, R+ is
It~(x)= p(1 - x) (35.6)
where A is the probability density function of the unique argument. Similarly, one can define
Inn : [0,1] R+ as the resulting density function of the connector xi A X2 = min(x , x2) over
the 2 arguments µ i : [0,1] and µz : [01 R+
EFTA00624414
268 35 Spatiotemporal Inference
11/4 (2) = (x)1 s2(x2)dx2
(35.7)
+$12(x).1 111(xl)dx1
See [Ilan WI for the justification of Equations 35.6 and 35.7.
Besides extending the traditional fuzzy operators, one can also define a wider class of con-
nectors that can fully modulate the output distribution. Let F : [0,1)" H ([0,1J H R+) be a
n-ary connector that takes n fuzzy values and returns a probability density function. In that
case the probability density function resulting from the extension of F over distributional fuzzy
values is:
AP" =
1: F(xl, x„)fizi (xi) µ„(x,,)dxi dx„ (35.8)
where sti , p„ are the n input arguments. That is, it is the average of all density functions
output by F applied over all fuzzy input values. Let us call that type of connectors fuzzy-
probabilistic.
In the following we give an example of such a fuzzy-probabilistic connector.
Example with PartOf
Let us consider the RCC relationship PartOf (P for short as defined in Equation 35.1). A typical
inference rule in the crisp case would be:
P(X,Y) P(Y,Z)
(35.9)
P(X, Z)
expressing the transitivity of P. But using a distribution of fuzzy values we would have the
following rule
P (X, Y) P(Y,Z) (µ2) (35.10)
P(X, Z) (nor)
POT stands for PartOf Transitivity. The definition of szpor for that particular inference rule
may depend on many assumptions like the shapes and sizes of regions X, Y and Z. In the
following we will give an example of a definition of ppm, with respect to some oversimplified
assumptions chosen to keep the example short.
Let us define the fuzzy variant of PartOf (X, Y) as the proportion of X which is part of Y
(as suggested in 'Pang). Let us also assume that every region is a unitary circle. In this case,
the required proportion depends solely on the distance dxy between the centers of X and Y,
so we may define a function f that takes that distance and returns the according fuzzy value;
that is, f(dxy) = P(X,Y)
EFTA00624415
35.3 Uncertainty with Distributional Fuzzy Values 269
4a — dxy sin (a ) if 0
< dxy ≤ 2
f(dxy) = S 27r (35.11)
0 if dxr > 2
where a = cos' (dxy/2).
For 0 ≤ dxy ≤ 2, f(dxy) is monotone decreasing, so the inverse of f(dxy), that takes a
fuzzy value and returns a distance, is a function declared as f (x) : [0,1] a-). [0,2].
Let be xxy = P(X,Y), xyz = P(Y,Z), x = P(X, Z), dxY = f -1(xxv), drz = f-1(xyz),
= Idxy —dyzi and Is = dxy -I-dyz. For dxy and dy2 fixed, let g : [0, ni H [I, u] be a function
that takes as input the angle Q of the two lines from the center of Y to X and Y to Z (as
depicted in Figure 35.4) and returns the distance dxz. g is defined as follows
9(0) = V(dxy - dvz sin (0))2 + (dYz cos (0 2
So 1 ≤ dxz ≤ u. It is easy to see that g is monotone increasing and surjective, therefore there
exists a function inverse C I : [l, u] H [0,71. Let h = fog, so h takes an angle as input and
(a) (b) (e)
Fig. 35.4: dxz, in dashline, for 3 different angles
returns a fuzzy value, h : [0,71 H [0,1]. Since f is monotone decreasing and g is monotone
increasing, h is monotone decreasing. Note that the codomain of h is [0, f -1(0] if 1 < 2 or {0}
otherwise. Assuming that 1 < 2, then the inverse of h is a function with the following signature
: [0, f -1.(1)] H [0,71. Using and assuming that the probability of picking ti E [0,7] is
uniform, we can define the binary connector POT. Let us define v = POT(xxy, xyz), recalling
that POT returns a density function and assuming x < f (I)
h-1(=) 1
v(x) = 2 A-1(x+86)
hm
2 1 -1(x -F
lim
— ar6
2h—it (x)
(35.12)
where ICI' is the derivative of IC'. If x ≥ ri(l) then v(x) = 0. For sake of simplicity the exact
expressions of h-1 and v(x) have been left out, and the case where one of the fuzzy arguments
EFTA00624416
270 35 Spatiotemporal Inference
xxv, xyz or both are null has not been considered but would be treated similarly assuming
some probability distribution over the distances day and dxz.
It is now possible to define ppm, in rule 35.10 (following Equation 35.8)
ILPOT =
(35.13)
POT(2 11X2)iilfri)112(X2)dXidX2
.10 0
Obviously, assuming that regions are unitary circles is crude; in practice, regions might be
of very different shapes and sizes. In fact it might be so difficult to chose the right assumptions
(and once chosen to define POT correctly), that in a complex practical context it may be best
to start with overly simplistic assumptions and then learn POT based on the experience of
the agent. So the agent would initially perform spatial reasoning not too accurately, but would
improve over time by adjusting POT, as well as the other connectors corresponding to other
inference rules.
It may also be useful to have more premises containing information about the sizes (e.g
Big(X)) and shapes (e.g Long(Y)) of the regions, like
B(X) L(Y) (A2) P(X, Y) (P) P(Y,Z) (A4)
P(X, Z) (a)
where B and L stand respectively for Big and Long.
Simplifying Numerical Calculation
Using probability density as described above is computationally expensive, and in many prac-
tical cases it's overkill. To decrease computational cost, several cruder approaches are passible,
such as discretizing the probability density functions with a coarse resolution, or restricting
attention to beta distributions and treating only their means and variances (as in [Yan
The right way to simplify depends on the fuzzy-probabilistic connector involved and on how
much inaccuracy can be tolerated in practice.
35.4 Spatio-temporal Inference in PLN
We have discussed the representation of spatiotemporal knowledge, including associated micer-
tainty. But ultimately what matters is what an intelligent agent can do with this knowledge.
We now turn to uncertain reasoning based on uncertain spatiotemporal knowledge, using the
integration of the above-discussed calculi into the Probabilistic Logic Networks reasoning sys-
tem, an uncertain inference framework designed specifically for AGI and integrated into the
OpenCog AGI framework.
We give here a few examples of spatiotemporal inference rules coded in PLN. Although
the current implementation of PLN incorporates both fuzziness and probability it does not
have a built-in truth value to represent distributional fuzzy values, or rather a distribution of
distribution of fuzzy value, as this is how, in essence, confidence is represented in PLN. At that
point, depending on design choice and experimentation, it is not clear whether we want to use
EFTA00624417
35.4 Spatio-temporal Inference in PLN 271
the existing truth values and treat them as distributional truth values or implement a new type
of truth value dedicated for that, so for our present theoretical purposes we will jast call it DF
Truth Value.
Due to the highly flexible Hal formalism (Higher Order Judgment, explained in the PLN
book in detail) we can express the inference rule for the relationship PartOf directly as Nodes
and Links as follows
ForAllLink $X $Y $Z
ImplicationLink_HOJ
ANDLink
PartOf ($X, SY) (tvi)
(35.14)
PartOf ($Y,SZ) (tv2)
ANDLink
tv3 = µpo7(tvl,tv2)
PartOf($X,$Z) (tv3)
where µpoi is defined in Equation 35.13 but extended over the domain of PLN DF Truth Value
instead of P(Z) distributional fuzzy value. Note that Part0f(SX,SY) (tv) is a shorthand for
EvaluationLink (tv)
PartOf
ListLink (35.15)
$71
$Y
and ForAllLink $X $Y $Z is a shorthand for
ForAllLink
ListLink
$X (35.16)
$Y
$Z
Of course one advantage of expressing the inference rule directly in Nodes and Links rather
than a built-in PLN inference rule is that we can use OpenCog itself to improve and refine it,
or even create new spatiotemporal rules based on its experience. In the next 2 examples the
fuzzy-probabilistic connectors are ignored, (so no DF Truth Value is indicated) but one could
define them similarly to ;tpor•
First consider a temporal rule from Allen's Interval Algebra. For instance "if $h meets $I2
and $I3 is during $I2 then $I3 is after SIC would be expressed as
ForAllLink $I1 $12 $I3
ImplicationLink
ANDLink
(35.17)
Meet(SI1,$I2)
During($I3,$I2)
After(SI3. $It )
EFTA00624418
272 35 Spatiotentporal Inference
And a last example with a metric predicate could be "if $X is near $Y and VC is far from $Z then
$Y is far from Sr
ForAllLink $X $Y $Z
ImplicationLink_HOJ
ANDLink
(35.18)
Near(SX,SY)
Far(SX, SZ)
Far($Y, $Z)
That is only a small and partial illustrative example - for instance other rules may be used to
specify that Near and Far and reflexive and symmetric.
35.5 Examples
The ideas presented here have extremely broad applicability; but for sake of concreteness, we
now give a handful of examples illustrating applications to commonsense reasoning problems.
35.5.1 Spatiotemporal Rules
The rules provided here are reduced to the strict minimum needed for the examples:
1. At $T, if $X is inside $Y and $Y is inside $Z then Si( is inside $Z
ForAllLink $T $X $Y $Z
ImplicationLink_HOJ
ANDLink
atTime(ST, Inside($X, $Y))
atTime(ST, Inside($Y, $Z))
atTime(ST, Inside($X,SZ))
2. If a small object $X is over $Y and $Y is far from $Z then $X is far from $Z
ForAllLink
ImplicationLink_HOJ
ANDLink
Small(SX)
Over($X,$Y)
Far(SY)
Far(SX)
That rule is expressed in a crisp way but again is to be understood in an uncertain way, although
we haven't worked out the exact formulae.
EFTA00624419
35.5 Examples 273
35.5.2 The Laptop is Safe from the Rain
A laptop is over the desk in the hotel room, the desk is far from the window and we want assess
to what extent the laptop is far from the window, therefore same from the rain.
Note that the truth values are ignored but each concept is to be understood as fuzzy, that
is having a PLN Fuzzy Truth Value but the numerical calculation are left out.
We want to assess how far the Laptop is from the window
Far( Window, Laptop)
Assuming the following
1. The laptop is small
Small(Laptop)
2. The laptop is over the desk
Over(Laptop, Desk)
3. The desk is far from the window
Far(Desk, Window)
Now we can show an inference trail that lead to the conclusion, the numeric calculation are let
for later.
1. using axioms 1, 2, 3 and PLN AND rule
ANDLink
Small(Laptop)
Over(Laptop, Desk)
Far(Desk, Window)
2. using spatiotemporal rule 2, instantiated with SX = Laptop, Si = Desk and $Z = Window
ImplicationLink_HOJ
ANDLink
Small(Laptop)
Over(Laptop, Desk)
Far(Desk, Window)
Far(Laptop, Window)
3. using the result of previous step as premise with PLN implication rule
Far(Laptop, Window)
35.5.3 Fetching the Toy Inside the Upper Cupboard
Suppose we know that there is a toy in an upper cupboard and near a bag, and want to assess
to which extent climbing on the pillow is going to bring us near the toy.
Here are the following assumptions
1. The toy is near the bag and inside the cupboard. The pillow is near and below the cupboard
Near (toy, bag) (tvi)
Inside(toy, cupboard) (tv2)
Below(pillow, cupboard) (tv3)
Near(pillow, cupboard) (tv4)
EFTA00624420
274 35 Spatiotemporal Inference
2. The toy is near the bag inside the cupboard, how near is the toy to the edge of the cupboard?
ImplicationLink_HOJ
ANDLink
Near(toy,bag)(tvi)
Inside(toy,cupboard)(tv2)
ANDLink
tv3 = tv2)
Near(toy,cupboard_edge)(tv3)
3. If I climb on the pillow, then shortly after I'll be on the pillow
PredictivelmplicationLink
Climb_on(pillow)
Over(self, pillow)
4. If I am on the pillow near the edge of the cupboard how near am I to the toy?
ImplicationLink_HOJ
ANDLink
Below(pillow, cupboard) (tvi)
Near(pillow, cupboard) (tv2)
Over (self , pillow) (tv3)
Near(toy, cupboard_edge) (tv4)
ANDLink
tv5 = OF2 (tvi, tv2, tv3, tv4)
Near (self , toy) (tits)
The target theorem is "How near I am to the toy if I climb on the pillow."
PredictivelmplicationLink
Climb_on(pillow)
Near(self, toy) (?)
And the inference chain as follows
1. Axiom 2 with axiom 1
Near(toy, cupboard_edge) (tv6)
2. Step 1 with axiom 1 and 3
PredictivelmplicationLink
Climb_on(pillow)
ANDLink
Below(pillow, cupboard) (tv3)
Near(pillow, cupboard) (tv4)
Over (self , pillow) (tv7)
Near(toy, cupboard_edge) (tv6)
3. Step 2 with axiom 4, target theorem: How near am I to the toy if I climb on the pillow
PredictivelmplicationLink
Climb_on(pillow)
Near(self, toy) (tv3)
EFTA00624421
35.6 An Integrative Approach to Planning 275
35.6 An Integrative Approach to Planning
Planning is a major research area in the mainstream AI community, and planning algorithms
have advanced dramatically in the last decade. However, the best of breed planning algorithms
are still not able to deal with planning in complex environments in the face of a high level of
uncertainty, which is the sort of situation routinely faced by humans in everyday life. Really
powerful planning, we suggest, requires an approach different than any of the dedicated planning
algorithms, involving spatiotemporal logic combined with a sophisticated search mechanism
(such as MOSES).
It may be valuable (or even necessary) for an intelligent system involved in planning-intensive
goals to maintain a specialized planning-focused data structure to guide general learning mech-
anisms toward more efficient learning in a planning context. But even if so, we believe planning
must ultimately be done as a case of more general learning, rather than via a specialized algo-
rithm.
The basic approach we suggest here is to
• use MOSES for the core plan learning algorithm. That is, MOSES would maintain a popu-
lation of "candidate partial plans", and evolve this population in an effort to find effective
complete plans.
• use PLN to help in the fitness evaluation of candidate partial plans. That is, PLN would
be used to estimate the probability that a partial plan can be extended into a high-quality
complete plan. This requires PLN to make heavy use of spatiotemporal logic, as described
in the previous sections of this chapter.
• use a GraphPlan-style 1B1:971 planning graph, to record information about candidate plans,
and to propagate information about mutual exclusion between actions. The planning graph
maybe be used to help guide both MOSES and PLN.
In essence, the planning graph simply records different states of the world that may be achiev-
able, with a high-strength PredictivelmplicationLink pointing between state X and Y if X
can sensibly serve as a predecessor to X; and a low-strength (but potentially high-confidence)
PredictivelmplicationLink between X and Y if the former excludes the latter. This may be a
subgraph of the Atomspace or it may be separately cached; but in each case it must be frequently
accessed via PLN in order for the latter to avoid making a massive number of unproductive
inferences in the course of assisting with planning.
One can think of this as being a bit like PGraphPlan IBI.9!II. except that
• MOSES is being used in place of forward or backward chaining search, enabling a more
global search of the plan space (mixing forward and backward learning freely)
• PLN is being used to estimate the value of partial plans, replacing heuristic methods of
value propagation
Regarding PLN, one possibility would be to (explicitly, or in effect) create a special API
function looking something like
EstimateSuccessProbability(PartialPlan PP, Goal G)
(assuming the goal statement contains information about the time allotted to achieve the
goal). The PartialPlan is simply a predicate composed of predicates linked together via temporal
EFTA00624422
276 35 Spatiotemporal Inference
links such as Predictivelmplication and SimultaneousAND. Of course, such a function could be
used within many non-MOSES approaches to planning also.
Put simply, the estimation of the success probability is "just" a matter of asking the PLN
backward-chainer to figure out the truth value of a certain ImplicationLink, i.e.
PredictivelmplicationLink [time-lag T]
EvaluationLink do PP
G
But of course, this may be a very difficult inference without some special guidance to help
the backward chainer. The GraphPlan-style planning graph could be used by PLN to guide it
in doing the inference, via telling it what variables to look at, in doing its inferences. This sort
of reasoning also requires PLN to have a fairly robust capability to reason about time intervals
and events occurring therein (i.e., basic temporal inference).
Regarding MOSES, given a candidate plan, it could look into the planning graph to aid
with program tree expansion. That is, given a population of partial plans, MOSES would
progressively add new nodes to each plan, representing predecessors or successors to the actions
already described in the plans. In choosing which nodes to add, it could be probabilistically
biased toward adding nodes suggested by the planning graph.
So, overall what we have is an approach to doing planning via MOSES, with PLN for fitness
estimation - but using a GraphPlan-style planning graph to guide MOSES's exploration of the
neighborhood of partial plans, and to guide PLN's inferences regarding the success likelihood
of partial plans.
EFTA00624423
Chapter 36
Adaptive, Integrative Inference Control
36.1 Introduction
The subtlest and most difficult aspect of logical inference is not the logical rule-set nor the
management of uncertainty, but the coning! of inference: the choice of which inference steps
to take, in what order, in which contexts. Without effective inference control methods, logical
inference is an unscalable and infeasible approach to learning declarative knowledge. One of the
key ideas underlying the CogPrime design is that inference control cannot effectively be handled
by looking at logic alone. Instead, effective inference control must arise from the intersection
between logical methods and other cognitive processes. In this chapter we describe some of the
general principles used for inference control in the CogPrime design.
Logic itself is quite abstract and relatively (though not entirely) independent of the specific
environment and goals with respect to which a system's intelligence is oriented. Inference con-
trol, however, is (among other things) a way of adapting a logic system to operate effectively
with respect to a specific environment and goal-set. So, the reliance of CogPrime's inference
control methods on the integration between multiple cognitive processes, is a reflection of the
foundation of CogPrime on the assumption (articulated in Chapter 9) that the relevant en-
vironment and goals embody interactions between world-structures and interaction-structures
best addressed by these various processes.
36.2 High-Level Control Mechanisms
The PLN implementation in CogPrime is complex and lends itself to utilization via many
different methods. However, a convenient way to think about it is in terms of three basic
backward-focused query operations:
• findtv, which takes in an expression and tries to find its truth value.
• findExamples, which takes an expression containing variables and tries to find concrete
terms to fill in for the variables.
• createExamples, which takes an expression containing variables and tries to create new
Atoms to fill in for the variables, using concept creation heuristics as discussed in a Chapter
38, coupled with inference for evaluating the products of concept creation.
277
EFTA00624424
278 36 Adaptive, Integrative Inference Control
and one forward-chaining operation:
• findConclusions, which takes a set of Atoms and seeks to draw the most interesting
possible set of conclusions via combining them with each other and with other knowledge
in the AtomTable.
These inference operations may of course call themselves and each other recursively, thus cre-
ating lengthy chains of diverse inference.
Findtv is quite straightforward, at the high level of discussion adopted here. Various inference
rules may match the Atom; in our current PLN implementation, loosely described below, these
inference rules are executed by objects called Rules. In the course of executing findtv, a decision
must be made regarding how much attention to allocate to each one of these Rule objects, and
some choices must be made by the objects themselves - issues that involve processes beyond
pure inference, and will be discussed later in this chapter. Depending on the inference rules
chosen, findtv may lead to the construction of inferences involving variable expressions, which
may then be evaluated via findExamples or createExamples queries.
The findExamples operation sometimes reduces to a simple search through the AtomSpace.
On the other hand, it can also be done in a subtler way. If the findExamples Rule wants to find
examples of $X so that F(SX), but can't find any, then it can perform some sort of heuristic
search, or else it can run another findExamples query, looking for $G so that
Implication SG F
and then running findExamples on G rather than F. But what if this findExamples query
doesn't come up with anything? Then it needs to run a createExamples query on the same
implication, trying to build a $G satisfying the implication.
Finally, forward-chaining inference (findConclusions) may be conceived of as a special heuris-
tic for handling special kinds of findExample problems. Suppose we have K Atoms and want to
find out what consequences logically ensue from these K Atoms, taken together. We can form
the conjunction of the K Atoms (let's call it C), and then look for SD so that
Implication C SD
Conceptually, this can be approached via findExamples, which defaults to createExamples in
cases where nothing is found. However, this sort of findExamples problem is special, involving
appropriate heuristics for combining the conjuncts contained in the expression C, which embody
the basic logic of forward-chaining rather than backward-chaining inference.
36.2.1 The Need for Adaptive Inference Control
It is clear that in humans, inference control is all about context. We use different inference
strategies in different contexts, and learning these strategies is most of what learning to think
is all about. One might think to approach this aspect of cognition, in the CogPrime design, by
introducing a variety of different inference control heuristics, each one giving a certain algorithm
for choosing which inferences to carry out in which order in a certain context. (This is similar to
what has been done within Cyc, for example http: cyc . corn.) However, in keeping with the
integrated intelligence theme that pervades CogPrime, we have chosen an alternate strategy for
PLN. We have one inference control scheme, which is quite simple, but which relies partially on
EFTA00624425
36.3 Inference Control in PLN 279
structures coming from outside PLN proper. The requisite variety of inference control strategies
is provided by variety in the non-PLN structures such as
• HebbianLinks existing in the AtomTable.
• Patterns recognized via pattern-mining in the corpus of prior inference trails
36.3 Inference Control in PLN
We will now describe the basic "inference control" loop of PLN in CogPrime. Pre-2013 OpenCog
versions used a somewhat different scheme, more similar to a traditional logic engine. The ap-
proach presented here is more cognitive synergy oriented, achieving PLN control via a combi-
nation of logic engine style methods and integration with attention allocation.
36.3.1 Representing PLN Rules as GroundedSchemallodes
PLN inference rules may be represented as GroundedSchemallodes. So for instance the PLN
Deduction Rule, becomes a GroundedSchemallode with the properties:
• Input: a pair of links (LI, L2), where LI and L2 are the same type, and must be one of
InheritanceLink, ImplicationLink, SubsetLink or ExtensionalimplicationLink
• Output: a single link, of the same type as the input
The actual PLN Rules and Formulas are then packed into the internal execution methods of
GroundedSchemallodes.
In the current PLN code, each inference rule has a Rule class and a separate Formula class.
So then, e.g. the PLNDeductionRule GroundedSchemallode, invok.es a function of the general
form
Link PLNDeductionRule(Link LI, Link L2)
which calculates the deductive consequence of two links. This function then invokes a function
of the form
TruthValue PLNDeductionFormula(TruthValue tAB, TruthValue tBC, TruthValue tA, TruthValue tB, Truth
which in turn invokes functions such as
SimpleTruthValue SimplePLNDeductionFormula(SimpleTruthValue tAB, SimpleTruthValue tBC, SimpleTruth
IndefiniteTruthValue IndefinitePLNDeductionFormula(IndefiniteTruthValue tAB, IndefiniteTruthValue
36.3.2 Recording Executed PLN Inferences in the Atomspace
Once an inference has been carried out, it can be represented in the Atomspace, e.g. as
EFTA00624426
280 36 Adaptive. Integrative Inference Control
ExecutionLink
GroundedSchemallode: PLNDeductionRule
ListLink
HypotheticalLink
InheritanceLink people animal <tvl>
HypotheticalLink
InheritanceLink animal breathe <tv2>
HypotheticalLink
InheritanceLink people breathe <tv3>
Note that a link such as
InheritanceLink people breathe <.8,.2>
will have its truth value stored as a truth value version within a CompositeTruthValue object.)
In the above, e.g.
InheritanceLink people animal
is used as shorthand for
InheritanceLink Cl C2
where CI and C2 are ConceptNodes representing "people" and "animal" respectively.
We can also have records of inferences involving variables, such as
ExecutionLink
GroundedSchemallode: PLNDeductionRule
ListLink
HypotheticalLink
InheritanceLink $V1 animal <tvl>
HypotheticalLink
InheritanceLink animal breathe <tv2>
HypotheticalLink
InheritanceLink $V1 breathe <tv3>
where SW is a specific VariableNode.
36.3.3 Anatomy of a Single Inference Step
A single inference step, then, may be viewed as follows:
1. Choose an inference rule R. and a tuple of Atoms that collectively match the input conditions
of the rule
2. Apply the chosen rule R to the chosen input Atoms
3. Create an ExecutionLink recording the output found
4. In addition to retaining this ExecutionLink in the Atomspace. also save a copy of it to the
InferenceRepository 'this is not needed for the very first implementation, but will be very
useful once PLN is in regular use.
The InferenceRepository, referred to here, is a special Atomspace that exists just to save a
record of PLN inferences. It can be mined, after the fact, to learn inference patterns, which can
be used to guide future inferences.
EFTA00624427
36.3 Inference Control in PLN 281
36.3.4 Basic Forward and Backward Inference Steps
The choice of an inference step, at the microscopic level, may be done in a number of ways, of
which perhaps the simplest are:
• "'Basic forward step". Choose an Atom Al, then choose a rule R. If It only takes one input,
then apply R. to Al. If R applies to two Atoms, then find another Atom A2 so that (Al,
A2) may be taken as the inputs of R.
• "'Basic backward step."' Choose an Atom Al, then choose a rule R. If R takes only one
input, then find an Atom A2 so that applying R. to A2, yields Al as output. If R takes two
inputs, then find two Atoms (A2, A3) so that applying It to (A2, A3) yields Al as output.
Given a target Atom such as
Al - Inheritance $V1 breathe
the VariableAbstractionRule will do inferences such as
ExecutionLink
VariableAbstractionRule
HypotheticalLink
Inheritance people breathe
HypotheticalLink
Inheritance Svl breathe
This allows the basic backward step to carry out variable fulfillment queries as well as truth
value queries. We may encapsulate these processes in the Atomspace as
GroundedSchemallode: BasicForwardlnferenceStep
GroundedSchemallode: BasicBackwardInferenceStep
which take as input some Atom Al.
and also as
GroundedSchemallode: AttentionalForwardlnferenceStep
GroundedSchemallode: AttentionalBackwardlnferenceStep
which automatically choose the Atom Al they start with, via choosing some Atom within the
AttentionalFocus, with probability proportional to STI.
Forward chaining, in its simplest form, then becomes: The process of repeatedly executing
the AttentionalForwardlnferenceStep Schemallode.
Backward chaining, in the simplest case (we will discuss more complex cases below), becomes
the process of
1. Repeatedly executing the BasicBackwardlnferenceStep Schemallode, starting from a given
target Atom
2. Concurrently. repeatedly executing the AttentionalBackwardlnferenceStep Schemallode, to
ensure that backward inference keeps occurring, regarding Atoms that were created via Step
1
Inside the BasicForwardStep or BasicBackwardStep schema, there are two chokes to be
made: choosing a rule R, and then choosing additional Atoms A2 and possibly A3.
The choice of the rule R should be made probabilistically, choosing each rule with probability
proportional to a certain weight associated with each rule. Initially we can assign these weights
EFTA00624428
282 36 Adaptive, Integrative Inference Control
generically, by hand, separately for each application domain. Later on they should be chosen
adaptively, based on information mined from the InferenceRepository, regarding which rules
have been better in which contexts.
The choice of the additional Atoms A2 and A3 is subtler, and should be done using STI
values as a guide:
• First the AttentionalFocus is searched, to find all the Atoms there that fit the input criteria
of the rule R. Among all the Atoms found, an Atom is chosen with probability proportional
to STI.
• If the AttentionalFocus doesn't contain anything suitable, then an effort may be made to
search the rest of the Atomspace to find something suitable. If multiple candidates are
found within the amount of effort allotted, then one should be chosen with probability
proportional to STI
If an Atom A is produced as output of a forward inference step, or is chosen as the input of a
backward inference step, then the STI of this Atom A should be incremented. This will increase
the probability of A being chosen for ongoing inference. In this way, attention allocation is used
to guide the course of ongoing inference.
36.3.5 Interaction of Forward and Backward Inference
Starting from a target, a series of backward inferences can figure out ways to estimate the truth
value of that target, or fill in the variables within that target.
However, once the backward-going chain of inferences is done (to some reasonable degree
of satisfaction), there is still the remaining task of using all the conclusions drawn during the
series of backward inferences, to actually update the target.
Elegantly, this can be done via forward inference. So if forward and backward inference
are both operating concurrently on the same pool of Atoms, it is forward inference that will
propagate the information learned during backward chaining inference, up to the target of the
backward chain.
36.3.6 Coordinating Variable Bindings
Probably the thorniest subtlety that comes up in a PLN implementation is the coordination of
the values assigned to variables, across different micro-level inferences that are supposed to be
coordinated together as part of the same macro-level inference.
For a very simple example, suppose we have a truth-value query with target
Al a InheritanceLink Bob rich
Suppose the deduction rule It is chosen.
Then if we can find (A2, A3) that look like, say,
A2 a InheritanceLink Hob owns_mansion
A3 a InheritanceLink owns_mansion rich
EFTA00624429
36.3 Inference Control in PLN 283
, our problem is solved.
But what if there is no such simple solution in the Atomspace available? Then we have to
build something like
A2 - InheritanceLink Bob Svl
A3 - InheritanceLink Svl rich
and try, to find something that works to fill in the variable $v1.
But this is tricky, because $v1 now has two constraints (A2 and A3). So, suppose A2 and
A3 are both created as a result of applying BasicBackwardlnferenceStep to Al, and thus A2
and A3 both get high STI values. Then both A2 and A3 are going to be acted on by Atten-
tionalBackwardInferenceStep. But as A2 and A3 are produced via other inputs using backward
inference, it is necessary that the values assigned to $v1 in the context of A2 and A3 remain
consistent with each other.
Note that, according to the operation of the Atomspace, the same VariableAtom will be used
to represent $vl no matter where it occurs.
For instance, it will be problematic if one inference rule schema tries to instantiate $vl with
"owns_mansion", but another tries to instantiate $vl with "lives_in_Manhattan".
That is, we don't want to find
InheritanceLink Bob lives_in_mansion
InheritanceLink lives_in_mansion owns_mansion
InheritanceLink Bob owns_mansion
which binds $v1 to owns_mansion, and
InheritanceLink lives_in_Manhattan lives_in_top_city
InheritanceLink lives_in_top_city rich
!-
InheritanceLink lives_in_Manhattan rich
which binds $vl to lives_in_Manhattan
We want A2 and A3 to be derived in ways that bind $vl to the same thing.
The most straightforward way to avoid confusion in this sort of context, is to introduce an
addition kind of inference step,
• "'Variable-guided backward step"'. Choose a set V of VariableNodes (which may just be
a single VariableNode $v1), and identify the set S_V of all Atoms involving any of the
variables in V.
- Firstly: If V divides into two sets VI and V2, so that no Atom contains variables in
both VI and V2, then launch separate variable-guided backwards steps for VI and V2.
'This step is "Problem Decompositionl
— Carry out the basic backward step for all the Atoms in S_V, but restricting the search
for Atoms A2, A3 in such a way that each of the variables in V is consistently instan-
tiated. This is a non-trivial optimization, and more will be said about this below.
• "'Variable-guided backward step, Atom-triggered."" ChnacP an Atom Al. Identify the set
V of VariableNodes targeted by Al, and then do a variable-guided backward step starting
from V.
This variable guidance may, of course, be incorporated into the AttentionalBackwardlnfer-
enceStep as well. In this case, backward chaining becomes the process of
EFTA00624430
284 36 Adaptive, Integrative Inference Control
• Repeatedly executing the VariableGuidedBackwardlnferenceStep Schemallode, starting
from a given target Atom
• Concurrently, repeatedly executing the AttentionalVariableGuidedBackwardlnferenceStep
Schemallode, to ensure that backward inference keeps occurring, regarding Atoms that were
created via Step 1
The hard work here is then done in step 2 of the Variable Guided Backward Step, which has
to search for multiple Atoms, to fulfill the requirements of multiple inference rules, in a way
that keeps consistent variable instantiations. But this same difficulty exists in a conventional
backward chaining framework, it's just arranged differently, and not as neatly encapsulated.
36.3.7 An Example of Problem Decomposition
Illustrating a point raised above, we now give an example of a case where, given a problem of
finding values to assign a set of variables to make a set of expressions hold simultaneously, the
appropriate course is to divide the set of expressions into two separate parts.
Suppose we have the six expressions
El - Inheritance ( $vl, Animal)
E2 a Evaluation( $v1, ($v2, Bacon) )
E3 - Inheritance( $v2, $v3)
E4 - Evaluation( Eat, ($v3, $vl) )
ES a Evaluation (Eat, ($v7, $v9) )
E6 a Inheritance $v9 $v6
Since the set {El, E2, E3, E4} doesn't share any variables with {E5, EC), there is no reason
to consider them all as one problem. Rather we will do better to decompose it into two problems,
one involving {El, E2, E3, E4} and one involving {E5, E6}.
In general, given a set of expressions, one can divide it into subsets, where each subset S has
the property that: for every variable v contained in 5, all occurrences of v in the Atomspace,
are in expressions contained in S.
36.3.8 Example of Casting a Variable Assignment Problem as an
Optimization Problem
Suppose we have the four expressions
El a Inheritance ( $v1, Animal)
E2 a Evaluation( $v2, ($vl, Bacon) )
E3 a Inheritance( $v2, $v3)
E4 a Evaluation( Enjoy, ($v1, $v3) )
EFTA00624431
36.3 Inference Control in PLN 285
where Animal. Bacon and Enjoy are specific Atoms.
Suppose the task at hand is to find values for ($v1, $v2, $v3) that will make all of these
expressions confidently true.
If there is some assignment
(6v1, $v2, $v3) (A1,A2, A3)
ready to hand in the Atomspace, that fulfills the equations El, E2, E3, E4, then the Atomspace
API's pattern matcher will find it. For instance,
(6v1, $v2, $v3) - (Cat, Eat, Chew)
would work here, since
El - Inheritance 4 Cat, Animal)
E2 - Evaluation( Eat, (Cat, Bacon) )
E3 - Inheritance( Eat, Chew)
E4 - Evaluation( Enjoy, (Cat, Chew) )
are all reasonably true.
If there is no such assignment ready to hand, then one is faced with a search problem. This
can can be approached as an optimization problem, e.g. one of maximizing a function
f(Sol, $v2, $v3) sc(E1) * sc(E2) * sc(E3)
where
sc(A) A.strength • A.confidence
The function f is then a function with signature
f: Atom"4 --> float
f can then be optimized by a host of optimization algorithms. For instance a genetic algorithm
approach might work, but a BOA (Bayesian Optimization Algorithm) approach would probably
be better.
In a GA approach, mutation would work as follows. Suppose one had a candidate
(Sol, Sv2, Sv3) - (Al,A2, A3)
Then one could mutate this candidate by (for instance) replacing Al with some other Atom
that is similar to Al, e.g. connected to Al with a high-weight SimilarityLink in the Atomspace.
36.3.9 Backward Chaining via Nested Optimization
Given this framework that does inference involving variables via using optimization to solve
simultaneous equations of logical expressions with overlapping variables, "backward chaining"
becomes the iterative launch of repeated optimization problems, each one defined in terms of
the previous ones. We will now illustrate this point via continuing with the E2, E3, E4)).
example from above. Suppose one found an assignment
(Sol, Sv2, Sv3) - (Al,A2, A3)
that worked for every equation except E3. Then there is the problem of finding some way to
make
EFTA00624432
286 36 Adaptive, Integrative Inference Control
E3 - Inheritance( A2, A3)
work.
For instance, what if we have found the assignment
(Svl, Sv2, Sv3) - (Cat, Eat, Chase)
In this case, we have
El - Inheritance ( Cat, Animal) -- YES
E2 - Evaluation( Eat, (Cat, Bacon) ) -- YES
E3 - Inheritance( Eat, Chase) -- NO
E4 - Evaluation( Enjoy, (Cat, Chase) ) -- YES
so the assignment works for every equation except E3. Then there is the problem of finding
some way to make
E3 Inheritance( Eat, Chase)
work. But if the tnith value of
Inheritance( Eat, Chase)
has a low strength and high confidence, this may seem hopeless, so this assignment may not
get followed up on.
On the other hand, we might have the assignment
(Svl, Sv2, Sv3) - (Cat, Eat, SocialActivity)
In this case, for a particular CogPrime instance, we might have
El - Inheritance ( Cat, Animal) -- YES
E2 - Evaluation( Eat, (Cat, Bacon) ) -- YES
E3 - Inheritance( Eat, SocialActivity) -- UNKNOWN
E4 - Evaluation( Enjoy, (Cat, SocialActivity) ) -- YES
The above would hold if the reasoning system knew that cats enjoy social activities, but did not
know whether eating is a social activity. In this case, the reasoning system would have reason
to launch a new inference process aimed at assessing the truth value of
E3 - Inheritance( Eat, SocialActivity) --
This is backward chaining: Launching a new inference process to figure out a question raised
by another inference process.
For instance, in this case the inference engine might: Choose an inference Rule (let's say it's
Deduction, for simplicity), and then look for $v4 so that
Inheritance Eat Sv4
Inheritance Sv4 SocialActivity
are both true. In this case one has spawned a new Variable-Guided Backward Inference
problem, which must be solved in order to make {AI, A2, A3} an OK solution for the problem
of {El, E2, E3, E4}.
Or it might choose the Induction rule, and look for $v4 so that
Inheritance Sv4 Eat
Inheritance Sv4 SocialActivity
Maybe then it would find that $v4=Dinner works, because it knows that
EFTA00624433
36.3 Inference Control in PLN 287
Inheritance Dinner Eat
Inheritance Dinner SocialActivity
But maybe $v4=Dinner doesn't boost the truth value of
E3 - Inheritance( Eat, SocialActivity)
high enough. In that case it may keep searching for more information about E4 in the context
of this particular variable assignment. It might choose Induction again, and discover e.g. that
Inheritance Lunch Eat
Inheritance Lunch SocialActivity
In this example, wove assumed that some non-backward-chaining heuristic search mechanism
found a solution that almost works, so that backward chaining is only needed on E3. But of
course, one could backward chain on all of El, E2, E3, E4 simultaneously - or various subsets
thereof.
For a simple example, suppose one backward chains on
El - Inheritance ( Svl, Animal)
E3 Inheritance( $v2, SocialActivity)
simultaneously. Then one is seeking, say, ($v4, $v5) so that
Inheritance $vl $vS
Inheritance $vS Animal
Inheritance $v2 $v4
Inheritance $v4 SocialActivity\
This adds no complexity, as the four relations partition into two disjoint sets of two. Separate
chaining processes may be carried out for El and E3.
On the other hand, for a slightly more complex example, what if we backward chain on
E2 Evaluation( Sv2, {$vl, Bacon) )
E3 - Inheritance( Sv2, SocialActivity)
simultaneously? (Assuming that a decision has already been made to explore the possibility
$v3 = SocialActivity.) Then we have a somewhat more complex situation. We are trying to find
$v2 that is a SocialActivity, so that $v1 likes to do $v2 in conjunction with Bacon.
If the Member2Evaluation rule is chosen for E2 and the Deduction rule is chosen for E3,
then we have
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
E7 - Member ($v1, Bacon) (SatisfyingSet $v2)
and if the Inheritance2Member rule is then chosen for E7, we have
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
ES - Inheritance ($vl, Bacon) (SatisfyingSet $v2)
and if Deduction is then chosen for ES then we have
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
E9 - Inheritance ($vl , Bacon) $v8
ElD - Inheritance $v8 (SatisfyingSet $v2)
EFTA00624434
288 36 Adaptive. Integrative Inference Cont roe
Following these steps expands the search to involve more variables and means the inference
engine now gets to deal with
El - Inheritance I Svl, Animal)
E4 - Evaluation( Enjoy, (Svl, SocialActivity) )
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
E9 - Inheritance ($171 , Bacon) Sv8
E10 - Inheritance 6178 (SatisfyingSet Sv2)
or some such i.e. we have expanded our problem to include more and more simtdtaneou.s logical
equations in more and more variables! Which is not necesbarily a terrible thing, but it does get
complicated.
We might find, for example, that $v=1 Pig, $v6=Dance, $v2=Waltz, $v8 = PiggyWaltz
El - Inheritance ( Pig, Animal)
E4 - Evaluation( Enjoy, (Pig, SocialActivity) I
ES - Inheritance Waltz Dance
E6 - Inheritance Dance SocialActivity
E9 - Inheritance (Pig , Bacon) PiggyWaltz
E10 - Inheritance PiggyWaltz (SatisfyingSet Waltz)
Here PiggyWaltz is a special dance that pigs do with their Bacon, as a SocialActivity!
Of course, this example is extremely contrived. Real inference examples will rarely be this
simple, and will not generally involve Nodes that have simple English names. This example is
just for illustration of the concepts involved.
36.4 Combining Backward and Forward Inference Steps with
Attention Allocation to Achieve the Same Effect as Backward
Chaining (and Even Smarter Inference Dynamics)
Backward chaining is a powerful heuristic, one can achieve the same effect - and even smarter
inference dynamics - via a combination of
• heuristic search to satisfy simultaneous expressions
• boosting the STI of expressions being searched
• importance spreading (of STI)
• ongoing background forward inference
can combine to yield the same basic effect as backward chaining, but without explicitly doing
backward chaining.
The basic idea is: When system of expressions involving variables is explored using a GA or
whatever other optimization process is deployed, these expressions also get their STI boosted.
Then, the atoms with high STI, are explored by the forward inference process, which is
always acting in the background on the atoms in the Atomspace. Other atoms related to these
also get STI via importance spreading. And these other related Atoms are then acted on by
forward inference as well.
This forward chaining will then lead to the formation of new Atoms, which may make the
solution of the system of expressions easier the next time it is visited by the backward inference
process
EFTA00624435
36.5 Hebbian Inference Control 289
In the above example, this means:
• El, E2, E3, E4 will all get their STI boosted
• Other Atoms related to these (Animal, Bacon and Enjoy) will also get their STI boosted
• These other Atoms will get forward inference done on them
• This forward inference will then yield new Atoms that can be drawn on when the solution
of the expression-system El, E2, E3, E4 is pursued the next time
So, for example, if the system did not know that eating is a social activity, it might learn this
during forward inference on SocialActivity. The fact that SocialActivity has high STI would
cause forward inferences such as
Inheritance Dinner Eat
Inheritance Dinner SocialActivity
Inheritance Eat SocialActivity
to get done. These forward inferences would then produce links that could simply be found by
the pattern matcher when trying to find variable assignments to satisfy (El, E2, E3, E4}.
36.4.1 Breakdown into MindAgents
To make this sort of PLN dynamic work, we require a number of MindAgents to be operating
"ambiently" in the background whenever inference is occurring; to wit:
• attentional forward chaining (i.e. each time this MindAgent is invoked, it chooses high-STI
Atoms and does basic forward chaining on them)
• attention allocation (importance updating is critical, Hebbian learning is also useful)
• attentional (variable guided) backward chaining
On top of this ambient inference, we may then have query-driven backward chaining inferences
submitted by other processes (via these launching backward inference steps and giving the
associated Atoms lots of STI). The ambient inference processes will help the query-driven
inference processes to get fulfilled.
36.5 Hebbian Inference Control
A key aspect of the PLN control mechanism described here is the use of attention allocation
to guide inference. A key aspect here is the use of attention allocation to guide Atom choice in
the course of forward and backward inference. Figure 36.1 gives a simple illustrative example
of the use of attention allocation, via HebbianLinks, for PLN backward chaining.
The semantics of a HebbianLink between A and B is, intuitively: In the past, when A was
important, B was also important. HebbianLinks are created via two basic mechanisms: pattern-
mining of associations between importances in the system's history, and PLN inference based
on HebbianLinks created via pattern mining (and inference). Thus, saying that PLN inference
control relies largely on HebbianLinks is in part saying that PLN inference control relies on
EFTA00624436
290 36 Adaptive, Integrative Inference Control
WILBUR a FRIENDLY]
DEDUCTION
WILBUR toe) PIG I PIG hFRIENDLY
G
%
0 DEDUCTION
<OG0
#
O, i t
ot
Search episodic
memory for
PIG 26( FRIENDLY
episodes with
friendly or DECLARATIVE-ATTENTIONAL
unfriendly pigs INTERACTION
USE IMPORTANCE SPREADING:
GIVE WILBUR PIG AND FRIENDLY
HIGH STI AND TRY SOME
THAT GETS HIGH STI
Fig. 36.1: The Use of Attention Allocation for Guiding Backward Chaining Inference.
PLN. There is a bit of a recursion here, but it's not a bottomless recursion because it bottoms
out with HebbianLinks learned via pattern mining.
As an example of the Atom-choices to be made by a forward or backward inference agent
in the course of doing inference, consider that to evaluate (Inheritance A C) via the deduction
Rule, some collection of intermediate nodes for the deduction must be chosen. In the case of
higher-order deduction, each deduction may involve a number of complicated subsidiary steps,
so perhaps only a single intermediate node will be chosen. This choice of intermediate nodes
must be made via context-dependent prior probabilities. In the case of other Rules besides
deduction, other similar choices must be made.
The basic means of using HebbianLinks in inferential Atom-choice is simple: If there are
Atoms linked via HebbianLinks with the other Atoms in the inference tree, then these Atoms
should be given preference in the selection process.
Along the same lines but more subtly, another valuable heuristic for guiding inference control
is "on-the-fly associatedness assessment." If there is a chance to apply the chosen Rule via
working with Atoms that are:
• strongly associated with the Atoms in the Atom being evaluated (via HebbianLinks)
EFTA00624437
36.5 Hebbian Inference Control 291
• strongly associated with each other via HebbianLinks (hence forming a cohesive set)
then this should be ranked as a good thing.
For instance, it may be the case that, when doing deduction regarding relationships between
humans, using relationships involving other humans as intermediate nodes in the deduction is
often useful. Formally this means that, when doing inference of the form:
AND
Inheritance A human
Inheritance A
Inheritance C human
Inheritance C
Inheritance A C
then it is often valuable to choose B so that:
HebbianLink B human
has high strength. This would follow from the above-mentioned heuristic.
Next, suppose one has noticed a more particular heuristic - that in trying to reason about
humans, it is particularly useful to think about their wants. This suggests that in abductions
of the above form it is often useful to choose B of the form:
SatisfyingSet [ wants(human, SX) )
This is too fine-gained of a cognitive-control intuition to come from simple association-
following. Instead, it requires fairly specific data-mining of the system's inference history. It
requires the recognition of "Hebbian predicated' of the form:
Hebbianlmplication
AND
Inheritance SA human
Inheritance SC human
Similarity
SB
SatisfyingSet
Evaluation wants (human, SX)
AND
Inheritance SA SB
Inheritance SC SB
The semantics of:
Hebbianlmplication X Y
is that when X is being thought about, it is often valuable to think about Y shortly thereafter.
So what is required to do inference control according to heuristics like think about humans
according to their wants is a kind of backward-chaining inference that combines Hebbian im-
plications with PLN inference rules. PLN inference says that to assess the relationship between
two people, one approach is abduction. But Hebbian learning says that when setting up an
abduction between two people, one useful precondition is if the intermediate term in the ab-
duction regards wants. Then a check can be made whether there are any relevant intermediate
terms regarding wants in the system's memory.
What we see here is that the overall inference control strategy can be quite simple. For each
Rule that can be applied, a check can be made for whether there is any relevant Hebbian knowl-
edge regarding the general constructs involved in the Atoms this Rule would be manipulating.
EFTA00624438
292 36 Adaptive, Integrative Inference Control
If so, then the prior probability of this Rule is increased, for the purposes of the Rule-choice
bandit problem. Then, if the Rule is chosen, the specific Atoms this Rule would involve in the
inference can be summoned up, and the relevant Hebbian knowledge regarding these Atoms
can be utilized.
To take another similar example, suppose we want to evaluate:
Inheritance pig dog
via the deduction Rule (which also carries out induction and abduction). There are a lot of
possible intermediate terms, but a reasonable heuristic is to ask a few basic questions about
them: How do they move around? What do they eat? How do they reproduce? How intelligent
are they? Some of these standard questions correspond to particular intermediate terms, e.g.
the intelligence question partly boils down to computing:
Inheritance pig intelligent
and:
Inheritance dog intelligent
So a link:
Hebbianlmplication animal intelligent
may be all that's needed to guide inference to asking this question. This HebbianLink says that
when thinking about animals, it's often interesting to think about intelligence. This should bias
the system to choose "intelligent" as an intermediate node for inference.
On the other hand, the what do they eat question is subtler and boils down to asking; Find
$X so that when:
R(SX) SatisfyingSet[SY] eats (SY,SX)
holds (R($X) is a concept representing what eat $X), then we have:
Inheritance pig RISX)
and:
Inheritance dog RISX)
In this case, a HebbianLink from animal to eat would not really be fine-grained enough. Instead
we want a link of the form:
Hebbianlmplication
Inheritance SX animal
SatisfyingSet[SY] eats (SX, SY)
This says that when thinking about an animal, it's interesting to think about what that animal
eats.
The deduction Rule, when choosing which intermediate nodes to use, needs to look at the
scope of available HebbianLinks and HebbianPredicates and use them to guide its choice. And
if there are no good intermediate nodes available, it may report that it doesn't have enough
experience to assess with any confidence whether it can come up with a good conclusion. As a
consequence of the bandit-problem dynamics, it may be allocated reduced resources, or another
Rule is chosen altogether.
EFTA00624439
36.7 Evolution As an Inference Control Scheme 293
36.6 Inference Pattern Mining
Along with general-purpose attention spreading, it it very useful for PLN processes to receive
specific guidance based on patterns mined from previously performed and storedife
This information is stored in CogPrime in a data repository called the InferencePattern-
Repository - which is, quite simply, a special "data table" containing inference trees extracted
from the system's inference history, and patterns recognized therein. An "inference tree" refers to
a tree whose nodes, called InferenceTreeNodes, are Atoms (or generally Atom-versions, Atoms
with truth value relative to a certain context), and whose links are inference steps (so each link
is labeled with a certain inference rule).
In a large CogPrime system it may not be feasible to store all inference trees; but then a wide
variety of trees should still be retained, including mainly successful ones as well as a sampling
of unsuccessful ones for purpose of comparison.
The InferencePatternRepository may then be used in two ways:
• An inference tree being actively expanded (i.e. utilized within the PLN inference system)
may be compared to inference trees in the repository, in real time, for guidance. That is, if
a node N in an inference tree is being expanded, then the repository can be searched for
nodes similar to N, whose contexts (within their inference trees) are similar to the context
of N within its inference tree. A study can then be made regarding which Rules and Atoms
were most useful in these prior, similar inferences, and the results of this can be used to
guide ongoing inference.
• Patterns can be extracted from the store of inference trees in the InferencePatternRepos-
itory, and stored separately from the actual inference trees (in essence, these patterns are
inference subtrees with variables in place of some of their concrete nodes or links). An infer-
ence tree being expanded can then be compared to these patterns instead of, or in addition
to, the actual trees in the repository. This provides greater efficiency in the case of common
patterns among inference trees.
A reasonable approach may be to first check for inference patterns and see if there are any
close matches; and if there are not, to then search for individual inference trees that are close
matches.
Mining patterns front the repository of inference trees is a potentially highly computationally
expensive operation, but this doesn't particularly matter since it can be run periodically in
the background while inference proceeds at its own pace in the foreground, using the mined
patterns. Algorithmically, it may be done either by exhaustive frequent-itemset-mining (as in
deterministic greedy datamining algorithms), or by stochastic greedy mining. These operations
should be carried out by an InferencePatternMiner MindAgent.
36.7 Evolution As an Inference Control Scheme
It is possible to use PEPL (Probabilistic Evolutionary Program Learning, such as MOSES)
as, in essence, an InferenceControl scheme. Suppose we are using an evolutionary learning
mechanism such as MOSES or PLEASURE EGoe08al to evolve populations of predicates or
schemata. Recall that there are two ways to evaluate procedures in CogPrime : by inference
EFTA00624440
294 36 Adaptive, Integrative Inference Control
or by direct evaluation. Consider the case where inference is needed in order to provide high-
confidence estimates of the evaluation or execution relationships involved. Then, there is the
question of how much effort to spend on inference, for each procedure being evaluated as part
of the fitness evaluation process. Spending a small amount of effort on inference means that
one doesn't discover much beyond what's immediately apparent in the AtomSpace. Spending a
large amount of effort on inference means that one is trying very hard to use indirect evidence
to support conjectures regarding the evaluation or execution Links involved.
When one is evolving a large population of procedures, one can't afford to do too much
inference on each candidate procedure being evaluated. Yet, of course, doing more inference
may yield more accurate fitness evaluations, hence decreasing the number of fitness evaluations
required.
Often, a good heuristic is to gradually increase the amount of inference effort spent on
procedure evaluation, during the course of evolution. Specifically, one may make the amount
of inference effort roughly proportional to the overall population fitness. This way, initially,
evolution is doing a cursory search, not thinking too much about each possibility. But once it
has some fairly decent guesses in its population, then it starts thinking hard, applying more
inference to each conjecture.
Since the procedures in the population are likely to be interrelated to each other, inferences
done on one procedure are likely to produce intermediate knowledge that's useful for doing
inference on other procedures. Therefore, what one has in this scheme is evolution as a control
mechanism for higher-order inference.
Combined with the use of evolutionary learning to achieve memory across optimization runs,
this is a very subtle approach to inference control, quite different from anything in the domain
of logic-based Al. Rather than guiding individual inference steps on a detailed basis, this type
of control mechanism uses evolutionary logic to guide the general direction of inference, pushing
the vast mass of exploratory inferences in the direction of solving the problem at hand, based
on a flexible usage of prior knowledge.
36.8 Incorporating Other Cognitive Processes into Inference
Hebbian inference control and inference pattern mining are valuable and powerful processes,
but they are not always going to be enough. The solution of some problems that CogPrime
chooses to address via inference will ultimately require the use of other methods, too. In these
cases, one workaround is for inference to call on other cognitive processes to help it out.
This is done via the forward or backward chaining agents identifying specific Atoms deserv-
ing of attention by other cognitive processes, and then spawning Tasks executing these other
cognitive processes on the appropriate Atoms.
Firstly, which Atoms should be selected for this kind of attention? What we want are Infer-
enceTreeNodes that:
• have high STI.
• have the impact to significantly change the overall truth value of the inference tree they are
embedded in (something that can be calculated by hypothetically varying the truth value of
the InferenceTreeNode and seeing how the truth value of the overall conclusion is affected).
• have truth values that are known with low confidence.
EFTA00624441
36.9 PLN and Bayes Nets 295
Truth values meeting these criteria should be taken as strong candidates for attention by other
cognitive processes.
The next question is which other cognitive processes do we apply in which cases?
MOSES in supervised categorization mode can be applied to a candidate InferenceTreeNode
representing a CogPrime Node if it has a sufficient number of members (Atoms linked to it
by MemberLinks); and, a sufficient number of new members have been added to it (or have
had their membership degree significantly changed) since MOSES in supervised categorization
mode was used on it last.
Next, pattern mining can be applied to look for connectivity patterns elsewhere in the Atom-
Table, similar to the connectivity patterns of the candidate Atom, if the candidate Atom has
changed significantly since pattern mining last visited it.
More subtly, what if, we try to find whether "cross breed" implies "Ugliness", and we know
that "bad genes" implies Ugliness, but can't find a way, by backward chaining, to prove that
"cross breed" implies "bad genes". Then we could launch a non-backward-chaining algorithm to
measure the overlap of SatisfyingSet(cross breed) and SatisfyingSet(bad genes). Specifically, we
could use MOSES in supervised categorization mode to find relationships characterizing "cross
breed" and other relationships characterizing "bad genes", and then do some forward chaining
inference on these relationships. This would be a general heuristic for what to do when there's
a link with low confidence but high potential importance to the inference tree.
SpeculativeConceptFormation (see Chapter 38) may also be used to create new concepts and
attempt to link them to the Atoms involved in an inference (via subsidiary inference processes,
or HebbianLink formation based on usage in learned procedures, etc.), so that they may be
used in inference.
36.9 PLN and Bayes Nets
Finally, we give some comments on the relationship between PLN and Bayes Nets IP.I88al. We
have not yet implemented such an approach, but it may well be that Bayes Nets methods can
serve as a useful augmentation to PLN for certain sorts of inference (specifically, for inference
on networks of knowledge that are relatively static in nature).
We can't use standard Bayes Nets as the primary way of structuring reasoning in CogPrime
because CogPrime's knowledge network is loopy. The peculiarities that allow standard Bayer net
belief propagation to work in standard loopy Bayes nets, don't hold up in CogPrime, because
of the way you have to update probabilities when you're managing a very large network in
interaction with a changing world, so that different parts of which get different amounts of focus.
So in PLN we use different mechanisms (the "inference trail" mechanism) to avoid "repeated
evidence counting" whereas in loopy Bayes nets they rely on the fact that in the standard loopy
Bayer net configuration, extra evidence counting occurs in a fairly constant way across the
network.
However, when you have within the AtomTable a set of interrelated knowledge items that you
know are going to be static for a while, and you want to be able to query them probabilistically,
then building a Bayes Net of some sort (i.e. "freezing" part of CogPrime's knowledge network
and mapping it into a Bayes Net) may be useful. I.e., one way to accelerate some PLN inference
would be:
EFTA00624442
296 36 Adaptive, Integrative Inference Control
1. Freeze a subnetwork of the AtomTable which is expected not to change a lot in the near
future
2. Interpret this subnetwork as a loopy Bayes net, and use standard Bayesian belief propaga-
tion to calculate probabilities based on it
This would be a highly efficient form of "background inference" in certain contexts. (Note
that this ideally requires an "indefinite Bayes net" implementation that propagates indefinite
probabilities through the standard Bayes-net local belief propagation algorithms, but this is
not mathematically problematic.)
On the other hand, if you have a very important subset of the Atomspace, then it may
be worthwhile to maintain a Bayes net modeling the conditional probabilities between these
Atoms, but with a dynamically updated structure.
EFTA00624443
Chapter 37
Pattern Mining
Co-authored with Jade O'Neill
37.1 Introduction
Having discussed inference in depth we now turn to other, simpler but equally important ap-
proaches to creating declarative knowledge. This chapters deals with pattern mining - the
creation of declarative knowledge representing patterns among other knowledge (which may be
declarative, sensory, episodic, procedural, etc.) - and the following chapter deals with specula-
tive concept creation.
Within the scope of pattern mining, we will discuss two basic approaches:
• supervised learning: given a predicate, finding a pattern among the entities that satisfy that
predicate.
• unsupervised learning: undirected search for "interesting patterns".
The supervised learning case is easier and we have done a number of experiments using
MOSES for supervised pattern mining, on biological (microarray gene expression and SNP)
and textual data. In the CogPrime case, the "positive examples" are the elements of the Sat-
isfyingSet of the predicate P, and the "negative examples" are everything else. This can be a
relatively straightforward problem if there are enough positive examples and they actually share
common aspects ... but some trickiness emerges, of course, when the common aspects are, in
each example, complexly intertwined with other aspects.
The unsupervised learning case is considerably trickier. The main problem issue here regards
the definition of an appropriate fitness function. We are searching for "interesting patterns." So
the question is, what constitutes an interesting pattern?
We will also discuss two basic algorithmic approaches:
• program learning, via MOSES or hificlimbing
• frequent subgraph mining, using greedy algorithms
The value of these various approaches is contingent on the environment and goal set being
such that algorithms of this nature can actually recognize relevant patterns in the world and
mind. Fortunately, the everyday human world does appear to have the property of possessing
multiple relevant patterns that are recognizable using varying levels of sophistication and effort.
It has patterns that can be recognized via simple frequent pattern mining, and other patterns
that are too subtle for this, and are better addressed by a search-based approach. In order for
297
EFTA00624444
298 37 Pattern Mining
an environment and goal set to be appropriate for the learning and teaching of a human-level
AI, it should have the same property of possessing multiple relevant patterns recognizable using
varying levels of subtlety.
37.2 Finding Interesting Patterns via Program Learning
As one important case of pattern mining, we now discuss the use of program learning to find
"interesting" patterns in sets of Atoms.
Clearly, "interestingness" is a multidimensional concept. One approach to defining it is em-
pirical, based on observation of which predicates have and have not proved interesting to the
system in the past (based on their long-term importance values, i.e. LTI).
In this approach, one has a supervised categorization problem: learn a rule predicting whether
a predicate will fall into the interesting category or the uninteresting category. Once one has
learned this rule, which has expressed this rule as a predicate itself, one can then use this rule
as the fitness function for evolutionary learning evolution.
There is also a simpler approach, which defines an objective notion of interestingness. This
objective notion is a weighted sum of two factors:
• Compactness.
• Surprisingness of truth value.
Compactness is easy to understand: all else equal, a predicate embodied in a small Combo tree
is better than a predicate embodied in a big one. There is some work hidden here in Combo
tree reduction; ideally, one would like to find the smallest representation of a given Combo tree,
but this is a computationally formidable problem, so one necessarily approaches it via heuristic
algorithms.
Surprisingness of truth value is a slightly subtler concept. Given a Boolean predicate, one can
envision two extreme ways of evaluating its truth value (represented by two different types of
ProcedureEvaluator). One can use an IndependenceAssumingProcedureEvaluator, which deals
with all AND and OR operators by assuming probabilistic independence. Or, one can use an
ordinary EffortBasedProcedureEvaluator, which uses dependency information wherever feasible
to evaluate the truth values of AND and OR operators. These two approaches will normally
give different truth values but, how different? The more different, the more surprising is the
truth value of the predicate, and the more interesting may the predicate be.
In order to explore the power of this kind of approach in a simple context, we have tested
pattern mining using MOSES on Boolean predicates as a data mining algorithm on a number
of different datasets, including some interesting and successful work in the analysis of gene
expression data, and some more experimental work analyzing sociological data from the National
Longitudinal Survey of Youth (NLSY) (http : // st at s bl s goy in Is /).
A very, simple illustrative result from the analysis of the NLSY data is the pattern:
OR
(NOT(MothersAge(X)) AND NOT(FirstSexAge(X)))
(Wealth(%) AND PIAT(X))
where the domain of X are individuals, meaning that:
• being the child of a young mother correlates with having sex at a younger age;
EFTA00624445
37.3 Pattern Mining via Frequent/Surprising Subgraph Mining 299
• being in a wealthier family correlates with better Math (PIAT) scores;
• the two sets previously described tend to be disjoint.
Of course, many data patterns are several times more complex than the simple illustrative
pattern shown above. However, one of the strengths of the evolutionary learning approach to
pattern mining is its ability to find simple patterns when they do exist, yet without (like some
other mining methods) imposing any specific restrictions on the pattern format.
37.3 Pattern Mining via Frequent/Surprising Subgraph Mining
Probabilistic evolutionary learning is an extremely powerful approach to pattern mining, but,
may not always be realistic due to its high computational cost. A cheaper, though also weaker,
alternative, is to use frequent subgraph mining algorithms such as RIWP03, IiKOIJ, which may
straightforwardly be adapted to hypergraphs such as the Atomspace.
Frequent subgraph mining is a port to the graph domain of the older, simpler idea of frequent
itemset mining, which we now briefly review. There are a number of algorithms in the latter
category, the classic is Apriori LAS911, and an alternative is Relim 11101051 which Ls conceptually
similar but seems to give better performance.
The basic goal of frequent itemset mining is to discover frequent subsets ("helmets") in a
group of sets, whose members are all drawn from some base set of items. One knows that for a
set of N items, there are 2A — I possible subgroups. The algorithm operates in several rounds.
Round i heuristically computes frequent i-itemsets (i.e. frequent sets containing i items). A
round has two steps: candidate generation and candidate counting. In the candidate generation
step, the algorithm generates a set of candidate i-itemsets whose support - the percentage of
events in which the item must appear - has not been yet been computed. In the candidate-
counting step, the algorithm scans its memory, database, counting the support of the candidate
itemsets. After the scan. the algorithm discards candidates with support lower than the specified
minimum (an algorithm parameter) and retains only the sufficiently frequent i-itemsets. The
algorithm reduces the number of tested subsets by pruning apriori those candidate itemsets that
cannot be frequent, based on the knowledge about infrequent itemsets obtained from previous
rounds. So for instance if {A, B} is a frequent 2-itemset then {A, B, C) will be considered as a
potential 3-itemset, on the contrary if {A, B} is not a frequent itemset then {A, B,C}, as well
as any superset of {A, B}, will be discarded. Although the worst case of this sort of algorithm is
exponential, practical executions are generally fast, depending essentially on the support limit.
Frequent subgraph mining follows the same pattern, but instead of a set of items it deals with
a group of graphs. There are many frequent subgraph mining algorithms in the literature, but
the basic concept underlying nearly all of them is the same: first find small frequent subgraphs.
Then seek to find slightly larger frequent patterns encompassing these small ones. Then seek
to find slightly larger frequent patterns encompassing these, etc. This approach is much faster
than something like MOSES, although management of the large number of subgraphs to be
searched through can require subtle design and implementation of data structures.
If, instead of an ensemble of small graphs, one has a single large graph like the AtomSpace,
one can follow the same approach, via randomly subsampling from the large graph to find the
graphs forming the ensemble to be mined from; see VI I1(11 for a detailed treatment of this sort of
EFTA00624446
300 37 Pattern Mining
approach. The fact that the AtomSpace is a hypergraph rather than a graph doesn't fundamen-
tally affect the matter since a hypergraph may always be considered a graph via introduction
of an additional node for each hyperedge (at the cost of a potentially great multiplication of
the number of links).
Frequent subgraph mining algorithms appropriately deployed can find subgraphs which occur
repeatedly in the Atomspace, including subgraphs containing Atom-valued variables . Each such
subgraph may be represented as a PredicateNode, and frequent subgraph mining will find such
PredicateNodes that have surprisingly high truth values when evaluated across the Atomspace.
But unlike MOSES when applied as described above, such an algorithm will generally find such
predicates only in a "greedy" way.
For instance, a greedy subgraph mining algorithm would be unlikely to find
OR
INOT(MothersAge(X)) AND NOT(FirstSexAge(X)))
(Nealth(X) AND PIATOO)
as a surprising pattern in an AtomSpace, unless at least one (and preferably both) of
Wealth(%) AND PIAT(X)
and
NOT(MothersAge(X)) AND NOT(FirstSexAge(X))
were surprising patterns in that Atomspace on their own.
37.4 Fishgram
Fishgram is a simple example of an algorithm for finding patterns in an Atomspace, instantiating
the general concepts presented in the previous section. It represents patterns as conjunctions
(AndLink) of Links, which usually contain variables. It does a greedy search, so it can quickly
find many patterns. In contrast, algorithms like MOSES are designed to find a small number
of the best patterns. Fishgram works by finding a set of objects that have links in common,
so it will be most effective if the AtomSpace has a lot of raw data, with simple patterns. For
example, it can be used on the perceptions from the virtual world. There are predicates for
basic perceptions (e.g. what kind of object something is, objects being near each other, types
of blocks. and actions being performed by the user or the AI).
The details of the Fishgram code and design are not sufficiently general or scalable to serve as
a robust, omnipurpose pattern mining solution for CogPrime. However, Fishgram is nevertheless
interesting, as an existent, implemented and tested prototype of a greedy frequent/interesting
subhypergraph mining system. A more scalable analogous system, with a similar principle of
operation, has been outlined and is in the process of being designed at time of writing, but will
not be presented here.
37.4.1 Example Patterns
Here is some example output from Fishgram, when run on the virtual agent's memories.
EFTA00624447
37A Fishgram 301
(AndLink
(EvaluationLink is_edible:PredicateNode (ListLink $1000041))
(InheritanceLink $1000041 Battery:ConceptNode)
This means a battery which can be "eaten" by the virtual robot. The variable $1000041 refers
to the object (battery).
Fishgram can also find patterns containing a sequence of events. In this case, there is a list
of EvaluationLinks or InheritanceLinks which describe the objects involved, followed by the
sequence of events.
(AndLink
(InheritanceLink $1007703 Battery:ConceptNode)
(SequentialAndLink
(EvaluationLink isHolding:PredicateNode (ListLink $1008725 $1007703)))
This means the agent was holding a battery. $1007703 is the battery, and there is also a
variable for the agent itself. Many interesting patterns involve more than one object. This
pattern would also include the user (or another AI) holding a battery, because the pattern does
not refer to the AI character specifically.
It can find patterns where it performs an action and achieves a goal. There is code to create
implications based on these conjunctions. After finding many conjunctions, it can produce
ImplicationLinks based on some of them. Here is an example where the Al-controlled virtual
robot discovers how to get energy.
(ImplicationLink
(AndLink
(EvaluationLink is_edible:PredicateNode (ListLink $1011619))
(InheritanceLink $1011619 Battery:ConceptNode)
(PredictivelmplicationLink
(EvaluationLink actionDone:PredicateNode (ListLink
(ExecutionLink eat:GroundedSchemallode
(ListLink $1011619))))
(EvaluationLink increased: PredicateNode (ListLink
(EvaluationLink
EnergyDemandGoal:PredicateNode)))
37.4.2 The Fishgram Algorithm
The core Fishgram algorithm, in pseudocode, is as follows:
EFTA00624448
302 37 Pattern Mining
initial layer = every pair (relation, binding)
while previous layer is not empty:
foreach (conjunction, binding) in previous layer:
let incoming = all (relation, binding) pairs containing an object
in the conjunction
let possible_next_events = all (event, binding) pairs where
the event happens during or shortly
after the last event in conjunction
foreach (relation, relation_binding) in incoming
and possible_next_events:
(new_relation new_conjunction_binding) =
map_to_existing_variables(conjunction ,
binding , relation , relation_binding)
if new_relation is already in conjunction, skip it
new_conjunction = conjunction + new_relation
if new_conjunction has been found already, skip it
otherwise, add new_conjunction to the current layer
map_to_existing_variables(conjunction, conjunction_binding,
relation, relation_binding)
r', s' = a copy of the relation and binding using new variables
foreach variable v, object o in relation_binding:
foreach variable v2, object o2 in conjunction_binding:
if o == o2:
change r' and s' to use v2 instead of v
37.4.3 Preprocessing
There are several preprocessing steps to make it easier for the main Fishgram search to find
patterns. There is a list of things that have to be variables. For example, any predicate that
refers to object (including agents) will be given a variable so it can refer to any object. Other
predicates or InheritanceLinIcs can be added to a pattern, to restrict it to specific kinds of
objects, as shown above. So there Ls a step which goes through all of the links in the AtomSpace,
and records a list of predicates with variables. Such as "X is red" or "X eats Y". This makes the
search part simpler, because it never has to decide whether something should be a variable or
a specific object.
There is also a filter system, so that things which seem irrelevant can be excluded from the
search. There is a combinatorial explosion as patterns become larger. Some predicates may be
redundant with each other, or known not to be very useful. It can also try to find only patterns
in the Al's "attentional focus", which is much smaller than the whole AtomSpace.
The Fishgram algorithm cannot currently handle patterns involving numbers, although it
could be extended to do so. The two options would be to either have a separate discretization
step, creating predicates for different ranges of a value. Or alternatively, have predicates for
mathematical operators. It would be passible to search for a "splitpoint" like in decision trees.
EFTA00624449
37.4 Fishgram 303
So a number would be chosen, and only things above that value (or only things below that value)
would count for a pattern. It would also be possible to have multiple numbers in a pattern,
and compare them in various ways. It is uncertain how practical this would be in Fishgrain.
MOSES is good for finding numeric patterns, so it may be better to simply use those patterns
inside Fishgrain.
The "increased" predicate is added by a preprocessing step. The goals have a fuzzy TruthValue
representing how well the goal is achieved at any point in time, so e.g. the EnergyDemandGoal
represents how much energy the virtual robot has at some point in time. The predicate records
times that a goal's TruthValue increased. This only happens immediately after doing something
to increase it, which helps avoid finding spurious patterns.
37.4.4 Search Process
Fishgram search is breadth-first. It starts with all predicates (or InheritanceLinks) found by the
preprocessing step. Then it finds pairs of predicates involving the same variable. Then they are
extended to conjunctions of three predicates, and so on. Many relations apply at a specific time,
for example the agent being near an object, or an action being performed. These are included
in a sequence, and are added in the order they occurred.
Fishgram remembers the examples for each pattern. If there is only one variable in the
pattern, an example is a single object; otherwise each example is a vector of objects for each
variable in the pattern. Each time a relation is added to a pattern, if it has no new variables,
some of the examples may be removed, because they don't satisfy the new predicate. It needs
to have at least one variable in common with the previous relations. Otherwise the patterns
would combine many unrelated things.
In frequent itemset mining (for example the APRIOR] algorithm), there is effectively one
variable, and adding a new predicate will often decrease the number of items that match. It can
never increase it. The number of passible conjunctions increases with the length, up to some
point, after which it decreases. But when mining for patterns with multiple objects there is
a much larger combinatorial explosion of patterns. Various criteria can be used to prune the
search.
The most basic criterion is the frequency. Only patterns with at least N examples will be
included, where N is an arbitrary constant. You can also set a maximum number of patterns
allowed for each length (number of relations), and only include the best ones. The next level of
the breadth-first search will only search for extensions of those patterns.
One can also use a measure of statistical interestingness, to make sure the relations in a
pattern are correlated with each other. There are many spurious frequent patterns, because
anything which is frequent will occur together with other things, whet her they are relevant or
not. For example "breathing while typing" is a frequent pattern, because people breathe at all
times. But "moving your hands while typing" is a much more interesting pattern. As people
only move their hands some of the time, a measure of correlation would prefer the second
pattern. The best measure may be interaction information, which is a generalisation of mutual
information that applies to patterns with more than two predicates. An early-stage AI would
not have much knowledge of cause and effect, so it would rely on statistical measures to find
useful patterns.
EFTA00624450
304 37 Pattern Mining
37.4.5 Comparison to other algorithms
Fishgram is more suitable for OpenCogPrime's purposes than existing graph mining algorithms,
most of which were designed with molecular datasets in mind. The OpenCog AtomSpace is a
different graph in various ways. For one, there are many possible relations between nodes (much
like in a semantic network). Many relations involve more than two objects, and there are also
properties predicates about a single object. So the relations are effectively directed links of
varying arity. It also has events, and many states can change over time (e.g. an egg changes
state while it's cooking). Fishgram is designed for general knowledge in an embodied agent.
There are other major differences. Fishgram uses a breadth-first search, rather than depth-
first search like most graph mining algorithms. And it does an "embedding-based" search, search-
ing for patterns that can be embedded multiple times in a large graph. Molecular datasets have
many separate graphs for separate molecules, but the embodied perceptions are closer to a
single, fairly well-connected graph. Depth-first search would be very slow on such a graph, as
there are many very long paths through the graph, and the search would mostly find those.
Whereas the useful patterns tend to be compact and repeated many times.
Lastly the design of Fishgram makes it easy to experiment with multiple different scoring
functions, from simple ones like frequency to much more sophisticated functions such as inter-
action information.
As mentioned above, the current implementation of Fishgram is not sufficiently scalable to
be utilized for general-purpose Atomspaces. The underlying data structure within Fishgram,
used to store recognized patterns, would need to be replaced, which would lead to various
other modifications within the algorithm. But, the general principle and approach illustrated
by Fishgrarn will be persisted in any more scalable reimplementation.
EFTA00624451
Chapter 38
Speculative Concept Formation
38.1 Introduction
One of the hallmarks of general intelligence is its capability to deal with novelty in its envi-
ronment and/or goal-set. And dealing with novelty intrinsically requires creating novelty. It's
impossible to efficiently handle new situations without creating new ideas appropriately. Thus,
in any environment complex and dynamic enough to support human-like general intelligence
(or any other kind of highly powerful general intelligence), the creation of novel ideas will be
paramount. New idea creation takes place in OpenCog via a variety of methods - e.g. inside
MOSES which creates new program trees, PLN which creates new logical relationships, ECAN
which creates new associative relationships, etc. But there is also a role for explicit, purposeful
creation of new Atoms representing new concepts, outside the scope of these other learning
mechanisms.
The human brain gets by, in adulthood, without creating that many new neurons - although
neurogenesis does occur on an ongoing basis. But this is achieved only via great redundancy,
because for the brain it's cheaper to maintain a large number of neurons in memory at the same
time, than to create and delete neurons. Things are different in a digital computer: memory is
more expensive but creation and deletion of object is cheaper. Thus in CogPrime, forgetting
and creation of Atoms is a regularly occurring phenomenon. In this chapter we discuss a key
class of mechanisms for Atom creation, "speculative concept formation." Further methods will
be discussed in following chapters.
The philosophy underlying CogPrime's speculative concept formation is that new things
should be created from pieces of good old things (a form of "evolution", broadly construed), and
that probabilistic extrapolation from experience should be used to guide the creation of new
things (inference). It's clear that these principles are necvsbary for the creation of new mental
forms but it's not obvious that they're sufficient: this is a nontrivial hypothesis, which may also
be considered a family of hypotheses since there are many different ways to do extrapolation
and intercombination. In the context of mind-world correspondence, the implicit assumption
underlying this sort of mechanism is that the relevant patterns in the world can often be
combined to form other relevant patterns. The everyday human world does quite markedly
display this kind of combinatory structure, and such a property seems basic enough that it's
appropriate for use as an assumption underlying the design of cognitive mechanisms.
In CogPrime we have introduced a variety of heuristics for creating new Atoms - especially
ConceptNodes - which may then be reasoned on and subjected to implicit (via attention allo-
305
EFTA00624452
306 38 Speculative Concept Formation
cation) and explicit (via the application of evolutionary learning to predicates obtained front
concepts via "concept predicatization") evolution. Among these are the node logical operators
described in the book Probabilistic Logic Networks, which allow the creation of new concepts
via AND, OR, XOR and so forth. However, logical heuristics alone are not sufficient. In this
chapter we will review some of the nonlogical heuristics that are used for speculative concept
formation. These operations play an important role in creativity - to use cognitive-psychology
language, they are one of the ways that CogPrime implements the process of blending, which
Falconnier and Turner (2003) have argued is key to human creativity on many different levels.
Each of these operations may be considered as implicitly associated with a hypothesis that,
in fact, the everyday human world tends to assign utility to patterns that are combinations of
other patterns produced via said operation.
An evolutionary perspective may also be useful here, on a technical level as well as philo-
sophically. As noted in The Hidden Pattern and hinted at in Chapter 3 of Part 1, one way
to think about an AGI system like CogPrime is as a huge evolving ecology. The AtomSpace
is a biosphere of sorts, and the mapping from Atom types into species has some validity to
it (though not complete accuracy: Atom types do not compete with each other; but they do
reproduce with each other, and according to most of the reproduction methods in use, Atoms
of differing type cannot cross-reproduce). Fitness is defined by importance. Reproduction is
defined by various operators that produce new Atoms from old, including the ones discussed in
this chapter, as well as other operators such as inference and explicit evolutionary operators.
New ConceptNode creation may be triggered by a variety of circumstances. If two ConceptN-
odes are created for different purposes, but later the system finds that most of their meanings
overlap, then it may be more efficient to merge the two into one. On the other hand, a node may
become overloaded with different usages, and it is more useful to split it into multiple nodes,
each with a more consistent content. Finally, there may be patterns across large numbers of
nodes that merit encapsulation in individual nodes. For instance, if there are 1000 fairly similar
ConceptNodes, it may be better not to merge them all together, but rather to create a single
node to which they all link, reifying the category that they collectively embody.
In the following sections, we will begin by describing operations that create new ConceptN-
odes from existing ones on a local basis: by mutating individual ConceptNodes or combining
pairs of ConceptNodes. Some of these operations are inspired by evolutionary operators used
in the GA, others are based on the cognitive psychology concept of "blending." Then we will
turn to the use of clustering and formal concept analysis algorithms inside CogPrime to refine
the system's knowledge about existing concepts, and create new concepts.
38.2 Evolutionary Concept Formation
A simple and useful way to combine ConceptNodes is to use GA-inspired evolutionary operators:
crossover and mutation. In mutation, one replaces some number of a Node's links with other
links in the system. In crossover, one takes two nodes and creates a new node containing some
links from one and some links from another.
More concretely, to cross over two ConceptNodes X and Y, one may proceed as follows (in
short clustering the union of X and Y):
• Create a series of empty nodes Zr, Z2, . Zk
EFTA00624453
38.2 Evolutionary Concept Formation 307
• Form a "link pool" consisting of all X's links and all Y's links, and then divide this pool
into clusters (clustering algorithms will be described below).
• For each cluster with significant cohesion, allocate the links in that cluster to one of the
new nodes Zi
On the other hand, to mutate a ConceptNode, a number of different mutation processes are
reasonable. For instance, one can
• Cluster the links of a Node, and remove one or more of the clusters, creating a node with
less links
• Cluster the links, remove one or more clusters, and then add new links that are similar to
the links in the remaining clusters
The EvolutionaryConceptFormation MindAgent selects pairs of nodes from the system,
where the probability of selecting a pair is determined by
• the average importance of the pair
• the degree of similarity of the pair
• the degree of association of the pair
(Of course, other heuristics are possible too). It then crosses over the pair, and mutates the
result.
Note that, unlike in some GA implementations, the parent node(s) are retained within the
system; they are not replaced by the children. Regardless of how many offspring they generate
by what methods, and regardless of their age, all Nodes compete and cooperate freely forever
according to the fitness criterion defined by the importance updating function. The entire
AtomSpace may be interpreted as a large evolutionary, ecological system, and the action of
CogPrime dynamics, as a whole, is to create fit nodes.
A more advanced variant of the EvolutionaryConceptFormation MindAgent would adapt its
mutation rate in a context-dependent way. But our intuition is that it is best to leave this kind
of refinement for learned cognitive schemata, rather than to hard-wire it into a MindAgent.
To encourage the formation of such schemata, one may introduce elementary schema functions
that embody the basic node-level evolutionary operators:
ConceptNode ConceptCrossover(ConceptNode A, ConceptNode B)
ConceptNode mutate(ConceptNode A, mutationAmount m)
There will also be a role for more abstract schemata that utilize these. An example cognitive
schema of this sort would be one that said: "When all my schema in a certain context seem
unable to achieve their goals, then maybe I need new concepts in this context, so I should
increase the rate of concept mutation and crossover, hoping to trigger some useful concept
formation."
As noted above, this component of CogPrime views the whole AtomSpace as a kind of
genetic algorithm - but the fitness function is "ecological" rather than fixed, and of course
the crossover and mutation operators are highly specialized. Most of the concepts produced
through evolutionary operations are going to be useless nonsense, but will be recognized by the
importance updating process and subsequently forgotten from the system. The useful ones will
link into other concepts and become ongoing aspects of the system's mind. The importance
updating process amounts to fitness evaluation, and it depends implicitly on the sum total of
the cognitive processes going on in CogPrime.
EFTA00624454
308 38 Speculative Concept Formation
To ensure that importance updating properly functions as fitness evaluation, it is critical
that evolutionarily-created concepts (and other speculatively created Atoms) always comprise
a small percentage of the total concepts in the system. This guarantees that importance will
serve as a meaningful "fitness function" for newly created ConceptNodes. The reason for this
is that the importance measures how useful the newly created node is, in the context of the
previously existing Atoms. If there are too many speculative, possibly useless new ConceptNodes
in the system at once, the importance becomes an extremely noisy fitness measure, as it's
largely measuring the degree to which instances of new nonsense fit in with other instances
of new nonsense. One may find interesting self-organizing phenomena in this way, but in an
AGI context we are not interested in undirected spontaneous pattern-formation, but rather in
harnessing self-organizing phenomena toward system goals. And the latter is achieved by having
a modest but not overwhelming amount of speculative new nodes entering into the system.
Finally, as discussed earlier, evolutionary operations on maps may occur naturally and au-
tomatically as a consequence of other cognitive operations. Maps are continually mutated due
to fluctuations in system dynamics; and maps may combine with other maps with which they
overlap, as a consequence of the nonlinear properties of activation spreading and importance
updating. Map-level evolutionary operations are not closely tied to their Atom-level counter-
parts (a difference from e.g. the close correspondence between map-level logical operations and
underlying Atom-level logical operations).
38.3 Conceptual Blending
The notion of Conceptual Blending (aka Conceptual Integration) was proposed by Gilles Fan-
cannier and Mark Turner IFT021 as general theory, of cognition. According to this theory, the
basic operation of creative thought is the "blend" in which elements and relationships from
diverse scenarios are merged together in a judicious way. As a very simple example, we may
consider the blend of "tower" and "snake" to form a new concept of "snake tower" (a tower that
looks somewhat like a snake). However, most examples of blends will not be nearly so obvious.
For instance, the complex numbers could be considered a blend between 2D points and real
numbers. Figure 38.1 gives a conceptual illustration of the blending process.
The production of a blend is generally considered to have three key stages (elucidated via
the example of building a snake-tower out of blocks):
• composition: combining judiciously chosen elements from two or more concept inputs
- Example: Taking the "buildingness" and "verticalness" of a tower, and the "head" and
"mouth" and "tail" of a snake
• completion: adding new elements from implicit background knowledge about the concept
inputs
— Example: Perhaps a mongoose-building will be built out of blocks, poised in a position
indicating it is chasing the snake-tower (incorporating the background knowledge that
mongeese often chase snakes)
• elaboration: fine-tuning, which shapes the elements into a new concept, guided by the desire
to optimize certain criteria
EFTA00624455
38.3 Conceptual Blending 309
Concept
Blending
Segal.' rondo...
arrow
.......
i tleird
.......)
human
Fig. 38.1: Conceptual Illustration of Conceptual Blending
— Example: The tail of the snake-tower is a part of the building that rests on the ground,
and connects to the main tower. The head of the snake-tower is a portion that sits atop
the main tower, analogous to the restaurant atop the Space Needle.
The "judiciousness" in the composition phase may be partially captured in CogPrime via
PLN inference, via introducing a "consistency criterion" that the elements chosen as part of
the blend should not dramatically decrease in confidence after the blend's relationships are
submitted to PLN inference. One especially doesn't want to choose mutually contradictory
elements from the two inputs. For instance one doesn't want to choose "alive" as an element
of "snake", and "non-living" as an element of "building." This kind of contradictory choice can
be ruled out by inference, because after very few inference steps, this choice would lead to a
drastic confidence reduction for the InheritanceLinks to both "alive" and "non-living."
Aside from consistency, some other criteria considered relevant to evaluating the quality of
a blend, are:
• topology principle that relations in the blend should match the relations of their counterparts
in other concepts related to the concept inputs
• web principle that the representation in the blended space should maintain mappings to the
concept inputs
• unpacking principle that, given a blended concept, the interpreter should be able to infer
things about other related concepts
• good reason principle that there should be simple explanations for the elements of the blend
EFTA00624456
310 38 Speculative Concept Formation
• metonymic tightening that when metonymically related elements are projected into the
blended space, there is pressure to compress the "distance" between them.
While vague-sounding in their verbal formulations, these criteria have been computationally
implemented in the Sapper system, which uses blending theory, to model analogy and metaphor
IVlC91, VO071; and in a different form in rat-061's framework for computational creativity. In
CogPrime terms, these various criteria essentially boil down to: the new, blended concept
should get a lot of interesting links.
One could implement blending in CogPrime very straightforwardly via an evolutionary ap-
proach: search the space of possible blends, evaluating each one according to its consistency
but also the STI that it achieves when released into the Atomspace. However, this will be quite
computationally expensive, so a wiser approach is to introduce heuristics aimed at increasing
the odds of producing important blends.
A simple heuristic is to calculate, for each candidate blend, the amount of STI that the
blend would possess N cycles later if, at the current time, it was given a certain amount of STI.
A blend that would accumulate more STI in this manner may be considered more promising,
because this means that its components are more richly interconnected. Further, this heuristic
may be used as a guide for greedy heuristics for creating blends: e.g. if one has chosen a certain
element A of the first blend input, then one may seek an element B of the second blend input
that has a strong Hebbian link to A (if such a B exists).
However, it may also be interesting to pursue different sorts of heuristics, using information-
theoretic or other mathematical criteria to preliminarily filter possible blends before they are
evaluated more carefully via integrated cognition and importance dynamics.
38.3.1 Outline of a CogPrime Blending Algorithm
A rough outline of a concept blending algorithm for CogPrime is as follows:
• Choose a pair of concepts Cl and C2, which have a nontrivially-strong HebbianLink between
them, but not an extremely high-strength SimilarityLink between them (i.e. the concepts
should have something to do with each other, but not be extremely similar; blends of
extremely similar things are boring). These parameters may be twiddled.
• Form a new concept C3, which has some of Cl's links, and some of C2's links
• If C3 has obvious contradictions, resolve them by pruning links. (For instance, if Cl inherits
from alive to degree .9 and C2 inherits from alive to degree .1, then one of these two
TruthValue versions for the inheritance link from alive, has got to be pruned...)
• For each of C3's remaining links L, make a vector indicating everything it or its targets are
associated with (via HebbianLinks or other links). This is basically a list of "what's related
to L". Then, assess whether there are a lot of common associations to the links L that came
from Cl and the links L that came from C2
• If the filter in step 4 is passed, then let the PLN forward chainer derive some conclusions
about C3, and see if it comes up with anything interesting (e.g. anything with surprising
truth value, or anything getting high STI, etc.)
Steps 1 and 2 should be repeated over and over. Step 5 is basically "cognition as usual" - i.e.
by the time the blended concept is thrown into the Atomspace and subjected to Step 5, it's
being treated the same as any other ConceptNode.
EFTA00624457
38.3 Conceptual Blending 311
The above is more of a meta-algorithm than a precise algorithm. Many avenues for variation
exist, including
• Step 1: heuristics for choosing what to try to blend
• Step 3: how far do we go here, at removing contradictions? Do we try simple PLN inference
to see if contradictions are unveiled, or do we just limit the contradiction-check to seeing if
the same exact link is given different truth-values?
• Step 4: there are many different ways to build this association-vector. There are also many
ways to measure whether a set of association-vectors demonstrates "common associations".
Interaction information 113e1031 is one fancy way; there are also simpler ones.
• Step 5: there are various ways to measure whether PLN has come up with anything inter-
esting
38.3.2 Another Example of Blending
To illustrate these ideas further, consider the example of the SUV - a blend of "Car" and "Jeep"
Among the relevant properties of Car are:
• appealing to ordinary, consumers
• fuel efficient
• fits in most parking spots
• easy to drive
• 2 wheel drive
Among the relevant properties of Jeep are:
• 4 wheel drive
• rugged
• capable of driving off road
• high clearance
• open or soft top
Obviously, if we want to blend Car and Jeep, we need to choose properties of each that
don't contradict each other. We can't give the Car/Jeep both 2 wheel drive and 4 wheel drive.
4 wheel drive wins for Car/Jeep because sacrificing it would get rid of "capable of driving off
road", which is critical to Jeep-ness; whereas sacrificing 2WD doesn't kill anything that's really
critical to car-ness.
On the other hand, having a soft top would really harm "appealing to consumers", which
from the view of car-makers is a big part of being a successful car. But getting rid of the hard
top doesn't really harm other aspects of jeep-ness in any series way.
However, what really made the SUV successful was that "rugged" and "high clearance"
turned out to make SUVs look funky to consumers, thus fulfilling the "appealing to ordinary
consumers" feature of Car. In other words, the presence of the links
• looks funky —> appealing to ordinary consumers
• rugged Si high clearance —> looks funky
EFTA00624458
312 38 Speculative Concept Formation
made a big difference. This is the sort of thing that gets figured out once one starts doing PLN
inference on the links associated with a candidate blend.
However, if one views each feature of the blend as a probability distribution over concept
space - for instance indicating how closely associated each concept is with that feature (e.g.
via HebbianLinks) then we see that the mutual information (and more generally interaction
information) between the features of the blend, is a quick estimate of how likely it is that
inference will lead to interesting conclusions via reasoning about the combination of features
that the blend possesses.
38.4 Clustering
Next, a different method for creating new ConceptNodes in CogPrime is using clustering al-
gorithms. There are many different clustering algorithms in the statistics and data mining
literature, and no doubt many of them could have value inside CogPrime. We have experi-
mented with several different clustering algorithms in the CogPrime context, and have selected
one, which we call Omniclust IGCPM061, based on its generally robust performance on high-
volume, noisy data. However, other methods such as EM (Expectation-Maximization) clustering
IWI:051 would likely serve the purpose very well also.
In the above discussion on evolutionary concept creation, we mentioned the use of a cluster-
ing algorithm to cluster links. The same algorithm we describe here for clustering ConceptNodes
directly and creating new ConceptNodes representing these clusters, can also be used for clus-
tering links in the context of node mutation and crossover.
The application of Omniclust or any other clustering algorithm for ConceptNode creation
in CogPrime is simple. The clustering algorithm is run periodically, and the most significant
clusters that it finds are embodied as ConceptNodes, with InheritanceLinks to their members.
If these significant clusters have subclusters also identified by Omniclust, then these subclusters
are also made into ConceptNodes, etc., with InheritanceLinks between clusters and subclusters.
Clustering technology is famously unreliable, but this unreliability may be mitigated some-
what by using clusters as initial guesses at concepts, and using other methods to refine the
clusters into more useful concepts. For instance, a cluster may be interpreted as a disjunctive
predicate, and a search may be made to determine sub-disjunction about which interesting
PLN conclusions may be drawn.
38.5 Concept Formation via Formal Concept Analysis
Another approach to concept formation is an uncertain version of Formal Concept Analysis
V;SW051. There are many ways to create such a version, here we describe one approach we
have found interesting, called Fuzzy Concept Formation (FCF).
The general formulation of FCF begins with n objects 0 1,...,0„, in basic attributes
al, ..., am, and information that object 0 ; possesses attribute aj to degree E t0,1j. In
CogPrime, the objects and attributes are Atoms, and is% is the strength of the InheritanceLink
pointing from a to
EFTA00624459
38.5 Concept Formation via Formal Concept Analysis 313
In this context, we may define a concept as a fuzzy set of objects, and a derived attribute
as a fuzzy set of attributes.
Fuzzy concept formation (FCF) is, then, a process that produces N "concepts" Cn+N
and ill "derived attributes" ...,d,n+m, based on the initial set of objects and attributes.
We can extend the weight matrix .1% to include entries involving concepts and derived at-
tributes as well, so that e.g. wn+3,„,+5 indicates the degree to which concept Cn+3 possesses
derived attribute dm+s•
The learning engine underlying FCF is a clustering algorithm dust = X,.;b)
which takes in r vectors Xr E [0,1]n and outputs b or fewer clusters of these vectors. The
overall FCF process Ls independent of the particular clustering algorithm involved, though the
interestingness of the concepts and attributes formed will of course vary widely based on the
specific clustering algorithm. Some clustering algorithms will work better with large values of
b, others with smaller values of b.
We then define the process form_coneepts(b) to operate as follows. Given a set S = SI, S.
containing objects, concepts, or a combination of objects and concepts, and an attribute vector
wt of length h with entries in [0,1) corresponding to each Si, one applies dust to find b clusters
of attribute vectors B1,..., B6. Each of these clusters may be considered as a fuzzy set,
for instance by considering the membership of x in cluster B to be 2-d(x,centroid(B)) for an
appropriate metric d. These fuzzy sets are the b concepts produced by form_concepts(b).
38.5.1 Calculating Membership Degrees of New Concepts
The degree to which a concept defined in this way possesses an attribute, may be defined in a
number of ways, maybe the simplest is: weighted-summing the degree to which the members
of the concept possess the attribute. For instance, to figure out the degree to which beautiful
women (a concept) are insane (an attribute), one would calculate
EwEbeaulif ut _women Xbeendif ut_ women (W)Xinsane(W)
EwEbeata if tot_ women Xbeataifut _women (w)
where xx (w) denotes the fuzzy membership degree of win X. One could probably also consider
Extensionalinheritance beautiful_ women insane.
38.5.2 Forming New Attributes
One may define an analogous process form_aaributes(b) that begins with a set A = Ak
containing (basic and/or derived) attributes, and a column vector
Ch)
EFTA00624460
314 38 Speculative Concept Formation
of length It with entries in [0,1) corresponding to each Ai (the column vector tells the degrees to
which various objects possess the attributes Ai ). One applies dust to find b clusters of vectors
vi: B1,...,Bb. These clusters may be interpreted as fuzzy sets, which are derived attributes.
38.5.2.1 Calculating Membership Degrees of New, Derived Attributes
One must then defines the degree to which an object or concept possesses a derived attribute.
One way to do this is using a geometric mean. For instance, suppose there is a derived attribute
formed by combining the attributes vain, selfish and egocentric. Then. the degree to which the
concept banker possesses this new derived attribute could be defined by
ESE banker XbenkcA (Xvoin(b)XseifishitEl
,-,XceoccnericODlia
EbEtianker Xbanker(b)
38.5.3 Iterating the Fuzzy Concept Formation Process
Given a set S of concepts and/or objects with a set A of attributes, one may define
• append _concepts(S', 5) as the result of adding the concepts in the set S' to $, and evalu-
ating all the attributes in A on these concepts, to get an expanded matrix w
• append _attributes(A', A) as the result of adding the attributes in the set A' to A, and
evaluating all the attributes in A' on the concepts and objects in S, to get an expanded
matrix to
• collapse(S, A) is the result of taking (S, A) and eliminating any concept or attribute that
has distance less than c front some other concept or attribute that comes before it in
the lexicographic ordering of concepts or attributes. I.e., collapse removes near-duplicate
concepts or attributes.
Now, one may begin with a set S of objects and attributes, and iteratively run a process
such as
b = rac \\e.g. r=2, or r=1.5
while (b>1)
S append_concepts (S, form_concepts (S, b) )
S collapse (S)
S append_attributes (S, form_attributes (S, b) )
S collapse (S)
b b/r
with c corresponding to the number of iterations. This will terminate in finite time with a
finitely expanded matrix w containing a number of concepts and derived attributes in addition
to the original objects and basic attributes.
Or, one may look at
while(S is different from old_S) (
EFTA00624461
38.5 Concept Formation via Formal Concept Analysis 315
old_S = S
S = add_concepts(S, form_concepts(S,b))
S = collapse(S)
S = add_attributes(S, form_attributes(S,b))
S = collapse(S)
This second version raises the mathematical question of the speed with which it will terminate
(as a function of c). I.e., when does the concept and attribute formation process converge, and
how fast? This will surely depend on the clustering algorithm involved.
EFTA00624462
EFTA00624463
Section VI
Integrative Learning
EFTA00624464
EFTA00624465
Chapter 39
Dimensional Embedding
39.1 Introduction
Among the many key features of the human brain omitted by typical formal neural network
models, one of the foremost is the brain's three-dimensionality. The brain is not just a network of
neurons arranged as an abstract graph; it's a network of neurons arranged in three-dimensional
space, and making use of this three-dimensionality directly and indirectly in various ways and
for various purposes. The somatosensory cortex contains a geometric map reflecting, approxi-
matively, the geometric structure of parts of the body. The Visual cortex uses the 2D layout
of cortical sheets to reflect the geometric structure of perceived space: motion detection neu-
rons often fire in the actual physical direction of motion, etc. The degree to which the brain
uses 2D and 3D geometric structure to reflect conceptual rather than perceptual or motoric
knowledge is unclear, but we suspect considerable. One well-known idea in this direction is
the "self-organizing map" or Kohonen net [Koh011, a highly effective computer science algo-
rithm that performs automated classification and clustering via projecting higher-dimensional
(perceptual, conceptual or motoric) vectors into a simulated 2D sheet of cortex.
It's not clear that the exploitation of low-dimensional geometric structure is something an
AGI system necessarily must support - there are always many different approaches to any
aspect of the AGI problem. However, the brain does make clear that exploitation of this sort
of structure is a powerful way to integrate various useful heuristics. In the context of mind-
world correspondence theory, there seems clear potential value in having a mind mirror the
dimensional structure of the world, at some level of approximation.
It's also worth emphasizing that the brain's 3D structure has minuses as well as plusses - one
suspects it complexifies and constrains the brain, along with implicitly suggesting various useful
heuristics. Any mathematical graph can be represented in 3 dimensions without links crossing
(unlike in 2 dimensions), but that doesn't mean the representation will always be efficient or
convenient - sometimes it may result in conceptually related, and/or frequently interacting,
entities being positioned far away from each other geometrically. Coupled with noisy signaling
methods such as the brain uses, this sometime lack of alignment between conceptual/pragmatic
and geometric structure can lead to various sorts of confusion (i.e. when neuron A sends a signal
to physical distant neurons B, this may cause various side-effects along the path, sonic of which
wouldn't happen if A and B were close to each other).
In the context of CogPrime, the most extreme way to incorporate a brain-like 3D structure
would be to actually embed an Atomspace in a bounded 3D region. Then the Atomspace would
319
EFTA00624466
320 39 Dimensional Embedding
be geometrically something like a brain, but with abstract nodes and links (some having explicit
symbolic content) rather than purely sub symbolic neurons. This would not be a ridiculous thing
to do, and could yield interesting results. However, we are unsure this would be an optimal
approach. Instead we have opted for a more moderate approach: couple the non-dimensional
Atomspace with a dimensional space, containing points corresponding to Atoms. That is, we
perform an embedding of Atoms in the OpenCog AtomSpace into n-dimensional space - a
judicious transformation of (hyper)graphs into vectors.
This embedding has applications to PLN inference control, and to the guidance of instance
generation in PEPL learning of Combo trees. It is also, in itself, a valuable and interesting
heuristic for sculpting the link topology of a CogPrime AtomSpace. The basic dimensional
embedding algorithm described here is fairly simple and not original to CogPrime, but it has
not previously been applied in any similar context.
The intuition underlying this approach is that there are some cases (e.g. PLN control, and
PEPL guidance) where dimensional geometry provides a useful heuristic for constraining a
huge search space, via providing a compact way of storing a large amount of information.
Dimensionally embedding Atoms lets CogPrime be dimensional like the brain when it needs to
be, yet with the freedom of nondimensionality the rest of the time. This dual strategy is one
that may be of value for AGI generally beyond the CogPrime design, and is somewhat related
to (though different in detail from) the way the CLARION cognitive architecture ISZO-il maps
declarative knowledge into knowledge appropriate for its neural net layer.
There is an obvious way to project CogPrime Atoms into n-dimensional space, by assigning
each Atom a numerical vector based on the weights of its links. But this is not a terribly
useful approach, because the vectors obtained in this way will live, potentially, in millions- or
billions-dimensional space. The approach we describe here is a bit different. We are defining
more specific embeddings, each one based on a particular link type or set of link types. And we
are doing the embedding into a space whose dimensionality is high but not too high, e.g. n=50.
This moderate dimensional space could then be projected down into a lower dimensional space,
like a 3D space, if needed.
The philosophy underlying the ideas proposed here is similar to that underlying Principal
Components Analysis (PCA) in statistics poli0j. The n-dimensional spaces we define here, like
those used in PCA or LSI (for Latent Semantic Indexing II.NIDK(l7j), are defined by sets of or-
thogonal concepts extracted from the original space of concepts. The difference is that PCA and
LSI work on spaces of entities defined by feature vectors, whereas the methods described here
work for entities defined as nodes in weighted graphs. There is no precise notion of orthogonality
for nodes in a weighted graph, but one can introduce a reasonable proxy.
39.2 Link Based Dimensional Embedding
In this section we define the type of dimensional embedding that we will be talking about. For
concreteness we will speak in terms of CogPrime nodes and links, but the discussion applies
much more generally than that.
A link based dimensional embedding is defined as a mapping that maps a set of CogPrime
Atoms into points in an n-dimensional real space, by:
• mapping link strength into coordinate values in an embedding space, and
EFTA00624467
39.2 Link Based Dimensional Embedding 321
• representing nodes as points in this embedding space, using the coordinate values defined
by the strengths of their links.
In the usual case, a dimensional embedding is formed from links of a single type, or from links
whose types are very closely related (e.g. from all symmetrical logical links).
Mapping all the link strengths of the links of a given type into coordinate values in a dimen-
sional space is a simple, but not a very effective strategy. The approach described here is based
on strategically choosing a subset of particular links and forming coordinate values from them.
The choice of links is based on the desire for a correspondence between the metric structure of
the embedding space, and the metric structure implicit in the weights of the links of the type
being embedded. The basic idea of metric preservation is depicted in Figure 39.1.
V
DIMENSIONAL
EMBEDDING
SPACE
•
ATONSPACE
poi. equation
MOUSE a Stet
Fig. 39.1: Metric-Preserving Dimensional Embedding. The basic idea of the sort of em-
bedding described here is to map Atoms into numerical vectors, in such a way that, on average,
distance between Atoms roughly correlates with distance between corresponding vectors. (The
picture shows a 3D embedding space for convenience, but in reality the dimension of the em-
bedding space will generally be much higher.)
More formally, let proj(A) denote the point in IV' corresponding to the Atom A. Then if, for
example, we are doing an embedding based on SimilarityUnits, we want there to be a strong
correlation (or rather anticorrelation) between:
EFTA00624468
322 39 Dimensional Embedding
(SimilarityLink A El) .tv. s
and
dE (proj(A), proj(B))
where dE denotes the Euclidean distance on the embedding space. This is a simple case
because SimilarityLink is symmetric. Dealing with asymmetric links like InheritanceLinks is a
little subtler, and will be done below in the context of inference control.
Larger dimensions generally allow greater correlation, but add complexity. If one chooses the
dimensionality equal to the number of nodes in the graph, there is really no point in doing the
embedding. On the other hand, if one tries to project a huge and complex graph into 1 or 2
dimensions, one is bound to lose a lot of important structure. The optimally useful embedding
will be into a space whose dimension is law but not too tarp.
For internal CogPrime inference purposes, we should generally use a moderately high-
dimensional embedding space, say n=50 or n=100.
39.3 Hard and Koren's Dimensional Embedding Algorithm
Our technique for embedding CogPrime Atoms into high-dimensional space is based on an
algorithm suggested by David Harel and Yehuda Koren II IK021. Their work is concerned with
visualizing large graphs, and they propose a two-phase approach:
1. embed the graph into a high-dimensional real space
2. project the high-dimensional points into 2D or 3D space for visualization
In CogPrime, we don't always require the projection step (step 2); our focus is on the initial
embedding step. Hard and Koren's algorithm for dimensional embedding (step 1) is directly
applicable to the CogPrime context.
Of course this is not the only embedding algorithm that would be reasonable to use in an
CogPrime context; it's just one possibility that seems to make sense.
Their algorithm works as follows.
Suppose one has a graph with symmetric weighted links. Further, assume that between any
two nodes in the graph, there is a way to compute the weight that a link between those two
nodes would have. even if the graph in fact doesn't contain a link between the two nodes.
In the CogPrime context, for instance, the nodes of the graph may be ConeeptNodes, and
the links may be SimilarityLinks. We will discuss the extension of the approach to deal with
asymmetric links like InheritanceLinks, later on.
Let n denote the dimension of the embedding space (e.g. n = 50). We wish to map graph
nodes into points in R", in such a way that the weight of the graph link between A and B
correlates with the distance between proj(A) and proj(B) in R".
39.3.1 Step 1: Choosing Pivot Points
Choose n "pivot points" that are roughly uniformly distributed across the graph.
EFTA00624469
39.4 Embedding Based Inference Control 323
To do this, one chooses the first pivot point at random and then iteratively chooses the i'th
point to be maximally distant from the previous (i — 1) points chosen.
One may also use additional criteria to govern the selection of pivot points. In CogPrime, for
instance, we may use long-term stability as a secondary criterion for selecting Atoms to serve as
pivot points. Greater computational efficiency is achieved if the pivot-point Atoms don't change
frequently.
39.3.2 Step 2: Similarity Estimation
Estimate the similarity between each Atom being projected, and the n pivot Atoms.
This is expensive. However, the cost is decreased somewhat in the CogPrime case by caching
the similarity values produced in a special table (they may not be important enough otherwise
to be preserved in CogPrime ). Then, in cases where neither the pivot Atom nor the Atom
being compared to it have changed recently, the cached value can be reused.
39.3.3 Step 3: Embedding
Create an n-dimensional space by assigning a coordinate axis to each pivot Atom. Then, for an
Atom i, the i'th coordinate value is given by its similarity to the i'th pivot Atom.
After this step, one has transformed one's graph into a collection of n-dimensional vectors.
WIKISOURCE:EmbeddingBasedInference
39.4 Embedding Based Inference Control
One important application for dimensional embedding in CogPrime Ls to help with the control
of
• Logical inference
• Direct evaluation of logical links
We describe how it can be used specifically to stop the CogPrime system from continually trying
to make the same unproductive inferences.
To understand the problem being addressed, suppose the system tries to evaluate the strength
of the relationship
SimilarityLink foot toilet
Assume that no link exists in the system representing this relationship.
Here "foot" and "toilet" are hypothetical ConceptNodes that represent aspects of the concepts
of foot and toilet respectively. In reality these concepts might well be represented by complex
maps rather than individual nodes.
EFTA00624470
324 39 Dimensional Embedding
Suppose the system determines that the strength of this Link is very close to zero. Then
(depending on a threshold in the MindAgent), it will probably not create a SimilarityLink
between the "foot" and "toilet" nodes.
Now, suppose that a few cycles later, the system again tries to evaluate the strength of the
same Link,
SimilarityLink foot toilet
Again, very likely, it will find a low strength and not create the Link at
The same problem may occur with InheritanceLinks, or any other (first or higher order)
logical link type.
Why would the system try, over and over again, to evaluate the strength of the same nonex-
istent relationship? Because the control strategies of the current forward-chaining inference and
pattern mining MindAgents are simple by design. These MindAgents work by selecting Atoms
from the AtomTable with probability proportional to importance, and trying to build links
between them. If the foot and toilet nodes are both important at the same time, then these
MindAgents will try to build links between them - regardless of how many times they've tried
to build links between these two nodes in the past and failed.
How do we solve this problem using dimensional embedding? Generally:
• one will need a different embedding space for each link type for which one wants to prevent
repeated attempted inference of useless relationships. Sometimes, very closely related link
types might share the same embedding space; this must be decided on a case-by-case basis.
• in the embedding space for a link type L, one only embeds Atoms of a type that can be
related by links of type L
It is too expensive to create a new embedding very often. Fortunately, when a new Atom is cre-
ated or an old Atom is significantly modified, it's easy to reposition the Atom in the embedding
space by computing its relationship to the pivot Atoms. Once enough change has happened,
however, new pivot Atoms will need to be recomputed, which is a substantial computational
expense. We must update the pivot point set every N cycles, where N is large; or else, whenever
the total amount of change in the system has exceeded a certain threshold.
Now, how is this embedding used for inference control? Let's consider the case of similarity
first. Quite simply, one selects a pair of Atoms (A,B) for SimilarityMining (or inference of a
SimilarityLink) based on some criterion such as, for instance:
importance(A) * importance(S) * simproj(A,B)
where
distproj(A,B) dE( proj(A) proj(B) )
simproj 2-c*distproj
and c is an important tunable parameter.
What this means is that, if A and B are far apart in the SimilarityLink embedding space,
the system is unlikely to try to assess their similarity.
There is a tremendous space efficiency of this approach, in that, where there are N Atoms
and m pivot Atoms, 1%1'2 similarity relationships are being approximately stored in m*N coor-
dinate values. Furthermore, the cost of computation is m*N times the cost of assessing a single
SimilarityLink. By accepting crude approximations of actual similarity values, one gets away
with linear time and space cost.
EFTA00624471
39.5 Dimensional Embedding and InheritanceLinks 325
Because this is just an approximation technique, there are definitely going to be cases where
A and B are not similar, even though they're close together in the embedding space. When
such a case is found, it may be useful for the AtomSpace to explicitly contain a low-strength
SimilarityLink between A and B. This link will prevent the system from making false embedding-
based decisions to explore (SimilarityLink A B) in the future. Putting explicit low-strength
SimilarityLinks in the system in these cases, is obviously much cheaper than using them for all
cases.
We've been talking about SimilarityLinks, but the approach is more broadly applicable. Any
symmetric link type can be dealt with about the same way. For instance, it might be useful to
keep dimensional embedding maps for
• SimilarityLink
• ExtensionalSimilarityLink
• EquivalenceLink
• ExtensionalEquivalenceLink
On the other hand, dealing with asymmetric links in terms of dimensional embedding requires
more subtlety - we turn to this topic below.
39.5 Dimensional Embedding and InheritanceLinks
Next, how can we use dimensional embedding to keep an approximate record of which links
do not inherit from each other? Because inheritance is an asymmetric relationship, whereas
distance in embedding spaces is a symmetrical relationship, there's no direct and simple way
to do so.
However, there is an indirect approach that solves the problem, which involves maintaining
two embedding spaces. and combining information about them in an appropriate way. In this
subsection, we'll discuss an approach that should work for InheritanceLink, SubsetLink, Impli-
cationLink, and ExtensionalImplicationLink and other related link types. But we'll explicitly
present it only for the InheritanceLink case.
Although the embedding algorithm described above was intended for symmetric weighted
grapks, in fact we can use it for asymmetric links in just about the same way. The use of
the embedding graph for inference control differs, but not the basic method of defining the
embedding.
In the InheritanceLink case, we can define pivot Atoms in the same way, and then we can
define two vectors for each Atom A:
proj_iparent (A)_i (InheritanceLink A A_i) .tv.s
proj_i child) (A)_i (InheritanceLink A_i A) .tv.s
where is the i'th pivot Atom.
If generally projthisd(A); < projaim(B); then qualitatively "children of A are children of
B"; and if generally projp.,„„i (A)1 ≥ projp, a(B); then qualitatively "parents of B are parents
of A". The combination of these two conditions means heuristically that (Inheritance. A B) is
likely. So, by combining the two embedding vectors assigned to each Atom, one can get heuristic
guidance regarding inheritance relations, analogous to the case with similarity relationships. One
EFTA00624472
326 39 Dimensional Embedding
may produce mathematical formulas estimating the error of this approach under appropriate
conditions, but in practice it will depend on the probability distribution of the vectors.
EFTA00624473
Chapter 40
Mental Simulation and Episodic Memory
40.1 Introduction
This brief chapter deals with two important, coupled cognitive components of CogPrime : the
component concerned with creating internal simulations of situations and episodes in the ex-
ternal physical world, and the one concerned with storing and retrieving memories of situations
and episodes.
These are components that are likely significantly different in CogPrime from anything that
exists in the human brain, yet, the functions they carry out are obviously essential to human
cognition (perhaps more so to human cognition than to CogPrime's cognition, because Cog-
Prime is by design more reliant on formal reasoning than the human brain is).
Much of human thought consists of internal, quasi-sensory "imaging" of the external physical
world - and much of human memory consists of remembering autobiographical situations and
episodes from daily life, or from stories heard from others or absorbed via media. Often this
episodic remembering takes the form of visualization, but not always. Blind people generally
think and remember in terms of non-visual imagery, and many sighted people think in terms
of sounds, tastes or smells in addition to visual images.
So far, the various mechanisms proposed as part of CogPrime do not have much to do with
either internal imagery or episodic remembering, even though both seem to play a large role
in human thought. This is OK, of course, since CogPrime is not intended as a simulacrum of
human thought, but rather as a different sort of intelligence.
However, we believe it will actually be valuable to CogPrime to incorporate both of these
factors. And for that purpose, we propose
• a novel mechanism: the incorporation within the CogPrime system of a 3D physical-world
simulation engine.
• an episodic memory store centrally founded on dimensional embedding, and linked to the
internal simulation model
327
EFTA00624474
328 40 Mental Simulation and Episodic Memory
40.2 Internal Simulations
The current use of virtual worlds for OpenCog is to provide a space in which human-controlled
agents and CogPrime -controlled agents can interact, thus allowing flexible instruction of the
CogPrime system by humans, and flexible embodied, grounded learning by CogPrime systems.
But this very same mechanism may be used internally to CogPrime, i.e. a CogPrime system may
be given an internal simulation world, which serves as a sort of "mind's eye." Any sufficiently
flexible virtual world software may be used for this purpose, for example OpenSim (http:
\opensim.org).
Atoms encoding percepts may be drawn from memory and used to generate forms within
the internal simulation world. These forms may then interact according to
• the patterns via which they are remembered to act
• the laws of physics, as embodied in the simulation world
This allows a kind of "implicit memory," in that patterns emergent from the world-embedded
interaction of a number of entities need not explicitly be stored in memory, so long as they will
emerge when the entities are re-awakened within the internal simulation world.
The SimulatorMindAgent grabs important perceptual Atoms and uses them to generate
forms within the internal simulation world, which then act according to remembered dynamical
patterns, with the laws of physics filling in the gaps in memory. This provides a sort of running
internal visualization of the world. Just as important, however, are specific schemata that
utilize visualization in appropriate contexts. For instance, if reasoning is having trouble solving
a problem related to physical entities, it may feed these entities to the internal simulation world
to see what can be discovered. Patterns discovered via simulation can then be fed into reasoning
for further analysis.
The process of perceiving events and objects in the simulation world is essentially identical
to the process of perceiving events and objects in the "actual" world.
And of course, an internal simulation world may be used whether the CogPrime system in
question is hooked up to a virtual world like OpenSim, or to a physical robot.
Finally, perhaps the most interesting aspect of internal simulation is the generation of "vir-
tual perceptions" from abstract concepts. Analogical reasoning may be used to generate virtual
perceptions that were never actually perceived, and these may then be visualized. The need for
"reality discrimination" comes up here, and is easier to enforce in CogPrime than in humans.
A PerceptNode that was never actually perceived may be explicitly embedded in a Hypotheti-
calLink, thus avoiding the possibility of confusing virtual percepts with actual ones. How useful
the visualization of virtual perceptions will be to CogPrime cognition, remains to be seen. This
kind of visualization is key to human imagination but this doesn't mean it will play the same
role in CogPrime's quite different cognitive processes. But it is important that CogPrime has
the power to carry out this kind of imagination.
40.3 Episodic Memory
Episodic memory refers to the memory of our own "life history" that each of us has. Loss of this
kind of memory is the most common type of amnesia in fiction - such amnesia is particularly
dramatic because our episodic memories constitute so much of what we consider as our "selves."
EFTA00624475
40.3 Episodic Memory 329
To a significant extent, we as humans remember, reason and relate in terms of stories - and the
centerpiece of our understanding of stories is our episodic memory. A CogPrime system need
not be as heavily story-focused as a typical human being (though it could be, potentially) -
but even so, episodic memory is a critical component of any CogPrime system controlling an
agent in a world.
The core idea underlying CogPrime's treatment of episodic memory, is a simple one: two
dimensional embedding spaces dedicated to episodes. An episode - a coherent collection of
happenings, often with causal interrelationships, often (but not always) occurring near the
same spatial or temporal locations as each other - may be represented explicitly as an Atom,
and implicitly as a map whose key is that Atom. These episode-Atoms may then be mapped
into two dedicated embedding spaces:
• one based on a distance metric determined by spatiotemporal proximity
• one based on a distance metric determined by semantic similarity
A story is then a series of episodes - ideally one that, if the episodes in the series become
important sequentially in the AtomSpace, causes a significant important-goal-related (ergo emo-
tional) response in the system. Stories may also be represented as Atoms, in the simplest case
consisting of SequentialAND links joining episode-Atoms. Stories then correspond to paths
through the two episodic embedding spaces. Each path through each embedding space implic-
itly has a sort of "halo" in the space - visualizable as a tube snaking through the space, centered
on the path. This tube contains other paths - other stories - that related to the given center
story, either spatiotemporally or semantically.
The familiar everyday human experience of episodic memory may then be approximatively
emulated via the properties of the dimensional embedding space. For instance, episodic memory
is famously associative - when we think of one episode or story, we think of others that are
spatiotemporally or semantically associated with it. This emerges naturally from the embedding
space approach, due to the natural emergence of distance-based associative memory in an
embedding space.
Figures 40.1 and 40.2 roughly illustrates the link between episodic/perceptual and declarative
memory.
EFTA00624476
330 40 Mental Simulation and Episodic Memory
ASSOCIATIVE
EPISODIC
MEMORY
LINKED TO
PERCEPTION
HIERARCHY
kiss
19805 L
•
New York CRY - - ATOMSPACE
Jack) Me
Fig. 40.1: Relationship Between Episodic, Declarative and Perceptual Memory. The
nodes and links at the bottom depict declarative memory stored in the Atomspace; the picture at
the top illustrates an archetypal episode stored in episodic memory, and linked to the perceptual
hierarchy enabling imagistic simulation.
EFTA00624477
40.3 Episodic Memory 331
ASSOCIATIVE
EPISODIC
MEMORY
LINKED TO
PERCEPTION
HIERARCHY
my Energy
ATOMSPACE
Fig. 40.2: Relationship Between Episodic, Declarative and Perceptual Memory. An-
other example similar to the one in ??, but referring specifically to events occurring in an
OpenCogPrime -controlled agent's virtual world.
EFTA00624478
EFTA00624479
Chapter 41
Integrative Procedure Learning
41.1 Introduction
"Procedure learning" - learning step-by-step procedures for carrying out internal or external
operations - is a highly critical aspect of general intelligence, and is carried out in CogPrime
via a complex combination of methods. This somewhat heterogeneous chapter reviews several
advanced aspects of procedure learning in CogPrime, mainly having to do with the integration
between different cognitive processes.
In terms of general cognitive theory and mind-world correspondence, this is some of the
subtlest material in the book. We are not concerned just with how the mind's learning of one
sort of knowledge correlated with the way this sort of knowledge is structured in the mind's
habitual environments, in the context of its habitual goals. Rather, we are concerned with how
various sorts of knowledge intersect and interact with each other. The proposed algorithmic
intersections between, for instance, declarative and procedural learning processes. are reflective
of implicit assumptions about how declarative and procedural knowledge are presented in the
world in the context of the system's goals - but these implicit assumptions are not always easy
to tease out and state in a compact way. We will do our best to highlight these assumptions as
they arise throughout the chapter.
Key among these assumptions, however, are that a human-like mind
• is presented with various procedure learning problems at various levels of difficulty (so that
different algorithms may be appropriate depending on the difficulty level). This leads for
instance to the possibility of using various different algorithms like MOSES or hill climbing,
for different procedure learning problems.
• is presented with some procedure learning problems that may be handled in a relatively
isolated way, and others that are extremely heavily dependent on context, often in a way that
recurs across multiple learning instances in similar contexts. This leads to a situations where
the value of bringing declarative (PLN) and associative (ECAN) and episodic knowledge into
the procedure learning process, has varying value depending on the situation.
• is presented with a rich variety of procedure learning problems with complex interrelation-
ships, including many problems that are closely related to previously solved problems in
various ways. This highlights the value of using PLN analogical reasoning, and importance
spreading along HebbianLinks learned by ECAN, to help guide procedure learning in various
ways.
333
EFTA00624480
334 41 Integrative Procedure Learning
• needs to learn some procedures whose execution may be carried out in a relatively isolated
way, and other procedures whose execution requires intensive ongoing interaction with other
cognitive processes
The diversity of procedure learning situations reflected in these assumptions, leads naturally to
the diversity of technical procedure learning approaches described in this chapter. Potentially
one could have a single, unified algorithm covering all the different sorts of procedure learning,
but instead we have found it more practical to articulate a small number of algorithms which
are then combined in different ways to yield the different kinds of procedure learning.
41.1.1 The Diverse Technicalities of Procedure Learning in CogPrime
On a technical level, this chapter discusses two closely related aspects of CogPrime : schema
learning and predicate learning, which we group under the general category of "procedure learn-
ing."
Schema learning - the learning of Schemallodes and schema maps (explained further in the
Chapter 42) - is CogPrime lingo for learning how to do things. Learning how to act, how to
perceive, and how to think - beyond what's explicitly encoded in the system's MindAgents. As
an advanced CogPrime system becomes more profoundly self-modifying, schema learning will
drive more and more of its evolution.
Predicate learning, on the other hand, is the most abstract and general manifestation of
pattern recognition in the CogPrime system. PredicateNodes, along with predicate maps, are
CogPrime's way of representing general patterns (general within the constraints imposed by
the system parameters, which in turn are governed by hardware constraints). Predicate evolu-
tion, predicate mining and higher-order inference - specialized and powerful forms of predicate
learning - are the system's most powerful ways of creating general patterns in the world and in
the mind. Simpler forms of predicate learning are grist for the mill of these processes.
It may be useful to draw an analogy with another (closely related) very hard problem in
CogPrime, discussed in the book Probabilistic Logic Networks: probabilistic logical unification,
which in the CogPrime /PLN framework basically comes down to finding the SatisfyingSets of
given predicates. Hard logical unification problems can often be avoided by breaking down large
predicates into small ones in strategic ways, guided by non-inferential mind processes, and then
doing unification only on the smaller predicates. Our limited experimental experience indicates
that the same "hierarchical breakdown" strategy also works for schema and predicate learning,
to an extent. But still, as with unification, even when one does break down a large schema or
predicate learning problem into a set of smaller problems, one is still in most cases left with a
set of fairly hard problems.
More concretely, CogPrime procedure learning may be generally decomposed into three as-
pects:
1. Converting back and forth between maps and ProcedureNodes (encapsulation and expan-
sion)
2. Learning the Combo Trees to be embedded in grounded ProcedureNodes
3. Learning procedure maps (networks of grounded ProcedureNodes acting in a coordinated
way to carry out procedures)
EFTA00624481
41.1 Introduction 335
Each of these three aspects of CogPrime procedure learning mentioned above may be dealt with
somewhat separately, though relying on largely overlapping methods.
CogPrime approaches these problems using a combination of techniques:
• Evolutionary procedure learning and hillclimbing for dealing with brand new procedure
learning problems, requiring the origination of innovative, highly approximate solutions out
of the blue
• Inferential procedure learning for taking approximate solutions and making them exact,
and for dealing with procedure learning problems within domains where closely analogous
procedure learning problems have previously been solved
• Heuristic, probabilistic data mining for the creation of encapsulated procedures (which then
feed into inferential and evolutionary, procedure learning), and the expansion of encapsulated
procedures into procedure maps
• PredictivelmplicationLink formation (augmented by PLN inference on such links) as a Cog-
Prime version of goal-directed reinforcement learning
Using these different learning methods together, as a coherently-tuned whole, one arrives at a
holistic procedure learning approach that combines speculation, systematic inference, encapsu-
lation and credit assignment in a single adaptive dynamic process.
We are relying on a combination of techniques to do what none of the techniques can ac-
complish on their own. The combination is far from arbitrary, however. As we will see, each of
the techniques involved plays a unique and important role.
41.1.1.1 Comments on an Alternative Representational Approach
We briefly pause to contrast certain technical aspects of the present approach to analogous as-
pects of the Webmind AI Engine (one of CogPrime's predecessor Al systems, briefly discussed
in Chapter 19.1). This predecessor system used a knowledge representation somewhat similar
to the Atomspace, but with various differences; for instance the base types were Node and Link
rather than Atom, and there was a Node type not used in CogPrime called the Schemain-
stanceNode (each one corresponding to a particular instance of a Schemallode, used within a
particular procedure).
In this approach, complex, learned schema were represented as distributed networks of el-
ementary SchemalnstanceNodes, but these networks were not defined purely by function ap-
plication - they involved explicit passing of variable values through VariableNodes. Special
logic-gate-bearing objects were created to deal with the distinction between arguments of a
SchemalnstanceNode, and predecessor tokens giving a SchemanstanceNode permission to act.
While this approach is in principle workable, it proved highly complex in practice, and for
the Novamente Cognition Engine and CogPrime we chose to store and manipulate procedural
knowledge separately from declarative knowledge (via Combo trees).
EFTA00624482
336 41 Integrative Procedure Learning
41.2 Preliminary Comments on Procedure Map Encapsulation and
Expansion
Like other knowledge in CogPrime, procedures may be stored in either a localized (Combo
tree) or globalized (procedure map) manlier, with the different approaches being appropriate
for different purposes. Activation of a localized procedure may spur activation of a globalized
procedure, and vice versa - so on the overall mind-network level the representation of procedures
is heavily "glocal."
One issue that looms large in this context is the conversion between localized and globalized
procedures - i.e., in CogPrime lingo, the encapsulation and expansion of procedure maps. This
matter will be considered in more detail in Chapter 42 but here we briefly review some key
ideas.
Converting from grounded ProcedureNodes into maps is a relatively simple learning prob-
lem: one enacts the procedure, observes which Atoms are active at what times during the
enaction process, and then creating PredictivelmplicationLinks between the Atoms active at
a certain time and those active at subsequent times. Generally it will be nectsbary to enact
the procedure multiple times and with different inputs, to build up the appropriate library of
Fred ictivelmplicat ion Links.
Converting from maps into ProcedureNodes is significantly trickier. First, it involves carrying
out data mining over the network of ProcedureNodes, identifying subnetworks that are coherent
schema or predicate maps. Then it involves translating the control structure of the map into
explicit logical form, so that the encapsulated version will follow the same order of execution
as the map version. This is an important case of the general process of map encapsulation, to
be discussed in Chapter 42
Next, the learning of grounded ProcedureNodes is carried out by a synergistic combination
of multiple mechanisms, including pure procedure learning methods like hillclimbing and evolu-
tionary learning, and logical inference. These two approaches have quite different characteristics.
Evolutionary learning and hillclimbing excel at confronting a problem that the system has no
clue about, and arriving at a reasonably good solution in the form of a schema or predicate.
Inference excels at deploying the system's existing knowledge to form useful schemata or pred-
icates. The choice of the appropriate mechanism for a given problem instance depends largely
on how much relevant knowledge is available.
A relatively simple case of ProcedureNode learning is where one is given a ConceptNode and
wants to find a ProcedureNode matching it. For instance, given a ConceptNode C, one may
wish to find the simplest possible predicate whose corresponding PredicateNode P satisfies
SatisfyingSet(P) C
On the other hand, given a ConceptNode C involved in inferred ExecutionLinks of the form
ExecutionLink C Ai Bi
i-1, .. ,n
one may wish to find a Schemallode so that the corresponding Schemallode will fulfill this
same set of ExecutionLinks. It may seem surprising at first that a ConceptNode might be
involved with ExecutionLinks, but remember that a function can be seen as a set of tuples
(ListLink in CogPrime ) where the first elements, the inputs of the function, are associated
with a unique output. These kinds of ProcedureNode learning may be cast as optimization
problems, and carried out by hillclimbing or evolutionary programming. Once procedures are
learned via evolutionary programming or other techniques. they may be refined via inference.
EFTA00624483
41.3 Predicate Schematization 337
The other case of ProcedureNode learning is goal-driven learning. Here one seeks a Sche-
mallode whose execution will cause a given goal (represented by a Goal Node) to be satisfied.
The details of Goal Nodes have already been reviewed; but all we need to know here is simply
that a Goal Node presents an objective function, a function to be maximized; and that it poses
the problem of finding schemata whose enaction will cause this function to be maximized in
specified contexts.
The learning of procedure maps, on the other hand, is carried out by reinforcement learn-
ing, augmented by inference. This is a matter of the system learning HebbianLinks between
ProcedureNod , as will be described below.
41.3 Predicate Schematization
Now we turn to the process called "predicate schematization," by which declarative knowledge
about how to carry, out actions may be translated into Combo trees embodying specific pro-
cedures for carrying out actions. This process is straightforward and automatic in some cases,
but in other cases requires significant contextually-savvy inference. This is a critical process
because some procedure knowledge - especially that which is heavily dependent on context in
either its execution or its utility - will be more easily learned via inferential methods than via
pure procedure-learning methods. But, even if a procedure is initially learned via inference (or
is learned by inference based on cruder initial guesses produced by pure procedure learning
methods), it may still be valuable to have this procedure in compact and rapidly executable
form such as Combo provides.
To proceed with the technical description of predicate schematization in CogPrime. we first
need the notion of an "executable predicate". Some predicates are executable in the sense that
they correspond to executable schemata, others are not. There are executable atomic predi-
cates (represented by individual PredicateNodes). and executable predicates (which are link
structures). In general. a predicate may be turned into a schema if it is an atomic executable
predicate, or if it is a compound link structure that consists entirely of executable atomic
predicates (e.g. pick_up, walk_to, can_do, etc.) and temporal links (e.g. SimultaneousAND,
PredictiveImplication, etc.)
Records of predicate execution may then be made using ExecutionLinks, e.g.
ExecutionLink pick_up ( me, ball_7)
is a record of the fact that the schema corresponding to the pick_up predicate was executed
on the arguments (me, ball_7).
It is also useful to introduce some special (executable) predicates related to schema execution:
• can_do, which represents the system's perceived ability to do something
• do, which denotes the system actually doing something; this is used to mark actions as
opposed to perceptions
• just_done, which is true of a schema if the schema has very recently been executed.
The general procedure used in figuring out what predicates to schematize, in order to create
a procedure achieving a certain goal, is: Start from the goal and work backwards, following
Predictivelmplications and EventualPredictivelmplications and treating can_do's as transpar-
ent, stopping when you find something that can currently be done, or else when the process
dwindles due to lack of links or lack of sufficiently certain links.
EFTA00624484
338 41 Integrative Procedure Learning
In this process, an ordered list of perceptions and actions will be created. The Atoms in this
perception/action-series (PA-series) are linked together via temporal-logical links.
The subtlety of this process, in general, will occur because there may be many different
paths to follow. One has the familiar combinatorial explosion of backward-chaining inference,
and it may be hard to find the best PA-series among all the mess. Experience-guided pruning
is needed here just as with backward-chaining inference.
Specific rules for translating temporal links into executable schemata, used in this process,
are as follows. All these rule-statements assume that B is in the selected PA-series. All node
variables not preceded by do or can_do are assumed to be perceptions. The —> denotes the
transformation from predicates to executable schemata.
EventualPredictivelmplicationLink (do A)
Repeat (do A) Until B
EventualPredictivelmplicationLink (do A) (can_do 8)
Repeat
do A
do B
Until
Evaluation just done B
the understanding being that the agent may try to do B and fail, and then try again the next
time around the loop
PredictivelmplicationLink (do A) (can_do 8) <time-lag T>
do A
wait T
do B
SimultaneouslmplicationLink A (can do B)
if A then do B
SimultaneouslmplicationLink (do A) (can_do B)
do A
do B
PredictivelmplicationLink A (can_do 8)
if A then do B
SequentialAndLink Al
EFTA00624485
41.3 Predicate Sclmniatization 339
Al
An
SequentialAndLink Al <time_lag T>
Al
Wait T
A2
Wait T
Wait T
An
SimultaneousANDLink Al ? An
Al
An
Note how all instances of can_do are stripped out upon conversion from predicate to schema,
and replaced with instances of do.
41.3.1 A Concrete Example
For a specific example of this process, consider the knowledge that: "If I walk to the teacher
while whistling, and then give the teacher the ball, Ill get rewarded."
This might be represented by the predicates
walk to the teacher while whistling
A_1
SimultaneousAND
do Walk_to
ExOutLink locate teacher
EvaluationLink do whistle
If I walk to the teacher while whistling, eventually I will be next to the teacher
EventualPredictivelmplication
A_1
Evaluation next_to teacher
While next to the teacher, I can give the teacher the ball
Simultaneouslmplication
EvaluationLink next_to teacher
can_do
EvaluationLink give (teacher, ball)
If I give the teacher the ball, I will get rewarded
EFTA00624486
340 41 Integrative Procedure Learning
Predictivelmplication
just_done
EvaluationLink done give (teacher, ball)
Evaluation reward
Via goal-driven predicate schematization, these predicates would become the schemata
walk toward the teacher while whistling
Repeat:
do WalkTo
ExOut locate teacher
do Whistle
Until:
next_to(teacher, ball)
if next to the teacher, give the teacher the ball
If:
Evaluation next_to teacher
Then
do give(teacher, ball)
Carrying out these two schemata will lead to the desired behavior of walking toward the teacher
while whistling, and then giving the teacher the ball when next to the teacher.
Note that, in this example:
• The walk_to, whistle, locate and give used in the example schemata are procedures corre-
sponding to the executable predicates walk_to, whistle, locate and give used in the example
predicates
• Next_to is evaluated rather than executed because (unlike the other atomic predicates in
the overall predicate being made executable) it has no "do" or "can_do" next to it
41.4 Concept-Driven Schema and Predicate Creation
In this section we will deal with the "conversion" of ConceptNodes into Schemallodes or Fred-
icateNodes. The two cases involve similar but nonidentical methods; we will begin with the
simpler PredicateNode case. Conceptually, the importance of this should be clear: sometimes
knowledge may be gained via concept-learning or linguistic means, but yet may be useful to
the mind in other forms, e.g. as executable schema or evaluable predicates. For instance, the
system may learn conceptually about bicycle-riding, but then may also want to learn executable
procedures allowing it to ride a bicycle. Or it may learn conceptually about criminal individuals,
but may then want to learn evaluable predicates allowing it to quickly evaluate whether a given
individual is a criminal or not.
41.4.1 Concept-Driven Predicate Creation
Suppose we have a ConceptNode C, with a set of links of the form
EFTA00624487
41.4 Concept-Driven Schema and Predicate Creation 341
MemberLink A_i C, il, ...,n
Our goal is to find a PredicateNode so that firstly,
MemberLink X C
is equivalent to
X "within" SatisfyingSet(P)
and secondly,
P is as simple as possible
This is related to the "Occam's Razor," Solomonoff induction related heuristic to be presented
later in this chapter.
We now have an optimization problem: search the space of predicates for P that maximize
the objective function f(P,C), defined as for instance
f(P,C) = cp(P) x r(C,
where cp(P), the complexity penalty of P, is a positive function that decreases when P gets
larger and with r(C, =
GetStrength
SimilarityLink
C
SatisfyingSet(P)
This is an optimization problem over predicate space, which can be solved in an approximate
way by the evolutionary programming methods described earlier.
The ConceptPredicatization MindAgent selects ConceptNodes based on
• Importance
• Total (truth value based) weight of attached MemberLinks and EvaluationLinks
and launches an evolutionary learning or hillclimbing task focused on learning predicates based
on the nodes it selects.
41.4.2 Concept-Driven Schema Creation
In the schema learning case, instead of a ConceptNode with MemberLinks and EvaluationLinks,
we begin with a ConceptNode C with ExecutionLinks. These ExecutionLinks were presumably
produced by inference (the only CogPrime cognitive process that knows how to create Execu-
tionLinks for non-ProcedureNodes).
The optimization problem we have here is: search the space of schemata for S that maximize
the objective function f (S,C), defined as follows:
f(S,(7)==cp(S)xr(S,C)
Let Q(C) be the set of pairs (X, Y) so that ExecutionLink C X Y, and
EFTA00624488
342 41 Integrative Procedure Learning
r (S, G)
GetStrength
SubsetLink
GM)
Graph(S)
where Graph(S) denotes the set of pairs (X, Y) so that ExecutionLink S X Y, where S has
been executed over all valid inputs.
Note that we consider a SubsetLink here because in practice C would have been observed
on a partial set of inputs.
Operationally, the situation here is very similar to that with concept predicatization. The
ConceptSchematization MindAgent must select ConceptNodes based on:
• Importance
• Total (truth value based) weight of ExecutionLinks
and then feed these to evolutionary optimization or hillclimbing.
41.5 Inference-Guided Evolution of Pattern -Embodying Predicates
Now we turn to predicate learning - the learning of PredicateNodes, in particular.
Aside from logical inference and learning predicates to match existing concepts, how does
the system create new predicates? Goal-driven schema learning (via evolution or reinforcement
learning) provides one alternate approach: create predicates in the context of creating use-
ful schema. Pattern mining, discussed in Chapter 37, provides another. Here we will describe
(yet) another complementary dynamic for predicate creation: pattern-oriented, inference-guided
PredicateNode evolution.
In most general terms, the notion pursued here is to form predicates that embody patterns
in itself and in the world. This brings us straight back to the foundations of the patternist
philosophy of mind, in which mind is viewed as a system for recognizing patterns in itself and
in the world, and then embodying these patterns in itself. This general concept is manifested
in many ways in the CogPrime design, and in this section we will discuss two of them:
• Reward of surprisingly probable Predicates
• Evolutionary learning of pattern-embodying Predicates
These are emphatically not the only way pattern-embodying PredicateNodes get into the sys-
tem. Inference and concept-based predicate learning also create PredicateNodes embodying
patterns. But these two mechanisms complete the picture.
41 4. 1 Rewarding Surprising Predicates
The TruthValue of a PredicateNode represents the expected TruthValue obtained by averaging
its TruthValue over all its possible legal argument-values. Some Predicates, however, may have
high TruthValue without really being worthwhile. They may not add any information to their
EFTA00624489
41.5 Inference-Cuided Evolution of Pattern-Embodying Predicates 343
components. We want to identify and reward those Predicates whose TruthValues actually add
information beyond what is implicit in the simple fact of combining their components.
For instance, consider the PredicateNode
AND
InheritanceLink X man
InheritanceLink X ugly
If we assume the man and ugly concepts are independent, then this PredicateNode will have
the TruthValue
tnan.tv.s x ugly.tv.s
In general, a PredicateNode will be considered interesting if:
1. Its Links are important
2. Its TruthValue differs significantly from what would be expected based on independence
assumptions about its components
It is of value to have interesting Predicates allocated more attention than uninteresting ones.
Factor 1 is already taken into account, in a sense: if the PredicateNode is involved in many
Links this will boost its activation which will boost its importance. On the other hand, Factor
2 is not taken into account by any previously discussed mechanisms.
For instance, we may wish to reward a PredicateNode if it has a surprisingly large or small
strength value. One way to do this is to calculate:
sdiff = 'actual strength— strength predicted via independence assumptions'
x weight_ of evidence
and then increment the value:
K x sdiff
onto the PredicateNode's LongTermlmportance value, and similarly increment STI using a
different constant.
Another factor that might usefully be caused to increment LTI is the simplicity of a Predi-
cateNode. Given two Predicates with equal strength, we want the system to prefer the simpler
one over the more complex one. However, the OccamsRazor MindAgent, to be presented below,
rewards simpler Predicates directly in their strength values. Hence if the latter is in use, it
seems unnecessary to reward them for their simplicity in their LTI values as well. This is an
issue that may require some experimentation as the system develops.
Returning to the surprisingness factor, consider the PredicateNode representing
AND
InheritanceLink X cat
EvaluationLink (eats X) fish
If this has a surprisingly high truth value, this means that there are more X known to (or
inferred by) the system, that both inherit from cat and eat fish, than one would expect given
the probabilities of a random X both inheriting from cat and eating fish. Thus, roughly speaking,
the conjunction of inheriting from cat and eating fish may be a pattern in the world.
EFTA00624490
344 41 Integrative Procedure Learning
We now see one very clear sense in which CogPrime dynamics implicitly leads to predicates
representing patterns. Small predicates that have surprising truth values get extra activation,
hence are more likely to stick around in the system. Thus the mind fills up with patterns.
41.5.2 A More Formal Treatment
It is worth taking a little time to clarify the sense in which we have a pattern in the above
example, using the mathematical notion of pattern reviewed in Chapter 3 of Part 1.
Consider the predicate:
predl(T) .tv
equals
GetStrength
AND
Inheritance $X cat
Evaluation eats ($X, fish)
where T is sonic threshold value (e.g. .8). Let B = SatisfyingSet(predl(T)). B is the set of
everything that inherits from cat and eats fish.
Now we will make use of the notion of basic complexity. If one assumes the entire AtomSpace
A constituting a given CogPrime system as given background information, then the basic com-
plexity °(BI IA) may be considered as the number of bits required to list the handles of the
elements of B, for lookup in A; whereas c(B) is the number of bits required to actually list the
elements of B. Now, the formula given above, defining the set B, may be considered as a process
P whose output is the set B. The simplicity c(PII A) is the number of bits needed to describe
this proems, which is a fairly small number. We assume A is given as background information,
accessible to the process.
Then the degree to which P is a pattern in B is given by
1 — c(PIIA)MBIIA)
which, if B is a sizable category, is going to be pretty close to 1.
The key to there being a pattern here is that the relation:
(Inheritance X cat) AND (eats X fish)
has a high strength and also a high count. The high count means that B is a large set,
either by direct observation or by hypothesis (inference). In the case where the count represents
actual pieces of evidence observed by the system and retained in memory, then quite literally
and directly, the PredicateNode represents a pattern in a subset of the system (relative to the
background knowledge consisting of the system as a whole). On the other hand, if the count
value has been obtained indirectly by inference, then it is possible that the system does not
actually know any examples of the relation. In this case, the PredicateNode is not a pattern
in the actual memory store of the system, but it is being hypothesized to be a pattern in the
world in which the system is embedded.
EFTA00624491
41.6 PredicateNode Mining 345
41.6 PredicateNode Mining
We have seen how the natural dynamics of the CogPrime system, with a little help from spe-
cial heuristics, can lead to the evolution of Predicates that embody patterns in the system's
perceived or inferred world. But it is also valuable to more aggressively and directly create
pattern-embodying Predicates. This does not contradict the implicit process, but rather com-
plements it. The explicit process we use is called PredicateNode Alining and is carried out by a
PredicateNodeMiner MindAgent.
Define an Atom structure template as a schema expression corresponding to a CogPrime
Link in which some of the arguments are replaced with variables. For instance,
Inheritance X cat
EvaluationLink (eats X) fish
are Atom structure templates. (Recall that Atom structure templates are important in PLN
inference control, as reviewed in 36)
What the PredicateNodeMiner does is to look for Atom structure templates and logical
combinations thereof which
• Minimize PredicateNode size
• Maximize surprisingness of truth value
This is accomplished by a combination of heuristics.
The first step in PredicateNode mining is to find Atom structure templates with high truth
values. This can be done by a fairly simple heuristic search process.
First, note that if one specifies an (Atom, Link type), one is specifying a set of Atom structure
templates. For instance, if one specifies
(cat, InheritanceLink)
then one is specifying the templates
InheritanceLink SX cat
and
InheritanceLink cat SX
One can thus find Atom structure templates as follows. Choose an Atom with high truth value,
and then, for each Link type, tabulate the total truth value of the Links of this type involving
this Atom. When one finds a promising (Atom, Link type) pair, one can then do inference to
test the truth value of the Atom structure template one has found.
Next, given high-truth-value Atom structure templates, the PredicateNodeMiner experi-
ments with joining them together using logical connectives. For each potential combination
it assesses the fitness in terms of size and surprisingness. This may be carried out in two ways:
1. By incrementally building up larger combinations from smaller ones, at each incremental
stage keeping only those combinations found to be valuable
2. For large combinations, by evolution of combinations
Option 1 is basically greedy data mining (which may be carried out via various standard al-
gorithms, as discussed in Chapter 37), which has the advantage of being much more rapid
EFTA00624492
346 41 Integrative Procedure Learning
than evolutionary programming, but the disadvantage that it misses large combinations whose
subsets are not as surprising as the combinations themselves. It seems there is room for both
approaches in CogPrime (and potentially many other approaches as well). The PredicateN-
odeMiner MindAgent contains a parameter telling it how much time to spend on stochastic
pattern mining vs. evolution, as well as parameters guiding the processes it invokes.
So far we have discussed the process of finding single-variable Atom structure templates. But
multivariable Atom structure templates may be obtained by combining single-variable ones. For
instance, given
eats SX fish
lives_in $X Antarctica
one may choose to investigate various combinations such as
(eats SX SY) AND (lives_in $Y)
(this particular example will have a predictably low truth value). So, the introduction of multiple
variables may be done in the same process as the creation of single-variable combinations of
Atom structure templates.
When a suitably fit Atom structure template or logical combination thereof is found, then a
PredicateNode is created embodying it, and placed into the AtomSpace. WIKISOURCE:SchemaMaps
41.7 Learning Schema Maps
Next we plunge into the issue of procedure maps - schema maps in particular. A schema map
is a simple yet subtle thing - a subnetwork of the AtomSpace consisting of Schemallodes,
computing some useful quantity or carrying out some useful process in a cooperative way. The
general purpose of schema maps is to allow schema execution to interact with other mental
processes in a more flexible way than is allowed by compact Combo trees with internal hooks
into the AtomSpace. I.e., to handle cases where procedure execution needs to be very highly
interactive, mediated by attention allocation and other CogPrime dynamics in a flexible way.
But how can schema maps be learned? The basic idea is simply reinforcement learning. In
a goal-directed system consisting of interconnected, cooperative elements, you reinforce those
connections and/or those elements that have been helpful for achieving goals, and weaken those
connections that haven't. Thus, over time, you obtain a network of elements that achieves goals
effectively.
The central difficulty in all reinforcement learning approaches is the 'assignment of credit'
problem. If a component of a system has been directly useful for achieving a goal, then rewarding
it is easy. But if the relevance of a component to a goal is indirect, then things aren't so simple.
Measuring indirect usefulness in a large, richly connected system is difficult - inaccuracies creep
into the process easily.
In CogPrime, reinforcement learning is handled via HebbianLinks, acted on by a combination
of cognitive processes. Earlier, in Chapter 23. we reviewed the semantics of HebbianLinks, and
discussed two methods for forming HebbianLinks:
1. Updating HebbianLink strengths via mining of the System Activity Table
EFTA00624493
41.7 Learning Schema Maps 347
2. Logical inference on HebbianLinks, which may also incorporate the use of inference to
combine HebbianLinks with other logical links (for instance, in the reinforcement learning
context, PredictivelmplicationLinks)
We now describe how HebbianLinks, formed and manipulated in this manner, may play a
key role in goal-driven reinforcement learning. In effect, what we will describe is an implicit
integration of the bucket brigade with PLN inference. The addition of robust probabilistic
inference adds a new kind of depth and precision to the reinforcement learning process.
Goal Nodes have an important ability to stimulate a lot of Schemallode execution activity.
If a goal needs to be fulfilled, it stimulates schemata that are known to make this happen. But
how is it known which schemata tend to fulfill a given goal? A link:
PredictivelmplicationLink S G
means that after schema S has been executed, goal G tends to be fulfilled. If these links between
goals and goal-valuable schemata exist, then activation spreading from goals can serve the
purpose of causing goal-useful schemata to become active.
The trick, then, is to use HebbianLinks and inference thereon to implicitly guess Predic-
tivelmplicationLinks. A HebbianLink between Si and S says that when thinking about Si was
useful in the past, thinking about S was also often useful. This suggests that if doing S achieves
goal G, maybe doing Si is also a good idea. The system may then try to find (by direct lookup
or reasoning) whether, in the current context, there is a Predictivelmplication joining Si to S.
In this way Hebbian reinforcement learning is being used as an inference control mechanism to
aid in the construction of a goal-directed chain of PredictiveehmplicationLinks, which may then
be schematized into a contextually useful procedure.
Note finally that this process feeds back into itself in an interesting way, via contributing
to ongoing HebbianLink formation. Along the way, while leading to the on-the-fiy construction
of context-appropriate procedures that achieve goals, it also reinforces the HebbianLinks that
hold together schema maps, sculpting new schema maps out of the existing field of interlinked
Schemallodes.
41.7.1 Goal-Directed Schema Evolution
Finally, as a complement to goal-driven reinforcement learning, there is also a process of goal-
directed Schemallode learning. This combines features of the goal-driven reinforcement learning
and concept-driven schema evolution methods discussed above. Here we use a Goal Node to
provide the fitness function for schema evolution.
The basic idea is that the fitness of a schema is defined by the degree to which enactment
of that schema causes fulfillment of the goal. This requires the introduction of Causallmpli-
cationLinks, as defined in PLN. In the simplest case, a CausallmplicationLink is simply a
PredictiveImplicationLink.
One relatively simple implementation of the idea is as follows. Suppose we have a Goal
Node G, whose satisfaction we desire to have achieved by time Ti. Suppose we want to find a
Schemallode S whose execution at time T2 will cause G to be achieved. We may define a fitness
function for evaluating candidate S by:
f(S,C,T1,T2). cp(S) x r(S, C, TI, T2)
EFTA00624494
348 41 Integrative Procedure Learning
r(S,G,T1,T2)
GetStrength
CausallmplicationLink
EvaluationLink
AtTime
T1
ExecutionLink S X Y
EvaluationLink AtTime (T2, G)
Another variant specifies only a relative time lag, not two absolute times.
f(S,G,T) = cp(S) x c(S,G,T)
v(S,G,T)
AND
NonEmpty
SatisfyingSet r(S,G,T1,T2)
TI > T2 - T
Using evolutionary learning or hillclimbing to find schemata fulfilling these fitness functions,
results in Schemallodes whose execution is expected to cause the achievement of given goals.
This is a complementary approach to reinforcement-learning based schema learning, and to
schema learning based on PredicateNode concept creation. The strengths and weaknesses of
these different approaches need to be extensively experimentally explored. However, prior ex-
perience with the learning algorithms involved gives us some guidance.
We know that when absolutely nothing is known about an objective function, evolutionary
programming is often the best way to proceed. Even when there is knowledge about an objective
function, the evolution process can take it into account, because the fitness functions involve
logical links, and the evaluation of these logical links may involve inference operations.
On the other hand, when there's a lot of relevant knowledge embodied in previously executed
procedures, using logical reasoning to guide new procedure creation can be cumbersome, due
to the overwhelming potentially aseful number of facts to choose when carrying inference.
The Hebbian mechanisms used in reinforcement learning may be understood as inferential in
their conceptual foundations (since a HebbianLink is equivalent to an ImplicationLink between
two propositions about importance levels). But in practice they provide a much-streamlined
approach to bringing knowledge implicit in existing procedures to bear on the creation of new
procedures. Reinforcement learning, we believe, will excel at combining existing procedures
to form new ones, and modifying existing procedures to work well in new contexts. Logical
inference can also help here, acting in cooperation with reinforcement learning. But when the
system has no clue how a certain goal might be fulfilled, evolutionary schema learning provides
a relatively time-efficient way for it to find something minimally workable.
Pragmatically, the GoalDrivenSchemaLearning MindAgent handles this aspect of the sys-
tem's operations. It selects Goal Nodes with probability proportional to importance, and then
spawns problems for the Evolutionary Optimization Unit Group accordingly. For a given Goal
Node, PLN control mechanisms are used to study its properties and select between the above
objective functions to use, on an heuristic basis.
EFTA00624495
41.8 Occam's Razor 349
41S Occam's Razor
Finally we turn to an important cognitive process that fits only loosely into the category of
"CogPrime Procedure learning" - it's not actually a procedure learning process, but rather a
process that utilizes the fruits of procedure learning.
The well-known "Occam's razor" heuristic says that all else being equal, simpler is better.
This notion is embodied mathematically in the Solomonoff-Levin "universal prior," according
to which the a priori probability of a computational entity X is defined as a normalized version
of:
m(X) E 2-1(➢)
where:
• the sum is taken over all programs p that compute X
• l(p) denotes the length of the program p
Normalization is necessary because these values will not automatically sum to 1 over the space
of all X.
Without normalization, in is a semimeasure rather than a measure; with normalization it
becomes the "Solomonoff-Levin measure" iLev011.
Roughly speaking, Solomonoff's induction theorem 'Saluda, SolG 111 shows that, if one is
trying to learn the computer program underlying a given set of observed data, and one does
Bayesian inference over the set of all programs to try and obtain the answer, then if one uses
the universal prior distribution one will arrive at the correct answer.
CogPrime is not a Solomonoff induction engine. The computational cost of actually applying
Solomonoff induction is unrealistically large. However, as we have seen in this chapter, there are
aspects of CogPrime that are reminiscent of Solomonoff induction. In concept-directed schema
and predicate learning, in pattern-based predicate learning - and in causal schema learning, we
are searching for schemata and predicates that minimize complexity while maximizing some
other quality. These processes all implement the Occam's Razor heuristic in a Solomonoffian
style.
Now we will introduce one more method of imposing the heuristic of algorithmic simplicity
on CogPrime Atoms (and hence, indirectly, on CogPrime maps as well). This is simply to give
a higher a priori probability to entities that are more simply computable.
For starters, we may increase the node probability of ProcedureNodes proportionately to
their simplicity. A reasonable formula here is simply:
2- " (P)
where P is the ProcedureNode and r > 0 is a parameter. This means that infinitely complex
P have a priori probability zero, whereas an infinitely simple P has an a priori probability 1.
This is not an exact implementation of the Solomonoff-Levin measure, but it's a decent
heuristic approximation. It is not pragmatically realistic to sum over the lengths of all programs
that do the same thing as a given predicate P. Generally the first term of the Solomonoff-Levin
summation is going to dominate the sum anyway, so if the ProcedureNode P is maximally
compact, then our simplified formula will be a good approximation of the Solomonoff-Levin
EFTA00624496
350 41 Integrative Procedure Learning
summation. These a priori probabilities may be merged with node probability estimates from
other sources, using the revision rule.
A similar strategy may be taken with ConceptNodes. We want to reward a ConceptNode
C with a higher a priori probability if C E SatisfyingSet(P) for a simple PredicateNode P. To
achieve this formulaically, let sim(X, Y) denote the strength of the SimilarityLink between X
and Y, and let:
sirn'(C, P) = sim(C, SatisfyingSet(P))
We may then define the a priori probability of a ConceptNode as:
pr(C) = E pp—rc(P)
P
where the sum goes over all P in the system. In practice of course it's only necessary to
compute the terms of the sum corresponding to P so that sim'(C, P) is large.
As with the a priori PredicateNode probabilities discussed above, these a priori ConceptN-
ode probabilities may be merged with other node probability information, using the revision
rule, and using a default parameter value for the weight of evidence. There is one pragmatic
difference here from the PredicateNode case, though. As the system learns new PredicateNodes,
its best estimate of pr(C) may change. Thus it makes sense for the system to store the a priori
probabilities of ConceptNodes separately from the node probabilities, so that when the a priori
probability is changed, a two step operation can be carried out:
• First, remove the old a priori probability from the node probability estimate, using the
reverse of the revision rule
• Then, add in the new a priori probability
Finally, we can take a similar approach to any Atom Y produced by a Schemallode. We can
construct:
pr(Y) = :E:S(Srr iY,112—ri c(S)+cCY))
S,X
where the sum goes over all pairs (.9, X) so that:
ExecutionLink S X Y
and s(S, X, Y) is the strength of this ExecutionLink. Here, we are rewarding Atoms that are
produced by simple schemata based on simple inputs.
The combined result of these heuristics is to cause the system to prefer simpler explanations,
analysis, procedures and ideas. But of course this is only an apriori preference. and if more
complex entities prove more useful, these will quickly gain greater strength and importance in
the system.
Implementationally, these various processes are carried out by the OccamsRazor MindAgent.
This dynamic selects ConceptNodes based on a combination of:
• importance
• time since the a priori probability was last updated (a long time is preferred)
It selects ExecutionLinks based on importance and based on the amount of time since they
were last visited by the OccamsRazor MindAgent. And it selects PredicateNodes based on
importance, filtering out PredicateNodes it has visited before.
EFTA00624497
Chapter 42
Map Formation
Abstract
42.1 Introduction
In Chapter 20 we distinguished the explicit versus implicit aspects of knowledge representa-
tion in CogPrime. The explicit level consists of Atoms with clearly comprehensible meanings,
whereas the implicit level consists of "maps" - collections of Atoms that become important in a
coordinated manner, analogously to cell assemblies in an attractor neural net. The combination
of the two is valuable because the world-patterns useful to human-like minds in achieving their
goals, involve varying degrees of isolation and interpenetration, and their effective goal-oriented
processing involves both symbolic manipulation (for which explicit representation is most valu-
able) and associative creative manipulation (for which distributed, implicit representation is
most valuable).
The chapters since have focused primarily on explicit representation, commenting on the
implicit "map" level only occasionally. There are two reasons for this: one theoretical, one
pragmatic. The theoretical reason is that the majority of map dynamics and representations
are implicit in Atom-level correlates. And the pragmatic reason is that, at this stage, we simply
do not know as much about CogPrime maps as we do about CogPrime Atoms. Maps are
emergent entities and, lacking a detailed theory of CogPrime dynamics, the only way we have
to study them in detail is to run CogPrime systems and mine their System Activity Tables and
logs for information. If CogPrime research goes well, then updated versions of this book may
include more details on observed map dynamics in various contexts.
In this chapter, however, we finally turn our gaze directly to maps and their relationships
to Atoms, and discuss processes that convert Atoms into maps (expansion) and vice versa
(encapsulation). These processes represent a bridge between the concretely-implemented and
emergent aspects of CogPrime's mind.
Map encapsulation is the process of recognizing Atoms that tend to become important in
a coordinated manner, and then creating new Atoms grouping these. As such it is essentially
a form of AtomSpace pattern mining. In terms of patternist philosophy, map encapsulation
is a direct incarnation of the so-called "cognitive equation"; that is, the process by which the
mind recognizes patterns in itself, and then embodies these patterns as new content within
351
EFTA00624498
352 42 Map Formation
itself - an instance of what Hofstadter famously labeled a "strange loop" 'Hurl. In SMEPH
terms, the encapsulation process is how CogPrime explicitly studies its own derived hypergraph
and then works to implement this derived hypergraph more efficiently by recapitulating it at
the concretely-implemented-mind level. This of course may change the derived hypergraph
considerably. Among other things, map encapsulation has the possibility of taking the things
that were the mast abstract, highest level patterns in the system and forming new patterns
involving them and their interrelationships - thus building the highest level of patterns in the
system higher and higher. Figures 42.2 and 42.1 illustrate concrete examples of the process.
Map Encapsulation
Atom Table Before Atom Table After
Map Encapsulation Map Encapsulation
Fig. 42.1: Illustration of the process of creating explicit Atoms corresponding to a pattern
previously represented as a distributed "map."
Map expansion, on the other hand, is the process of taking knowledge that is explicitly
represented and causing the AtomSpace to represent it implicitly, on the map level. In many
cases this will happen automatically. For instance, a ConceptNode C may turn into a concept
map if the importance updating process iteratively acts in such a way as to create/reinforce a
map consisting of C and its relata. Or, an Atom-level InheritanceLink may implicitly spawn a
map-level InheritanceEdge (in SMEPH terms). However, there is one important case in which
Atom-to-map conversion must occur explicitly: the expansion of compound ProoedureNodes
into procedure maps. This must occur explicitly because the process graphs inside ProcedureN-
odes have no dynamics going on except evaluation; there is no opportunity for them to manifest
themselves as maps, unless a MindAgent is introduced that explicitly does so. Of course, just
unfolding a Combo tree into a procedure map doesn't intrinsically make it a significant part of
the derived hypergraph - but it opens the door for the inter-cognitive-process integration that
may make this occur.
EFTA00624499
42.2 Map Encapsulation 353
1 Implication
Guided by the results of
map formation, PLN seeks
specific, related logical relationships
Concept node
symbolizing concept node
map related to
the agent talking implication O symbolizing
map related to
happy people
Hap formation creates
new, abstract Concept Nodes
symbolizing the patterns of
co-Importance it has noted
Fig. 42.2: Illustration of the process of creating explicit Atoms corresponding to a pattern
previously represented as a distributed "map."
42.2 Map Encapsulation
Returning to encapsulation: it may be viewed as a form of symbolization, in which the system
creates concrete entities to serve as symbols for its own emergent patterns. It can then study
an emergent pattern's interrelationships by studying the interrelationships of the symbol with
other symbols.
For instance, suppose a system has three derived-hypergraph ConceptVertices A, B and C,
and observes that:
EFTA00624500
354 42 Map Formation
InheritanceEdge A S
InheritanceEdge B C
Then encapsulation may create ConceptNodes A', B' and C' for A, B and C, and Inheri-
tanceLinks corresponding to the InheritanceEdges, where e.g. A' is a set containing all the
Atoms contained in the static map A. First-order PLN inference will then immediately con-
clude:
InheritanceLink A' C'
and it may possibly do so with a higher strength than the strength corresponding to the (per-
haps not significant) InheritanceEdge between A and C. But if the encapsulation is done right
then the existence of the new InheritanceLink will indirectly cause the formation of the corre-
sponding:
InheritanceEdge A C
via the further action of inference, which will use (InheritanceLink A' C') to trigger the inference
of further inheritance relationships between members of A' and members of C', which will create
an emergent inheritance between members of A (the map corresponding to A') and C (the map
corresponding to C').
The above example involved the conversion of static maps into ConceptNodes. Another
approach to map encapsulation is to represent the fact that a set of Atoms constitutes a map
as a predicate; for instance if the nodes A, B and C are habitually used together, then the
predicate P may be formed, where:
P
AND
A is used at time T
B is used at time T
C is used at time T
The habitualness of A, B and C being used together will be reflected in the fact that P has
a surprisingly high truth value. By a simple concept formation heuristic, this may be used to
form a link AND(A, B, C), so that:
AND(A, B, C) is used at time T
This composite link AND(A, B, C) is then an embodiment of the map in single-Atom form.
Similarly, if a set of schemata is commonly used in a certain series, this may be recognized in
a predicate, and a composite schema may then be created embodying the component schemata.
For instance, suppose it is recognized as a pattern that:
AND
Si is used at time T on input I1 producing output 01
S2 is used at time T+s on input 01 producing output 02
Then we may explicitly create a schema that consists of Si taking input and feeding its output
to 52. This cannot be done via any standard concept formation heuristic; it requires a special
process.
One might wonder why this Atom-to-map conversion process is necessary: Why not just
let maps combine to build new maps, hierarchically, rather than artificially transforming some
maps into Atoms and letting maps then form from these map-representing Atoms. It is all a
matter of precision. Operations on the map level are fuzzier and less reliable than operations
on the Atom level. This fuzziness has its positive and its negative aspects. For example, it is
EFTA00624501
42.3 Atom and Predicate Activity Tables 355
good for spontaneous creativity, but bad for constructing lengthy, confident chains of thought.
WIKISOURCE:ActivityTables
42.3 Atom and Predicate Activity Tables
A major role in map formation is played by a collection of special tables. Map encapsulation
takes place, not by data mining directly on the AtomTable, but by data mining on these special
tables constructed from the AtomTable, specifically with efficiency of map mining in mind.
First, there is the Atom Utilization Table, which may be derived from the SystemActivi-
tyTable. The Atom Utilization Table, in its most simple possible version, takes the form shown
in Table 42.1.
Time Atom Handle H
? ??
T ? (Effort spent on Atom H at time t, utility derived front atom H at time t)
? ??
Table 42.1: Atom Utilization Table
The calculation of "utility" values for this purpose must be done in a "local" way by MindAgents,
rather than by a global calculation of the degree to which utilizing a certain Atom has led to
the achievement of a certain system goal (this kind of global calculation would be better in
principle, but it would require massive computational effort to calculate for every Atom in
the system at frequent intervals). Each MindAgent needs to estimate how much utility it has
obtained from a given Atom, as well as how much effort it has spent on this Atom, and report
these numbers to the Atom Utilization Table.
The normalization of effort values is simple, since effort can be quantified in terms of time and
space expended. Normalization of utility values Ls harder, as it is difficult to define a common
scale to span all the different MindAgents, which in some cases carry out very different sorts
of operations. One reasonably "objective" approach is to assign each MindAgent an amount of
"utility credit", at time T, equal to the amount of currency that the MindAgent has spent since
it last disbursed its utility credits. It may then divide up its utility credit among the Atoms it
has utilized. Other reasonable approaches may also he defined.
The use of utility and utility credit for Atoms and MindAgents is similar to the stimulus
used in the Attention allocation system. There, MindAgents reward Atoms with stimulus to
indicate that their short and long term importance should be increased. Merging utility and
stimulus is a natural approach to implementing utility in OpenCogPrime.
Note that there are many practical manifestations that the abstract notion of an Activi-
tyTable may take. It could be an ordinary row-and-column style table, but that is not the only
nor the most interesting possibility. An ActivityTable may also be effectively stored as a series
of graphs corresponding to time intervals - one graph for each interval, consisting of Hebbian-
Links formed solely based on importance during that interval. In this case it is basically a set
of graphs, which may be stored for instance in an AtomTable, perhaps with a special index.
Then there is the Procedure Activity Table, which records the inputs and outputs associated
with procedures:
EFTA00624502
356 42 Map Formation
Time ProcedureNode Handle H
? ??
T ? (Inputs to H, Outputs from H)
7 ?
Table 42.2: Procedure Activity Table for a Particular MindAgent
Data mining on these tables may be carried out by a variety of algorithms (see MapMining)
- the more advanced the algorithm, the fuller the transfer from the derived-hypergraph level
to the concretely-implemented level. There is a tradeoff here similar to that with attention
allocation - if too much time is spent studying the derived hypergraph, then there will not be
any interesting cognitive dynamics going on anymore because other cognitive processes get no
resources, so the map encapsulation process will fail because there is nothing to study!
These same tables may be used in the attention allocation process, for assigning of MindAgent-
specific AttentionValues to Atoms. WIKISOURCE:MapMining
42.4 Mining the AtomSpace for Maps
Searching for general maps in a complex AtomSpace is an unrealistically difficult problem, as
the search space is huge. So, the bulk of map-mining activity involves looking for the most
simple and obvious sorts of maps. A certain amount of resources may also be allocated to
looking for subtler maps using more resource-intensive methods.
The following categories of maps can be searched for at relatively low cost:
• Static maps
• Temporal motif maps
Conceptually, a static map is simply a set of Atoms that all tend to be active at the same
time.
Next, by a "temporal motif map" we mean a set of pairs:
ti )
of the type:
(Atom, int)
so that for many activation cycle indices T, A; is highly active at some time very close to index
T +ti. The reason both static maps and temporal motif maps are easy to recognize is that they
are both simply repeated patterns.
Perceptual context formation involves a special case of static and temporal motif mining. In
perceptual context formation, one specifically wishes to mine maps involving perceptual nodes
associated with a single interaction channel (see Chapter 26 for interaction channel). These
maps then represent real-world contexts, that may be useful in guiding real-world-oriented goal
activity (via schema-context-goal triads).
In CogPrime so far we have considered three broad approaches for mining static and temporal
motif maps from AtomSpaces:
EFTA00624503
42.4 Mining the AtomSpace for Maps 357
• Frequent subgraph mining, frequent itemset mining, or other sorts of datamining on Activity
Tables
• Clustering on the network of HebbianLinks
• Evolutionary Optimization based datamining on Activity Tables
The first two approaches are significantly more time-efficient than the latter, but also signifi-
cantly more limited in the scope of patterns they can find.
Any of these approaches can be used to look for maps subject to several types of constraints,
such as for instance:
• Unconstrained: maps may contain any kinds of Atoms
• Strictly constrained: maps may only contain Atom types contained on a certain list
• Probabilistically constrained: maps must contain Atom types contained on a certain
list, as x% of their elements
• Trigger-constrained: the map must contain an Atom whose type is on a certain list, as
its most active element
Different sorts of constraints will lead to different sorts of maps, of course. We don't know at
this stage which sorts of constraints will yield the best results. Some special cases, however, are
reasonably well understood. For instance:
• procedure encapsulation, to be discussed below, involves searching for (strictly-constrained)
maps consisting solely of ProcedureinstanceNodes.
• to enhance goal achievement, it is likely useful to search for trigger-constrained maps trig-
gered by Goal Nodes.
What the MapEncapsulation CIAO-Dynamic (Concretely-Implemented-Mind-Dynamic, see
Chapter 19) does once it finds a map, is dependent upon the type of map it's found. In the
special case of procedure encapsulation, it creates a compound ProcedureNode (selecting Sche-
mallode or PredicateNode based on whether the output is a TruthValue or not). For static
maps, it creates a ConceptNode, which links to all members of the map with MemberLinks, the
weight of which is determined by the degree of map membership. For dynamic maps, it creates
Predictivelmplication links depicting the pattern of change.
42.4.1 Frequent Itemset Mining for Map Mining
One class of technique that is useful here Ls frequent itemset mining (FIN1), a process that looks
to find all frequent combinations of items occurring in a set of data. Another useful class of
algorithms Ls greedy or stochastic itemset mining, which does roughly the same thing as FIM
but without being completely exhaustive (the advantage being greater execution speed). Here
we will discuss FIN1, but the basic concepts are the same if one is doing greedy or stochastic
mining instead.
The basic goal of frequent itemset mining is to discover frequent subsets in a group of
items. One knows that for a set of N items, there are 2N-1 possible subgroups. To avoid the
exponential explosion of subsets, one may compute the frequent itemsets in several rounds.
Round i computes all frequent i-itemsets.
EFTA00624504
358 42 Map Formation
A round has two steps: candidate generation and candidate counting. In the candidate gen-
eration step, the algorithm generates a set of candidate i-itemsets whose support - a minimum
percentage of events in which the item must appear - has not been yet been computed. In the
candidate-counting step, the algorithm scans its memory database, counting the support of the
candidate itemsets. After the scan, the algorithm discards candidates with support lower than
the specified minimum (an algorithm parameter) and retains only the frequent i-itemsets. The
algorithm reduces the number of tested subsets by pruning a priori those candidate itemsets
that cannot be frequent, based on the knowledge about infrequent itemsets obtained from pre-
vious rounds. So for instance if {A, B) is a frequent 2-itemset then {A, B,C} may possibly
be a 3-itemset, on the contrary if {A, B) is not a frequent itemset then {A, B,C}, as well as
any super set of (A,B), will be discarded. Although the worst case of this sort of algorithm is
exponential, practical executions are generally fast, depending essentially on the support limit.
To apply this kind of approach to search for static maps, one simply creates a large set of
sets of Atoms - one set for each time-point. In the set S(t) corresponding to time t, we place all
Atoms that were firing activation at time t. The itemset miner then searches for sets of Atoms
that are subsets of many different S(t) corresponding to many different times t. These are Atom
sets that are frequently co-active.
Table ?? presents a typical example of data prepared for frequent itemset mining, in the
context of context formation via static-map recognition. Columns represent important nodes
and rows indicate time slices. For simplicity, we have thresholded the values and show only
activity values; so that a 1in a cell indicates that the Atom indicated by the column was being
utilized at the time indicated by the row.
In the example, if we assume minimum support as 50 percent, the context nodes Cl = {Q,
Ft}, and C2 = {Q, T, U) would be created.
Using frequent itemset mining to find temporal motif maps is a similar, but slightly more
complex process. Here, one fixes a time-window W. Then, for each activation cycle index t, one
creates a set S(t) consisting of pairs of the form:
(A, a)
where A is an Atom and 0 ≤ s ≤ W is an integer temporal offset. We have:
(A,$) "within" S(t)
if Atom A is firing activation at time t+s. helmet mining is then used to search for common
subsets among the S(t). These common subsets are common patterns of temporal activation,
i.e. repeated temporal motifs.
The strength of this approach is its ability to rapidly search through a huge space of possibly
significant subsets. Its weakness is its restriction to finding maps that can be incrementally built
up from smaller maps. How significant this weakness is, depends on the particular statistics of
map occurrence in CogPrime. Intuitively, we believe frequent itemset mining can perform rather
well in this context, and our preliminary experiments have supported this intuition.
Frequent Subgraph Mining for Map Mining
A limitation of FIM techniques, from a CogPrime perspective, is that they are intended for
relational databases (RDBs); but the information about co-activity in a CogPrime instance is
generally going to be more efficiently stored as graphs rather than RDB's. Indeed an Activi-
tyTable may be effectively stored as a series of graphs corresponding to time intervals - one
EFTA00624505
42.5 Map Dynamics 359
graph for each interval, consisting of HebbianLinks formed solely based On importance during
that interval. From ActivityTable stores like this, the way to find maps is not frequent itemset
mining but rather frequent subgraph mining - a variant of FIM that is conceptually similar
but algorithmically more subtle, and on which there has arisen a significant literature in re-
cent years. We have already briefly discussed this technology in Chapter 37 on pattern mining
the Atomspace - map mining being an important special case of Atomspace pattern mining.
As noted there, some of the many approaches to frequent subgraph mining are described in
Ili W PO3, K KO I].
42.4.2 Evolutionary Map Detection
Just as general Atomspace pattern mining may be done via evolutionary learning as well as
greedy mining, the same holds for the special case of map mining. Complementary to the itemset
mining approach, the CogPrime design also uses evolutionary optimization to find maps. Here
the data setup is the same as in the itemset mining case, but instead of using an incremental
search approach, one sets up a population of subsets of the sets S(t), and seeks to evolve the
population to find an optimally fit S(t). Fitness is defined simply as high frequency - relative
to the frequency one would expect based on statistical independence assumptions alone.
In principle one could use evolutionary learning to do all map encapsulation, but this isn't
computationally feasible - it would limit too severely the amount of map encapsulation that
could be done. Instead, evolutionary learning must be supplemented by some more rapid, less
expensive technique.
42.5 Map Dynamics
Assume one has a collection of Atoms, with:
• Importance values I(A), assigned via the economic attention allocation mechanism.
• HebbianLink strengths (HebbianLink A B).tv.s, assigned as (loosely speaking) the proba-
bility of B's importance assuming A's importance.
Then, one way to search for static maps is to look for collections C of Atoms that are strong
clusters according to HebbianLinks. That is, for instance, to find collections C so that:
• The mean strength of (HebbianLink A B).tv.s, where A and B are in the collection C, is
large.
• The mean strength of (HebbianLink A Z).tv.s, where A is in the collection C and Z is not,
is small.
(this is just a very simple cluster quality measurement; there is a variety of other cluster quality
measurements one might use instead.)
Dynamic maps may be more complex, for instance there might be two collections CI and C2
so that:
• Mean strength of (HebbianLink A B).s, where A is in CI and B is in C2
EFTA00624506
360 42 Map Formation
• Mean strength of (HebbianLink B A).s, where B is in C2 and A is in Cl
are both wry large.
A static map will tend to be an attractor for CogPrime's attention-allocation-based dynamics,
in the sense that when a few elements of the map are acted upon, it is likely that other elements
of the map will soon also come to be acted upon. The reason is that, if a few elements of the map
are acted upon usefully, then their importance values will increase. Node probability inference
based on the HebbianLinks will then cause the importance values of the other nodes in the
map to increase, thus increasing the probability that the other nodes in the map are acted
upon. Critical here is that the HebbianLinks have a higher weight of evidence than the node
importance values. This is because the node importance values are assumed to be ephemeral
- they reflect whether a given node is important at a given moment or not - whereas the
HebbianLinks are assumed to reflect longer-lasting information.
A dynamic map will also be an attractor, but of a more complex kind. The example given
above, with Cl and C2, will be a periodic attractor rather than a fixed-point attractor. INIK-
ISOURCE:ProcedureEncapsulation
42.6 Procedure Encapsulation and Expansion
One of the most important special cases of map encapsulation is procedure encapsulation.
This refers to the process of taking a schema/predicate map and embodying it in a single
ProcedureNode. This may be done by mining on the Procedure Activity Table, described in
Activity Tables, using either:
• a special variant of itemset mining that seeks for procedures whose outputs serve as inputs
for other procedures.
• Evolutionary optimization with a fitness function that restricts attention to sets of pro-
cedurm that form a digraph, where the procedures lie at the vertices and an arrow from
vertex A to vertex B indicates that the outputs of A become the inputs of B.
The reverse of this process, procedure expansion, is also interesting, though algorithmically
easier - here one takes a compound ProcedureNode and expands its internals into a collection
of appropriately interlinked ProcedureNodes. The challenge here is to figure out where to split
a complex Combo tree into subtrees. But if the Combo tree has a hierarchical structure then
this is very simple; the hierarchical subunits may simply be split into separate ProcedureNodes.
These two processes may be used in sequence to interesting effect: expanding an important
compound ProcedureNode so it can be modified via reinforcement learning, then encapsulating
its modified version for efficient execution, then perhaps expanding this modified version later
on.
To an extent, the existence of these two different representations of procedures is an artifact
of CogPrime's particular software design (and ultimately, a reflection of certain properties of the
von Neumann computing architecture). But it also represents a more fundamental dichotomy,
between:
• Procedures represented in a way that allows them to be dynamically, improvisationally
restructured via interaction with other mental processes during the execution process.
EFTA00624507
42.6 Procedure Encapsulation and Expansion 361
• Procedures represented in a way that is relatively encapsulated and mechanical, allowing
collaboration with other aspects of the mind during execution only in fairly limited ways
Conceptually, we believe that this is a very useful distinction for a mind to make. In nearly
any reasonable cognitive architecture, it's going to be more efficient to execute a procedure
if that procedure is treated as something with a relatively rigid structure, so it can simply
be executed without worrying about interactions except in a few specific regards. This is a
strong motivation for an artificial cognitive system to have a dual (at least) representation of
procedures, or else a subtle representation that is flexible regarding its degree of flexibility, and
automagically translates constraint into efficiency.
42.6.1 Procedure Encapsulation in More Detail
A procedure map is a temporal motif: it is a set of Atoms (ProcedureNodes), which are habit-
ually, executed in a particular temporal order, and which implicitly pass arguments amongst
each other. For instance, if procedure A acts to create node X, and procedure B then takes
node X as input, then we may say that A has implicitly passed an argument to B.
The encapsulation process can recognize some very subtle patterns, but a fair fraction of its
activity can be understood in terms of some simple heuristics.
For instance, the map encapsulation process will create a node
h = Bfg= f og =f composed with g
(B as in combinatory logic) when there are many examples in the system of:
ExecutionLink g x y
ExecutionLink f y z
The procedure encapsulation process will also recognize larger repeated subgraphs, and their
patterns of execution over time. But some of its recognition of larger subgraphs may be done
incrementally, by repeated recognition of simple patterns like the ones just described.
42.6.2 Procedure Encapsulation in the Human Brain
Finally, we briefly discuss some conceptual issues regarding the relation between CogPrime
procedure encapsulation and the human brain. Current knowledge of the human brain is weak
in this regard, but we won't be surprised if, in time, it is revealed that the brain stores procedures
in several different ways, that one distinction between these different ways has to do with degree
of openness to interactions, and that the less open ways lead to faster execution.
Generally speaking, there is good evidence for a neural distinction between procedural,
episodic and declarative memory. But knowledge about distinctions between different kinds
of procedural memory, is scanter. It is known that procedural knowledge can be "routinized"
- so that, e.g., once you get good at serving a tennis ball or solving a quadratic equation,
your brain handles the process in a different way than before when you were learning. And it
seems plausible that routinized knowledge, as represented in the brain, has fewer connections
EFTA00624508
362 42 Map Formation
back to the rest of the brain than the pre-routinized knowledge. But there will be much firmer
knowledge about such things in the coming years and decades as brain scanning technology
advances.
Overall, there is more knowledge in cognitive and neural science about motor procedures than
cognitive procedures (see e.g. ISW051. In the brain, much of motor procedural memory resides
in the pre-motor area of the cortex. The motor plans stored here are not static entities and are
easily modified through feedback, and through interaction with other brain regions. Generally,
a motor plan will be stored in a distributed way across a significant percentage of the premotor
cortex; and a complex or multipart actions will tend to involve numerous sub-plans, executed
in both parallel and in serial. Often what we think of as separate/distinct motor-plans may
in fact be just slightly different combinations of subplans (a phenomenon also occurring with
schema maps in CogPrime ).
In the case of motor plans, a great deal of the mutinization process has to do with learning the
timing necessary for correct coordination between muscles and motor subplans. This involves
integration of several brain regions - for instance, timing is handled by the cerebellum to a
degree, and some motor-execution decisions are regulated by the basal ganglia.
One can think of many motor plans as involving abstract and concrete sub-plans. The ab-
stract sub-plans are more likely to involve integration with those parts of the cortex dealing
with conceptual thought. The concrete sub-plans have highly optimized timings, based on close
integration with cerebellum, basal ganglia and so forth - but are not closely integrated with
the conceptualization-focused parts of the brain. So, a rough CogPrime model of human motor
procedures might involve schema maps coordinating the abstract aspects of motor procedures,
triggering activity of complex Schemallodes containing precisely optimized procedures that
interact carefully with external actuators. WIKISOURCE:MapsAndAttention
42.7 Maps and Focused Attention
The cause of map formation is important to understand. Formation of small maps seems to
follow from the logic of focused attention, along with hierarchical maps of a certain nature. But
the argument for this is somewhat subtle, involving cognitive synergy between PLN inference
and economic attention allocation.
The nature of PLN is that the effectiveness of reasoning is maximized by (among other
strategies) minimizing the number of incorrect independence assumptions. If reasoning on N
nodes, the way to minimize independence assumptions is to use the full inclusion-exclusion
formula to calculate interdependencies between the N nodes. This involves 2N terms, one for
each subset of the N nodes. Very rarely, in practical cases, will one have significant information
about all these subsets. However, the nature of focused attention is that the system seeks to
find out about as many of these subsets as possible, so as to be able to make the most accurate
possible inferences, hence minimizing the use of unjustified independence assumptions. This
implies that focused attention cannot hold too many items within it at one time, because if N
is too big, then doing a decent sampling of the subsets of the N items is no longer realistic.
So, suppose that N items have been held within focused attention, meaning that a lot of
predicates embodying combinations of N items have been constructed and evaluated and rea-
soned on. Then, during this extensive process of attentional focus, many of the N items will be
useful in combination with each other - because of the existence of predicates joining the items.
EFTA00624509
42.8 Recognizing and Creating Self-Referential Structures 363
Hence, many HebbianLinks will grow between the N items - causing the set of N items to form
a map.
By this reasoning, it seems that focused attention will implicitly be a map formation process
- even though its immediate purpose is not map formation, but rather accurate inference (infer-
ence that minimizes independence assumptions by computing as many cross terms as is possible
based on available direct and indirect evidence). Furthermore, it will encourage the formation
of maps with a small number of elements in them (say, N<10). However, these elements may
themselves be ConceptNodes grouping other nodes together, perhaps grouping together nodes
that are involved in maps. In this way, one may see the formation of hierarchical maps. formed
of clusters of clusters of clusters..., where each cluster has N<10 elements in it. These hierar-
chical maps manifest the abstract dual network concept that occurs frequently in CogPrime
philosophy.
It is tempting to postulate that any intelligent system must display similar properties - so
that focused attention, in general, has a strictly limited scope and causes the formation of
maps that have central cores of roughly the same size as its scope. If this is indeed a general
principle, it is an important one, because it tells you something about the general structure of
derived hypergraphs assnriated with intelligent systems, based on the computational resource
constraints of the systems.
The scope of an intelligent system's attentional focus would seem to generally increase log-
arithmically with the system's computational power. This follows immediately if one assumes
that attentional focus involves free intercombination of the items within it. If attentional focus
is the major locus of map formation, then - lapsing into SMEPH-speak - it follows that the
bulk of the ConceptVertices in the intelligent system's derived hypergraphs may correspond
to maps focused on a fairly small number of other ConeeptVertic . In other words, derived
hypergraphs may tend to have a fairly localized structure, in which each ConceptVertex has
very strong InheritanceEdges pointing from a handful of other ConceptVertices (corresponding
to the other things that were in the attentional focus when that ConceptVertex was formed).
itVIKISOURCE:RecognizingAndCreatingSelfReferentialStructum
42.8 Recognizing and Creating Self-Referential Structures
Finally, this brief section covers a large and essential topic: how CogPrime will be able to
recognize and create large-scale self-referential structures.
Some of the most essential structures underlying human-level intelligence are self-referential
in nature. These include:
• the phenomenal self (see Thomas Metzinger's book "Being No One")
• the will
• reflective awareness
These structures are arguably not critical for basic survival functionality in natural environ-
ments. However, they are important for adequate functionality within advanced social systems,
and for abstract thinking regarding science, humanities, arts and technolo&v.
Recall that in Chapter 3 of Part 1 these entities are formalized in terms of hypersets and,
the following recursive definitions are given:
EFTA00624510
364 42 Map Formation
• "S is conscious of X" is defined as: The declarative content that "S is conscious of X"
correlates with "X is a pattern in S"
• "S wills X" is defined as: The declarative content that "S wills X" causally implies "S does
X"
• "X is part of S's self" is defined as: The declarative content that "X is a part of S's self"
correlates with "X is a persistent pattern in S over time"
Relatedly, one may posit multiple similar processes that are mutually recursive, e.g.
• S is conscious of T and U
• T is conscious of S and U
• U is conscious of S and T
The cognitive importance of this sort of mutual recursion is further discussed in Appendix ??.
According to the philosophy underlying CogPrime, none of these are things that should be
programmed into an artificial mind. Rather, they must emerge in the course of a mind's self-
organization in connection with its environment. However, a mind may be constructed so that,
by design, these sorts of important self-referential structures are encouraged to emerge.
42.8.1 Encouraging the Recognition of Self-Referential Structures in
the AtomSpace
How can we do this - encourage a CogPrime instance to recognize complex self-referential
structures that may exist in its AtomTable? This is important, because, according to the same
logic as map formation: if these structures are explicitly recognized when they exist, they can
then be reasoned on and otherwise further refined, which will then cause them to exist more
definitively, and hence to be explicitly recognized as yet more prominent patterns ... etc. The
same virtuous cycle via which ongoing map recognition and encapsulation is supposed to lead
to concept formation, may be posited on the level of complex self-referential structures, leading
to their refinement, development and ongoing complexity.
One really simple way is to encode self-referential operators in the Combo vocabulary, that
is used to represent the procedures grounding GroundedPredicateNodes.
That way, one can recognize self-referential patterns in the AtomTable via standard Cog-
Prime methods like MOSES and integrative procedure and predicate learning as discussed in
Chapter 41, so long as one uses Combo trees that are allowed to include self-referential operators
at their nodes. All that matters is that one is able to take one of these Combo trees, compare
it to an AtomTable, and assess the degree to which that Combo tree constitutes a pattern in
that AtomTable.
But how can we do this? How can we match a self-referential structure like:
EquivalenceLink
EvaluationLink will (S,X)
CausallmplicationLink
EvaluationLink will (S,X)
EvaluationLink do (S,X)
against an AtomTable or portion thereof?
The question is whether there is some "map" of Atoms (some set of PredicateNodes) willMap,
so that we may infer the SMEPH (see Chapter 14) relationship:
EFTA00624511
42.8 Recognizing and Creating Self-Referential Structures 365
EquivalenceEdge
EvaluationEdge willMap (S, X)
CausalImplicat ionEdge
EvaluationEdge willMap (S, X)
EvaluationEdge doMap ($,X)
as a statistical pattern in the AtomTable's history over the recent past. (Here, doMap is defined
to be the map corresponding to the built-in "do" predicate.)
If so, then this map willMap, may be encapsulated in a single new Node (call it willNode),
which represents the system's will. This willNode may then be explicitly reasoned upon, used
within concept creation, etc. It will lead to the spontaneous formation of a more sophisticated,
fully-fleshed-out will map. And so forth.
Now, what is required for this sort of statistical pattern to be recognizable in the AtomTable's
history? What is required is that EquivalenceEdges (which, note, must be part of the Combo
vocabulary in order for any MOSES-related algorithms to recognize patterns involving them)
must be defined according to the logic of hypersets rather than the logic of sets. What is
fascinating is that this is no big deal! In fact, the AtomTable software structures support this
automatically; it's just not the way most people are used to thinking about things. There is no
reason, in terms of the AtomTable, not to create self-referential structures like the one given
above.
The next question, though, is how do we calculate the truth values of structures like those
above. The truth value of a hyperset structure turns out to be an infinite order probability dis-
tribution, which a complex and peculiar entity [Coe' Oa]. Infinite-order probability distributions
are partially-ordered, and so one can compare the extent to which two different self-referential
structures apply to a given body of data (e.g. an AtomTable), via comparing the infinite-order
distros that constitute their truth values. In this way, one can recognize self-referential patterns
in an AtomTable, and carry out encapsulation of self-referential maps. This sounds very abstract
and complicated, but the class of infinite-order distributions defined in the above-referenced pa-
pers actually have their truth values defined by simple matrix mathematics, so there is really
nothing that abstruse involved in practice.
Finally, there is the question of how these hyperset structures are to be logically manipulated
within PLN. The answer is that regular PLN inference can be applied perfectly well to hypersets,
but some additional hyperset operations may also be introduced; these are currently being
researched.
Clearly, with this subtle, currently unimplemented component of the CogPrime design we
have veered rather far from anything the human brain could plausibly be doing in detail. But
yet, some meaningful connections may be drawn. In Chapter 13 of Part 1 we have discussed
how probabilistic logic might emerge from the brain, and also how the brain may embody
self-referential structures like the ones considered here, via (perhaps using the hippocampus)
encoding whole neural nets as inputs to other neural nets. Regarding infinite-order probabilities,
it is certainly the case that the brain is efficient at carrying out operations equivalent to matrix
manipulations (e.g. in vision and audition), and IG oe Will reduced infinite-order probabilities to
finite matrix manipulations, so that it's not completely outlandish to posit the brain could be
doing something mathematically analogous. Thus, all in all, it seems at least plausible that the
brain could be doing something roughly analogous to what we've described here, though the
details would obviously be very different.
EFTA00624512
EFTA00624513
Section VII
Communication Between Human and Artificial
Minds
EFTA00624514
EFTA00624515
Chapter 43
Communication Between Artificial Minds
43.1 Introduction
Language is a key aspect of human intelligence, and seems to be one of two critical factors
separating humans from other intelligent animals - the other being the ability to use tools.
Steven Mithen IMit9(1 argues that the key factor in the emergence of the modern human
mind from its predecessors was the coming-together of formerly largely distinct mental modules
for linguistic communication and tool making/use. Other animals do appear to have fairly
sophisticated forms of linguistic communication, which we don't understand very well at present;
but as best we can tell, modern human language has many qualitatively different aspects from
these, which enable it to synergize effectively with tool making and use, and which have enabled
it to co-evolve with various aspects of tool-dependent culture.
Some AGI theorists have argued that, since the human brain is largely the same as that of
apes and other mammals without human-like language, the emulation of human-like language is
not the right place to focus if one wants to build human-level AGI. Rather, this argument goes,
one should proceed in the same order that evolution did - start with motivated perception and
action, and then once these are mastered, human-like language will only be a small additional
step. We suspect this would indeed be a viable approach - but may not be well suited for the
hardware available today. Robot hardware Ls quite primitive compared to animal bodies, but
the kind of motivated perception and action that non-human animals do Ls extremely body-
centric (even more so than is the case in humans). On the other hand, modern computing
technology is quite sophisticated as regards language - we program computers (including AIs)
using languages of a sort, for example. This suggests that on a pragmatic basis, it may make
sense to start working with language at an earlier stage in AGI development, than the analogue
with the evolution of natural organisms would suggest.
The CogPrime architecture is compatible with a variety of different approaches to language
learning and capability, and frankly at this stage we are not sure which approach is best. Our
intention is to experiment with a variety of approaches and proceed pragmatically and empir-
ically. One option is to follow the more "natural" course and let sophisticated non-linguistic
cognition emerge first, before dealing with language in any serious way - and then encourage
human-like language facility to emerge via experience. Another option is to integrate some
sort of traditional computational linguistics system into CogPrime, and then allow CogPrime's
learning algorithms to modify this system based on its experience. Discussion of this latter
option occupies most of this section of the book - involves many tricks and compromises, but
369
EFTA00624516
370 43 Communication Between Artificial Minds
could potentially constitute a faster route to success. Yet another option is to communicate with
young CogPrime systems using an invented language halfway between the human-language and
programming-language domains, such as Lojban (this possibility is discussed in Appendix E).
In this initial chapter on communication, we will pursue a direction quite different from the
latter chapters, and discuss a kind of communication that we think may be very valuable in the
CogPrime domain, although it has no close analogue among human beings. Many aspects of
CogPrime closely resemble aspects of the human mind; but in the end CogPrime is not intended
as an emulation of human intelligence, and there are sonic aspects of CogPrime that bear no
resemblance to anything in the human mind, but exploit some of the advantages of digital
computing infrastructure over neural wetware. One of the latter aspects is Psynese, a word we
have introduced to refer to direct mind-to-mind information transfer between artificial minds.
Psynese has some relatively simple practical applications: e.g. it could aid with the use of
linguistic resources and hand-coded or statistical language parsers within a learning-based Ifni-
guage system, to be discussed in following chapters. In this use case, one sets up one CogPrime
using the traditional NLP approaches. and another CogPrime using a purer learning-based ap-
proach, and lets the two systems share mind-stuff in a controlled way. Psynese may also be
useful in the context of intelligent virtual pets, where one may wish to set up a CogPrime
representing "collective knowledge" of multiple virtual pets.
But it also has some grander potential implications, such as the ability to fuse multiple AI
systems into "mindplexes" as discussed in Chapter 12 of Part 1.
One might wonder why a community of two or more CogPrime s would need a language at
all, in order to communicate. After all, unlike humans, CogPrime systems can simply exchange
"brain fragments" - subspaces of their Atomspaces. One CogPrime can just send relevant nodes
and links to another CogPrime (in binary form, or in an XML representation, etc.), bypassing
the linear syntax of language. This is in fact the basis of Psynese: why transmit linear strings
of characters when one can directly transit Atoms? But the details are subtler than it might at
first seem.
One CogPrime can't simply "transfer a thought" to another CogPrime. The problem is that
the meaning of an Atom consists largely of its relationships with other Atoms, and so to pass
a node to another CogPrime, it also has to pass the Atoms that it is related to, and so on,
and so on. Atomspaces tend to be densely interconnected, and so to transmit one thought
fully accurately, a CogPrime system is going to end up having to transmit a copy of its entire
Atomspace! Even if privacy were not an issue, this form of communication (each utterance
coming packaged with a whole mind-copy) would present rather severe processing load on the
communicators involved.
The idea of Psynese is to work around this interconnectedness problem by defining a mecha-
nism for CogPrime instances to query each others' minds directly, and explicitly represent each
others' concepts internally. This doesn't involve any unique cognitive operations besides those
required for ordinary individual thought, but it requires some unique ways of wrapping up these
operations and keeping track of their products.
Another idea this leads to is the notion of a PsyneseVocabulary: a collection of Atoms,
associated with a community of CogPrime s, approximating the most important Atoms inside
that community. The combinatorial explosion of direct-Atomspace communication is then halted
by an appeal to standardized Psynese Atoms. Pragmatically, a PsyneseVocabulary might be
contained in a PsyneseVocabulary server, a special CogPrime instance that exists to mediate
communications between other CogPrime s, and provide CogPrime s with information. Psynese
makes sense both as a mechanism for peer-to-peer communication between CogPrime s, and as
EFTA00624517
43.2 A Simple Example Using a PsyneseVocabulary Server 371
a mechanism allowing standardized communication between a community of CogPrime s using
a PsyneseVocabulary server.
43.2 A Simple Example Using a PsyneseVocabulary Server
Suppose CogPrime 1 wanted to tell CogPrime 2 that "Russians are crazy" (with the latter word
meaning something inbetween "insane" and "impractical"); and suppose that both CogPrime s
are connected to the same Psynese CogPrime with PsyneseVocabulary PV. Then, for instance,
it must find the Atom in PV corresponding to its concept "crazy." To do this it must create an
AtomStructureTemplate such as
Predl(C1)
equals
ThereExists
WI, C2, C3, W2, W3
AND
ConceptNode: CI
ReferenceLink Cl WI
WordNode: W1 #crazy
ConceptNode: C2
HebbianLink CI C2
ReferenceLink C2 W2
WordNode: W2 #insane
ConceptNode: C3
HebbianLink CI C3
ReferenceLink C3 W3
WordNode: W3 #impractical
encapsulating relevant properties of the Atom it wants to grab from PV. In this example the
properties specified are:
• ConceptNode, linked via a ReferenceLink to the WordNode for "crazy"
• HebbianLinks with ConceptNodes linked via ReferenceLinks to the WordNodes for 'insane"
and -impractical"
So, what CogPrime 1 can do is fish in PV for "some concept that is denoted by the word 'crazy'
and is associated with 'insane' and 'impractical'." The association with 'insane" provides more
insurance of getting the correct sense of the word "crazy" as opposed to e.g. the one in the
phrase "He was crazy about her" or in 'That's crazy, man, crazy" (in the latter slang usage
"crazy" basically means "excellent"). The association with "impractical" biases away from the
interpretation that all Russians are literally psychiatric patients.
So, suppose that CogPrime 1 has fished the appropriate Atoms for "crazy" and "Russian"
from PV. Then it may represent in its Atomspace something we may denote crudely (a better
notation will be introduced later) as
InheritanceLink PV:477335:1256953732 PV:744444:1256953735 C.8.,6>
• A similar but perhaps more compelling example would be the interpretation of the phrase "the accountant
cooked the books." In this case both "cooked" and "books" are used in atypical senses, but specifying a Hebbian-
Link to 'accounting' would cause the right Nodes to get retrieved from PV.
EFTA00624518
372 43 Communication Between Artificial Minds
where e.g. "PV:744444" means "the Atom with Handle 744444 in CogPrime PV at time
1256953735," and may also wish to store additional information such as
PsyneseEvaluationLink <.9>
PV
Predl
PV:744444:1256953735
meaning that Predl(PV : 744444 : 1256953735) holds true with truth value < .9 > if all the
Atoms referred to within Predi are interpreted as existing in PV rather than CogPrime 1.
The InheritanceLink then means: "In the opinion of CogPrime 1, 'Russian' as defined by
PV:477335:1256953732 inherits from 'crazy' as defined by PV:744444:1256953735 with truth
value <.8,.6>."
Suppose CogPrime 1 then sends the InheritanceLink to CogPrime 2. It is going to be mean-
ingfully interpretable by CogPrime 2 to the extent that CogPrime 2 can interpret the relevant
PV Atoms, for instance by finding Atoms of its own that correspond to them. To interpret
these Atoms, CogPrime 2 must carry out the reverse process that CogPrime 1 did to find the
Atoms in the first place. For instance, to figure out what PV:744444:1256953735 means to it,
CogPrime 2 may find some of the important links associated with the Node in PV, and make
a predicate accordingly, e.g.:
Pred2(C1)
equals
ThereExists
Wl, C2, C3, W2, W3
AND
ConceptNode: CI
ReferenceLink Cl WI
WordNode: W1 ►crazy
ConceptNode: C2
HebbianLink CI C2
ReferenceLink C2 W2
WordNode: W2 #lunatic
ConceptNode: C3
HebbianLink CI C3
ReferenceLink C3 W3
WordNode: W3 ►unrealistic
On the other hand, if there is no PsyneseVocabulary involved, then CogPrime 1 can submit
the same query directly to CogPrime 2. There is no problem with this, but if there is a reasonably
large community of CogPrime s it becomes more efficient for them all to agree on a standard
vocabulary of Atoms to be used for communication - just as, at a certain point in human
history, it was recognized as more efficient for people to use dictionaries rather than to rely on
peer-to
-peer methods for resolution of linguistic disagreements.
The above examples involve human natural language terms, but this does not have to be
the case. PsyneseVocabularies can contain Atoms representing quantitative or other types of
data, and can also contain purely abstract concepts. The basic idea is the same. A CogPrime
has some Atoms it wants to convey to another CogPrime, and it looks in a PsyneseVocabulary
to see how easily it can approximate these Atoms in terms of "socially understood" Atoms.
This is particularly effective if the CogPrime receiving the communication is familiar with the
PsyneseVocabulary in question. Then the recipient may already know the PsyneseVocabulary
Atoms it is being pointed to; it may have already thought about the difference between these
consensus concepts and its own related concepts. Also, if the sender CogPrime is encapsulating
EFTA00624519
43.3 Psynese as a Language 373
maps for easy communication, it may specifically seek approximate encapsulations involving
PsyneseVocabulary terms, rather than first encapsulating in its own terms and then translating
into PsyneseVocabulary terms.
43.2.1 The Psynese Match Schema
One way to streamline the above operations is to introduce a Psynese Match Schema. with the
property that
ExOut
PsyneseMatch PV A
within CogPrime instance CPI, denotes the Atom within CogPrime instance PV that most
closely matches the Atom A in CPI. Note that the PsyneseMatch schema implicitly relies on
various parameters, because it must encapsulate the kind of process described explicitly in the
above example. PsyneseMatch must, internally, decide how many and which Atoms related to
A should be used to formulate a query to PV, and also how to rank the responses to the query
(e.g. by strength x confidence).
Using PsyneseMatch, the example written above as
Inheritance PV:477335:1256953732 PV:744444:1256953735 ‹.8.,6>
could be rewritten as
Inheritance c.8.,G>
ExOut
PsyneseMatch PV Cl
ExOut
PsyneseMatch PV C2
where Cl and C2 are the ConceptNodes in CPI corresponding to the intended senses of
"crazy" and "Ru.ssian."
43.3 Psynese as a Language
The general definition of a psynese expression for CogPrime is a Set of Atoms that contains
only:
• Nodes from Psynes.eVocabularies
• Perceptual nodes (numbers, words, etc.)
• Relationships relating no nodes other than the ones in the above two categories, and relating
no relationships except ones in this category
• Predicates or Schemata involving no relationships or nodes other than the ones in the above
three categories, or in this category
The PsyneseEvaluationLink type indicated earlier forces interpretation of a predicate as a Psy-
nese expression.
In what sense is the use of Psynese expressions to communicate a language? Clearly it is a
formal language in the mathematical sense. It is not quite a "human language" as we normally
EFTA00624520
374 43 Communication Between Artificial Minds
conceive it, but it is ideally suited to serve the same functions for CogPrime s as human language
serves for humans. The biggest differences front human language are:
• Psynese uses weighted, typed hypergraphs (i.e. Atomspaces) instead of linear strings of
symbols. This eliminates the "parsing" aspect of language (syntax being mainly a way of
projecting graph structures into linear expressions).
• Psynese lacks subtle and ambiguous referential constructions like "this", "it" and so forth.
These are tools allowing complex thoughts to be compactly expressed in a linear way, but
CogPrime s don't need them. Atoms can be named and pointed to directly without complex,
poorly-specified mechanisms mediating the process.
• Psynese has far less ambiguity. There may be Atoms with more than one aspect to their
meanings, but the cost of clarifying such ambiguities Ls much lower for CogPrime s than for
humans using language, and so habitually there will not be the rampant ambiguity that we
see in human expressions.
On the other hand. mapping Psynese into Lojban - a syntactically formal, semantically
highly precise language created for communication between humans - rather than a true natural
language would be much more straightforward. Indeed, one could create a PsyneseVocabulary
based on Lojban, which might be ideally suited to serve as an intermediary between different
CogPrime s. And Lojban may be used to create a linearized version of Psynese that looks more
like a natural language. We return to this point in Appendix ??.
43.4 Psynese Mindplexes
We now recall from Chapter 12 of Part 1 the notion of a mindplex: that is, an intelligent system
that:
1. Is composed of a collection of intelligent systems, each of which has its own "theater of
consciousness" and autonomous control system, but which interact tightly, exchanging large
quantities of information frequently
2. Has a powerful control system on the collective level, and an active "theater of consciousness"
on the collective level as well
In informal discussions, we have found that some people, on being introduced to the mindplex
concept, react by contending that either human minds or human social groups are mindplexes.
However, I believe that, while there are significant similarities between mindplexes and minds,
and between mindplexes and social groups, there are also major qualitative differences. It's
true that an individual human mind may be viewed as a collective, both from a theory-of-
cognition perspective (e.g. Minsky's "society of mind" theory IMM88I) and from a personality-
psychology perspective (e.g. the theory of subpersonalities Illow901). And it's true that social
groups display some autonomous control and some emergent-level awareness. However, in a
healthy human mind, the collective level rather than the cognitive-agent or subpersonality level
is dominant, the latter existing in service of the former; and in a human social group, the
individual-human level is dominant, the group-mind clearly "cognizing" much more crudely
than its individual-human components, and exerting most of its intelligence via its impact on
individual human minds. A mindplex is a hypothetical intelligent system in which neither level
is dominant, and both levels are extremely powerful. A mindplex is like a human mind in
EFTA00624521
43.4 Psynese Mindplexes 375
which the subpersonalities are fully-developed human personalities, with full independence of
thought, and yet the combination of subpersonalities is also an effective personality. Or, from
the other direction, a mindplex is like a human society that has become so integrated and so
cohesive that it displays the kind of consciousness and self-control that we normally associate
with individuals.
There are two mechanisms via which mindplexes may possibly arise in the medium-term
future:
1. Humans becoming more tightly coupled via the advance of communication technologies, and
a communication-centric AI system coming to embody the "emergent conscious theater" of
a human-incorporating mindplex
2. A society of AI systems communicating amongst each other with a richness not possible for
human beings, and coming to form a mindplex rather than merely a society of distinct Al's
The former sort of mindplex relates to the concept of a "global brain" discussed in Chapter
12 of Part 1. Of course, these two sorts of mindplexes are not mutually contradictory, and may
coexist or fuse. The passibility also exists for higher-order mindplexes, meaning mindplexes
whose component minds are themselves mindplexes. This would occur, for example, if one
had a mindplex composed of a family of closely-interacting AI systems, which acted within a
mindplex associated with the global communication network.
Psynese, however, is more directly relevant to the latter form of mindplex. It gives a concrete
mechanism via which such a mindplex might be sculpted.
43.4.1 AGI Mindpleres
How does one get from CogPrime s communicating via Psynese to CogPrime mindplexes?
Clearly, with the Psynese mode of communication, the potential is there for much richer
communication than exists between humans. There are limitations, posed by the private nature
of many concepts - but these limitations are much less onerous than for human language, and
can be overcome to some extent by the learning of complex cognitive schemata for translation
between the "private languages" of individual Atomspaces and the "public languages" of Psynese
servers.
But rich communication does not in itself imply the evolution of mindplexes. It is possible
that a community of Psynese-communicating CogPrime s might spontaneously evolve a mind-
plex structure - at this point, we don't know enough about CogPrime individual or collective
dynamics to say. But it is not necessary to rely on spontaneous evolution. In fact it is passible,
and even architecturally simple, to design a community of CogPrime s in such a way as to
encourage and almost force the emergence of a mindplex structure.
The solution is simple: simply beef up PsyneseVocabulary servers. Rather than relatively
passive receptacles of knowledge from the CogPrime s they serve, allow them to be active,
creative entities, with their own feelings, goals and motivations.
The PsyneseVocabulary servers serving a community of CogPrime s are absolutely critical
to these CogPrime s. Without them, high-level inter-CogPrime communication is effectively
impossible. And without the concepts the PsyneseVocabularies supply, high-level individual
CogPrime thought will be difficult, because CogPrime s will come to think in Psynese to at
least the same extent to which humans think in language.
EFTA00624522
376 43 Communication Between Artificial Minds
Suppose each PsyneseVocabulary server has its own full CogPrime mind, its own "conscious
theater". These minds are in a sense "emergent minds" of the CogPrime community they serve
- because their contents are a kind of "nonlinear weighted average" of the mind-contents of the
community. Furthermore, the actions these minds take will feed back and affect the community
in direct and indirect ways - by affecting the language by which the minds communicate. Clearly,
the definition of a mindplex is fulfilled.
But what will the dynamics of such a CogPrime mindplex be like? What will be the properties
of its cognitive and personality psychology? We could speculate on this here, but would have
very little faith in the possible accuracy of our speculations. The psychology of mindplexes will
reveal itself to us experimentally as our work on AGI engineering, education and socialization
proceeds.
One major issue that arises, however, is that of personality filtering. Put simply: each intelli-
gent agent in a mindplex must somehow decide for itself which knowledge to grab from available
PsyneseVocabulary servers and other minds, and which knowledge to avoid grabbing from oth-
ers in the name of individuality. Different minds may make different choices in this regard. For
instance, one choice could be to, as a matter of routine, take only extremely confident knowledge
from the PsyneseVocabulary server. This corresponds roughly to ingesting "facts" from the col-
lective knowledge pool, but not opinions or speculations. Less confident knowledge would then
be ingested from the collective knowledge pool on a carefully calculated and as-needed basis.
Another choice could be to accept only small networks of Atoms from the collective knowledge
pool, on the principle that these can be reflectively understood as they are ingested, whereas
large networks of Atoms are difficult to deliberate and reflect about. But any policies like this
are merely heuristic ones.
43.5 Psynese and Natural Language Processing
Next we review a more near-term, practical application of the Psynese mechanism: the fusion of
two different approaches to natural language processing in CogPrime, the experiential learning
approach and the "engineered NLP subsystem" approach.
In the former approach, language is not given any extremely special role, and CogPrime is
expected to learn language much as it would learn any other complex sort of knowledge. There
may of course be learning biases programmed into the system, to enable it to learn language
based on its experience more rapidly. But there is no concrete linguistic knowledge programmed
in.
In the latter approach, one may use knowledge from statistical corpus analysis, one may use
electronic resources like WordNet and FrameNet, and one may use sophisticated, specialized
tools like natural language parsers with hand-coded grammars. Rather than trying to emulate
the way a human child learns language, one is trying to emulate the way a human adult
comprehends and generates language.
Of course, there is not really a rigid dichotomy between these two approaches. Many linguistic
theorists who focus on experiential learning also believe in some form of universal grammar,
and would advocate for an approach where learning is foundational but is biased by in-built
abstract structures representing universal grammar. There is of course very little knowledge
(and few detailed hypotheses) about how universal grammar might be encoded in the human
brain, though there is reason to think it may be at a very abstract level, due to the significant
EFTA00624523
43.5 Psynese and Natural Language Processing 377
overlaps between grammatical structure, social role structure ICE3001, and physical reasoning
[Cam;
The engineered approach to NLP provides better functionality right "out of the box," and
enables the exploitation of the vast knowledge accumulated by computational linguists in the
past decades. However, we suspect that computational linguistics may have hit a ceiling in some
regards, in terms of the quality of the language comprehension and generation that it can deliver.
It runs up against problems related to the disambiguation of complex syntactic constructs, which
don't seem to be resolvable using either a tractable number of hand-coded rules, or supervised
or unsupervised learning based on a tractably large set of examples. This conclusion may be
disputed, and some researchers believe that statistical computational linguistics can eventually
provide human-level functionality, once the Web becomes a bit larger and the computers used to
analyze it become a bit more powerful. But in our view it is interesting to explore hybridization
between the engineered and experiential approaches, with the motivation that the experiential
approach may provide a level of flexibility and insight at dealing with ambiguity that the
engineered approach apparently lacks.
After all, the way a human child deals with the tricky disambiguation problems that stump
current computational linguistics systems is not via analysis of trillion-word corpuses, bur rather
via correlating language with non-linguistic experience. One may argue that the genome implic-
itly contains a massive corpus of speech, but there it's also to be noted that this is experientially
contextualized speech. And it seems clear front the psycholinguistic evidence Romo:31 that for
young human children, language is part and parcel of social and physical experience, learned in
a manner that's intricately tied up with the learning of many other sorts of skills.
One interesting approach to this sort of hybridization, using Psynese, is to create multiple
CogPrime instances taking different approaches to language learning, and let them communi-
cate. Most simply one may create
• A CogPrime instance that learns language mainly based on experience, with perhaps some
basic in-built structure and some judicious biasing to its learning (let's call this CPap)
• A CogPrime instance using an engineered NLP system (let's call this CPeng)
In this case, CPap can use CPc,k9 as a cheap way to test its ideas. For instance suppose,
CPap thinks it has correctly interpreted a certain sentence S into Atom-set A. Then it can
send its interpretation A to CPap and see whether CPens, thinks A is a good interpretation of
S, by consulting CPapthe trait value of
ReferenceLink
ExOut
PsyneseMatch CP a9 S
ExOut
PsyneseMatch GT." A
Similarly, if CPap believes it has found a good way (S) to linguistically express a collection $
of Atoms A, it can check whether these two match reasonably well in CP,,,g.
Of course, this approach could be abused in an inefficient and foolish way, for instance if
CP„,.p did nothing but randomly generate sentences and then test them against Ca w In this
case we would have a much less efficient approach than simply using CPeng directly. However,
effectively making use of CPap as a resource requires a different strategy: throwing CPap only
a relatively small selection of things that seem to make sense, and using CPap as a filter to
avoid trying out rough-draft guesses in actual human conversation.
EFTA00624524
378 43 Communication Between Artificial Minds
This hybrid approach, we suggest, may provide a way of getting the best of both worlds: the
flexibility of a experiential-learning-based language approach, together with the exploitation
of existing linguistic tools and resources. With this in mind, in the following chapters we will
describe both engineering and experiential-learning based approaches to NLP.
43.5.1 Collective Language Learning
Finally we bring the language-learning and mindplex themes together, in the notion of collective
language learning. One of the most interesting uses for a mindplex architecture is to allow
multiple CogPrime agents to share the linguistic knowledge they gain. One may envision a
PsyneseVocabulary server into which a population of CogPrime agents input their linguistic
knowledge specifically, and which these agents then consult when they wish to comprehend or
express something in language. and their individual NLP systems are not up to the task.
This could be a very powerful approach to language learning, because it would allow a
potentially very large number of AI systems to effectively act as a single language learning
system. It is an especially appealing approach in the context of CogPrime systems used to
control animated agents in online virtual worlds or multiplayer games. The amount of linguistic
experience undergone by, say, 100,000 virtually embodied CogPrime agents communicating with
human virtual world avatars and game players, would be far more than any single human child
or any single agent could undergo. Thus, to the extent that language learning can be accelerated
by additional experience, this approach could enable language to be learned quite rapidly.
EFTA00624525
Chapter 44
Natural Language Comprehension
Co-authored with Michael Ross and Linas Vepstas and Ruiting Lian
44.1 Introduction
Two key approaches to endowing AGI systems with linguistic facility exist, as noted above:
• "Experiential" - shorthand here for "gaining most of its linguistic knowledge from interactive
experience, in such a way that language learning is not easily separable from generic learning
how to survive and flourish"
• "Engineered" - shorthand here for "gaining most of its linguistic knowledge front sources
other than the system's own experience in the world" (including learning language front
resources like corpora)
This dichotomy is somewhat fuzzy, since getting experiential language learning to work well
may involve some specialized engineering, and engineered NLP systems may also involve some
learning from experience. However, in spite of the fuzziness, the dichotomy is still real and
important; there are concrete choices to be made in designing an NLP system and this dichotomy
compactly symbolizes some of them. Much of this chapter and the next few will be focused on
the engineering approach, but we will also devote some space to discussing the experience-based
approach. Our overall perspective on the dichotomy is that
• the engineering-based approach, on its own, is unlikely to take us to human-level NLP ... but
this isn't wholly impossible, if the engineering is done in a manner that integrates linguistic
functionality richly with other kinds of experiential learning
• using a combination of experience-based and engineering-based approaches may be the most
practical option
• the engineering approach is useful for guiding the experiential approach, because it tells us
a lot about what kinds of general structures and dynamics may be adequate for intelligent
language processing. To simplify a bit, one can prepare an AGI system for experiential
learning by supplying it with structures and dynamics capable of supporting the key com-
ponents of an engineered NIP system - and biased toward learning things similar to known
engineered NLP systems - but requiring all, or the bulk of, the actual linguistic content
to be learned via experience. This approach may be preferable to requiring a system to
learn language based on more abstract structures and dynamics, and may indeed be more
comparable to what human brains do, given the large amount of linguistic biasing that is
probably built into the human genome.
379
EFTA00624526
380 44 Natural Language Comprehension
Further distinctions, overlapping with this one, may also be useful. One may distinguish (at
least) five modes of instructing NLP systems, the first three of which are valid only for engineered
NLP systems, but the latter two of which are valid both for engineered and experiential NLP
systems:
• hand-coded rules
• supervised learning on hand-tagged corpuses, or via other mechanisms of explicit human
training
• unsupervised learning from static bodies of data
• unsupervised learning via interactive experience
• supervised learning via interactive experience
Note that, in principle, any of these modes may be used in a pure-language or a socially/phys-
ically embodied language context. Of course, there is also semi-supervised learning which may
be used in place of supervised learning in the above list IICSZO8j.
Another key dichotomy related to linguistic facility is language comprehension versos lan-
guage generation (each of which is typically divided into a number of different subprocesses).
In language comprehension, we have processes like stemming, part-of-speech tagging, grammar-
based parsing, semantic analysis, reference resolution and discourse analysis. In language gener-
ation, we have semantic analysis, syntactic sentence generation, pragmatic discourse generation,
reference-insertion, and so forth. In this chapter and the next two we will briefly review all these
different topics and explain how they may be embodied in CogPrime. Then, in Chapter ?? we
present a complementary approach to linguistic interaction with AGI systems based on the
invented language Lojban: and in Chapter 48 we discuss the use of CogPrime cognition to
regulate the dialogue process.
A typical, engineered computational NLP system involves hand-coded algorithms carrying
out each of the specific tasks mentioned in the previous paragraph, sometimes with parameters,
rules or number tables that are tuned or learned statistically based on corpuses of data. In
fact, most NLP systems handle only understanding or only generation; systems that cover both
aspects in a unified way are quite rare. The human mind, on the other hand, carries out these
tasks in a much more interconnected way - using separate procedures for the separate tasks, to
some extent, but allowing each of these procedures to be deeply informed by the information
generated by the other procedures. This interconnectedness is what allows the human mind
to really understand language - specifically because human language syntax is complex and
ambiguous enough that the only way to master it is to infuse one's syntactic analyses with
semantic (and to a lesser extent pragmatic) knowledge. In our treatment of NLP we will pay
attention to connections between linguistic ftmctionalities, as well as to linguistic functionalities
in isolation.
It's worth emphasizing that what we mean by a "experience based" language system is quite
different from corpus-based language systems as are commonplace in computational linguistics
today [MS99J (and from the corpus-based learning algorithm to be discussed in Chapter ??
below). In fact, we feel the distinction between corpus-based and rule-based language processing
systems is often overblown. Whether one hand-codes a set of rules, or carefully marks up a
corpus so that rules can be induced from it, doesn't ultimately make that much difference. For
instance, OpenCogPrime's RelEx system (to be described below) uses hand-coded rules to do
much the same thing that the Stanford parser does using rules induced from a tagged corpus.
But both systems do roughly the same thing. RelEx is currently faster due to using fewer rules,
and it handles some complex cases like comparatives better (presumably because they were not
EFTA00624527
44.2 Linguistic Atom Types 381
well covered in the Stanford parser's training corpus); but the Stanford parser may be preferable
in other respects, for instance it's more easily generalizable to languages beyond English (for
a language with structure fairly similar to English, one just has to supply a new marked-up
training corpus; whereas porting RelEx rules to other languages requires more effort).
An unsupervised corpus-based learning system like the one to be described in Chapter?? is
a little more distinct from rule-based systems, in that it is based on inducing patterns from
natural rather than specially prepared data. But still, it is learning language as a phenomenon
unto itself, rather than learning language as part and parcel of a system's overall experience in
the world.
The key distinction to be made, in our view, is between language systems that learn language
in a social and physical context, versus those that deal with language in isolation. Dealing with
language in context immediately changes the way the linguistics problem appears (to the Al
system, and also to the researcher), and makes hand-coded rules and hand-tagged corpuses less
viable, shifting attention toward experiential learning based approaches.
Ultimately we believe that the "right" way to teach an AGI system language is via semi-
supervised learning in a socially and physically embodied context. That is: talk to the system,
and have it learn both from your reinforcement signals and from unsupervised analysis of the
dialogue. However, we believe that other modes of teaching NLP systems can also contribute,
especially if used in support of a system that also does semi-supervised learning based on
embodied interactive dialogue.
Finally, a note on one aspect of language comprehension that we don't deal with here. We deal
only with text processing, not speech understanding or generation. A CogPrime approach to
speech would be quite feasible to develop, for instance using neural-symbolic hybridization with
DeSTIN or a similar perceptual-motor hierarchy. However, this potential aspect of CogPrime
has not been pursued in detail yet, and we won't devote space to it here.
44.2 Linguistic Atom Types
Explicit representation of linguistic knowledge in terms of Atoms is not a deep issue, more of a
"plumbing" type of issue, but it must be dealt with before moving on to subtler aspects.
In principle, for dealing with linguistic information coming in through ASCII, all we need
besides the generic CogPrime structures and dynamics are two node types and one relationship
type:
• CharacterNode
• CharacterinstanceNode
• a unary, relationship coziest denoting an externally-observed list of items
Sequences of characters may then be represented in terms of lists and the coziest schema. For
instance the word "pig" is represented by the list concat(#p, #i, #g)
The concat operator can be used to help define special NL atom types, such as:
• MorphemeNode/ MorphemeinstanceNode
• WordNode/WordlnstanceNode
• PhraseNode/PhraselnstanceNode
• SentenceNode/ SentencelnstanceNode
• UtteranceNode/ UtterancelnstanceNode
EFTA00624528
382 44 Natural Language Comprehension
44.3 The Comprehension and Generation Pipelines
Exactly how the "comprehension pipeline" is broken down into component transformations,
depends on one's linguistic theory of choice. The approach taken in OpenCogPrimes engineered
NLP framework, in use from 2008-2012, looked like:
Text --> Tokenizer --> Link Parser -->
Syntactico-Semantic Relationship Extractor (RelEx)
Semantic RelationshipExtractor (RelEx2Frame) -->
SemanticNodes & Links
In 2012-13, a new approach has been undertaken, which simplifies things a little and looks
like
Text --> Tokenizer --> Link Parser -->
Syntactico-Semantic Relationship Extractor (Syn2Sem) -->
Semantic Nodes & Links
Note that many other variants of the NL pipeline include a"tagging" stage, which assigns part
of speech tags to words based on the words occurring around them. In our current approach,
tagging is essentially subsumed within parsing; the choice of a POS (part-of-speech) tag for
a word instance is carried out within the link parser. However, it may still be valuable to
derive information about likely POS tags for word instances from other techniques, and use
this information within a link parsing framework by allowing it to bias the probabilities used
in the parsing process.
None of the processes in this pipeline are terribly difficult to carry out, if one is willing to
use hand-coded rules within each step, or derive rules via supervised learning, to govern their
operation. The truly tricky aspects of NL comprehension are:
• arriving at the rules used by the various subprocesses, in a way that naturally supports
generalization and modification of the rules based on ongoing experience
• allowing semantic understanding to bias the choice of rules in particular contexts
• knowing when to break the rules and be guided by semantic intuition instead
Importing rules straight from linguistic databases results in a system that (like the current
RelEx system) is reasonably linguistically savvy on the surface, but lacks the ability to adapt its
knowledge effectively based on experience, and has trouble comprehending complex language.
Supervised learning based on hand-created corpuses tends to result in rule-bases with similar
problems. This doesn't necessarily mean that hand-coding or supervised learning of linguistic
rules has no place in an AGI system. but it means that if one uses these methods, one must
take extra care to make one's rules modifiable and generalizable based on ongoing experience,
because the initial version of one's rules is not going to be good enough.
Generation is the subject of the following chapter, but for comparison we give here a high-
level overview of the generation pipeline, which may be conceived as:
1. Content determination: figuring out what needs to be said in a given context
2. Discourse planning: overall organization of the information to be communicated
3. Lexicalization: assigning words to concepts
4. Reference generation: linking words in the generated sentences using pronouns and other
kinds of reference
EFTA00624529
41.4 Parsing with Link Grammar 383
5. Syntactic and morphological realization: the generation of sentences via a process inverse to
parsing, representing the information gathered in the above phases
6. Phonological or orthographic realization: turning the above into spoken or written words,
complete with timing (in the spoken case), punctuation (in the written case), etc.
In Chapter 46 we explain how this pipeline is realized in OpenCogPrimes current engineered
NL generation system.
44.4 Parsing with Link Grammar
Now we proceed to explain some of the details of OpenCogPrime's engineered NL comprehension
system. This section gives an overview of link grammar, a key part of the current OpenCog
NLP framework, and explains what makes it different from other linguistic formalisms.
We emphasize that this particular grammatical formalism is not, in itself, a critical part
of the CogPrime design. In fact, it should be quite possible to create and teach a CogPrime
AGI system without using any particular grammatical formalism - having it acquire linguistic
knowledge in a purely experiential way. However, a great deal of insight into CogPrime -based
language processing may be obtained by considering the relevant issues in the concrete detail
that the assumption of a specific grammatical formalism provides. This insight is of course
useful if one is building a CogPrime that makes use of that particular grammatical formalism,
but it's also useful to some degree even if one is building a CogPrime that deals with human
language entirely experientially.
This material will be more comprehensible to the reader who has some familiarity with
computational linguistics, e.g. with notions such as parts of speech, feature structures, lexicons,
dependency grammars, and so forth. Excellent references are [MS99, Jac0:31. We will try to keep
the discussion relatively elementary, but have opted not to insert a computational linguistics
tutorial.
The essential idea of link grammar is that each word conies with a feature structure consisting
of a set of typed connectors. Parsing consists of matching up connectors from one word with
connectors from another
To understand this in detail, the best course is to consider an example sentence. We will use
the following example, drawn from the classic paper "Parsing with a Link Grammar" by Sleator
and Temperley IS'I'93):
The cat chased a snake
The link grammar parse structure for this sentence is:
« xp
I • ss
r •
lid
.0.--Ds +- 40- -4.-Jw-4.
I I I I
+--se-r
I
I
-Paf - 1.
I ♦
LEFT-WALL the person.n with whoa she works.v is.v silly.a
In phrase structure grammar terms, this corresponds loosely to
(S (NP The cat)
(VP chased
(NP a snake))
EFTA00624530
384 44 Natural Language Comprehension
but the OpenCog linguistic pipeline makes scant use of this kind of phrase structure rendition
(which is fine in this simple example; but in the case of complex sentences, construction of
analogous mappings from link parse structures to phrase structure grammar parse trees can
be complex and problematic). Currently the hierarchical view Ls used in OpenCog only within
some reference resolution heuristics.
There is a database called the "link grammar dictionary" which contains connectors associ-
ated with all common English words. The notation used to describe feature structures in this
dictionary is quite simple. Different kinds of connectors are denoted by letters or pairs of letters
like S or SX. Then if a word WI has the connector S+, this means that the word can have an S
link coming out to the right side. If a word W2 has the connector S-, this means that the word
can have an $ link coming out to the left side. In this case, if %VI occurs to the left of W2 in a
sentence, then the two words can be joined together with an S link.
The features of the words in our example sentence, as given in the S&T paper, are:
Words Formula
a, the D+
snake, cat D- & (0- or S+)
Chased S- & 0+
To illustrate the role of syntactic sense disambiguation, we will uce alternate formulas
for one of the words in the example: the verb sense of "snake." We then have
Words Formula
A, the D+
snake_N, cat, ran_N D- & (0- or S+)
Chased S- & 0+
snake_V S-
The variables to be used in parsing this sentence are, for each word:
1. the features in the Agreement structure of the word (for any of its senses)
2. the words matching each of the connectors of the word
For example,
1. For "snake," there are features for "word that links to D-", "word that links to 0-" and "word
that links to 8+". There are also features for "tense" and "person".
2. For "the", the only feature is "word that links to D+". No features for Agreement are needed.
The nature of linkage imposes constraints on the variable assignments; for instance, if "the"
is assigned as the value of the "word that links to D-" feature of "snake", then "snake" must be
assigned as the value of the "word that links to D+" feature of "the."
The rules of link grammar impose additional constraints — i.e. the planarity, connectivity,
ordering and exclusion metarules described in Sleator and Temperley's papers. Planarity means
that links don't cross - a rule that S&T's parser enforces with absoluteness, whereas we have
found it is probably better to impose it as a probabilistic constraint, since sometimes it's
really nice to let links cross (the representation of conjunctions is one example). Connectivity
means that the links and words of a sentence mast form a connected graph - all the words
must be linked into the other words in the sentence via some path. Again connectivity is a
valuable constraint but in some cases one wants to relax it - if one just can't understand
the whole sentence, one may wish to understand at least some parts of it, meaning that one
has a disconnected graph whose components are the phrases of the sentence that have been
EFTA00624531
4,1.4 Parsing with Link Grammar 385
successfully comprehended. Finally, linguistic transformations may potentially be applied while
checking if these constraints are fulfilled (that is, instead of just checking if the constraints are
fulfilled, one may check if the constraints are fulfilled after one or more transformations are
performed.)
We will use the term "Agreement" to refer to "person" values or ordered pairs (tense, person),
and NAGR to refer to the number of agreement values (12-40, perhaps, in most realistic linguis-
tic theories). Agreement may be dealt with alongside the connector constraints. For instance,
"chased" has the Agreement values (past, third person), and it has the constraint that its S-
argument must match the person component of its Agreement structure.
Semantic restrictions may be imposed in the same framework. For instance, it may be known
that the subject of "chased" is generally animate. In that case, we'd say
Words Formula
A, the D+
snake_N, cat D- & (0- or S+)
Chased (S-, g Inheritance animate <.8>)
tz 0+
Snake_V 5-
wl ere we've added the modifier Inheritance animate) to the S- connector of the verb "chased,"
to indicate that with strength .8, he word connecting to this S- connector should denote
something inheriting from "animate.' In this example, "snake" and "cat" inherit from "animate",
so the probabilistic restriction down t help the parser any. If the sentence were instead
The snake in the hat chased the car
then the "animate" constraint would tell the parsing process not to start out by trying to connect
"hat" to "chased", because the connection is semantically unlikely.
44.4.1 Link Grammar vs. Phrase Structure Grammar
Before proceeding further, it's worth making a couple observations about the relationship be-
tween link grammars and typical phrase structure grammars. These could also be formulated as
observations about the relationship between dependency grammars and phrase structure gram-
mars, but that gets a little more complicated as there are many kinds of dependency grammars
with different properties; for simplicity we will restrict our discussion here to t he link grammar
that we actually use in OpenCog. Two useful observations may be:
1. Link grammar formulas correspond to grammatical categories. For example, the link struc-
ture for "chased" is "5- & O+." In categorical grammar, this would seem to mean that "
'chased' belongs to the category of words with link structure '5- & O+'." In other words,
each "formula" in link grammar corresponds to a category of words attached to that formula.
2. Links to words might as well be interpreted as links to phrases headed by those words.
For example, in the sentence "the cat chased a snake", there's an O-link from "chased" to
"snake." This might as well be interpreted as "there's an O-link from the phrase headed
by `chased' to the phrase headed by `snake'." Link grammar simplifies things by implicitly
identifying each phrase by its head.
EFTA00624532
386 44 Natural Language Comprehension
Based on these observations, one could look at phrase structure as implicit in a link parse; and
this does make sense, but also leads to some linguistic complexities that we won't enter into
here.
Fig. 44.1: Dependency and Phrase-Structure Parses
the/D man/N that/IN eime/V eats/V bananas/II ritts/Ill LID fork/M
eatsi//
man/N eati/V
theme cameN eatiN bananas/kr with/1N
the man that/1N cameN eats bananas with/1N (akiN
that came kith al rfisl
I I
a for*
A comparison of dependency (above) and phrase-structure (below) parses. In general, one can be converted to
the other (algorithmically); dependency grammars tend to be easier understand.
(Image taken from C. Schneider, "Learning to Disambiguate Syntactic Relations" Linguistik online 17, 5/03)
44.5 The RelEx Framework for Natural Language Comprehension
Now we move forward in the pipeline from syntax toward semantics. The NL comprehension
framework provided with OpenCog at its inception in 2008 is RelEx, an English-language se-
mantic relationship extractor, which consists of two main components: the dependency extractor
and the relationship extractor. It can identify subject, object, indirect object and many other
dependency relationships between words in a sentence; it generates dependency trees, resem-
bling those of dependency grammars. In 2012 we are in the process of replacing RelEx with a
different approach that we believe will be more amenable to generalization based on experience.
Here we will describe both approaches.
The overall processing scheme of RelEx is shown in Figure 44.2.
The dependency extractor component carries out dependency grammar parsing via a cus-
tomized version of the open-source Sleator and Temperley's link parser, as reviewed above. The
link parser outputs several parses, and the dependencies of the best one are taken. The rela-
tionship extractor component is composed of a number of template matching algorithms that
act upon the link parser's output to produce a semantic interpretation of the parse. It contains
three steps:
EFTA00624533
41.5 The RelEx Fkamework for Natural Language Comprehension 3ST
•
LS Pent
Lilt Pere empalaq
Rpm enSsa gignmen
ILIMIMillaal it Milifteall•I•POOSSISOMIDSIMINIDADOSIO•••••••
/WWII, • • • • • • • • • ease, • • .M. • • W.V. ••• • ea.. • • • • • •
Saber
Irealartenest Aargau
Ibrionemen Ana'
II r ItebassIdp
Refs Scat EltIll4101
ltdoeselps
Fig. 44.2: A Overview of the RelEx Architecture for Language Comprehension
1. Convert the Link Parser output to a feature structure representation
2. Execute the Sentence Algorithm Applier, which contains a series of Sentence Algorithms,
to modify the feature structure.
3. Extract the final output representation by traversing the feature structure.
A feature structure, in the RelEx context, is a directed graph in which each node contains
either a value, or an unordered list of features. A feature is just a labeled link to another node.
Sentence Algorithm Applier loads a list of SentenceAlgorithms from the algorithm definition
file, and the SentenceAlgorithms are executed in the order they are listed in the file. RelEx
iterates through every single feature node in the feature structure, and attempts to apply the
algorithm to each node. Then the modified feature structures are used to generate the final
RelEx semantic relationships.
44.5.1 RelEx2Frame: Mapping Syntactico-Semantic Relationships
into PrameNet Based Logical Relationships
Next in the current OpenCog NL comprehension pipeline, the RelEx2Frame component uses
hand-coded rules to map RelEx output into sets of relationships utilizing FrameNet and other
similar semantic resources. This is definitively viewed as a "stopgap" without a role in a human-
level AGI system, but it's described here because it's part of the current OpenCog system and
EFTA00624534
388 44 Natural Language Comprehension
is now being used together with other OpenCog components in practical projects, including
theme with proto-AGI intentions.
The syntax currently used for describing semantic relationships drawn from FrameNet and
other sources is exemplified by the example
Al_Benefit:Benefitor(give,$varl)
The n indicates the data source, where 1 is a number indicating that the resource is FrameNet.
The "give" indicates the word in the original sentence from which the relationship is drawn,
that embodies the given semantic relationship. So far the resources we've utilized are:
1. FrameNet
2. Custom relationship names
but using other resources in future is quite possible.
An example using a custom relationship would be:
A2_inheritance($varl,$var2)
which defines an inheritance relationship: something that is part of CogPrime's ontology but
not part of FrameNet.
The "Benefit" part of the first example indicates the frame indicated, and the "Benefitor"
indicates the frame element indicated. This distinction (frame vs. frame element) is particular to
FrameNet; other knowledge resources might use a different sort of identifier. In general, whatever
lies between the underscore and the initial parenthese should be considered as particular to the
knowledge-resource in question, and may have different format and semantics depending on the
knowledge resource (but shouldn't contain parentheses or underscores unless those are preceded
by an escape character).
As an example, consider:
Put the ball on the table
Here the RelEx output is:
imperative(put) [1]
_obj (Put, ball) (1]
on (Put, table) (1]
singular (ball) (1]
singular (table) (1]
, The relevant FrameNet Mapping Rules are:
SvarO = ball
Svarl = table
I IF imperative(put) THEN Al_Placing:Agent(put,you)
I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0)
I IF on(put,$varl) & _obj(put,$var0) THEN A1_Placing:Goal(put,$varl)
Al_Locative_relation:Figure($var0) Al_Locative_relation:Ground($varl)
Finally, the output FrameNet Mapping is:
EFTA00624535
44.5 The RelEx Framework for Natural Language Comprehension 389
^1_Placing:Agent(put,you)
al_Placing:Theme(put,ball)
^1_Placing:Goal(put,table)
"l_Locative_relation:Figure(put,ball)
"l_Locative_relation:Ground(put,table)
The textual syntax used for the hand-coded rules mapping RelEx to FrameNet. at the mo-
ment, looks like:
I IF imperative(put) THEN Al_Placing:Agent(put,you)
I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0)
I IF on(put,$varl) & _obj(put,$var0) THEN Al_Placing:Goal(put,$varl) \
al_Locative_relation:Figure(Svar0) Al_Locative_relation:Ground($varl)
Basically, this means each rule looks like
I IF condition THEN action
where the condition is a series of RelEx relationships, and the action is a series of F\ameNet
relationships. The arguments of the relationships may be words or may be variables in which
case their names must start with $ The only variables appearing in the action should be ones
that appeared in the condition.
44.5.2 A Priori Probabilities For Rules
It can be useful to attach a priori, heuristic probabilities to RelEx2Frame rules, say
I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0) <.5>
to denote that the a priori probability for the rule is 0.5
This is a crude mechanism because the probability of a rule being useful, in reality, depends
so much on context; but it still has some nonzero value.
44.5.3 Exclusions Between Rules
It may be also useful to specify that two rules can't semantically-consistently be applied to the
same RelEx relationship. To do this, we need to associate rules with labels, and then specify
exclusion relationships such as
# IF on(put,$varl) & _obj(put,$var0) THEN Al_Placing:Goal(put,$varl) \
Al_Locative_relation:Figure($var0) ^1_Locative_relation:Ground($varl) (1]
# IF on(put,$varl) & _Aubj(put,$var0) THEN \
Al_Performing_arts:Performance(put,$varl) \
Al_Performing_arts:Performer(put,$var0) (2]
# EXCLUSION 1 2
• An escape character " must be used to handle cases where the character "S" starts a word
EFTA00624536
390 44 Natural Language Comprehension
In this example, Rule 1 would apply to "He put the ball on the table", whereas Rule 2 would
apply to "He put on a show". The exclusion says that generally these two rules shouldn't
be applied to the same situation. Of course some jokes, poetic expressions, etc., may involve
applying excluded rules in parallel.
44.5.4 Handling Multiple Prepositional Relationships
Finally, one complexity arising in such rules is exemplified by the sentence:
"Bob says killing for the Mafia beats killing for the government"
whose RelEx mapping looks like
uncountable (Bob) [6]
present(says) [6]
_subj (says, Bob) [6]
_that (says, beats) [3]
uncountable(killing) [6]
for (killing, Mafia) [3]
singular (Mafia) [6]
definite(Mafia) [6]
hyp(beats) [ 3 ]
present (beats) [ 5 ]
_subj (beats, killing) [3]
_obj (beats, killing_1) [5]
uncountable (killing_1) [5]
for (killing_l, government) [2]
definite(government) [6]
In this case there are two instances of "for". The output of Relac2Frame must thus take care to
distinguish the two different for's (or we might want to modify RelEx to make this distinction).
The mechanism currently used for this is to subscript the for's, as in
uncountable (Bob) [6]
present(says) [6]
_subj (says, Bob) [6]
_that (says, beats) [3]
uncountable(killing) [6]
for (killing, Mafia) [3]
singular (Mafia) [6]
definite(Mafia) [6]
hyp(beats) [3]
present (beats) [6]
_subj (beats, killing) [3]
_obj (beats, killing_1) [5]
uncountable (killing_1) [5]
for_l (killing_1, government) [2]
definite(government) [6]
EFTA00624537
44.5 The RelEx Framework for Natural Language Comprehension 391
so that upon applying the rule:
I IF for ($var0, $varl) ^ (present ($var0) OR past ($var0) OR future ($var0) ) \
THEN ^2_Benefit :Benefitor (for, $varl) ^2_Benefit :Act (for, $var0)
we obtain
^2_Benefit:Benefitor(for,Mafia)
^2_Benefit:Act(for,killing)
A2_Benefit:Benefitor(for_l,government)
^2_Benefit:Act(for_l,killing_1)
Here the first argument of the output relationships allows us to correctly associate the dif-
ferent acts of killing with the different benefitors
44.5.5 Comparatives and Phantom. Nodes
Next, a bit of subtlety is needed to deal with sentences like
Mike eats more cookies than Ben.
which RelEx handles via
_subj(eat, Mike)
_obj(eat, cookie)
more(cookie, $cVar0)
ScVar0(Ben)
Then a Relac2FrameNet mapping rule such as:
IF
_subj (eat, $var0)
_obj (eat,$varl)
more ($varl,$cVar0)
ScVarO($var2)
THEN
^2_AsymmetricEvaluativeComparison : Prof iledltem (more, $varl)
"2_AsymmetricEvaluativeComparison : Standardltem (more, $varl_1)
^2_AsymmetricEvaluativeComparison :Valence (more, more)
l_Ingest ion: Ingestor (eat, $var0)
l_Ingest ion: Ingested (eat, $varl)
l_Ingest ion: Ingestor (eat_1,$var2)
l_Ingest ion : Ingested (eat_l, $varl_1)
applies, which embodies the commonsense intuition about comparisons regarding eating. (Note
that we have introduced a new frame AsymmetricEvaluativeComparison here, by analogy to
the standard FrameNet frame Evaluative_comparison.)
Note also that the above rule may be too specialized, though it's not incorrect. One could
also try more general rules like
EFTA00624538
392 44 Natural Language Comprehension
IF
%Agent ($var0)
%Agent ($varl)
_subj ($var3, $var0)
_obj ($var3, $varl)
more ($varl, $cVar0)
ScVarO ($var2)
THEN
^2_AsymmetricEvaluativeComparison : Prof iledltem (more, $varl)
^2_AsymmetricEvaluativeComparison:StandardItem (more, $varl_1)
^2_AsymmetricEvaluativeComparison :Valence (more, more)
_subj ($var3, $var0)
_obj ($var3, $varl)
_subj ($var3_1, $var2)
_obj ($var3_1, $varl_1)
However, this rule is a little different than most RelEx2Frame rules, in that it produces output
that then needs to be processed by the RelEx2Frame rule-base a second time. There's nothing
wrong with this, it's just an added layer of complexity.
44.6 Frame2Atom
The next step in the current OpenCog NLP comprehension pipeline is to translate the output
of RelEx2Frame into Atoms. This may be done in a variety of ways; the current Frame2Atom
script embodies one approach that has proved workable, but is certainly not the only useful
one.
The Node types currently used in Frame2Atom are:
• WordNode
• ConceptNode
- DefinedFrameNode
- DefinedLinguisticConceptNode
• PredicateNode
- DefinedFrameElementNode
- DefinedLinguisticRelationshipNode
• SpecificEntityNode
The special node types
• DefinedFrameNode
• DefinedFrameElementNode
have been created to correspond to FrameNet frames and elements respectively (or frames and
elements drawn from similar resources to FrameNet, such as our own frame dictionary).
Similarly, the special node types
EFTA00624539
44.6 Frame2Atom 393
• DefinedLinguisticConceptNode
• DefinedLinguisticRelationshipNode
have been created to correspond to RelEx unary and binary relationships respectively.
The "defined" is in the names because once we have a more advanced CogPrime system, it
will be able to learn its own frames, frame elements, linguistic concepts and relationships. But
what distinguishes these "defined" Atoms is that they have names which correspond to specific
external resources.
The Link types we need for Frarne2Atom are:
• InheritanceLink
• ReferenceLink (current using WRLink aka "word reference link")
• FrameElementLink
ReferenceLink is a special link type for connecting concepts to the words that they refer to.
(This could be eliminated via using more complex constructs, but it's a very common case so
for practical purposes it makes sense to define it as a link type.)
FrameElementLink is a special link type connecting a frame to its element. Its semantics
(and how it could be eliminated at cost of increased memory and complexity) will be explained
below.
44.6.1 Examples of Frame2Atorn
Below follow some examples to illustrate the nature of the mapping intended. The examples
include a lot of explanatory discussion as well.
Note that, in these examples, [14 denotes an Atom with AtomHandle n. All Atoms have Han-
dles, but Handles are only denoted in cases where this seems useful. (In the XML representation
used in the current OpenCogPrime impehnentation, these are replaced by UUID's)
The notation
WordNode#pig
denotes a WordNode with name pig, and a similar convention is used for other AtomTypes
whose names are useful to know.
These examples pertain to fragments of the parse
Ben slowly ate the fat chickens.
A:_advmod:V(slowly:A, eat:VI
N:_nn:N(fat:N, chicken:N)
N:definite(Ben:N)
N:definite(chicken:N)
N:masculine(Ben:N)
N:person(Ben:14/
N:plural(chicken:N)
N:singular(Ben:N)
V:_obj:N(eat:V, chicken:N)
V:_subj:N(eat:V, Ben:N)
V:past(eat:V)
EFTA00624540
394 44 Natural Language Comprehension
Al_Ingestion:Ingestor(eat,Ben)
Al_Temporalcolocation:Event(past,eat)
Al_Ingestion:Ingestibles(eate chicken)
Al_Activity:Agent(subject,Ben)
Al_Activity:Activity(verb,eat)
Al_Transitive_action:Event(verbf eat)
AlTransitiveaction:Patient(objectf chicken)
44.6.1.1 Example 1
_obj(eat,chicken)
would map into
EvaluationLink
DefinedLinguisticRelationshipNode l_obj
ListLink
ConceptNode [2]
ConceptNode [3]
InheritanceLink
[2]
ConceptNode [4]
InheritanceLink
I 31
ConceptNode [5]
ReferenceLink [6]
WordNode feat [8]
[4]
ReferenceLink [7]
WordNode 'chicken [9]
I 51
Please note that the Atoms labeled 4,5,6,7,8,9 would not normally have to be created when
entering the relationship
_obj(eat,chicken)
into the AtomTable. They should already be there, assuming the system already knows about
the concepts of eating and chickens. These would need to be newly created only if the system
had never seen these words before.
For instance, the Atom 121 represents the specific instance of "eat" involved in the relationship
being entered into the system. The Atom 141 represents the general concept of "eat", which is
what is linked to the word "eat."
Note that a very simple step of inference, from these Atoms, would lead to the conclusion
EvaluationLink
EFTA00624541
44.6 Frame2Atom 395
DefinedLinguisticRelationshipNode i_obj
ListLink
ConceptNode [4]
ConceptNode [5]
which represents the general statement that chickens are eaten. This is such an obvious and
important step, that perhaps as soon as the relationship _obj(eat, chicken) is entered into the
system, it should immediately be carried out (i.e. that link if not present should be created,
and if present should have its truth value updated). This is a choice to be implemented in the
specific scripts or schema t hat deal with ingestion of natural language text.
44.6.1.2 Example 2
mascutine(Ben)
would map into
InheritanceLink
SpecificEntityNode [40]
DefinedLinguisticConceptNode *masculine
InheritanceLink
[40]
(10]
ReferenceLink
WordNode *Ben
[10]
44.6.1.3 Example 3
The mapping of the RelExToFrame output
Ingestion : Ingestor(eat, Ben)
would use the existing Atoms
DefinedFrameNode *Ingestion [11]
DefinedFrameElementNode *Ingestion:Ingestor [12]
which would be related via
FrameElementLink [11] [12]
(Note that FrameElementLink may in principle be reduced to more elementary PLN link types.)
Note that each FrameNet frame contains some core elements and some optional elements.
This may be handled by giving core elements links such as
FrameElementLink F E <1>
EFTA00624542
396 44 Natural Language Comprehension
and optional ones links such as
FrameElementLink F E <.7>
Getting back to the example at hand, we would then have
InheritanceLink [2] [11]
(recall, 12] is the instance of eating involved in Example 1; and, till is the Ingestion frame),
which says that this instance of eating is an instance of ingestion. (In principle, some instances
of eating might not be instances of ingestion - or more generally, we can't assume that all
instances of a given concept will always associate with the same FrameNodes. This could be
assumed only if we assumed all word-associated concepts were disambiguated to a single known
FrameNet frame, but this can't be assumed, especially if later on we want to use cognitive
processes to do sense disambiguation.)
We would then also have links denoting the role of Ben as an Ingestor in the frame-instance
12j, i.e.
EvaluationLink
DefinedFrameElementNode 4Ingestion:Ingestor [12]
ListLink
[2]
[40]
This says that the specific instance of Ben observed in that sentence OD served the role of
Ingestion:Ingestor in regard to the frame-instance 121 (which is an instance of eating, which is
known to be an instance of the frame of Ingestion).
44.6.2 Issues Involving Disambiguation
Right now, OpenCogPrime's RelEx2Frame rulebase is far from adequately large (there are
currently around 5000 rules) and the link parser and RelEx are also imperfect. The current
OpenCog NLP system does work, but for complex sentences it tends to generate too many
interpretations of each sentence - "parse selection" or more generally "interpretation selection"
is not yet adequately addressed. This is a tricky issue that can be addressed to some extent via
statistical linguistics methods, but we believe that to solve it convincingly and thoroughly will
require more cognitively sophisticated methods.
The most straightforward way to approach it statistically is to process a large number of
sentences, and then tabulate co-occurrence probabilities of different relationships across all the
sentences. This allows one to calculate the probability of a given interpretation conditional on
the corpus, via looking at the probabilities of the combinations of relationships in the inter-
pretation. This may be done using a Bayes Net or using PLN - in any case the problem is
one of calculating the probability of a conjunction of terms based on knowledge regarding the
probabilities of various sub-conjunctions. As this method doesn't require marked-up training
data, but is rather purely unsupervised, it's feasible to apply it to a very large corpus of text -
the only cost is computer time.
What the statistical approach won't handle, though, are the more conceptually original lin-
guistic constructs, containing combinations that didn't occur frequently in the system's training
EFTA00624543
44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and RelEx2Frame 397
corpus. It will rate innovative semantic constructs as unlikely, which will lead it to errors some-
times - errors of choosing an interpretation that seems odd in terms of the sentence's real-world
interpretation, but matches well with things the system has seen before. The only way to solve
this is with genuine understanding - with the system reasoning on each of the interpretations
and seeing which one makes more sense. And this kind of reasoning generally requires some
relevant commonsense background knowledge - which must be gained via experience, reading
and conversing, or from a hand-coded knowledge base, or via some combination of the above.
Related issues also involving disambiguation include word sense disambiguation (words with
multiple meanings) and anaphor resolution (recognizing the referents of pronouns, and of nouns
that refer to other nouns, etc.).
The current RelEx system contains a simple statistical parse ranker (which rates a parse
higher if the links it includes occur more frequently in a large parsed corpus), statistical methods
for word sense disambiguation IMihoil inspired by those in Rada Mihalcea's work IS:009I, and
an anaphor resolution algorithm based on the classic Hobbs Algorithm (customized to work with
the link parser) Illob78J. While reasonably effective in many cases, from an AGI perspective
these must all be considered "stopgaps" to be replaced with code that handles these tasks
using probabilistic inference. It is conceptually straightforward to replace statistical linguistic
algorithms with comparable PLN-based methods, however significant attention must be paid to
code optimization as using a more general algorithm is rarely as efficient as using a specialized
one. But once one is handling things in PLN and the Atomspace rather than in specialized
computational linguistics code, there is the opportunity to use a variety of inference rules for
generalization, analogy and so forth, which enables a radically more robust form of linguistic
intelligence.
44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and
RelEx2Frame
This section describes an alternative approach to the RelEx / RelEx2Frame approach described
above, which is in the midst of implementation at time of writing. This alternative represents
a sort of midway point between the rule-based RelEx / RelEx2Frame approach, and a concep-
tually ideal fully experiential learning based approach.
The motivations underlying this alternative approach have been to create an OpenCog NIP
system with the capability to:
• support simple dialogue in a video game like world, and a robot system
• leverage primarily semi-supervised experiential learning
• replace the RelEx2Frame rules, which are currently problematic, with a different way of
mapping syntactic relationships into Atoms, that is still reasoning and learning friendly
• require only relatively modest effort for implementation (not multiple human-years)
The latter requirement ruled out a pure "learn language from experience with no aid from
computational linguistics tools" approach, which may well happen within OpenCog at some
point.
EFTA00624544
398 44 Natural Language Comprehension
44.8 Mapping Link Parses into Atom Structures
The core idea of the new approach is to learn "Syn2Sem" rules that map link parses into Atom
structures. These rules may then be automatically reversed to form Sem2Syn rules, which may
be used in language generation.
Note that this is different from the RelEx approach as currently pursued (the "old approach"),
which contains
• one set of rules (the RelEx rules) mapping link parses into semantic relation-sets ("RelEx
relation-sets" or rel-sets)
• another set of rules (the RelEx2Frame rules) mapping rehsets into FrameNet-based relation-
sets
• another set of rules (the Frame2Atom rules) mapping FrameNet-based relation-sets into
Atom-sets
In the old approach, all the rules were hand-coded. In the new approach
• nothing needs to be hand-coded (except the existing link parser dictionary); the rules can be
learned from a corpus of (link-parse, Atom-set) pairs. This corpus may be human-created;
or may be derived via a system's experience in some domain where sentences are heard or
read, and can be correlated with observed nonlinguistic structures that can be described by
Atoms.
• in practice, some hand-coded rules are being created to map RelEx rd-sets into Atom-sets
directly (bypassing RelEx2Frame) in a simple way. These rules will be used, together with
RelEx, to create a large corpus of (link parse, Atom-set) pairs, which will be used as a
training corpus. This training corpus will have more errors than a hand-created corpus, but
will have the compensating advantage of being significantly larger than any hand-created
corpus would feasibly be.
In the old approach, NL generation was done by using a pattern-matching approach, applied
to a corpus of (link parse, rd-set) pairs, to mine rules mapping rd-sets to sets of link parser
links. This worked to an extent, but the process of piecing together the generated sets of link
parser links to form coherent "sentence parses" (that could then be turned into sentences)
turned out to be subtler than expected. and appeared to require an es. calatingly complex set of
hand-coded rules, to be extended beyond simple cases.
In the new approach, NL generation is done by explicitly reversing the mapping rules learned
for mapping link parses into Atom sets. This is possible because the rules are explicitly given in
a form enabling easy reversal; whereas in the old approach, RelEx transformed link parses into
rd-sets using a process of successively applying many rules to an ornamented tree, each rule
acting on variables ("ornaments") deposited by previous rules. Put simply, RelEx transformed
link parses into rehsets via imperative programming, whereas in the new approach, link parses
are transformed into Atom-sets using learned rules that are logical in nature. The movement
from imperative to logical style dramatically eases automated rule reversal.
EFTA00624545
4,1.9 Making a Training Corpus 399
44.8.1 Example Training Pair
For concreteness, an example (link parse, Atom-set) pair would be as follows. For the sentence
"Trains move quickly", the link parse looks like
Sp (trains, move)
MVa (move, quickly)
whereas the Atom-set looks like
Inheritance
move_l
move
Evaluation
move_l
train
Inheritance
move_l
quick
Rule learning proceeds, in the new approach, from a corpus consisting of such pairs.
44.9 Making a Training Corpus
44.9.1 Leveraging RelEx to Create a Training Corpus
To create a substantial training corpus for the new approach, we are leveraging the existence
of RelEx. We have a large corpus of sentences parsed by the link parser and then processed by
RelEx. A new collection of rules is being created, RelEx2Atom, that directly translates RelEx
parses into Atoms, in a simple way, embodying the minimal necessary degree of disambiguation
(in a sense to be described just below). Using these RelEx2Atom rules, one can transform a
corpus of (link parse, RelEx rel-set) triples into a corpus of (link parse, Atom-set) pairs - which
can then be used as training data for learning Syn2Sem rules.
44.9.2 Making an Experience Based Training Corpus
An alternate approach to making a training corpus would be to utilize a virtual world such as
the Unity3D world now being used for OpenCog game AI research and development.
A human game-player could create a training corpus by repeated:
• typing in a sentence
• indicating, via the graphic user interface, which entities or events in the virtual world were
referred to by the sentence
EFTA00624546
400 44 Natural Language Comprehension
Since OpenCog possesses code for transforming entities and events in the virtual world into
Atom-sets, this would implicitly produce a training corpus of (sentence, Atom-set) pairs, which
using the link parser could then be transformed into (link parse, Atom-set) pairs.
44.94 Unsupervised, Experience Based Corpus Creation
One could also dispense with the explicit reference-indication GUI, and just have a user type
sentences to the Al agent as the latter proceeds through the virtual world. The Al agent would
then have to figure out what specifically the sentences were referring to - maybe the human-
controlled avatar is pointing at something; maybe one thing recently changed in the game world
and nothing else did; etc. This mode of corpus creation would be reasonably similar to human
first language learning in format (though of course there are many differences from human
first language learning in the overall approach, for instance we are assuming the link parser,
whereas a human language learner has to learn grammar for themselves, based on complex and
ill-understood genetically encoded prior probabilistic knowledge regarding the likely aspects of
the grammar to be learned).
This seems a very interesting direction to explore later on, but at time of writing we are pro-
ceeding with the RelEx-hased training corpus, for sake of simplicity and speed of development.
44.10 Limiting the Degree of Disambiguation Attempted
The old approach is in a sense more ambitious than the new approach, because the RelEx2Frame
rules attempt to perform a deeper and more thorough level of semantic disambiguation than
the new rules. However, the RelEx2Frame rule-set in its current state is too "noisy" to be really
useful; it would need dramatic improvement to be helpful in practice. The key difference is that,
• In the new approach, the syntax-to-semantics mapping rules attempt only the disambigua-
tion that needs to be done to get the structure of the resultant Atom-set correct. Any
further disambiguation is left to be done later, by MindAgents acting on the Atom-sets
after they've already been placed in the AtomSpace.
• In the old approach, the RelEx2Frame rules attempted, in many cases, to disambiguate
between different meanings beyond the level needed to disambiguate the structure of the
Atom-set
To illustrate the difference, consider the sentences
• Love moves quickly.
• Trains move quickly.
These sentences involve different senses of "move" - change in physical location, versus a more
general notion of progress. However, both sentences map to the same basic conceptual structure,
e.g.
Inheritance
move_l
EFTA00624547
44.11 Rule Format 401
move
Evaluation
move_l
train
Inheritance
move_l
quick
versus
Inheritance
move_2
move
Evaluation
move_2
love
Inheritance
move_2
quick
The RelEx2Frame rules try to distinguish between these cases via, in effect, associating the
two instances move_l and move_2 with different frames, using hand-coded rules that map
RelEx rehsets into appropriate Atom-sets defined in terms of FrameNet relations. This is not a
useless thing to do; however, doing it well requires a very large and well-honed rule-base. Cyc's
natural language engine attempts to do something similar, though using a different parser than
the link parser and a different ontology than FrameNet; it does a much better job than the
current version of RelEx2Frame, but still does a surprisingly incomplete job given the massive
amount of effort put into sculpting the relevant rule-sets.
The new approach does not try to perform this kind of disambiguation prior to mapping
things into Atom-sets. Rather, this kind of disambiguation is left for inference to do, after the
relevant Atoms have already been placed in the AtomSpace. The rule of thumb is: Do precisely
the disambiguation needed to map the parse into a compact, simple Atom-set, whose component
nodes correspond to English words. Let the disambiguation of the meaning of the English words
be done by some other process acting on the AtomSpace.
44.11 Rule Format
To represent Syn2Sem rules, it is convenient to represent link parses as Atom-sets. Each element
of the training corpus will then be of the form (Atom set representing link parse, Atom-set
representing semantic interpretation). Syn2Sem rules are then rules mapping Atom-sets to
Atom-sets.
Broadly speaking, the format of a Syn2Sem rule is then
EFTA00624548
402 44 Natural Language Comprehension
Implication
Atom-set representing portion of link parse
Atom-set representing portion of semantic interpretation
44.11.1 Example Rule
A simple example rule would be
Implication
Evaluation
Predicate: Sp
\$V1
\ $V2
Evaluation
\ $V2
\$V1
This rule, in essence, maps verbs into predicates that take their subjects as arguments.
On the other hand, an Sem2Syn rule would look like the reverse:
Implication
Atom-set representing portion of link parse
Atom-set representing portion of semantic interpretation
Our current approach is to begin with Syn2Sem rules, because, due to the nature of natural
language, these rules will tend to be more certain. That is: it is more strongly the case in natural
languages that each syntactic construct maps into a small set of semantic structures, than that
each semantic structure is realizable only via a small set of syntactic constructs. There are
usually more ways structurally different, reasonably sensible ways to say an arbitrary thought,
than there are structurally different, reasonably sensible ways to interpret an arbitrary sentence.
Because of this fact about language, the design of the Atom-sets in the corpus is based on the
principle of finding an Atom structure that most simply represents the meaning of the sentence
corresponding to each given link parse. Thus, there will be many Syn2Sem rules with a high
degree of certitude attached to them. On the other hand, the Sem2Syn rules will tend to have
less certitude, because there may be many different syntactic ways to realize a given semantic
expression.
44.12 Rule Learning
Learning of Syn2Sem rules may be done via any algorithm that is able to search rule space for
rules of the proper format with high truth value as evaluated across the training set. Currently
we are experimenting with using OpenCogPrime's frequent subgraph mining algorithm in this
context. MOSES could also potentially be used to learn Syn2Sem rules. One suspects that
MOSES might be better than frequent subgraph mining for learning complex rules, but based
EFTA00624549
44.13 Creating a Cyc-Like Database via Text Mining 403
on preliminary, experimentation, frequent subgraph mining seems fine for learning the simple
rules involved in simple sentences.
PLN inference may also be used to generate new rules by combining previous ones, and to
generalize rules into more abstract forms.
44.13 Creating a Cyc-Like Database via Text Mining
The discussion of these NL comprehension mechanisms leads naturally to one interesting poten-
tial application of the OpenC,og NL comprehension pipeline - which is only indirectly related
to CogPrime, but would create a valuable resource for use by CogPrime if implemented. The
possibility exists to use the OpenCog NL comprehension system to create a vaguely Cyc-like
database of common-sense rules.
The approach would be as follows:
1. Get a corpus of text
2. Parse the text using OpenCog (RelEx or Syn2Sem)
3. Mine logical relationships among Atomrelationships from the data thus produced, using
greedy data-mining. MOSES, or other methods
These mined logical relationships will then be loosely analogous to the rules the Cyc team have
programmed in. For instance, there will be many rules like:
I IF _subj(understand,$var0) THEN ^l_Grasp:Cognizer(understand,$var0)
I IF _subj(know,$var0) THEN Al_Grasp:Cognizer(understand,$var0)
So statistical mining would learn rules like
IF "l_Mental_property(stupid) & Al_Mental_property:Protagonist($var0)
THEN ^l_Grasp:Cognizer(understand,$var0) <.3>
IF ^1_Mental_property(smart) & Al_Mental_property:Protagonist($var0)
THEN ^l_Grasp:Cognizer(understand,$var0) <.8>
which means that stupid people mentally grasp less than smart people do.
Note that these commonsense rules would come out automatically probabilistically quanti-
fied.
Note also that to make such rules come out well, one needs to do some (probabilistic)
synonym-matching on nouns, adverbs and adjectives, e.g. so that mentions of "smart", "intelli-
gent", "clever", etc. will count as instances of
Al_Mental_property (smart)
By combining probabilistic synonym matching on words. with mapping RelEx output into
FrameNet input, and doing statistical mining, it should be passible to build a database like Cyc
but far more complete and with coherent probabilistic weightings.
Although this way of building a commonsense knowledge base requires a lot of human
engineering, it requires far less than something like Cyc. One "just" needs to build the
RelEx2FrameNet mapping rules, not all the commonsense knowledge relationships directly —
EFTA00624550
404 44 Natural Language Comprehension
those come from text. We do not advocate this as a solution to the AGI problem, but merely
suggest that it could produce a large amount of useful knowledge to feed into an AGI's brain.
And of course, the better an AI one has, the better one can do the step labeled "Rank
the parses and FrameNet interpretations using inference or heuristics or both." So there is a
potential virtuous cycle here: more commonsense knowledge mined helps create a better AI
mind, which helps mine better commonsense knowledge, etc.
44.14 PROWL Grammar
We have described the crux of the NL comprehension pipeline that is currently in place in
the OpenCog codebase, plus some ideas for fairly moderate modifications or extensions. This
section is a little more speculative, and describes an alternative approach that fits better with
the overall CogPrime design, which however has not yet been implemented. The ideas given
here lead more naturally to a design for experience-based language learning and processing, a
connection that will be pointed out in a later section.
What we describe here is a partially-new theory of language formed via combining ideas
from three sources: Hudson's Word Grammar 11hu190, HudOiaj, Sleator and Temperley's link
grammar. and Probabilistic Logic Networks. Reflecting its origin in these three sources, we have
named the new theory PROWL grammar, meaning PRObabilistic Word Link Grammar. We
believe PROWL has value purely as a conceptual approach to understanding language; however,
it has been developed largely from the standpoint of computational linguistics - as part of an
attempt to create a framework for computational language understanding and generation that
both
1. yields broadly adequate behavior based on hand-coding of "expert rules" such as grammat-
ical rules, combined with statistical corpus analysis
2. integrates naturally with a broader Al framework that combines language with embodied
social, experiential learning, that ultimately will allow linguistic rules derived via expert
encoding and statistical corpus analysis to be replaced with comparable, more refined rules
resulting from the system's own experience
PROWL has been developed as part of the larger CogPrime project; but, it is described in this
section mostly in a CogPrime -independent way, and is intended to be independently evaluable
(and, hopefully, valuable).
As an integration of three existing frameworks, PROWL could be presented in various differ-
ent ways. One could choose any one of the three components as an initial foundation, and then
present the combined theory, as an expansion/modification of this component. Here we choose
to present it as an expansion/modification of Word Grammar, as this is the way it originated,
and it is also the most natural approach for readers with a linguistics background. From this
perspective, to simplify a fair bit, one may describe PROWL as consisting of Word Grammar
with three major changes:
1. Word Grammar's network knowledge representation is replaced with a richer PLN-based
network knowledge representation.
EFTA00624551
44.14 PROWL Grammar 405
a. This includes, for instance. the replacement of Word Grammar's single "isa" relation-
ship type with a more nuanced collection of logically distinct probabilistic inheritance
relationship types
2. Going along with the above, Word Grammar's "default inheritance" mechanism is replaced
by an appropriate PLN control mechanism that guides the use of standard PLN inference
rules
a. This allows the same default-inheritance based inferences that Word Grammar relies
upon, but embeds these inferences in a richer probabilistic framework that allows them
to be integrated with a wide variety of other inferences
3. Word Grammar's small set of syntactic link types is replaced with a richer set of syntactic
link types as used in Link Grammar
a. The precise optimal set of link types is not clear; it may be that the link grammar's
syntactic link type vocabulary is larger than necessary, but we also find it clear that
the current version of Word Grammar's syntactic link type vocabulary is smaller than
feasible (at least, without the addition of large, new, and as yet unspecified ideas to
Word Grammar)
In the following subsections we will review these changes in a little more detail. Basic familiarity
with Word Grammar. Link Grammar and PLN is assumed.
Note that in this section we will focus mainly on those issues that are somehow nonobvious.
This means that a host of very important topics that come along with the Word Grammar
/ PLN integration are not even mentioned. The way Word Grammar deals with morphology,
semantics and pragmatics, for instance, seems to us quite sensible and workable - and doesn't
really change at all when you integrate Word Grammar with PLN, except that Word Grammar's
crisp isa links become PLN-style probabilistic Inheritance links.
44.14.1 Brief Review of Word Grammar
Word Grammar is a theory of language structure which Richard Hudson began developing in
the early 1980's riltal901. While partly descended from Systemic Functional Grammar, there
are also significant differences. The main ideas of Word Grammar are as follows
• It presents language as a network of knowledge, linking concepts about words, their mean-
ings, etc. - e.g. the word "dog" is linked to the meaning 'dog', to the form /dog/, to the
word-class 'noun', etc.
• If language is a network, then it is possible to decide what kind of network it is (e.g. it
seems to be a scale-free small-world network)
• It is monostratal - only one structure per sentence, no transformations.
• It uses word-word dependencies - e.g. a noun is the subject of a verb.
• It does not use phrase structure - e.g. it does not recognise a noun phrase as the subject of
a clause, though these phrases are implicit in the dependency structure.
t the following list is paraphrased with edits front http://www.phon.ucl.ac.uk/home/dick/wg.htm
downloaded on June 27 2010
EFTA00624552
406 44 Natural Language Comprehension
• It shows grammatical relations/functions by explicit labels - e.g. 'subject' and 'object'.
• It uses features only for inflectional contrasts that are mentioned in agreement rules - e.g.
number but not tense or transitivity.
• It uses default inheritance, as a very general way of capturing the contrast between 'basic' or
'underlying' patterns and 'exceptions' or 'transformations' - e.g. by default, English words
follow the word they depend on, but exceptionally subjects precede it; particular cases
'inherit' the default pattern unless it is explicitly overridden by a contradictory rule.
• It views concepts as prototypes rather than 'classical' categories that can be defined by
necessary and sufficient conditions. All characteristics (i.e. all links in the network) have
equal status, though some may for pragmatic reasons be harder to override than others.
• In this network there are no clear boundaries between different areas of knowledge - e.g.
between 'lexicon' and 'grammar', or between 'linguistic meaning' and 'encyclopedic knowl-
edge; language is not a separate module of cognition.
• In particular, there is no clear boundary between 'internal' and 'external' facts about words,
so a grammar should be able to incorporate sociolinguistic facts - e.g. the speaker of "side-
walk" is an American.
44.14.2 Word Grammar's Logical Network Model
Word Grammar presents an elegant framework in which all the different aspects of language
are encompassed within a single knowledge network. Representationally, this network combines
two key aspects:
1. Inheritance (called is-a) is explicitly represented
2. General relationships between n-ary predicates and their arguments, including syntactic
relationships, are explicitly represented
Dynamically, the network contains two key aspects:
1. An inference rule called "default inheritance"
2. Activation-spreading, similar to that in a neural network or standard semantic network
The similarity between Word Grammar and CogPrime is fairly strong. In the latter, inheritance
and generic predicate-argument relationships are explicitly represented; and, a close analogue of
activation spreading is present in the "attention allocation" subsystem. As in Word Grammar,
important cognitive phenomena are grounded in the symbiotic combination of logical-inference
and activation-spreading dynamics.
At the most general level, the reaction of the Word Grammar network to any situation is
proposed to involve three stages:
1. Node creation and identification: of nodes representing the situation as understood, in its
most relevant aspects
2. Where choices need to be made (e.g. where an identified predicate needs to choose which
other nodes to bind to as arguments), activation spreading is used, and the most active
eligible argument is utilized (this is called "best fit binding")
3. Default inheritance is used to supply new links to the relevant nodes as necessary
EFTA00624553
44.14 PROWL Grammar 407
Default inheritance is a process that relies on the placement of each node in a directed acyclic
graph hierarchy (dag) of isa links. The basic idea is as follows. Suppose one has a node N, and a
predicate f(N,L), where L is another argument or list of arguments. Then, if the truth value of
f(N,L) is not explicitly stored in the network, N inherits the value from any ancestor A in the
dag so that: f(A,L) is explicitly stored in the network; and there is not any node P inbetween
N and A for which f(P,L) is explicitly stored in the network. Note that multiple inheritance is
explicitly supported, and in cases where this leads to multiple assignments of truth values to a
predicate, confusion in the linguistic mind may ensue. In many cases the option coming from
the ancestor with the highest level of activity may be selected.
Our suggestion is that Word Grammar's network representation may be replaced with PLN's
logical network representation without any loss, and with significant gain. Word Grammar's
network representation has not been fleshed out as thoroughly as that of PLN, it does not
handle uncertainty, and it is not associated with general mechanisms for inference. The one
nontrivial issue that must be addressed in porting Word Grammar to the PLN representation
is the role of default inheritance in Word Grammar. This is covered in the following subsection.
The integration of activation spreading and default inheritance proposed in Word Gram-
mar, should be easily achievable within CogPrime assuming a functional attention allocation
subsystem.
44.14.3 Link Grammar Parsing vs Word Grammar Parsing
From a CogPrime /PLN point of view, perhaps the most striking original contribution of Word
Grammar is in the area of syntax parsing. Word Grammar's treatment of morphology and se-
mantics is, basically, exactly what one would expect from representing such things in a richly
structured semantic network. PLN adds much additional riclmess to Word Grammar via al-
lowing nuanced representation of uncertainty, which is critical on every level of the linguistic
hierarchy - but this doesn't change the fundamental linguistic approach of Word Grammar.
Regarding syntax processing, however, Word Grammar makes some quite specific and unique
hypotheses, which if correct are very valuable contributions.
The conceptual assumption we make here is that syntax processing, while carried out using
generic cognitive processes for uncertain inference and activation spreading, also involves some
highly specific constraints on these processes. The extent to which these constraints are learned
versus inherited is yet unknown, and for the subtleties of this issue the reader is referred to
lEI33- 971. Word Grammar and Link Grammar are then understood as embodying different
hypotheses regarding what these constraints actually are.
It is interesting to consider the contributions of Word Grammar to syntax parsing via com-
paring it to Link Grammar.
Note that Link Grammar, while a less comprehensive conceptual theory than Word Gram-
mar, has been used to produce a state-of-the-art syntax parser, which has been incorporated
into a number of other software systems including OpenCog. So it is clear that the Link Gram-
mar approach has a great deal of pragmatic value. On the other hand, it also seems clear that
Link Grammar has certain theoretical shortcomings. It deals with many linguistic phenomena
very elegantly, but there are other phenomena for which its approach can only be described as
"hacky."
EFTA00624554
408 44 Natural Language Comprehension
Word Grammar contains fewer hacks than Link Grammar, but has not yet been put to the
test of large-scale computational implementation, so it's not yet clear how many hacks would
need to be added to give it the relatively broad coverage that Link Grammar currently has. Our
own impression is that to make Word Grammar actually work as the foundation for a broad-
coverage grammar parser (whether standalone, or integrated into a broader artificial cognition
framework), one would need to move it somewhat in the direction of link grammar, via adding
a greater number of specialized syntactic link types (more on this shortly). There are in fact
concrete indications of this in IIlud07ai.
The Link Grammar framework may be decomposed into three aspects:
1. The link grammar dictionary, which for each word in English, contains a number of links of
different types. Some links point left, some point right, and each link is labeled. Furthermore,
some links are required and others are optional.
2. The "no-links-cross" constraint, which states that the correct parse of a sentence will involve
drawing links between words, in such a way that all the required links of each word are
fulfilled, and no two links cross when the links are depicted in two dimensions
3. A processing algorithm, which involves first searching the space of all passible linkages
among the words in a sentence to find all complete linkages that obey the no-links-cross
constraint; and then applying various postprocessing rules to handle cases (such as con-
junctions) that aren't handled properly by this algorithm
In PROWL, what we suggest is that
1. The link grammar dictionary is highly valuable and provides a level of linguistic detail that
is not present in Word Grammar; and, we suggest that in order to turn Word Grammar
into a computationally tractable system, one will need something at least halfway between
the currently minimal collection of syntactic link types used in Word Grammar and the
much richer collection used in Link Grammar
2. The no-links-cross constraint is an approximation of a deeper syntactic constraint ("land-
mark transitivity") that has been articulated in the most recent formulations of Word Gram-
mar. Specifically: when a no-links-crossing parse is found, it is correct according to Word
Grammar; but Word Grammar correctly recognizes some parses that violate this constraint
3. The Link Grammar parsing algorithm is not cognitively natural, but is effective in a
standalone-parsing framework. The Word Grammar approach to parsing is cognitively
natural, but as formulated could only be computationally implemented in the context of
an already-very-powerful general intelligence system. Fortunately, various intermediary ap-
proaches to parsing seem possible.
44.14.3.1 Using Landmark Transitivity with the Link Grammar Dictionary
An earlier version of Word Grammar utilized a constraint called "no tangled links" which is
equivalent to the link parser's "no links cross" constraint. In the new version of Word Grammar
this is replaced with a subtler and more permissive constraint called "landmark transitivity."
While in Word Grammar. landmark transitivity is used with a small set of syntactic link types,
there is no reason why it can't be used with the richer set of link types that Link Grammar
provides. In fact, this seems to us a probably effective method of eliminating most or all of the
"postprocessing rules" that exist in the link parser, and that constitute the least elegant aspect
of the Link Grammar framework.
EFTA00624555
44.14 PROWL Grammar 409
The first foundational concept, on the path to the notion of landmark transitivity, is the
notion of a syntactic parent. In Word Grammar each syntactic link has a parent end and a
child end. In a dependency grammar context, the notion is that the child depends upon the
parent. For instance, in Word Grammar, in the link between a noun and an adjective, the noun
is the parent.
To apply landmark transitivity in the context of the Link Grammar, one needs to provide
some additional information regarding each link in the Link Grammar dictionary. One needs
to specify which end of each of the link grammar links is the "parent" and which is the "child."
Examples of this kind of markup are as follows (with P shown by the parent):
S link: subject-noun finite verb (P)
O link: transitive verb (P) direct or indirect object
D link: determiner noun (P)
MV link: verb (P) verb modifier
J link: preposition object (P)
ON link: on time-expression [P]
M link: noun [P] modifiers
In some cases a word may have more than one parent. In this case, the rule is that the landmark
is the one that is superordinate to all the other parents. In the rare case that two words are
each others' parents, then either may serve as the landmark.
The concept of a parent leads naturally into that of a landmark. The first rule regarding
landmarks is that a parent is a landmark for its child. Next, two kinds of landmarks are in-
troduced: Before landmarks (in which the child is before the parent) and After landmarks (in
which the child is after the parent). The Before/After distinction should be obvious in the Link
Grammar examples given above.
The landmark transitivity rule, then, has two parts. If A is a landmark for B, of subtype L
(where L is either Before or After), then
1. Subordinate transitivity says that if B is a landmark for C, then A is also a type-L
landmark for C
2. Sister transitivity says that if A is a landmark for C, then B is also a landmark for C
Finally, there are some special link types that cause a word to depend on its grandparents or
higher ancestors as well as its parents. I note that these are not treated thoroughly in (Hudson,
2007); one needs to look to the earlier, longer and rarer work [Hud901. Some questions are dealt
with this way. Another example is what in Word Grammar is called a "proxy link", as occurs
between "wit]l and - whom- in
The person with whom she works
EFTA00624556
410 44 Natural Language Comprehension
The link parser deals with this particular example via a .1w link
+ xp
I • Ss •i
+ Wd 4. 4.---Cs---•
4.--pe 4-40-.4.-je-• s.--Ss-• I-Pail.
I I I I I I
LEFT-WALL the person.n with whoa she works.v is.v silly.a
so to apply landmark transitivity in the context of the Link Grammar, in this case, it seems
one would need to implement the rule that in the case of two words connected by a .1w-link,
the child of one of the words is also the child of the other. Handling other special cases like
this in the context of Link Grammar seems conceptually unproblematic, though naturally some
hidden rocks may appear. Basically a list needs to be made of which kinds of link parser links
embody proxy relationships for which other kinds of link parser links.
According to the landmark transitivity approach, then, the criterion for syntactic correctness
of a parse is that, if one takes the links in the parse and applies the landmark transitivity rule
(along with the other special-case "raising" rules we've discussed), one does not arrive at any
contradictions (i.e. no situations where A is a Before landmark of B. mid
The main problem with the landmark-transitivity constraint seems to be computational
tractability. The problem exists for both comprehension and generation, but we'll focus on
comprehension here.
To find all possible parses of a sentence using Hudson's landmark-transitivity-based approach,
one needs to find all linkages that don't lead to contradictions when used as premises for reason-
ing based on the landmark-transitivity axioms. This appears to be extremely computationally
intensive! So, it seems that Word Grammar style parsing is only computationally feasible for
a system that has extremely strong semantic understanding, so as to be able to filter out the
vast majority of possible parses on semantic rather than purely syntactic grounds.
On the other hand, it seems possible to apply landmark-transitivity together with no-links-
cross, to provide parsing that is both efficient and general. If applying the no-links-cross con-
straint finds a parse in which no links cross, without using postprocessing rules, then this will
always be a legal parse according to the landmark-transitivity rule.
However, landmark-transitivity also allows a lot of other parses that link grammar either
needs postprocessing rules to handle, or can't find even with postprocessing rules. So, it would
make sense to apply no-links-cross parsing first, but then if this fails, apply landmark-transitivity
parsing starting from the partial parses that the former stage produced. This is the approach
suggested in PROWL, and a similar approach may be suggested for language generation.
44.14.3.2 Overcoming the Current Limitations of Word Grammar
Finally, it is worth noting that expanding the Word Grammar parsing framework to include
the link grammar dictionary, will likely allow us to solve some unsolved problems in Word
Grammar. For instance, II lud0iaj notes that the current formulation of Word Grammar has no
way to distinguish the behavior of last vs. this in
I ate last night
I ate this ham
The issue he sees is that in the first case, night should be considered the parent of last; whereas
in the second case, this should be considered the parent of ham.
EFTA00624557
44.14 PROWL Grammar 411
The current link parser also fails to handle this issue according to Hudson's intuition:
X13
LEFT —WALL I.p ate.v last.a night.t
+
+
AP
+
+
Qs
I
+--Wd--+ap*i+ + , D*1.1-+
I I I I I
LEFT I WALL I.p ate v this.d ham.n .
However, the link grammar framework gives us a clear possibility for allowing the kind of
interpretation Hudson wants: just allow this to take a left-going 0 -link, and (in PROWL) let
it optionally assume the parent role when involved in a D-link relationship. There are no funky
link-crossing or semantic issues here; just a straightforward link-grammar dictionary, edit.
This illustrates the syntactic flexibility of the link parsing framework, and also the inelegance
- adding new links to the dictionary, generally solves syntactic problems, but at the cast of
creating more complexity to be dealt with further down the pipeline, when the various link
types need to be compressed into a smaller number of semantic relationship types for purposes
of actual comprehension (as is done in RelEx, for example). However, as far as we can tell, this
seems to be a necessary cast for adequately handling the full complexity of natural language
syntax. Word Grammar holds out the hope of possibly avoiding this kind of complexity, but
without filling in enough details to allow a clear estimate of whether this hope can ever be
fulfilled.
44.14.4 Contextually Guided Greedy Parsing and Generation Using
Word Link Grammar
Another difference between Link Grammar and currently utilized. and Word Grammar as
described, is the nature of the parsing algorithm. Link Grammar operates in a manner that is
fairly traditional among contemporary parsing algorithms: given a sentence, it produces a large
set of possible parses, and then it is left to other methods/algorithms to select the right parse,
and to form a semantic interpretation of the selected parse. Parse selection may of course involve
semantic interpretation: one way to choose the right parse is to choose the one that has the
most contextually sensible semantic interpretation. We may call this approach whole-sentence
purely-syntactic parsing, or WSPS parsing.
One of the nice things about Link Grammar, as compared to many other computational
parsing frameworks, is that it produces a relatively small number of parses, compared for
instance to typical head-driven phrase-structure grammar parsers. For simple sentences the
link parser generally produces only handful of parses. But for complex sentences the link parser
can produce hundreds of parses, which can be computationally costly to sift through.
EFTA00624558
412 44 Natural Language Comprehension
Word Grammar, on the other hand, presents far fewer constraints regarding which words may
link to other words. Therefore, to apply parsing in the style of the current link parser, in the
context of Word Grammar, would be completely infeasible. The number of possible parses would
be tremendous. The idea of Word Grammar is to pare down parses via semantic/pragmatic
sensibleness, during the course of the syntax parsing proems, rather than breaking things down
into two phases (parsing followed by semantic/pragmatic interpretation). Parsing is suggested
to proceed forward through a sentence: when a word is encountered, it is linked to the words
coming before it in the sentence, in a way that makes sense. If this seems impossible, consistently
with the links that have already been drawn in the course of the parsing process, then some
backtracking is done and prior chokes may be revisited. This approach is more like what
humans do when parsing a sentence, and does not have the effect of producing a large number
of syntactically passible, semantically/pragmatically absurd parses, and then sorting through
them afterwards. It is what we call a contertually-guided greedy parsing (CGGP) approach.
For language generation, the link parser and Word Grammar approaches also suggest different
strategies. Link Grammar suggests taking a semantic network, then searching holistically for
a linear sequence of words that, when link-parsed, would give rise to that semantic network
as the interpretation. On the other hand, Word Grammar suggests taking that same semantic
network and iterating through it progressively, verbalizing each node of the network as one
walks through it, and backtracking if one reaches a point where there is no way to verbalize the
current node consistently with how one has already verbalized the previous nodes.
The main observation we want to make here is that, while Word Grammar by its nature
(due to the relative paucity of explicit constraints on which syntactic links may be formed), can
operate with CGGP but not WSPS parsing. On the other hand, while Link Grammar is cur-
rently utilized with WPSP parsing, there is no reason one can't use it with CGGP parsing just
as well. There is no objection to using CGGP parsing together with the link-parser dictionary,
nor with the no-links cross constraint rather than the landmark-transitivity constraint (in fact,
as noted above, earlier versions of Word Grammar made use of the no-links-cross constraint).
What we propose in PROWL is to use the link grammar dictionary together with the CGGP
parsing approach. The WSPS parsing approach may perhaps be useful as a fallback for handling
extremely complex and perverted sentences where CGGP takes too long to come to an answer
- it corresponds to sentences that are so obscure one has to do really hard, analytical thinking
to figure out what they mean.
Regarding constraints on link structure, the suggestion in PROWL is to use the no-links-
cross constraint as a first approximation. In comprehension, if no sufficiently high-probability
interpretation obeying the no-links-cross constraint is found, then the scope of investigation
should expand to include link-structures obeying landmark-transitivity but violating no-links-
cross. In generation, things are a little subtler: a list should be kept of link-type combinations
that often correctly violate no-links-cross, and when these combinations are encountered in the
generation process, then constructs that satisfy landmark-transitivity but not no-links-cross
should be considered.
Arguably, the PROWL approach is less elegant than either Link Grammar or Word Gram-
mar considered on its own. However, we are dubious of the proposition that human syntax
processing, with all its surface messiness and complexity, is really generated by a simple, uni-
fied, mathematically elegant underlying framework. Our goal is not to find a maximally elegant
theoretical framework, but rather one that works both as a standalone computational-linguistics
system, and as an integrated component of an adaptively-learning AGI system.
EFTA00624559
44.15 Aspects of Language Learning 413
44.15 Aspects of Language Learning
Now we finally turn to language learning - a topic that spans the engineered and experiential
approaches to NLP. In the experiential approach, learning is required to gain even simple lin-
guistic functionality. In the engineered approach, even if a great deal of linguistic functionality
is built in, learning may be used for adding new functionality and modifying the initially given
functionality. In this section we will focus on a kw aspects of language learning that would be
required even if the current engineered OpenCog comprehension pipeline were completed to a
high level of functionality. The more thoroughgoing language learning required for the expe-
riential approach will then be discussed in the following section. Further, Chapter 45 will dig
in depth into an aspect of language learning that to some extent cuts across the engineered/-
experiential dichotomy - unsupervised learning of linguistic structures from large corpora of
text.
44.15.1 Word Sense Creation
In our examples above, we've frequently referred to ReferenceLinks between WordNodes and
ConceptNodes. But, how do these links get built? One aspect of this is the process of word
sense creation.
Suppose we have a WordNode W that has ReferenceLinks to a number of different Con-
ceptNodes. A common case is that these ConceptNodes fall into clusters, each one denoting a
"sense" of the word. The clusters are defined by the following relationships:
1. ConceptNodes within a cluster have high-strength SimilarityLinks to each other
2. ConceptNodes in different clusters have low-strength (i.e. dissimilarity-denoting) Similar-
ityLinks to each other
When a word is first learned, it will normally be linked only to mutually agreeable ConceptN-
odes, i.e. there will only be one sense of the word. As more and more instances of the word
are seen, however, eventually the WordNode will gather more than one sense. Sometimes dif-
ferent senses are different syntactically, other times they are different only semantically, but
are involved in the same syntactic relationships. In the case of a word with multiple senses,
most of the relevant feature structure information will be attached to word-sense-representing
ConceptNodes, not to WordNodes themselves.
The formation of sense-representing ConceptNodes may be done by the standard clustering
and predicate mining processes, which will create such ConceptNodes when there are adequately
many Atoms in the system satisfying the criteria represent. It may also be valuable to create a
particular SenseMining CIAO-Dynamic, which uses the same criteria for node formation as the
clustering and predicate mining CIM-Dynamics, but focuses specifically on creating predicates
related to WordNodes and their nearby ConceptNodes.
EFTA00624560
414 44 Natural Language Comprehension
44.15.2 Feature Structure Learning
We've mentioned above the obvious fact that, to intelligently use a feature-structure based
grammar, the system needs to be capable of learning new linguistic feature structures. Probing
into this in more detail, we see that there are two distinct but related kinds of feature structure
learning:
1. learning the values that features have for particular word senses.
2. learning new features altogether.
Learning the values that features have for particular word senses must be done when new
senses are created; and even for features imported front resources like the link grammar, the
possibility of corrections must obviously be accepted. This kind of learning can be done by
straightforward inference - inference from examples of word usage, and by analogy from features
for similar words. A simple example to think about, e.g., is learning the verb sense of "fax" when
only the noun sense is known.
Next, the learning of new features can be viewed as a reasoning problem, in that inference
can learn new relations applied to nodes representing syntactic senses of words. In principle,
these "features" may be very general or very specialized, depending on the case. New feature
learning, in practice, requires a lot of examples, and is a more fundamental but less common
kind of learning than learning feature values for known word senses. A good example would be
the learning of "third person" by an agent that knows only first and second person.
In this example, it's clear that information from embodied experience would be extremely
helpful. In principle, it could be learned front corpus analysis alone - but the presence of knowl-
edge that certain words ("him", "her", "they", etc.) tend to occur in association with observed
agents different from the speaker or the hearer, would certainly help a lot with identifying
"third person" as a separate construct. It seems that either a very large number of un-embodied
examples or a relatively small number of embodied examples would be needed to support the
inference of the "third person" feature. And we suspect this example is typical - i.e. that the
most effective route to new feature structure learning involves both embodied social experience
and rather deep commonsense knowledge about the world.
44.15.3 Transformation and Semantic Mapping Rule Learning
Word sense learning and feature structure learning are important parts of language learning,
but they're far from the whole story. An equally important role is played by linguistic trans-
formations, such as the rules used in RelEx and RelEx2Frame. At least some of these must be
learned based on experience, for human-level intelligent language processing to proceed.
Each of these transformations can be straightforwardly cast as an ImplicationLink between
PredicateNodes, and hence formalistically can be learned by PLN inference, combined with one
or another heuristic methods for compound predicate creation. The question is what knowledge
exists for PLN to draw on in assessing the strengths of these links, and more critically, to guide
the heuristic predicate formation methods. This is a case that likely requires the full complexity
of "integrative predicate learning" as discussed in Chapter 41. And, as with feature structure
learning, it's a case that will be much more effectively handled using knowledge from social
embodied experience alongside purely linguistic knowledge.
EFTA00624561
44.16 Experiential Language Learning 415
44.16 Experiential Language Learning
We have talked a great deal about "engineered" approaches to NL comprehension and only
peripherally about experiential approaches. But there has been a not-so-secret plan underlying
this approach. There are many approaches to experiential language learning, ranging from a
"tabula rasa" approach in which language is just treated as raw data, to an approach where the
whole structure of a language comprehension system is programmed in, and "merely" the content
remains to be learned. There isn't much to say about the tabula rasa approach - we have already
discussed CogPrime's approach to learning, and in principle it is just as applicable to language
learning as to any other kind of learning. The more structured approach has more unique aspects
to it, so we will turn attention to it here. Of course, various intermediate approaches may be
constructed by leaving out various structures.
The approach to experiential language learning we consider most promising is based on
the PROWL approach, discussed above. In this approach one programs in a certain amount of
"universal grammar," and then allows the system to learn content via experience that obeys this
universal grammar. In a PROWL approach, the basic linguistic representational infrastructure
is given by the Atomspace that already exists in OpenCog, so the content of "universal grammar"
is basically
• the propensity to identify words
• the propensity to create a small set of asymmetric (i.e. parent/child) labeled relationship
types, to use to label relationships between semantically related word-instances. These are
"syntactic link types."
• the set of constraints on syntactic links implicit in word grammar, e.g. landmark transitivity
or no-links-cross
Building in the above items, without building in any particular syntactic links, seems enough
to motivate a system to learn a grammar resembling that of human languages.
Of course, experiential language learning of this nature is very, very different from "tabula
rasa" experiential language learning. But we note that, while PROWL style experiential lan-
guage learning seems like a difficult problem given existing AI technologies, tabula rasa language
learning seems like a nearly unapproachable problem. One could infer from this that current AI
technologies are simply inadequate to approach the problem that the young human child mind
solves. However, there seems to be some solid evidence that the young human child mind does
contain some form of universal grammar guiding its learning. Though we don't yet know what
form this universal prior linguistic knowledge takes in the human mind or brain, the evidence
regarding common structures arising spontaneously in various unrelated Creole languages is
extremely compelling supporting ideas presented previously based on different lines
of evidence. So we suggest that PROWL based experiential language learning is actually con-
ceptually closer to human child language learning than a tabula rasa approach - although we
certainly don't claim that the PROWL based approach builds in the exact same things as the
human genome does.
What we need to make experiential language learning work, then, is a language-focu.sed
inference-control mechanism that includes, e.g.
• a propensity to look for syntactic link types, as outlined just above
• a propensity to form new word senses, as outlined earlier
EFTA00624562
416 44 Natural Language Comprehension
• a propensity to search for implications of the general form of RelEx and RelEx2Frame or
Syn2Sem rules
Given these propensities, it seems reasonable to expect a PLN inference system to be able to
"fill in the linguistic content" based on its experience, using links between linguistic and other
experiential content as its guide. This is a very difficult learning problem, to be sure, but it
seems in principle a tractable one, since we have broken it down into a number of interrelated
component learning problems in a manner guided by the structure of language.
Other aspects of language comprehension, such as word sense disambiguation and anaphor
resolution, seem to plausibly follow from applying inference to linguistic data in the context of
embodied experiential data, without requiring especial attention to inference control or supply-
ing prior knowledge.
Chapter ?? presents an elaboration of this sort of perspective, in a limited case which enables
greater clarity: the learning of linguistic content from an unsupervised corpus, based on the
assumption of linguistic infrastructure s just summarized above.
44.17 Which Path(s) Forward?
We have discussed a variety of approaches to achieving human-level NL comprehension in the
CogPrime framework. Which approach do we think is best? All things considered, we suspect
that a tabula rasa experiential approach is impractical, whereas a traditional computational
linguistics approach (whether based on hand-coded rules. corpus analysis, or a combination
thereof) will reach an intelligence ceiling well short of human capability. On the other hand we
believe that all of these options
1. the creation of an engineered NL comprehension system (as we have already done), and
the adaptation and enhancement of this system using learning that incorporates knowledge
from embodied experience
2. the creation of an engineered NL comprehension system via unsupervised learning from a
large corpus, as described in Chapter ?? below
3. the creation of an experiential learning based NL comprehension system using in-built
structures, such as the PROWL based approach described above
4. the creation of an experiential learning based system as described above, using an engineered
system (like the current one) as a "fitness estimation" resource in the manner described at
the end of Chapter 43
have significant promise and are worthy of pursuit. Which of these approaches we focus on in
our ongoing OpenCogPrime implementation work will depend on logistical issues as much as
on theoretical preference.
EFTA00624563
Chapter 45
Language Learning via Unsupervised Corpus
Analysis
Co-authored with Linas Vepstas
45.1 Introduction
The approach taken to NLP in the OpenCog project up through 2013, in practice, has involved
engineering and integrating rule-based NLP systems as "scaffolding", with a view toward later
replacing the rule content with alternative content learned via an OpenCog system's experience.
In this chapter we present a variant on this approach, in which the rule content of the
existing rule-based NLP system is replaced with new content learned via unsupervised corpus
analysis. This content can then be modified and improved via an OpenCog system's experience,
embodied and otherwise, as needed.
This unsupervised corpus analysis based approach deviates fairly far from human cogni-
tive science. However, as discussed above, language processing is one of those areas where the
pragmatic differences between young humans and early-stage AGI systems may be critical to
consider. The automated learning of language from embodied, social experience is a key part of
the path to AGI, and is one way that CogPrimes and other AGI systems should learn language.
On the other hand. unsupervised corpus based language learning, may perhaps also have a
significant role to play in the path to linguistically savvy AGI, leveraging some advantages that
AGIs have that humans do not, such as direct access to massive amounts of online text (without
the need to filter the text through slow-paced sense-perception systems like eyes).
The learning of language from unannot.ated text corpora is not a major pursuit within the
computational linguistics community currently. Supervised learning of linguistic structures from
expert-annotated corpora plays a large role, but this is a wholly different sort of pursuit, more
analogous to rule-based NLP, in that it involves humans explicitly specifying formal linguistic
structures (e.g. parse trees for sentences in a corpus). However, we hypothesize that unsuper-
vised corpus-based language learning can be carried out by properly orchestrating the use of
some fairly standard machine learning algorithms (already included in OpenCog / CogPrime),
within an appropriate structured framework (such as OpenCog's current NLP framework).
The review of [K M0-1I provides a summary, of the state of the art in automatic grammar
induction (the third alternative listed above), as it stood a decade ago: it addresses a nun-
Dr. Vepstas would properly be listed as the first author of this chapter; this material was developed in a
collaboration between Vepstas and Coertzel. However, as with all the co-authored chapters in this book, final
responsibility for any flaws in the presentation of the material lies with Ben Coertzel, the chief author of the
bok.
417
EFTA00624564
418 45 Language Learning via Unsupervised Corpus Analysis
ber of linguistic issues and difficulties that arise in actual implementations of algorithms. It
is also notable in that it builds a bridge between phrase-structure grammars and dependency
grammars, essentially pointing out that these are more or less equivalent, and that, in fact, sig-
nificant progress can be achieved by taking on both points of view at once. Grammar induction
has progressed somewhat since this review was written, and we will mention some of the more
recent work below; but yet, it is fair to say that there has been no truly dramatic progress in
this direction.
In this chapter we describe a novel approach to achieving automated grammar induction, i.e.
to machine learning of linguistic content front a large, unannotated text corpus. The methods
described may also be useful for language learning based on embodied experience; and may make
use of content created using hand-coded rules or machine learning front annotated corpora. But
our focus in this chapter will be on learning linguistic content from a large, unannotated text
corpus.
The algorithmic approach given in this chapter is wholly in the spirit of the "PROWL"
approach reviewed above in Chapter 44. However, PROWL is a quite general idea. Here we
present a highly specific PROWL-like algorithm, which is focused on learning front a large
unannotated corpus rather than from embodied experience. Because of the corpus-oriented
focus, it is possible to tie the algorithm of this chapter in with the statistical language learning
literature, more tightly than is possible with PROWL language learning in general. Yet, the
specifics presented here could largely be generalized to a broader PROWL context.
We consider the approach described here as "deep learning" oriented because it is based on
hierarchical pattern recognition in linguistic data: identifying patterns, then patterns among
these patterns, etc., in a hierarchy that allows "higher level" (more abstract) patterns to feed
back down the hierarchy and affect the recognition of lower level patterns. Our approach does not
use conventional deep learning architectures like Deep Boltzmann machines or recurrent neural
networks. Conceptually, our approach is based on a similar intuition to these algorithms, in that
it relies on the presence of hierarchical structure in its input data, and utilizes a hierarchical
pattern recognition structure with copious feedback to adaptively identify this hierarchical
structure. But the specific pattern recognition algorithms we use, and the specific nature of the
hierarchy we construct, are guided by existing knowledge about what works and what doesn't
in (both statistical and rule-based) computational linguistics.
While the overall approach presented here is novel, most of the detailed ideas are extensions
and generalizations of the prior work of multiple authors, which will be referenced and in some
cases discussed below. In our view, the body of ideas needed to enable unsupervised learning of
language front large corpora has been gradually emerging during the last decade. The approach
given here has unique aspects, but also many aspects already validated by the work of others.
For sake of simplicity, we will deal here only with learning from written text here. We believe
that conceptually very similar methods can be applied to spoken language as well, but this brings
extra complexities that we will avoid for the purpose of the present document. (In short: Below
we represent syntactic and semantic learning as separate but similarly structured and closely
coupled learning processes. To handle speech input thoroughly, we would suggest phonological
learning as another separate, similarly structured and closely coupled learning process.)
Finally, we stress that the algorithms presented here are intended to be used in conjunction
with a large corpus, and a large amount of processing power. Without a very large corpus, some
of the feedbacks required for the learning process described would be unlikely to happen (e.g.
the ability of syntactic and semantic learning to guide each other). We have not yet sought
EFTA00624565
45.2 Assumed Linguistic Infrastructure 419
to estimate exactly how large a corpus would be required, but our informal estimate is that
Wikipedia might or might not be large enough, and the Web is certainly more than enough.
We don't pretend to know just how far this sort of unsupervised, corpus based learning can
be pushed. To what extent can the content of a natural language like English be learned this
way. How much, if any, ambiguity will be left over once this kind of learning has been thoroughly
done - only pragmatically disambiguable via embodied social learning? Strong opinions on these
sorts of issues abound in the cognitive science, linguistics and AI communities; but the only
apparent way to resolve these questions is empirically.
45.2 Assumed Linguistic Infrastructure
While the approach outlined in this chapter aims to learn the linguistic content of a language
from textual data, it does not aim to learn the idea of language. Implicitly, we assume a model
in which a learning system begins with a basic "linguistic infrastructure" indicating the various
parts of a natural language and how they generally interrelate; and it then learns the linguistic
content characterizing a particular language. In principle, it would also be passible to have an
AI system to learn the very concept of a language and build its own linguistic infrastructure.
However, that is not the problem we address here; and we suspect such an approach would
require drastically more computational resources.
The basic linguistic infrastructure assumed here includes:
• A formalism for expressing grammatical (dependency) rules is assumed.
- The ideas given here are not tied to any specific grammatical formalism, but as in
Chapter ?? we find it convenient to make use of a formalism in the style of dependency
grammarres511. Taking a mathematical perspective, different grammar formalisms can
be translated into one-another, using relatively simple rules and algorithinsIK:\ II. The
primary difference between them is more a matter of taste, perceived linguistic 'natural-
ness', adaptability, and choice of parser algorithm. In particular, categorial grammars
can be converted into link grammars in a straight-forward way, and vice versa, but link
grammars provide a more compact dictionary. Link grammarsIST9l, ST931 are a type
of dependency grammar; these, in turn, can be converted to and from phrase-structure
grammars. We believe that dependency grammars provide a more simple and natural
description of linguistic phenomena. We also believe that dependency grammars have a
more natural fit with maximum-entropy ideas, where a dependency relationship can be
literally interpreted as the mutual information between word-pair:0)108F Dependency
grammars also work well with Markov models; dependency parsers can be implemented
as Viterbi decoders. Figure 44.1 illustrates two different formalisms.
- The discussion below assumes the use of a formalism similar that of Link Grammar,
as described above. In this theory, each word is associated with a set of 'connector
disjuncts', each connector disjunct controlling the possible linkages that the word may
take part in. A disjunct can be thought of as a jig-saw puzzle-piece; valid syntactic word
orders are those for which the puzzle-pieces can be validly connected. A single connector
can be thought of as a single tab on a puzzle-piece (shown in figure ??). Connectors are
thus 'types' X with a + or - sign indicating that they connect to the left or right. For
EFTA00624566
420 45 Language Learning via Unsupervised Corpus Analysis
example, a typical verb disjunct might be S — & 0+ indicating that a subject (a noun)
is expected on the left, and an object (also a noun) is expected on the right.
- Some of the discussion below assumes select aspects of (Dick Hudson's) Word GrammarIII tid&I,
Iludoibl. As reviewed above, Word Grammar theory (implicitly) uses connectors simi-
lar to those of Link Grammar, but allows each connector to be marked as the head of
a link or not. A link then becomes an arrow from a head word to the dependent word.
(Somewhat confusingly, the head of the arrow points at the dependent word; this means
the tail of the arrow is attached to the head word).
- Each word is associated with a "lexical entry"; in Link Grammar, this is the set of
connector disjuncts for that word. It is usually the case that many words share a common
lexical entry; for example, most common nouns are syntactically similar enough that
they can all be grouped under a single lexical entry. Conversely, a single word is allowed
to have multiple lexical entries; so, for example, "saw", the noun, will have a different
lexical entry from "saw", the past tense of the verb "to see". That is, lexical entries can
loosely correspond to traditional dictionary entries. Whether or not a word has multiple
lexical entries is a matter of convenience, rather than a fundamental aspect. Curiously,
a single Link Grammar connector disjunct can be viewed as a very, fine-grained part-
of-speech. In this way, it is a stepping stone to the semantic meaning of a word.
• A parser, for extracting syntactic structure from sentences, is assumed. What's more, it is
assumed that the parser is capable of using semantic relationships to guide parsing.
- A paradigmatic example of such a parser is the "Viterbi Link Parser", currently under
development for use with the Link Grammar. This parser is currently operational in a
simple form. The name refers to its use of a the general ideas of the Viterbi algorithm.
This algorithm seems biologically plausible, in that it applies only a local analysis of
sentence structure, of limited scope, as opposed to a global optimization, thus roughly
emulating the process of human listening. The current set of legal parses of a sentence is
pruned incrementally and probabilistically, based on flexible criteria. These potentially
include the semantic relationships extractable from the partial parse obtained at a given
point in time. It also allows for parsing to be guided by inter-sentence relationships, such
as pronoun resolution, to disambiguate otherwise ambiguous sentences.
• A formalism for expressing semantic relationships is assumed.
- A semantic relationship generalizes the notion of a lexical entry, to allow for changes
of word order, paraphrasing, tense, number, the presence or absence of modifiers, etc.
An example of such a relationship would be eat(X, Y) - indicating the eating of some
entity Y by some entity X. This abstracts into common form several different syntac-
tic expressions: "Ben ate a cookie", "A cookie will be eaten by Ben", "Ben sat, eating
cookies".
- Nothing particularly special is assumed here regarding semantic relationships, beyond a
basic predicate-argument structure. It is assumed that predicates can have arguments
that are other predicates, and not just atomic terms; this has an explicit impact on how
predicates and arguments are represented. A "semantic representation" of a sentence is
a network of arrows (defining predicates and arguments), each arrow or a small subset
of arrows defining a "semantic relationship". However, the beginning or end of an arrow
is not necessarily a single node, but may land on a subgraph.
EFTA00624567
45.3 Linguistic Content To Be Learned 421
- Type constraints seem reasonable, but its not clear if these must be made explicit, or
if they are the implicit result of learning. Thus, eat(X, Y) requires that X and Y both
be entities, and not, for example, actions or prepositions.
- We have not yet thought through exactly how rich the semantic formalism should be for
handling the full variety of quantifier constructs in complex natural language. But we
suspect that it's OK to just use basic predicate-argument relationships and not build
explicit quantification into the formalism, allowing quantifiers to be treated like other
predicates.
- Obviously, CogPrime's formalism for expressing linguistic structures in terms of Atoms,
presented in Chapter 44, fulfills the requirements of the learning scheme presented in
this chapter. However, we wish to stress that the learning scheme presented her does
not depend on the particulars of CogPrime's representation scheme, though it is very
compatible with them.
45.3 Linguistic Content To Be Learned
Given the above linguistic infrastructure, what remains for a language learning system to
learn is the linguistic content that characterizes a particular language. Everything included
in OpenCog's existing "scaffolding" rule-based NLP system would, in this approach, be learned
to first approximation via unsupervised corpus analysis.
Specifically, given the assumed framework, key things to be learned include:
• A list of 'link types' that will be used to form 'disjuncts' must be learned.
- An example of a link type is the 'subject' link S. This link typically connects the sub-
ject of a sentence to the head verb. Given the normal English subject-verb word order,
nouns will typically have an Si-connector, indicating that an S link may be formed only
when the noun appears to the left of a word bearing an S— connector. Likewise, verbs
will typically be associated with 8— connectors. The current Link Grammar contains
roughly one hundred different link-types, with additional optional subtypes that are
used to further constrain syntactic structure. This number of different link types seems
required simply because there are many relationships between words: there is not just a
subject-verb or verb-object relationship, but also rather fine distinctions, such as those
needed to form grammatical time, date, money, and measurement expressions, punctu-
ation use, including street-addresses, cardinal and ordinal relationships, proper (given)
names, titles and suffixes, and other highly constrained grammatical constructions. This
is in addition to the usual linguistic territory of needing to indicate dependent clauses,
comparatives, subject-verb inversion, and so on. It is expected that a comparable mun-
ber of link types will need to be learned.
- Some link types are rather strict, such as those connect verb subjects and objects, while
other types are considerably more ambiguous, such as those involving prepositions.
This reflects the structure of English, where subject-verb-object order is fairly rigor-
ously enforced, but the ordering and use of prepositions Ls considerably looser. When
considering the looser cases, it becomes clear that there is no single, inherent 'right
answer' for the creation and assignment of link types, and that several different, yet
linguistically plausible linkage assignments may be made.
EFTA00624568
422 45 Language Learning via Unsupervised Corpus Analysis
- The definition of a good link-type is one that leads the parser - applied across the whole
corpus - to allow parsing to be successful for almost all sentences, and yet not to be
so broad as to enable parsing of word-salads. Significant pressure must be applied to
prevent excess proliferation of link types, yet no so much as to over-simplify things, and
provide valid parses for unobserved, ungrammatical sentences.
• Lexical entries for different words must be learned.
- Typically, multiple connectors are needed to define how a word can link syntactically to
others. Thus, for example, many verbs have the disjunct S — 0+ indicating that they
need a subject noun to the left, and an object to the right. All words have at least a
handful of valid disjuncts that they can be used with, and sometimes hundreds or even
more. Thus, a "lexical entry" mast be learned for each word, the lexical entry, being a
set of disjuncts that can be used with that word.
- Many words are syntactically similar; most common nouns can share a single lexical
entry. Yet, there are many exceptions. Thus, during learning, there is a back-and forth
process of grouping and =grouping words; clustering them so that they share lexical
entries, but also splitting apart clusters when its realized that some words behave dif-
ferently. Thus for example, the words "sing" and "apologize" are both verbs, and thus
share some linguistic structure, but one cannot say "I apologized a song to Vicky"; if
these two verbs were initially grouped together into a common lexical entry, they must
later be split apart.
- The definition of a good lexical entry is much the same as that for a good link type: ob-
served sentences must be parsable; random sentences mostly must not be, and excessive
proliferation and complexity must be prevented.
• Semantic relationships mast be learned.
- The semantic relationship eat(X,Y) is prototypical. Fotmdationally, such a semantic
relationship may be represented as a set whose elements consist of syntactico-semantic
subgraphs. For the relation eat(X, Y), a subgraph may be as simple as a single (syntactic)
disjunct S - & 0+ for the normal word order "Ben ate a cookie", but it may also be
a more complex set needed to represent the inverted word order in "a cookie was eaten
by Ben". The set of all of these different subgraphs defines the semantic relationship.
The subgraphs themselves may be syntactic (as in the example above), or they may be
other semantic relationships, or a mixture thereof.
- Not all re-phrasings are semantically equivalent. "Mr. Smith is late" has a rather dif-
ferent meaning from "The late Mr. Smith."
- In general, place-holders like X and Y may be words or category labels. In early stages
of learning, it is expected that X and Y are each just sets of words. At some point,
though, it should become clear that these sets are not specific to this one relationship,
but can appropriately take part in many relationships. In the above example, X and Y
mast be entities (physical objects), and, as such, can participate in (most) any other
relationships where entities are called for. More narrowly, X is presumably a person or
animal. while Y is a foodstuff. Furthermore, as entities, it might be inferred when these
refer to the same physical object (see the section 'reference resolution' below).
- Categories can be understood as sets of synonyms, including hyponyms (thus, "grub" is
a synonym for "food", while "cookie" is a hyponym.
EFTA00624569
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 423
• Idioms and set phrases must be learned.
- English has a large number of idiomatic expressions whose meanings cannot be inferred
from the constituent words (such as "to pull one's leg"). In this way, idioms present
a challenge: their sometimes complex syntactic constructions belie their often simpler
semantic content. On the other hand, idioms have a very rigid word-choice and word
order, and are highly invariant. Set phrases take a middle ground: word-choice is not
quite as fixed as for idioms, but, none-the-less, there is a conventional word order that
is usually employed. Not that the manually-constructed Link Grammar dictionaries
contain thousands of lexical entries for idiomatic constructions. In essence, these are
multi-word constructions that are treated as if they were a single word.
Each of the above tasks have already been accomplished and described in the literature; for
example, automated learning of synonymous words and phrases has been described by LinILP011
and Poon & DomingostPD09]. The authors are not aware of any attempts to learn all of these,
together, in one go, rather than presuming the pre-existance of dependent layers.
45.3.1 Deeper Aspects of Comprehension
While the learning of the above aspects of language is the focus of our discussion here, the search
for semantic structure does not end there; more is possible. In particular, natural language
generation has a vital need for lexical functions, so that appropriate word-choice can be made
when vocalizing ideas. In order to truly understand text, one also needs, as a minimum, to
discern referential structure. and sophisticated understanding requires discerning topics. We
believe automated, unsupervised learning of these aspects is attainable, but is best addressed
after the 'simpler' language learning described above. We are not aware of any prior work
aimed at automatically learning these, aside from relatively simple. unsophisticated (bag-of-
words style) efforts at topic categorization.
45.4 A Methodology for Unsupervised Language Learning from a
Large Corpus
The language learning approach presented here is novel in its overall nature. Each part of it,
however, draws on prior experimental and theoretical research by others on particular aspects
of language learning, as well as on our own previous work building computational linguistic
systems. The goal is to assemble a system out of parts that are already known to work well in
isolation.
Prior published research, from a multitude of authors over the last few decades, has already
demonstrated how many of the items listed above can be learnt in an unsupervised setting
(see e.g. Fur98, K4104, LPOI, CSW, PD09, AlihO7, ICSPCB] for relevant background). All
of the previously demonstrated results, however, were obtained in isolation, via research that
assumed the pre-existence of surrounding infrastructure far beyond what we assume above. The
approach proposed here may be understood as a combination, generalization and refinement
EFTA00624570
424 45 Language Learning via Unsupervised Corpus Analysis
these techniques, to create a system that can learn, more or less ab initio from a large corpus,
with a final result of a working, usable natural language comprehension system.
However, we must caution that the proposed approach is in no way a haphazard mash-up of
techniques. There is a deep algorithmic commonality to the different prior methods we combine,
which has not always been apparent in the prior literature due to the different emphases and
technical vocabularies used in the research papers in question. In parallel with implementing
the ideas presented here, we intend to workin fully formalizing the underlying mathematics
of the undertaking, so that it becomes clear what approximations are being taken, and what
avenues remain unexplored. Some fairly specific directions in this regard suggest themselves.
All of the prior research alluded to above invokes some or another variation of maximum en-
tropy principles, sometimes explicitly, but usually implicitly. In general, entropy maximization
principles provide the foundation for learning systems such as (hidden) Markov models, Markov
networks and Hopfield neural networks, and they connect indirectly with Bayesian probability
based analyses. However, the actual task of maximizing the entropy is an NP-hard problem;
forward progress depends on short-cuts, approximations and clever algorithms, some of which
are of general nature, and some domain-dependent. Part of the task of refining the details of
the language learning methodology presented here, is to explore various short-cuts and approx-
imations to entropy maximization, and discover new, clever algorithms of this nature that are
relevant to the language learning domain. As has been the case in physics and other domains,
we suspect that progress here will be best achieved via a coupled exploration of experimental
and mathematical aspects of the subject matter.
45.4.1 A High Level Perspective on Language Learning
On an abstract conceptual level, the approach proposed here depicts language learning as an
instance of a general learning loop such as:
1. Group together linguistic entities (i.e. words or linguistic relationships, such as those de-
scribed in the previous section) that display similar usage patterns (where one is looking at
usage patterns that are compactly describable given one's meta-language). Many but not
nectsbarily all usage patterns for a given linguistic entity, will involve its use in conjunction
with other linguistic entities.
2. For each such grouping make a category label.
3. Add these category labels to one's meta-language
4. Return to Step 1
It stands to reason that the result of this sort of learning loop, if successful, will be a hierarchi-
cally composed collection linguistic relationships possessing the following
Linguistic Coherence Property: Linguistic entities are reasonably well characterizable in terms
of the compactly describable patterns observable in their relationship with with other linguistic
entities.
Note that there Ls nothing intrinsically "deep" or hierarchical in this sort of linguistic co-
herence. However, the ability to learn the patterns relating linguistic entities with others, via a
recursive hierarchical learning loop such as described above, is contingent on the presence of a
fairly marked hierarchical structure in the linguistic data being studied. There is much evidence
that such hierarchical structure does indeed exist in natural languages. The "deep learning" in
EFTA00624571
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 425
our approach is embedded in the repeated cycles through the loop given above - each time one
goes through the loop, the learning gets one level deeper.
This sort of property has observed to hold for many linguistic entities, an observation dating
back at least to Saussure IdS771 and the start of structuralist linguistics. It is basically a fancier
way of saying that the meanings of words and other linguistic constructs, may be found via their
relationships to other words and linguistic constructs. We are not committed to structuralism
as a theoretical paradigm, and we have considerable respect for the aid that non-linguistic in-
formation -such as the sensorimotor data that comes from embodiment - can add to language,
as should be apparent from the overall discussion in this book. However, the potential dramatic
utility of non-linguistic information for language learning does not imply the impossibility or
infeasibility of learning language from corpus data alone. It is inarguable that non-linguistic re-
lationships comprise a significant portion of the everyday meaning of linguistic entities; but yet,
redundancy is prevalent in natural systems, and we believe that purely linguistic relationships
may well provide sufficient data for learning of natural languages. If there are some aspects of
natural language that cannot be learned via corpus analysis, it seems difficult to identify what
these aspects are via armchair theorizing, and likely that they will only be accurately identified
via pushing corpus linguistics as far as it can go.
This generic learning process is a special case of the general process of symbolization, de-
scribed in Chaotic Logic roe9-1] and elsewhere as a key aspect of general intelligence. In this
process, a system finds patterns in itself and its environment, and then symbolizes these patterns
via simple tokens or symbols that become part of the system's native knowledge representation
scheme (and hence parts of its "metalanguage" for describing things to itself). Having repre-
sented a complex pattern as a simple symbolic token, it can then easily look at other patterns
involving this patterns as a component.
Note that in its generic format as stated above, the "language learning loop" is not restricted
to corpus based analysis, but may also include extralinguistic aspects of usage patterns, such
as gestures, tones of voice, and the physical and social context of linguistic communication.
Linguistic and extra-linguistic factors may come together to comprise "usage patterns." How-
ever, the restriction to corpus data does not necessarily denude the language learning loop of
its power: it merely restricts one to particular classes of usage patterns, whose informativeness
must be empirically determined.
In principle, one might be able to create a functional language learning system based only
on a very generic implementation of the above learning loops. In practice, however. biases
toward particular sorts of usage patterns can be very valuable in guiding language learning. In
a computational language learning context, it may be worthwhile to break down the language
learning process into multiple instances of the basic language learning loops, each focused on
different sorts of usage patterns, and coupled with each other in specific ways. This is in fact
what we will propose here.
Specifically, the language learning process proposed here involves:
• One language learning loop for learning purely syntactic linguistic relationships (such as
link types and lexical entries, described above), which are then used to provide input to a
syntax parser.
• One language learning loop for learning higher-level "syntactico-semantic" linguistic rela-
tionships (such as semantic relationships, idioms, and lexical functions, described above),
which are extracted from the output of the syntax parser.
EFTA00624572
425 45 Language Learning via Unsupervised Corpus Analysis
These two loops are not independent of one-another; the second loop can provide feedback to the
first, regarding the correctness of the extracted structures; then as the first loop produces more
correct, confident results, the second loop can in turn become more confident in it's output. In
this sense, the two loops attack the same sort of slow-convergence issues that 'deep learning'
tackles in neural-net training.
The syntax parser itself, in this context, is used to extract directed acyclic graphs (dags),
usually trees, from the graph of syntactic relationships associated with a sentence. These dags
represent parses of the sentence. So the overall scope of the learning process proposed here is
to learn a system of syntactic relationships that displays appropriate coherence and
that, when fed into an appropriate parser, will yield parse trees that give rise to a
system of syntactico-semantic relationships that displays appropriate coherence.
45.4.2 Learning Syntax
The process of learning syntax from a corpus may be understood fairly directly in terms of
entropy maximization. As a simple example, consider the measurement of the entropy of the
arrangement of words in a sentence. To a fair degree. this can be approximated by the sum of
the mutual entropy between pairs of words. Yuret showed that by searching for and maximizing
this sum of entropies, one obtains a tree structure that closely resembles that of a dependency
parserin 91. That is, the word pairs with the highest mutual entropy are more or less the same
as the arrows in a dependency parse, such as that shown in figure 44.1. Thus, an initial task
is to create a catalog of word-pairs with a large mutual entropy (mutual information, or MI)
between them. This catalog can then be used to approximate the most-likely dependency parse
of a sentence, although, at this stage, the link-types are as yet unknown.
Finding dependency links using mutual information is just the first step to building a practical
parser. The generation of high-MI word-pairs works well for isolating which words should be
linked, but it does have several major drawbacks. First and foremast, the word-pairs do not come
with any sort of classification; there is no link type describing the dependency relationship
between two words. Secondly, most words fall into classes (e.g. nouns, verbs, etc.), but the
high-MI links do not tell us what these are. A compact, efficient parser appears to require this
sort of type information.
To discover syntactic link types, it is necessary to start grouping together words that appear
in similar contexts. This can be done with clustering and similarity techniques, which appears
to be sufficient to discover not only basic parts of speech (verbs, nouns, modifiers, determiners),
but also link types. So, for example, the computation of word-pair MI is likely to reveal the
following high-MI word pairs: "big car, "fast car", "expensive car", 'ted car". It is reasonable
to group together the words big, expensive, fast and red into a single category, interpreted as
modifiers to car. The grouping can be further refined if these same modifiers are observed with
other words (e.g. "big bicycle", "fast bicycle", etc.) This has two effects: it not only reinforces
the correctness of the original grouping of modifiers, but also suggests that perhaps cars and
bicycles should be grouped together. Thus, one has discovered two classes of words: modifiers
and nouns. In essence, one has crudely discovered parts of speech.
The link between these two classes carries a type; the type of that link is defined by these two
classes. The use of a pair of word classes to define a link type is a basic premise of categorial
grammarlICSIVA. In this example, a link between a modifier and a noun would be a type
EFTA00624573
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 427
denoted as Id \ N in categorial grammar, M denoting the class of modifiers, and N the class
of nouns. In the system of Link Grammar, this is replaced by a simple name, but its really
one and the same thing. (In this case, the existing dictionaries use the A link for this relation,
with A conjuring up 'adjective' as a mnemonic.) The simple-name is a boon for readability, as
categorial grammars usually have very complex-looking link-type names: e.g. (N1)1,S)/NP for
the simplest transitive verbs. Typing seems to be an inherent part of language; types must be
extracted durng the learning process.
The introduction of types here has mathematical underpinnings provided by type theory.
An introduction to type theory can be found in JPro13J, and an application of type theory to
linguistics can be found inEKSPC nil. This Ls a rather abstract work, but it sheds light on the
nature of link types, word-classes, parts-of-speech and the like as formal types of type theory.
This is useful in dispelling the seeming taint of ad hoc arbitrariness of clustering: in a linguistic
context, it is not so much ad hoc as it is a way of guaranteeing that only certain words can
appear in certain positions in grammatically correct sentences, a sort of constraint that seems
to be an inherent part of language, and seems to be effectively formalizable via type theory.
Word-clustering, as worked in the above example, can be viewed as another entropy-
maximization technique. It is essentially a kind of factorization of dependent probabilities
into most likely factors. By classifying a large number of words as 'modifiers of nouns', one
is essentially admitting that they are equi-probable in that role, in the Markovian senselAsh651
(equivalently, treating them as equally-weighted priors, in the Bayesian probability sense). That
is, given the word "car", we should treat big, fast, expensive and red as being equi-probable (in
the absence of other information). Equi-probability is an axiom in Bayesian probability (the
axiom of priors), but it derives from the principle of maximum entropy (as any other probability
assignment would have a lower entropy).
We have described how link types may be learned in an unsupervised setting. Connector
types are then trivially assigned to the left and right words of a word-pair. The dependency
graph, as obtained by linking only those word pairs with a high MI, then allows disjuncts to
be easily extracted, on a sentence-by-sentence basis. At this point, another stage of pattern
recognition may be applied: Given a single word, appearing in many different sentences, one
should presumably find that this word only makes use of a relatively small, limited set of
disjuncts. It is then a counting exercise to determine which disjuncts are occurring the most
often for this word: these then form this word's lexical entry. (This "counting exercise" may
also be thought of as an instance of frequent subgraph mining, as will be elaborated below.)
A second clustering step may then be applied: it's presumably noticeable that many words
use more-or-less the same disjuncts in syntactic constructions. These can then be grouped into
the same lexical entry. However, we previously generated a different set of word groupings (into
parts of speech), and one may ask: how does that grouping compare to this grouping? is it
close, or can the groupings be refined? If the groupings cannot be harmonized, then perhaps
there is a certain level of detail that was previously missed: perhaps one of the groups should be
split into several parts. Conversely, perhaps one of the groupings was incomplete, and should
be expanded to include more words. Thus, there is a certain back-and-forth feedback between
these different learning steps, with later steps reinforcing or refining earlier steps, forcing a new
revision of the later steps.
EFTA00624574
428 45 Language Learning via Unsupervised Corpus Analysis
45.4.2.1 Loose language
A recognized difficulty with the direct application of Yuret's observation (that the high-MI
word-pair tree is essentially identical to the dependency parse tree) is the flexibility of the
preposition in the English languagell<NIMI. The preposition is so widely used, in such a large
variety of situations and contexts, that the mutual information between it, and any other word
or word-set, is rather low (is unlikely and thus carries little information). The two-point, pair-
wise mutual entropy provides a poor approximation to what the English language is doing in
this particular case. It appears that the situation can be rescued with the use of a three-point
mutual information (a special case of interaction information 113010:31).
The discovery and use of such constructs is described in [PDOM. A similar, related issue can
be termed "the richness of the MV link type in Link Grammar". This one link type, describing
verb modifiers (which includes prepositions) can be applied in a very large class of situations;
as a result, discovering this link type, while at the same time limiting its deployment to only
grammatical sentences, may prove to be a bit of a challenge. Even in the manually maintained
Link Grammar dictionaries, it can present a parsing challenge because so many narrower cases
can often be treated with an MV link. In summary, some constructions in English are so flexible
that it can be difficult to discern a uniform set of rules for describing them; certainly, pair-wise
mutual information seems insufficient to elucidate these cases.
Curiously, these more challenging situations occur primarily with more complex sentence
constructions. Perhaps the flexibility is associated with the difficulty that humans have with
composing complex sentences; short sentences are almost 'set phrases', while longer sentences
can be a semi-grammatical jumble. In any case, some of the trouble might be avoided by limiting
the corpus to smaller, easier sentences at first, perhaps by working with children's literature at
first.
45.4.2.2 Elaboration of the Syntactic Learning Loop
We now reiterate the syntactic learning process described above in a more systematic way. By
getting more concrete, we also make certain assumptions, and restrictions, some of which may
end up getting changed or lifted in the course of implementation and detailed exploration of
the overall approach. What is discussed in this section is merely one simple, initial approach to
concretizing the core language learning loop we envision in a syntactic context.
Syntax, as we consider it here, involves the following basic entities:
• words
• categories of words
• "co-occurrence links", each one defined as (in the simplest case) an ordered pair or triple
of words, labeled with an uncertain truth value
• "syntactic link types", each one defined as a certain set of ordered pairs of words
• "disjuncts", each one associated with a particular word w, and consisting of an ordered
set of link types involving the word w. That is, each of these links contains at least one
word-pair containing w as first or second argument. (This nomenclature here comes from
Link Grammar; each disjunct is a conjunction of link types. A word is associated with a
set of disjuncts. In the course of parsing, one must choose between the multiple disjuncts
associated with a word, to fulfill the constraints required of an appropriate parse structure.)
EFTA00624575
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 429
An elementary version of the basic syntactic language learning loop described above would take
the form.
1. Search for high-MI word pairs. Define one's usage links as the given co-occurrence links
2. Cluster words into categories based on the similarity of their associated usage links
• Note that this will likely be a tricky instance of clustering, and classical clustering
algorithms may not perform well. One interesting, less standard approach would be to
use OpenCog's MOSES algorithm 114)006, ImoOrel to learn an array of program trees,
each one serving as a recognizer for a single cluster, in the same general manner done
with Genetic Programming in [I3E07/.
3. Define initial syntactic link types from categories that are joined by large bundles of usage
links
• That is, if the words in category Cu have a lot of usage links to the words in category C2,
then create a syntactic link type whose elements are (tui ,w2), for all tui E Cu,w2 E C2.
4. Associate each word with an extended set of usage links, consisting of: its existing usage
links, plus the syntactic links that one can infer for it based on the categories the word
belongs to. One may also look at chains of (e.g.) 2 syntactic links originating at the word.
• For example, suppose cat E Ci and Cu has syntactic link L. Suppose (cat, eat) and
(dog,run) are both in Li. Then if there is a sentence "The cat likes to rim", the link
L1 lets one infer the syntactic link cat 4 run. The frequency of this syntactic link in a
relevant corpus may be used to assign it an uncertain truth value.
• Given the sentence "The cat likes to run in the park," a chain of syntactic links such
as cat —) run Li park may be constructed.
5. Return to Step 2, but using the extended set of usage links produced in Step 4, with the
goal of refining both clusters and the set of link types for accuracy. Initially, all categories
contain one word each, and there is a unique link type for each pair of categories. This is
an inefficeint representation of language, and so the goal of clustering is to have a relatively
small set of clusters and link types, with many words/word-pairs assigned to each. This can
be done by maximizing the sum of the logarithms of the sizes of the clusters and link types;
that is, by maximing entropy. Since the category assignments depend on the link types, and
vice versa, a very large number of iterations of the loop are likely to be required. Based on
the current Link Grammar English dictionaries, one expects to discover hundreds of link
types (or more, depending on how subtypes are counted), and perhaps a thousand word
clusters (most of these corresponding to irregular verbs and idiomatic phrases).
Many variants of this same sort of process are conceivable, and it's currently unclear what sort
of variant will work best. But this kind of process is what one obtains when one implements
the basic language learning loop described above on a purely syntactic level.
How might one integrate semantic understanding into this syntactic learning loop? Once one
has semantic relationships associated with a word, one uses them to generate new "usage links"
for the word, and includes these usage links in the algorithm from Step l onwards. This may
be done in a variety of different ways, and one may give different weightings to syntactic versus
semantic usage links, resulting in the learning of different links.
EFTA00624576
430 45 Language Learning via Unsupervised Corpus Analysis
The above process would produce a large set of syntactic links between words. We then find
a further series of steps. These may be carried out concurrently with the above steps, as soon
as Step 4 has been reached for the first time.
1. This syntactic graph (with nodes as words and syntactic links joining them) may then be
mined, using a variety of graph mining tools, to find common combinations of links. This
gives the "disjuncts" mentioned above.
2. Given the set of disjuncts, one carries out parsing using a process such as link parsing or
word grammar parsing, thus arriving at a set of parses for the sentences in one's reference
corpus. Depending on the nature of one's parser, these parses may be ranked according to
semantic plausibility. Each parse may be viewed as a directed acyclic graph (dag), usually
a tree, with words at the nodes and syntactic-link type labels on the links.
3. One can now define new usage links for each word: namely, the syntactic links occurring
in sentence parses, containing the word in question. These links may be weighted based on
the weights of the parses they occur in.
4. One can now return to Step 2 using the new usage links, alongside the previous ones.
Weighting these usage links relative to the others may be done in various ways.
Several subtleties have been ignored in the above, such as the proper discovery, and treatment of
idomatic phrases, the discovery of sentence boundaries, the handling of embedded data (price
quotes, lists, chapter titles, etc.) as well as the potential speed bump that are prepositions.
Fleshing out the details of this loop into a workable, efficient design is the primary engineering
challenge. This will take significant time and effort.
45.4.3 Learning Semantics
Syntactic relationships provide only the shallowest interpretation of language; semantics conies
next. One may view semantic relationships (including semantic relationships close to the syntax
level, which we may call "syntactico-semantic" relationships) as ensuing from syntactic relation-
ships, via a similar but separate learning process to the one proposed above. Just as our approach
to syntax learning is heavily influenced by our work with Link Grammar, our approach to seman-
tics is heavily influenced by our work on the RelEx system [RVC03, LGEUJ, GPPG06, LG ' 2],
which maps the output of the Link Grammar parser into a more abstract, semantic form. Proto-
type systems [Goc10b, LGIC 1:4 have also been written mapping the output of RelEx into even
more abstract semantic form, consistent with the semantics of the Probabilistic Logic Networks
t(1( 108I formalism as implemented in CogPrime. These systems are largely based on hand-
coded rules, and thus not in the spirit of language learning pursued in this proposal. However,
they display the same structure that we assume here; the difference being that here we specify a
mechanism for learning the linguistic content that fills in the structure via unsupervised corpus
learning, obviating the need for hand-coding.
Specifically, we suggest that discovery of semantic relations requires the implementation of
something similar to'LI'01I, except that this work needs to be generalized from 2-point relations
to 3-point and N-point relations, roughly as described in EPD091. This allows the automatic,
unsupervised recognition of synonymous phrases, such as 'Texas borders on Mexico" and 'Texas
is next to Mexico", to extract the general semantic relation next_ to(X, Y), and the fact that
this relation can be expressed in one of several different ways.
EFTA00624577
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 431
At the simplest level, in this approach, semantic learning proceeds by scanning the corpus
for sentences that use similar or the same words, yet employ them in a different order, or have
point substitutions of single words, or of small phrases. Sentences which are very similar, or
identical, save for one word, offer up candidates for synonyms, or sometimes antonyms. Sentences
which use the same words, but in seemingly different syntactic constructions, are candidates
for synonymous sentences. These may be used to extract semantic relations: the recognition of
sets of different syntactic constructions that carry the same meaning.
In essence, similar contexts must be recognized, and then word and word-order differences
between these other-wise similar contexts must be compared. There are two primary challenges:
how to recognize similar contexts, and how to assign probabilities.
The work of 1P1)001 articulates solutions to both challenges. For the first, it describes a
general framework in which relations such as next_ to(X, Y) can be understood as lambda-
expressions AmAy.next_to(x, y), so that one can employ first-order logic constructions in place
of graphical representations. This is partly a notational trick; it just shows how to split up
input syntactic constructions into atoms and terms, for which probabilities can be assigned.
For the second challenge, they show how probabilities can be assigned to these expressions.
by making explicit use of the notions of conditional random fields (or rather, a certain special
case, termed Markov Logic Networks). Conditional random fields, or Markov networks, are a
certain mathematical formalism that provides the most general framework in which entropy
maximization problems can be solved: roughly speaking, it can be understood as a means of
properly distributing probabilities across networks. Unfortunately, this work is quite abstract
and rather dense. Thus, a much easier understanding to the general idea can be obtained from
IL Pull; unfortunately, the later fails to provide the general N-point case needed for semantic
relations in general, and also fails to consider the use of maximum entropy principles to obtain
similarity measures.
The above can be used to extract synonymous constructions, and, in this way, semantic
relations. However, neither of the above references deal with distinguishing different meanings
for a given word. That is, while eats(X, Y) might be a learnable semantic relation, the sentence
"He ate it" does not necvsbarily justify its use. Of course: "He ate it" is an idiomatic expression
meaning "he crashed", which should be associated with the semantic relation crash(X), not
eat(X, Y). There are global textual clues that this may be the case: trouble resolving the reference
"it", and a lack of mention of foodstuffs in neighboring sentences. A viable yet simple algorithm
for the disambiguation of meaning is offered by the Mihalcea algorithmIMTF04, SNI071.
This is an application of the (google) PageRank algorithm to word senses, taken across words
appearing in multiple sentences. The premise is that the correct word-sense is the one that is
most strongly supported by senses of nearby words; a graph between word senses is drawn,
and then solved as a Markov chain. In the original formulation, word senses are defined by
appealing to WordNet, and affinity between word-senses is obtained via one of several similarity
measures. Neither of these can be applied in learning a language de novo. Instead, these must
both be deduced by clustering and splitting, again. So, for example, it is known that word senses
correlate fairly strongly with disjuncts (based on authors unpublished experiments), and thus,
a reasonable first cut is to presume that every different disjunct in a lexical entry conveys a
different meaning, until proved otherwise. The above-described discovery of synonymous phrases
can then be used to group different disjuncts into a single "word sense". Disjuncts that remain
ungrouped after this process are already considered to have distinct senses, and so can be used
as distinct senses in the Mihalcea network.
EFTA00624578
432 45 Language Learning via Unsupervised Corpus Analysis
Sense similarity measures can then be developed by using the above-discovered senses, and
measuring how well they correlate across different texts. That is, if the word "bell" occurs multi-
ple times in a sequence of paragraphs, it is reasonable to assume that each of these occurrences
are associated with the same meaning. Thus, each distinct disjunct for the word "bell" can
then be presumed to still convey the same sense. One now asks, what words co-occur with
the word "ben"? The frequent appearance of "chime" and "ring" can and should be noted. In
essence, one Ls once-again computing word-pair mutual information, except that now, instead
of limiting word-pairs to be words that are near each other, they can instead involve far-away
words, several sentences apart. One can then expand the word sense of "bell" to include a list of
co-occurring words (and indeed, this is the slippery slope leading to set phrases and eventually
idioms).
Failures of co-occurrences can also further strengthen distinct meanings. Consider "he chimed
in" and "the bell chimed". In both cases, chime is a verb. In the first sentence, chime carries
the disjunct S- & K+ (here, K+ is the standard Link Grammar connector to particles) while
the second has only the simpler disjunct S-. Thus, based on disjunct usage alone, one already
suspects that these two have a different meaning. This is strengthened by the lack of occurrence
of words such as "belt' or "ring" in the first case, with a frequent observation of words pertaining
to talking.
There is one final trick that must be applied in order to get reasonably rapid learning; this
can be loosely thought of as "the sigmoid function trick of neural networks", though it may also
be manifested in other ways not utilizing specific neural net mathematics. The key point is that
semantics intrinsically involves a variety of uncertain, probabilistic and fuzzy relationships; but
in order to learn a robust hierarchy of semantic structures, one needs to iteratively crispen these
fuzzy relationships into strict ones.
In much of the above, there is a recurring need to categorize, classify and discover similarity.
The naivest means of doing so is by counting, and applying basic probability (Bayesian, Marko-
vian) to the resulting counts to deduce likelihoods. Unfortunately, such formulas distribute
probabilities in essentially linear ways (i.e. form a linear algebra), and thus have a rather poor
ability to discriminate or distinguish (in the sense of receiver operating characteristics, of dis-
criminating signal from noise). Consider the last example: the list of words co-occurring with
chime, over the space of a few paragraphs, is likely to be tremendous. Most of this is surely
noise. There is a trick to over-coming this that is deeply embedded in the theory of neural
networks, and yet completely ignored in probabilistic (Bayesian, Markovian) networks: the sig-
moid function. The sigmoid function serves to focus in on a single stimulus, and elevate its
importance, and, at the same time, strongly suppress all other stimuli. In essence, the sigmoid
function looks at two probabilities, say 0.55 and 0.45, and says "let's pretend the first one is 0.9
and the second one is 0.1, and move forward from there". It builds in a strong discrimination
to all inputs. In the language of standard, text-book probability theory, such discrimination is
utterly unwarranted; and indeed, it is. However, applying strong discrimination to learning can
help speed learning by converting certain vague impressions into certainties. These certainties
can then be built upon to obtain additional certainties, or to be torn apart. as needed.
Thus, in all of the above efforts to gauge the similarity between different things, it is useful
to have a sharp yes/no answer, rather than a vague muddling with likelihoods. In some of
the above-described algorithms, this sharpness is already built in: so, Yuret approximates the
mutual information of an entire sentence as the sum of mutual information between word pairs:
the smaller, unlikely corrections are discarded. Clearly, they mast also be revived in order to
handle prepositions. Something similar must also be done in the extraction of synonymous
EFTA00624579
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 433
phrases. semantic relations, and meaning; the domain is that much likelier to be noisy, and
thus, the need to discriminate signal from noise that much more important.
45.4.3.1 Elaboration of the Semantic Learning Loop
We now provide a more detailed elaboration of a simple version of the general semantic learning
process described above. The same caveat applies here as in our elaborated description of
syntactic learning above: the specific algorithmic approach outlined here is a simple instantiation
of the general approach we have in mind, which may well require refinement based on lessons
learned during experimentation and further theoretical analysis.
One way to do semantic learning, according to the approach outlined above, is as follows:
1. An initial semantic corpus is posited, whose elements are parse graphs produced by the
syntactic process described earlier
2. A semantic relationship set (or rel-set) is computed from the semantic corpus, via calculat-
ing the frequent (or otherwise statistically informative) subgraphs occurring in the elements
of the corpus. Each node of such a subgraph may contain a word, a category or a variable;
the links of the subgraph are labeled with (syntactic, or semantic) link types. Each parse
graph is annotated with the semantic graphs associated with the words it contains (ex-
plicitly: each word in a parse graph may be linked via a ReferenceLink to each variable or
literal with a semantic graph that corresponds to that word in the context of the sentence
underlying the parse graph.)
• For instance, the link combination v1 4 v2 4 v3 may commonly occur (representing
the standard Subject-Verb-Object (SVO) structure)
• In this case, for the sentence "The rock broke the window," we would have links such
ReferenceLink
as rock v1 connecting the nodes (such as the "rock" node) in the parse
structure with nodes (such as vi ) in this associated semantic subgraph.
3. Rel-sets are divided into categories based on the similarities of their associated semantic
graphs.
• This division into categories manifests the sigmoid-function-style crispening mentioned
above. Each rel-set will have similarities to other red-sets, to varying fuzzy degrees.
Defining specific categories turns a fuzzy web of similarities into crisp categorial bound-
aries; which involves some loss of information, but also creates a simpler platform for
further steps of learning.
• Two semantic graphs may be called "associated" if they have a nonempty intersection.
The intersection determines the type of association involved. Similarity assessment be-
tween graphs G and H may involve estimation of which graphs G and H are associated
with in which ways.
• For instance, "The cat ate the dog" and "The frog was eaten by the walrus" represent
the semantic structure eat(cat,dog) in two different ways. In link parser terminology,
they do so respectively via the subgraphs g, = v, 4 vs Or v3 and 02 = vi 4 vs 4
UV 1
v3 v4 -) v5. These two semantic graphs will have a lot of the same associations.
For iastance, in our corpus we may have "The big cat ate the dog in the morning"
(including big 4 cat) and also "The big frog was eaten by the walrus in the morning"
EFTA00624580
434 45 Language Learning via Unsupervised Corpus Analysis
(including big 4 frog), meaning that big 4 v5 is a graph commonly associated with
both g, and J Cz. Due to having many commonly associated graphs like this, gi and 02
are likely to be associated to a common cluster.
4. Nodes referring to these categories are added to the parse graphs in the semantic corpus.
Most simply, a category node C is assigned a link of type L pointing to another node x,
if any element of C has a link of type L pointing to x. (More sophisticated methods of
assigning links to category nodes may also be worth exploring.)
• If gi and 02 have been assigned to a common category C, then "I believe the pig ate
the horse" and "I believe the law was invalidated by the revolution" will both appear
as instantiations of the graph ga = believe cv C. This 03 is compact because of
the recognition of C as a cluster, leading to its representation as a single symbol. The
recognition of 03 will occur in Step 2 the next time around the learning loop.
5. Return to Step 2, with the newly enriched semantic corpus. As before, one wants to discover
not too many and not too few categories; again, the appropriate solution to this problem
appears to be entropy maximization. That is, during the frequent subgraph mining stages,
one maintains counts of how often these occur in the corpus; from these, one constructs the
equivalent of the mutual information associated with the subgraphs; categorization requires
maximizing the sum of the log of the sizes of the categories.
As noted earlier, these semantic relationships may be used in the syntactic phase of language
understanding in two ways:
• Semantic graphs associated with words may be considered as "usage links" and thus included
as part of the data used for syntactic category formation.
• During the parsing proems, full or partial parses leading to higher-probability semantic
graphs may be favored.
45.5 The Importance of Incremental Learning
The learning process described here builds up complex syntactic and semantic structures from
simpler ones. To start it, all one needs are basic before and after relationships derived from a
corpus. Everything else is built up from there, given the assumption of appropriate syntactic
and semantic formalisms and a semantics-guided syntax parser.
As we have noted, the series of learning steps we propose falls into the broad category of
"deep learning", or of hierarchical modeling. That is, learning must occur at several levels at
once, each reinforcing, and making use of results from another. Link types cannot be identified
until word clusters are found, and word clusters cannot be found until word-pair relationships
are discovered. However, once link-types are known, these can be then used to refine clusters
and the selected word-pair relations. Further, the process of finding word clusters - both pre
and post parsing - relies on a hierarchical build-up of clusters, each phase of clustering utilizing
results of the previous "lower level" phrase.
However, for this bootstrapping learning to work well, one will likely need to begin with
simple language, so that the semantic relationships embodied in the text are not that far
removed from the simple before/after relationships. The complexity of the texts may then be
EFTA00624581
45.6 Integrating Language Learned via Corpus Analysis into CogPrime's Experiential Learning 435
ramped up gradually. For instance, the needed effect might be achieved via sorting a very large
corpus in order of increasing reading level.
45.6 Integrating Language Learned via Corpus Analysis into
CogPrime's Experiential Learning
Supposing everything in this chapter were implemented and tested and worked reasonably well
as envisioned. What would this get us in terms of progress toward AGI?
Arguably, with a relatively modest additional effort, it could get as a natural language
question answering system, answering a variety of questions based on the text corpus available
to it. One would have to use the learned rules for language generation, but the methods of
Chapter 46 would likely suffice for that.
Such a dialogue system would be a valuable achievement in its own right, of scientific,
commercial and humanistic interest - but of course, it wouldn't be AGI. 'lb get something
approaching AGI from this sort of effort, one would have to utilize additional reasoning and
concept creation algorithms to enable the answering of questions based on knowledge not stored
explicitly in the provided corpus. The dialogue system would have to be able to piece together
new answers from various fragmentary, perhaps contradictory, pieces of information contain
in the corpus. Ultimately, we suspect, one would need something like the CogPrime architec-
ture, or something else with a comparable level of sophistication, to appropriately leverage the
information extracted from texts via the learned language rules.
An open question, as indicated above, is how much of language itself would a corpus based
language learning system like the one outlined here miss, assuming a massive but realistic
corpus (say, a significant fraction of the Web). This is unresolved and ultimately will only be
determined via experiment. Our suspicion is that a very large percentage of language can be
understood via these corpus-based methods. But there may be exceptions that would require
an unrealistically large corpus size.
As a simple example, consider the ability to interpret vaguely given spatial directions like
"Go right out the door, past a few curves in the road, then when you get to a hill with a big
red house on it (well not that big, but bigger than most of the others you'll see on the walk),
start heading down toward the water, till the brash gets thick, then start heading left.... Follow
the ground as it rises and eventually you'll see the lake." Of course, it is theoretically possible
for an AGI system to learn to interpret directions like this purely via corpus analysis. But it
seems the task would be a lot easier for an AGI endowed with a body so that it could actually
experience routes like the one being described. And space and time are not the only source of
relevant examples; social and emotional reasoning have a similar property. Learning to interpret
language about these from reading is certainly possible, but one will have an easier time and
do a better job if one is out in the world experiencing social and emotional life oneself.
Even if there turn out to be significant limitations regarding what can be learned in practice
about language via corpus analysis, though, it may still prove a valuable contributor to the
mind of a CogPrime system. As compared to hand-coded rules, comparably abstract linguistic
knowledge achieved via statistical corpus analysis should be much easier to integrate with the
results of probabilistic inference and embodied learning, due to its probabilistic weighting and
its connection with the specific examples that gave rise to it.
EFTA00624582
EFTA00624583
Chapter 46
Natural Language Generation
Co-authored with Ruiting Lian and Rui Liu
46.1 Introduction
Language generation, unsurprisingly, shares most of the key features of language comprehension
discussed in chapter 44 - after all, the division between generation and comprehension is to
some extent an artificial convention, and the two functions are intimately bound up both in the
human mind and in the CogPrime architecture.
In this chapter we discuss language generation, in a manner similar to the previous chapter's
treatment of language comprehension. First we discuss our currently implemented, "engineered"
language generation system, and then we discuss some alternative approaches:
• how a more experiential-learning based system might be made by retaining the basic struc-
ture of the engineered system but removing the "pre-wired" contents.
• how a "Sem2Syn" system might be made, via reversing the Syn2Sem system described in
Chapter 44. This is the subject of implementation effort, at time of writing.
At the start of Chapter 44 we gave a high-level overview of a typical NL generation pipeline.
Here we will focus largely but not entirely on the "syntactic and morphological realization"
stage, which we refer to for simplicity as "sentence generation" (taking a slight terminological
liberty, as "sentence fragment generation" is also included here). All of the stages of language
generation are important., and there is a nontrivial amount of feedback among them. However,
there is also a significant amount of autonomy, such that it often makes sense to analyze each
one separately and then tease out its interactions with the other stages.
46.2 SegSim for Sentence Generation
The sentence generation approach currently taken in OpenCog (front 2009 - early 2012), which
we call SegSim, is relatively simple and is depicted in Figure 46.1 and described as follows:
1. The NL generation system stores a large set of pairs of the form (semantic structure,
syntactic/morphological realization)
2. When it is given a new semantic structure to express, it first breaks this semantic structure
into natural parts, using a set of simple syntactic-semantic rules
437
EFTA00624584
438 46 Natural Language Generation
3. For each of these parts, it then matches the parts against its memory to find relevant pairs
(which may be full or partial matches), and uses these pairs to generate a set of syntactic
realizations (which may be sentences or sentence fragments)
4. If the matching has failed, then (a) it returns to Step 2 and carries out the breakdown
into parts again. But if this has happened too many times, then (b) it recourses to a
different algorithm (mast likely a search or optimization based approach, which is more
computationally costly) to determine the syntactic realization of the part in question.
5. If the above step generated multiple fragments, they are pieced together, and a certain rating
function is used to judge if this has been done adequately (using criteria of grammaticality
and expected comprehensibility, among others). If this fails, then Step 3 is tried again on
one or more of the parts; or Step 2 is tried again. (Note that one option for piecing the
fragments together is to string together a number of different sentences; but this may not
be judged optimal by the rating function.)
6. Finally, a "cleanup" phase is conducted, in which correct morphological forms are inserted,
and articles and certain other "function words" are inserted.
The specific OpenCog software implementing the SegSim algorithm is called "NLGen"; this
is an implementation of the SegSim concept that focuses on sentence generation from RelEx
semantic relationship. In the current (early 2012) NLGen version, Step 1 is handled in a very
simple way using a relational database; but this will be modified in future so as to properly
use the AtomSpace. Work is currently underway to replace NLGen with a different "Sem2Syn"
approach, that will be described at the end of this chapter. But discussion of NLGen is still
instructive regarding the intersection of language generation concepts with OpenCog concepts.
The substructure currently used in Step 2 is defined by the predicates of the sentence, i.e.
we define one substructure for each predicate, which can be described as follows:
Predicate(Alyumenti (Afodifyj))
where
• 1 < i < m and 0 < j < n and in, n are integers
• "Predicate" stands for the predicate of the sentence, corresponding to the variable $0 of the
RelEx relationship _subj($0, $1) or _obj($0, $1)
• Argument; is the i-th semantic parameter related with the predicate
• Modify.; is the j-th modifier of the Argument;
If there is more than one predicate, then multiple subnets are extracted analogously.
For instance, given the sentence "I happily study beautiful mathematics in beautiful China
with beautiful people." The substructure can be defined as Figure 46.2.
For each of these substructures, Step 3 is supposed to match the substructures of a sentence
against its global memory (which contains a large body of previously encountered 'semantic
structure, syntactic/morphological realization' pairs) to find the most similar or same substruc-
tures and the relevant syntactic relations to generate a set of syntactic realizations, which may
be sentences or sentence fragments. In our current implementation. a customized subgraph
matching algorithm has been used to match the subnets from the parsed corpus at this step.
If Step 3 generated multiple fragments, they mast be pieced together. In Step 4, the Link
Parser's dictionary has been used for detecting the dangling syntactic links corresponding to the
fragments, which can be used to integrate the multiple fragments. For instance, in the example
of Figure 46.3, according to the last 3 steps, SegSim would generate two fragments: "the parser
EFTA00624585
46.2 SegSim for Sentence Generation 139
MG brie Vron of Ref
ink twat Fula.lee
IV
• breed
• MM.
I ul
SAGA MN. MC
r otallf
'adman :
if
LS Phial .
tunny rad
Pisces Swiss's)
1
( league )
Imnal %runt ..41
Fig. 46.1: A Overview of the SegSim Architecture for Language Generation
will ignore the sentence" and "whose length is too long". Then it consults the Link Parser's
dictionary, and finds that "whose" has a connector "Mr-", which is used for relative clauses
involving "whose", to connect to the previous noun "sentence". Analogously, we can integrate
the other fragments into a whole sentence.
For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would
generate two fragments: "the parser will ignore the sentence" and "whose length is too long".
Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-",
which is used for relative clauses involving "whose", to connect to the previous noun "sentence".
Analogously, we can integrate the other fragments into a whole sentence.
Finally, a "cleanup" or "post-processing" phase is conducted, applying the correct inflections
to each word depending on the word properties provided by the input RelEx relations. For
example, we can use the RelEx relation "DEFINITE-FLAG(cover, T)" to insert the article "the"
in front of the word "cover". We have considered five factors in this version of NLGen: article,
EFTA00624586
440 46 Natural Language Generation
bentfil tenth'
Fig. 46.2: Example of a substructure
• - •• .....
”.-.b.....'.......••••-•••••••••••••••••••••••-”**********,
.- :Ir . ' • *
I •-•4*--•- ..-w•
I I I I I i I t i l l
1177-111t tit putts allot yeon.i tle tertiv.aS ;mu.* ita ta
Fig. 46.3: Linkage of an example
noun plural, verb intense, possessive and query type (the latter which is only for interrogative
sentences).
In the "cleanup" step, we also use the chunk parser tool from OpenNLP4 for adjusting the
position of an article being inserted. For instance, consider the proto-sentence "I have big red
apple." If we use the RelEx relation "noun_number(apple, singular)" to inflect the word "apple"
directly, the final sentence will be "I have big red an apple", which is not well-formed. So we use
the chunk parser to detect the phrase "big red apple" first, then apply the article rule in front
of the noun phrase. This is a pragmatic approach which may be replaced with something more
elegant and principled in later revisions of the NLGen system.
• http://opennIpsourceforge.net/
EFTA00624587
46.2 SegSim for Sentence Generation 411
46.2.1 NLGen: Example Results
NLGen is currently in a relatively early stage of development, and does not handle the full range
of linguistic and semantic phenomena that it will when it's completed. However, it can already
express a variety of sentences encapsulating a variety of syntactic and semantic phenomena; in
this section we will give some specific examples of what it can do.
The SegSim approach performs sentence generation by matching portions of propositional
input to a large corpus of parsed sentences, therefore, when the successful matches can be found
in the corpus, it can generate some similar and well-formed sentences via the relevant syntactic
structures.
There currently do not exist any authoritative evaluation criteria for Natural Language
Generation systems. Among many complicating factors here is the fact that different language
generation systems have different kinds of inputs, depending on many things including their
application area. So it's complicated to compare the results of NLGen with those obtained by
other systems.
It is easier however to test whether NLGen is implementing SegSim succmsfully. One ap-
proach is to take a sentence, run it through RelEx to generate a set of relationships, and see
if NLGen can regenerate the sentence from the relationship. We show here the results of some
simple tests of this nature, performed with the current version of the system using a very small
test corpus for similarity matching. Note: In each of these example results, the input of NLGen
is the RelEx relationships produced by the sentence before "==>"; and the sentence to the
right side of "==>" was one of the sentences generated by NLGen.
Example 1:
I study mathematics at Xiamen University.
==> I study mathematics at Xiamen\_University.
Example 2:
The red block is next to the blue cup.
==> The red block is next\_to the blue cup.
Example 3:
Ruiting walked from the table to the door.
==> Ruiting walked to the door from the table.
Example 4:
I intend to walk very slowly from Ben to Ruiting,
with Hugo's pencil in my hand.
==> With Hugo's pencil in my hand,
I intend to walk very slowly from Ben to Ruiting.
Example 5:
I'd like you to tell me where the red ball is.
==> I tell like you to 'd me where the red ball is.
Example 6:
EFTA00624588
442 46 Natural Language Generation
I study at Xiamen University, which is located in China.
==> I study at Xiamen1_University.
Example 7:
What is next to the tree?
==> What is next to the tree?
Example 8:
Where is the red ball?
==> Where is the red ball?
Example 9:
Pass me the ball.
==> Pass me the ball.
Example 10:
Tell Ben where the red ball is.
==> Tell Ben where the red ball is.
To make the process clearer, for Example 1 we also show the RelEx relationships produced
from the sentence before the "==>":
For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would
generate two fragments: "the parser will ignore the sentence" and "whose length is too long".
Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-",
which is used for relative clauses involving "whose", to connect to the previous noun "sentence".
Analogously, we can integrate the other fragments into a whole sentence.
Figure 46.4 shows the relationships of Example 1 fed to NLGen as input. The types of the
semantic relationships are documented in the RelEx's wiki pagest.
These examples illustrate some key points about the current version of NLGen. It works
well on simple, commonplace sentences (Example 1, 2), though it may reorder the sentence
fragments sometimes (Example 3, 4). On the other hand, because of its reliance on matching
against a corpus, NLGen is incapable of forming good sentences with syntactic structures not
found in the corpus (Example 5, 6). On a larger corpus these examples would have given
successful results. In Example 5, the odd error is due to the presence of too many "_subj"
RelEx relationships in the relationship-set corresponding to the sentence, which distracts the
matching proems when it attempts to find similar substructures in the small test corpus. Then
from Example 7 to 10, we can see NLGen still works well for question sentences and imperative
sentence if the substructures we extract can be matched, but the substructures may be similar
with the assertive sentence, so we need to refine it in the "cleanup" step. For example: the
substructures we extracted for the sentence "are you a student?" are the same as the ones for
"you are a student?", since the two sentences both have the same binary RelEx relationships:
_subj (be, you)
_obj (be, student)
t http://opencog.org/wiki/RelEx#Relations_and_Features
EFTA00624589
EFTA00624590
-鹵届工【6 由{…七山ユ。亂毛9・ ・
L 二-叩口…貫-中帥囑言岶c 靂 8 2 ユI 一召ュ田巴三吮田-己扁u 一韶-目畠品中E8 屡?
山u 岳芸器・工-中」唱一8 ・署屮・。m 蕃 昌 七 E 廿齧-【叩置-】°c 屈
-u嘱】莫山山看・二品8 原啻一・誌毛專】
v
-
一」一骨x 国一&&ョ哲6矯石一鹵《晶.瑯.等.警函
一←吐冪--「」白三」詈
一
」G 一…一一」-置苔~ユ
-互昌』詈且
一量弓口寫、H 冪-与【口三一
言m 岔◎、H -製獸
吾山>薯a 詈 且
「(蕾且、宀窘-一詈異当
一
尊m 訂>三岳舜冖-、曇豈岔
、-暮-菫冨…
H
一 一計昌ば-m 且
-」、盲巴晋囗口岳鬲暹嘗「」曲一「一【」曽
G弓(一箇寓凵当囗口岳舜三粤
一量曇》叩含冖m 訂h」曇岳舜冖”冪君口口巨
一g 一岔~~甘昌ロ」言且
詈 c…篇缶c 庠)8c3c●切蛔&【c 一嘱晶蚪等
444 46 Natural Language Generation
like "TRUTH-QUERY-FLAG(be, T)" which means if that the referent "be" is a verb/event and
the event is involved is a question.
The particular shortcomings demonstrated in these examples are simple to remedy within
the current NLGen framework, via simply expanding the corpus. However, to get truly general
behavior from NLGen it will be necessary to insert some other generation method to cover
those cases where similarity matching fails, as discussed above. The NLGen2 system created by
Blake Lentoine [Lenin)] is one possibility in this regard: based on RelEx and the link parser,
it carries out rule-based generation using an implementation of Chomsky's Merge operator.
Integration of NLGen with NLGen2 is currently being considered. We note that the Merge
operator is computationally inefficient by nature, so that it will likely never be suitable for the
primary, sentence generation method in a language generation system. However, pairing NLGen
for generation of familiar and routine utterances with a Merge-based approach for generation
of complex or unfamiliar utterances, may prove a robust approach.
46.3 Experiential Learning of Language Generation
As in the case of language comprehension, there are multiple ways to create an experiential
learning based language generation system, involving various levels of "wired in" knowledge.
Our best guess is that for generation as for comprehension. a "tabula rasa" approach will prove
computationally intractable for quite some time to come, and an approach in which some basic
structures and processes are provided, and then filled out with content learned via experience,
will provide the greatest odds of success.
A highly abstracted version of SegSim may be formulated as follows:
1. The Al system stores semantic and syntactic structures, and its control mechanism is biased
to search for, and remember, linkages between them
2. When it is given a new semantic structure to express, it first breaks this semantic structure
into natural parts, using inference based on whatever implications it has in its memory that
will serve this purpose
3. Its inference control mechanism is biased to carry out inferences with the following implica-
tion: For each of these parts, match it against its memory to find relevant pairs (which may
be full or partial matches), and use these pairs to generate a set of syntactic realizations
(which may be sentences or sentence fragments)
4. If the matching has failed to yield results with sufficient confidence, then (a) it returns
to Step 2 and carries out the breakdown into parts again. But if this has happened too
many times, then (b) it uses its ordinary inference control routine to try to determine the
syntactic realization of the part in question.
5. If the above step generated multiple fragments, they are pieced together, and an attempt
is made to infer, based on experience, whether the result will be effectively communicative.
If this fails, then Step 3 is tried again on one or more of the parts; or Step 2 is tried again.
6. Other inference-driven transformations may occur at any step of the process, but are par-
ticularly likely to occur at the end. In some languages these transformations may result in
the insertion of correct morphological forms or other "function words."
What we suggest is that it may be interesting to supply a CogPrime system with this overall
process, and let it fill in the rest by experiential adaptation. In the case that the system is
EFTA00624591
46.5 Conclusion 445
learning to comprehend at the same time as it's learning to generate, this means that its early-
stage generations will be based on its rough, early-stage comprehension of syntax - but that's
OK. Comprehension and generation will then "grow up" together.
46.4 Sem2Syn
A subject of current research is the extension of the Syn2Sem approach mentioned above into
a reverse-order. Sem2Syn system for language generation.
Given that the Syn2Sem rules are expressed as ImplicationLinks, they can be reversed auto-
matically and immediately - although, the reversed versions will not necessarily have the same
truth values. So if a collection of Syn2Sem rules are learned from a corpus, then they can be
used to automatically generate a set of Sem2Syn rules, each tagged with a probabilistic truth
value. Application of the whole set of Sem2Syn rules to a given Atom-set in need of articulation,
will result in a collection of link-parse links.
To produce a sentence from such a collection of link-parse links, another process is also
needed, which will select a subset of the collection that corresponds to a complete sentence,
legally parsable via the link parser. The overall collection might naturally break down into more
than one sentence.
In terms of the abstracted version of SegSim given above, the primary difference between
NLGen and SegSim lies in Step 3. Syn2Sem replaces the SegSim "data-store matching" algo-
rithm with inference based on implications obtained from reversing the implications used for
language comprehension.
46.5 Conclusion
There are many different ways to do language generation within OpenCog, ranging from pure
experiential learning to a database-driven approach like NLGen. Each of these different ways
may have value for certain applications, and it's unclear which ones may be viable in a human-
level AGI context. Conceptually we would favor a pure experiential learning approach. but, we
are currently exploring a "compromise" approach based on Sem2Syn. This is an area where
experimentation is going to tell us more than abstract theory.
EFTA00624592
EFTA00624593
Chapter 47
Embodied Language Processing
Co-authored with Samir Araujo and Welter Silva
47.1 Introduction
"Language" is an important abstraction — but one should never forget that it's an abstraction.
Language evolved in the context of embodied action, and even the most abstract language is
full of words and phrases referring to embodied experience. Even our mathematics is heavily
based on our embodied experience - geometry is about space; calculus is about space and
time: algebra is a sort of linguistic manipulation generalized from experience-oriented language,
etc. (see 'ELMO] for detailed arguments in this regard). To consider language in the context of
human-like general intelligence, one needs to consider it in the context of embodied experience.
There is a large literature on the importance of embodiment for child language learning,
but perhaps the most eloquent case has been made by Michael Tomasello, in his excellent
book Constructing a Language ??. Citing a host of relevant research by himself and others,
Tomasello gives a very clear summary of the value of social interaction and embodiment for
language learning in human children. And while he doesn't phrase it in these terms, the picture
he portrays includes central roles for reinforcement, imitative and corrective learning. Imitative
learning is obvious: so much of embodied language learning has to do with the learner copying
what it has heard other say in similar contexts. Corrective learning occurs every time a parent
or peer rephrases something for a child.
In this chapter, after some theoretical discussion of the nature of symbolism and the role of
gesture and sound in language, we describe some computational experiments run with OpenCog
controlling virtual pets in a virtual world, regarding the use of embodied experience for anaphor
resolution and question-answering. These comprise an extremely simplistic example of the in-
terplay between language and embodiment, but have the advantage of concreteness, since they
were actually implemented and experimented with. Some of the specific OpenCog tools used in
these experiments are no longer current (e.g. the use of RelEx2Frame, which is now deprecated
in favor of alternative approaches to mapping parses into more abstract semantic relationships);
but the basic principles and flow illustrated here are still relevant to current and future work.
447
EFTA00624594
448 47 Embodied Language Processing
47.2 Semiosis
The foundation of communication is semiosis - the representation between the signifier and the
signified. Often the signified has to do with the external world or the communicating agent's
body; hence the critical role of embodiment in language.
Thus, before turning to the topic of embodied language use and learning per se, we will
briefly treat the related topic of how an AGI system may learn semiosis itself via its embodied
experience. This is a large and rich topic, but we will restrict ourselves to giving a few relatively
simple examples intended to make the principles clear. We will structure our discussion of
semiotic learning according to Charles Sanders Peirce's theory of semiosis [Pei341, in which
there are three basic types of signs: icons, indices and symbols.
In Peirce's ontology of semiosis, an icon is a sign that physically resembles what it stands
for. Representational pictures, for example, are icons because they look like the thing they
represent. Onomatopoeic words are icons, as they sound like the object or fact they signify. The
iconicity of an icon need not be immediate to appreciate. The fact that "kirikiriki" is iconic for
a rooster's crow is not obvious to English-speakers yet it is to many Spanish-speakers; and the
the converse is true for "cock-a-doodle-doo."
Next, an index is a sign whose occurrence probabilistically implies the occurrence of some
other event or object (for reasons other than the habitual usage of the sign in connection with
the event or object among some community of communicating agents). The index can be the
cause of the signified thing, or its consequence, or merely be correlated to it. For example, a
smile on your face is an index of your happy state of mind. Loud music and the sound of many
people moving and talking in a room is an index for a party in the room. On the whole, more
contextual background knowledge is required to appreciate an index than an icon.
Finally, any sign that is not an icon or index is a symbol. More explicitly, one may say that a
symbol is a sign whose relation to the signified thing is conventional or arbitrary. For instance,
the stop sign is a symbol for the imperative to stop; the word "dog" is a symbol for the concept
it refers to.
The distinction between the various types of signs is not always obvious, and some signs may
have multiple aspects. For instance, the thumbs-up gesture is a symbol for positive emotion
or encouragement. It is not an index - unlike a smile which is an index for happiness because
smiling Ls intrinsically biologically tied to happiness. there is no intrinsic connection between the
thumbs-up signal and positive emotion or encouragement. On the other hand, one might argue
that the thumbs-up signal is very weakly iconic, in that its up-ness resembles the subjective
up-ness of a positive emotion (note that in English an idiom for happiness Ls "feeling up").
Teaching an embodied virtual agent to recognize simple icons is a relatively straightforward
learning task. For instance, suppose one wanted to teach an agent that in order to get the
teacher to give it a certain type of object, it should go to a box full of pictures and select a
picture of an object of that type, and bring it to the teacher. One way this may occur in an
OpenCog-controlled agent is for the agent to learn a rule of the following form:
EFTA00624595
47.2 Sentiosis 449
ImplicationLink
ANDLink
ContextLink
Visual
SimilarityLink $X $Y
PredictivelmplicationLink
SequentialANDLink
ExecutionLink goto box
ExecutionLink grab $X
ExecutionLink goto teacher
EvaluationLink give me teacher $Y
While not a trivial learning problem, this is straightforward to a CogPrime -controlled agent
that is primed to consider visual similarities as significant (i.e. is primed to consider the visual-
appearance context within its search for patterns in its experience).
Next, proceeding from icons to indices: Suppose one wanted to teach an agent that in order
to get the teacher to give it a certain type of object, it should go to a box full of pictures and
select a picture of an object that has commonly been used together with objects of that type,
and bring it to the teacher. This is a combination of iconic and indexical semiosis, and would
be achieved via the agent learning a rule of the form
Implication
AND
Context
Visual
Similarity $X $2
Context
Experience
SpatioTemporalAssociation $Z $Y
Predictivelmplication
SequentialAnd
Execution goto box
Execution grab $X
Execution goto teacher
Evaluation give me teacher $Y
Symbolism, finally, may be seen to emerge as a fairly straightforward extension of indexing.
After all, how does an agent come to learn that a certain symbol refers to a certain entity?
An advanced linguistic agent can learn this via explicit verbal instruction, e.g. one may tell it
"The word 'hideous' means 'very, ugly'." But in the early stages of language learning, this sort
of instructional device is not available, and so the way an agent learns that a word is associated
with an object or an action is through spatiotemporal association. For instance, suppose the
teacher wants to teach the agent to dance every time the teacher says the word "dance" - a very
simple example of symbolism. Assuming the agent already knows how to dance, this merely
requires the agent learn the implication
Predictivelmplication
Sequent ialAND
Evaluation say teacher me "dance"
Execution dance
EFTA00624596
450 47 Embodied Language Processing
give teacher me Reward
And, once this has been learned, then simultaneously the relationship
SpatioTemporalAssociation dance "dance"
will be learned. What's interesting is what happens after a number of associations of this nature
have been learned. Then, the system may infer a general rule of the form
Implication
AND
SpatioTemporalAssociation \$X \SZ
HasType \$X GroundedSchema
Predictivelmplication
SeguentialAND
Evaluation say teacher me \$Z
Execution \$X
Evaluation give teacher me Reward
This implication represents the general rule that if the teacher says a word corresponding to an
action the agent knows how to do, and the agent does it, then the agent may get a reward front
the teacher. Abstracting this from a number of pertinent examples is a relatively straightforward
feat of probabilistic inference for the PLN inference engine.
Of course, the above implication is overly simplistic, and would lead an agent to stupidly start
walking every time its teacher used the word "walk" in conversation and the agent overheard
it. To be useful in a realistic social context, the implication must be made more complex so as
to include some of the pragmatic surround in which the teacher utters the word or phrase $Z.
47.3 Teaching Gestural Communication
Based on the ideas described above, it is relatively straightforward to teach virtually embodied
agents the elements of gestural comunication. This is important for two reasons: gestural com-
munication is extremely useful unto itself, as one sees from its role in communication among
young children and primates 122i; and, gestural communication forms a foundation for verbal
communication, during the typical course of human language learning 1231. Note for instance
the study described in 122J, which "reports empirical longitudinal data on the early stages of
language development," concluding that
...the output systems of speech and gesture may draw on underlying brain mechanisms com-
mon to both language and motor functions. We analyze the spontaneous interaction with their
parents of three typically-developing children (2 Al, 1 F) videotaped monthly at home between
10 and 23 months of age. Data analyses focused on the production of actions, representational
and deictic gestures and words, and gesture-word combinations. Results indicate that there is
a continuity between the production of the first action schemes, the first gestures and the first
words produced by children. The relationship between gestures and words changes over time.
The onset of two-word speech was preceded by the emergence of gesture-word combinations.
If young children learn language as a continuous outgrowth of gestural communication, per-
haps the same approach may be effective for (virtually or physically) embodied AI's.
EFTA00624597
47.3 Teaching Gestural Communication 451
An example of an iconic gesture occurs when one smiles explicitly to illustrate to some other
agent that one is happy. Smiling is a natural expression of happiness, but of course one doesn't
always smile when one's happy. The reason that explicit smiling is iconic is that the explicit
smile actually resembles the unintentional smile, which is what it "stands for."
This kind of iconic gesture may emerge in a socially-embedded learning agent through a very
simple logic. Suppose that when the agent is happy, it benefits from its nearby friends being
happy as well, so that they may then do happy things together. And suppose that the agent
has noticed that when it smiles, this has a statistical tendency to make its friends happy. Then,
when it is happy and near its friends, it will have a good reason to smile. So through very
simple probabilistic reasoning, the use of explicit smiling as a communicative tool may result.
But what if the agent is not actually happy, but still wants some other agent to be happy?
Using the reasoning from the prior paragraph, it will likely figure out to smile to make the
other agent happy - even though it isn't actually happy.
Another simple example of an iconic gesture would be moving one's hands towards one's
mouth, mimicking the movements of feeding oneself, when one wants to eat. Many analogous
iconic gestures exist, such as doing a small solo part of a two-person dance to indicate that one
wants to do the whole dance together with another person. The general rule an agent needs to
learn in order to generate iconic gestures of this nature is that, in the context of shared activity,
mimicking part of a process will sometimes serve the function of evoking that whole process.
This sort of iconic gesture may be learned in essentially the same way as an indexical gesture
such as a dog repeatedly drawing the owner's attention to the owner's backpack, when the dog
wants to go outside. The dog doesn't actually care about going outside with the backpack - he
would just as soon go outside without it - but he knows the backpack is correlated with going
outside, which is his actual interest.
The general rule here is
R :=
Implication
Simultaneouslmplication
Execution $X $Y
Predictivelmplication $X $Y
I.e., if doing $X often correlates with $Y, then maybe doing $X will bring about $Y. This sort of
rule can bring about a lot of silly "superstitious" behavior but also can be particularly effective
in social contexts, meaning in formal terms that
Context
near_teacher
R
holds with a higher truth value than It itself. This is a very small conglomeration of semantic
nodes and links yet it encapsulates a very important communicational pattern: that if you want
something to happen, and act out part of it - or something historically associated with it -
around your teacher, then the thing may happen.
Many other cases of iconic gesture are more complex and mix iconic with symbolic aspects.
For instance, one waves one hand away from oneself, to try to get someone else to go away. The
hand is moving, roughly speaking, in the direction one wants the other to move in. However,
understanding the meaning of this gesture requires a bit of savvy or experience. One one does
grasp it, however, then one can understand its nuances: For instance, if I wave my hand in an
EFTA00624598
452 47 Embodied Language Processing
arc leading from your direction toward the direction of the door, maybe that means I want you
to go out the door.
Purely symbolic (or nearly so) gestures include the thumbs-up symbol mentioned above, and
many others including valence-indicating symbols like a nodded head for YES, a shaken-side-to-
side head for NO, and shrugged shoulders for "I don't know." Each of these valence-indicating
symbols actually indicates a fairly complex concept, which is learned from experience partly via
attention to the symbol itself. So, an agent may learn that the nodded head corresponds with
situations where the teacher gives it a reward, and also with situations where the agent makes
a request and the teacher complies. The cluster of situati<ms corresponding to the nodded-
head then forms the agent's initial concept of "positive valence," which encompasses, loosely
speaking, both the good and the true.
Summarizing our discussion of gestural communication: An awful lot of language exists
between intelligent agents even if no word is ever spoken. And, our belief is that these sorts
of non-verbal semiosis form the best possible context for the learning of verbal language, and
that to attack verbal language learning outside this sort of context is to make an intrinsically-
difficult problem even harder than it has to be. And this leads us to the final part of the
chapter, which is a bit more speculative and adventuresome. The material in this section and
the prior ones describes experiments of the sort we are currently carrying out with our virtual
agent control software. We have not yet demonstrated all the forms of semiosis and non-linguistic
communication described in the last section using our virtual agent control system, but we have
demonstrated some of them and are actively working on extending our system's capabilities. In
the following section, we venture a bit further into the realm of hypothesis and describe some
functionalities that are beyond the scope of our current virtual agent control software, but that
we hope to put into place gradually during the next 1-2 years. The basic goal of this work is to
move from non-verbal to verbal communication.
It is interesting to enumerate the aspects in which each of the above components appears to
be capable of tractable adaptation via experiential, embodied learning:
• Words and phrases that are found to be systematically associated with particular objects
in the world, may be added to the "gazeteer list" used by the entity extractor
• The link parser dictionary may be automatically extended. In cases where the agent hears
a sentence that is supposed to describe a certain situation, and realizes that in order for the
sentence to be mapped into a set of logical relationships accurately describing the situation,
it would be necessary for a certain word to have a certain syntactic link that it doesn't
have, then the link parser dictionary may be modified to add the link to the word. (On the
other hand, creating new link parser link types seems like a very difficult sort of learning -
not to say it is unaddressable, but it will not be our focus in the near term.)
• Similar to with the link parser dictionary, if it is apparent that to interpret an utterance in
accordance with reality a RelEx rule must be added or modified, this may be automatically
done. The RelEx rules are expressed in the format of relatively simple logical implications
between Boolean combinations of syntactic and semantic relationships, so that learning and
modifying them is within the scope of a probabilistic logic system such as OpenCogPrime's
PLN inference engine.
• The rules used by RelEx2Prame may be experientially modified quite analogously to those
used by RelEx
• Our current statistical parse ranker ranks an interpretation of a sentence based on the
frequency of occurrence of its component links across a parsed corpus. A deeper approach,
however, would be to rank an interpretation based on its commonsensical plausibility, as
EFTA00624599
47.3 Teaching Gestural Communication 453
inferred from experienced-world-knowledge as well as corpus-derived knowledge. Again, this
is within the scope of what an inference engine such as PLN should be able to do.
• Our word sense disambiguation and reference resolution algorithms involve probabilistic
estimations that could be extended to refer to the experienced world as well as to a parsed
corpus. For example, in assessing which sense of the noun "run" is intended in a certain
context, the system could check whether stockings, or sports-events or series-of-events, are
more prominent in the currently-observed situation. In assessing the sentence "The children
kicked the dogs, and then they laughed," the system could map "they" into "children" via
experientially-acquired knowledge that children laugh much more often than dogs.
• NLGen uses the link parser dictionary, treated above, and also uses rules analogous to (but
inverse to) RelEx rules, mapping semantic relations into brief word-sequences. The "gold
standard" for NLGen is whether, when it produces a sentence S from a set Ft of semantic
relationships, the feeding of $ into the language comprehension subsystem produces Et
(or a close approximation) as output. Thus, as the semantic mapping rules in RelEx and
RelEx2Frame adapt to experience, the rules used in NLGen must adapt accordingly, which
poses an inference problem unto itself.
All in all, when one delves in detail into the components that make up our hybrid
statistical/rule-based NLP system, one sees there is a strong opportunity for experiential adap-
tive learning to substantially modify nearly every aspect of the NLP system, while leaving the
basic framework intact.
This approach, we suggest, may provide means of dealing with a number of problems that
have systematically vexed existing linguistic approaches. One example is parse ranking for com-
plex sentences: this seems almost entirely a matter of the ability to assess the semantic plausi-
bility of different parses, and doing this based on statistical corpus analysis seems unreasonable.
One needs knowledge about a world to ground reasoning about plausiblity.
Another example is preposition disambiguation, a topic that is barely dealt with at all in
the computational linguistics literature (see e.g. [331 for an indication of the state of the art).
Consider the problem of assessing which meaning of "with" is intended in sentences like "I ate
dinner with a fork", "I ate dinner with my sister", "I ate dinner with dessert." In performing
this sort of judgment, an embodied system may use knowledge about which interpretations
have matched observed reality in the case of similar utterances it has processed in the past,
and for which it has directly seen the situations referred to by the utterances. If it has seen
in the past, through direct embodied experience, that when someone said "I ate cereal with a
spoon," they meant that the spoon was their tool not part of their food or their eating-partner;
then when it hears "I ate dinner with a fork," it may match "cereal" to "dinner" and "spoon"
to "fork" (based on probabilistic similarity measurement) and infer that the interpretation
of "with" in the latter sentence should also be to denote a tool. How does this approach to
computational language understanding tie in with gestural and general semiotic learning as we
discussed earlier? The study of child language has shown that early language use is not purely
verbal by any means, but is in fact a complex combination of verbal and gestural communication
123J. With the exception of first bullet point (entity extraction) above, every one of our instances
of experiential modification of our language framework listed above involves the use of an
understanding of what situation actually exists in the world, to help the system identify what
the logical relationships output by the NLP system are supposed to be in a certain context.
But a large amount of early-stage linguistic communication is social in nature, and a large
amount of the remainder has to do with the body's relationship to physical objects. And, in
understanding "what actually exists in the world" regarding social and physical relationships, a
EFTA00624600
454 47 Embodied Language Processing
full understanding of gestural communication is important. So, the overall pathway we propose
for achieving robust, ultimately human-level NLP functionality is as follows:
• The capability for learning diverse instances of semiosis is established
• Gestural communication is mastered, via nonverbal imitative/reinforcement/corrective
learning mechanisms such as we utilized for our embodied virtual agents
• Gestural communication, combined with observation of and action in the world and verbal
interaction with teachers, allows the system to adapt numerous aspects of its initial NLP
engine to allow it to more effectively interpret simple sentences pertaining to social and
physical relationships
• Finally, given the ability to effectively interpret and produce these simple and practical
sentences, probabilistic logical inference allows the system to gradually extend this ability
to more and more complex and abstract senses, incrementally adapting aspects of the NLP
engine as its scope broadens.
In this brief section we will mention another potentially important factor that we have
intentionally omitted in the above analysis - but that may wind up being very important,
and that can certainly be taken into account in our framework if this proves necessary. We
have argued that gesture is an important predecessor to language in human children, and that
incorporating it in AI language learning may be valuable. But there is another aspect of early
language use that plays a similar role to gesture, which we have left out in the above discussion:
this is the acoustic aspects of speech.
Clearly, pre-linguistic children make ample use of communicative sounds of various sorts.
These sounds may be iconic, indexical or symbolic; and they may have a great deal of sub-
tlety. Steven Mithen IMit961 has argued that non-verbal utterances constitute a kind of proto-
language. and that both music and language evolved out of this. Their role in language learning
is well-known. We are uncertain as to whether an exclusive focus on text rather than speech
would critically impair the language learning process of an AI system. We are fairly strongly
convinced of the importance of gesture because it seems bound up with the importance of
semiosis - gesture, it seems, Ls how young children learn flexible semiotic communication skills,
and then these skills are gradually ported from the gestural to the verbal domain. Semioti-
cally, on the other hand, phonology doesn't seem to give anything special beyond what gesture
gives. What it does give is an added subtlety of emotional expressiveness - something that is
largely missing from virtual agents aS implemented today, due to the lack of really fine-grained
facial expressions. Also, it provides valuable clues to parsing, in that groups of words that are
syntactically hound together are often phrased together acoustically.
If one wished to incorporate acoustics into the framework described above, it would not
be objectionably difficult on a technical level. Speech-to-text and text-to-speech software both
exist, but neither have been developed with a view specifically toward conveyance of emotional
information. One could approach the problem of assessing the emotional state of an utterance
based on its sound as a supervised categorization problem, to be solved via supplying a machine
learning algorithm with training data consisting of human-created pairs of the form (utterance,
emotional valence). Similarly, one could tune the dependence of text-to-speech software for
appropriate emotional expressiveness based on the same training corpus.
EFTA00624601
47.4 Simple Experiments with Embodiment and Anaphor Resolution 455
47.4 Simple Experiments with Embodiment and Anaphor Resolution
Now we turn to some fairly simple practical work that was done in 2008 with the OpenCog-based
PetBrain software, involving the use of virtually embodied experience to help with interpretation
of linguistic utterances. This work has been superseded somewhat by more recent work using
OpenCog to control virtual agents; but the PetBrain work was especially clear and simple, so
suitable in an expository sense for in-depth discussion here.
One of the two ways the PetBrain related language processing to embodied experience was via
using the latter to resolve anaphoric references in text produced by human-controlled avatars.
The PetBrain controlled agent lived in a world with many objects, each one with their own
characteristics. For example, we could have multiple balls, with varying colors and sizes. We
represent this in the OpenCog Atomspace via using multiple nodes: a single ConceptNode to
represent the concept "ball", a WordNode associated with the word "ball", and numerous Se-
meNocS representing particular balls. There may of course also be ConceptNodes representing
ball-related ideas not summarized in any natural language word, e.g. "big fat squishy balls,"
"balls that can usefully be hit with a bat", etc.
As the agent interacts with the world, it acquires information about the objects it finds,
through perceptions. The perceptions associated with a given object are stored as other nodes
linked to the node representing the specific object instance. All this information is represented
in the Atomspace using FrameNet-style relationships (exemplified in the next section).
When the user says, e.g., "Grab the red ball", the agent needs to figure out which specific
ball the user is referring to - i.e. it needs to invoke the Reference Resolution (RR) process. RR
uses the information in the sentence to select instances and also a few heuristic rules. Broadly
speaking, Reference Resolution maps nouns in the user's sentences to actual objects in the
virtual world. based on world-knowledge obtained by the agent through perceptions.
In this example, first the brain selects the ConceptNodes related to the word "ball". Then
it examines all individual instances associated with these concepts, using the determiners in
the sentence along with other appropriate restrictions (in this example the determiner is the
adjective "red"; and since the verb is "grab" it also looks for objects that can be fetched). If it
finds more than one "fetchable red ball", a heuristic is used to select one (in this case, it chooses
the nearest instance).
The agent also needs to map pronouns in the sentences to actual objects in the virtual world.
For example, if the user says "I like the red ball. Grab it.", the agent must map the pronoun
"it" to a specific red ball. This process is done in two stages: first using anaphor resolution to
associate the pronoun "it" with the previously heard noun "ball"; then using reference resolution
to associate the noun "ball" with the actual object.
The subtlety of anaphor resolution is that there may be more than one plausible "candidate"
noun corresponding to a given pronouns. As noted above, at time writing RelEx's anaphor
resolution system is somewhat simplistic and is based on the classical Hobbs algorithm[llob78].
Basically. when a pronoun (it, he, she, they and so on) is identified in a sentence, the Hobbs
algorithm searches through recent sentences to find the nouns that fit this pronoun according
to number, gender and other characteristics. The Hobbs algorithm is used to create a ranking
of candidate nouns, ordered by time (most recently mentioned nouns come first).
We improved the Hobbs algorithm results by using the agent's world-knowledge to help
choose the best candidate noun. Suppose the agent heard the sentences:
"The ball is red."
EFTA00624602
456 47 Embodied Language Processing
"The stick is brown."
and then it receives a third sentence
"Grab it.".
The anaphor resolver will build a list containing two options for the pronoun "it" of the third
sentence: ball and stick. Given that the stick corresponds to the most recently mentioned noun,
the agent will grab it instead of (as Hobbs would suggest) the ball.
Similarly, if the agent's history, contains
"From here I can see as tree and a ball."
"Grab it."
Hobbs algorithm returns as candidate nouns "tree" and "ball", in this order. But using our
integrative Reference Resolution process, the agent will conclude that a tree cannot be grabbed,
so this candidate is discarded and "ball" is chosen.
47.5 Simple Experiments with Embodiment and Question Answering
The PetBrain was also capable of answering simple questions about its feelings/emotions (hap-
piness, fear, etc.) and about the environment in which it lives. After a question is asked to the
agent, it is parsed by RelEx and classified as either a truth question or a discursive one. After
that, RelEx rewrites the given question as a list of Frames (based on FrameNet" with some
customizations), which represent its semantic content. The Frames version of the question is
then processed by the agent and the answer is also written in Frames. The answer Frames are
then sent to a module that converts it back to the RelEx format. Finally the answer, in RelEx
format, is processed by the NLGen module, that generates the text of the answer in English.
We will discuss this process here in the context of the simple question "What is next to the
tree?", which in an appropriate environment receives the answer 'he red ball is next to the
tree."
Question answering (QA) of course has a long history, in AI iNlayoil, and our approach
fits squarely into the tradition of "deep semantic QA systems"; however it is innovative in its
combination of dependency parsing with FrameNet and most importantly in the manner of its
integration of QA with an overall cognitive architecture for agent control.
4 7.5.1 Preparing/Matching Frames
In order to answer an incoming question, the agent tries to match the Frames list, created by
RelEx, against the Frames stored in its own memory. In general these Frames could come from
a variety of sources, including inference, concept creation and perception; but in the current
PetBrain they primarily come from perception, and simple transformations of perceptions.
However, the agent cannot use the incoming perceptual Frames in their original format
because they lack grounding information (information that connects the mentioned elements to
• http://frameneticsi.berkeley.edu
EFTA00624603
47.5 Simple Experiments with Embodiment and Question Answering 457
the real elements of the environment). So, two steps are then executed before trying to match
the names: Reference Resolution (described above) and names Rewriting. Frames Rewriting
is a process that changes the values of the incoming Frames elements into grounded values.
Here is an example:
Incoming Frame (Generated by RelEx)
EvaluationLink
DefinedFrameElementNode Color:Color
WordlnstanceNode "redelaaa"
EvaluationLink
DefinedFrameElementNode Color:Entity
WordlnstanceNode "ballellobb"
ReferenceLink
WordlnstanceNode "realaaa"
WordNode "red"
After Reference Resolution
ReferenceLink
WordlnstanceNode "ballebbb"
SemeNode "ball 99"
Grounded Frame (After Rewriting)
EvaluationLink
DefinedFrameElementNode Color:Color
ConceptNode "red"
EvaluationLink
DefinedFrameElementNode Color:Entity
SemeNode "ball 99"
Frame Rewriting serves to convert the incoming Frames to the same structure used by the
Frames stored into the agent's memory. After Rewriting, the new Frames are then matched
against the agent's memory, and if all names were found in it, the answer is known by the
agent, otherwise it is unknown.
In the PetBrain system, if a truth question was posed and all Frames were matched suc-
cessfully, the answer would be be "yes"; otherwise the answer is "no". Mapping of ambiguous
matching results into ambiguous responses were not handled in the PetBrain.
If the question requires a discursive answer the process is slightly different. For known answers
the matched Frames are converted into RelEx format by Frames2RelEx and then sent to NLGen,
which prepares the final English text to be answered. There are two types of unknown answers.
The first one is when at least one name cannot be matched against the agent's memory and
the answer is "I don't know". And the second type of unknown answer occurs when all Frames
were matched successfully but they cannot be correctly converted into RelEx format or NLGen
cannot identify the incoming relations. In this case the answer is "I know the answer, but I
don't know how to say it".
EFTA00624604
458 47 Embodied Language Processing
47.5.2 Frames2RelEx
As mentioned above, this module is responsible for receiving a list of grounded names and
returning another list containing the relations, in RelEx format, which represents the grammat-
ical form of the sentence described by the given names. That is, the names list represents a
sentence that the agent wants to say to another agent. NLGen needs an input in RelEx Format
in order to generate an English version of the sentence; Frames2RelEx does this conversion.
Currently, Frames2RelEx is implemented as a rule-based system in which the preconditions
are the required frames and the output is one or more RelEx relations e.g.
'Color (Entity,Color) =>
present ($2) .a($2) adj ($2) _predadj ($1, $2)
definite ($1) .n($1) noun ($1) singular ($1)
.v(be) verb(be) punctuation(.) det(the)
where the precondition comes before the symbol => and Color is a frame which has two
elements: Entity and Color. Each element is interpreted as a variable Entity = $1 and Color =
$2. The effect, or output of the rule, is a list of RelEx relations. As in the case of RelEx2Frame,
the use of hand-coded rules is considered a stopgap, and for a powerful AGI system based on
this framework such rules will need to be learned via experience.
47.5.3 Example of the Question Answering Pipeline
Turning to the example "What is next to the tree?", Figure ?? illustrates the processes involved:
Multiverse Client RelEx
r
What is next to the tree? I next to (be. tree)
_subj (be. _SqVar)
Re4Ex
tense (be. present)
itocative_relation:Figure = Sobject
elocative_relation:Ground = tree
olocativerelafion:Relation_type = next 4
Pet Brain
BLocative _relation:Soule = ball 99
PetBrain alLocative_mlabon:Ground = tree
BLocativejelabon:Relation type = next
_obi(not. tree)
subj(rext ball 99)
imperativelnext) NLGen
hyp(next) The ball is next to the tree.
Fig. 47.1: Overview of current PetBrain language comprehension process
EFTA00624605
47.5 Simple Experiments with Embodiment and Question Answering 459
The question is parsed by RelEx, which creates the frames indicating that the sentence is
a question regarding a location reference (next) relative to an object (tree). The frame that
represents questions is called Questioning and it contains the elements Manner that indicates
the kind of question (truth-question, what, where, and so on), Message that indicates the main
term of the question and Addressee that indicates the target of the question. To indicate that
the question is related to a location, the Locative_relation frame is also created with a variable
inserted in its element Figure, which represents the expected answer (in this specific case, the
object that is next to the tree).
The question-answer module tries to match the question frames in the Atomspace to fit the
variable element. Suppose that the object that is next to the tree is the red ball. In this way,
the module will match all the frames requested and realize that the answer is the value of
the element Figure of the frame Locative _relation stored in the Atom Table. Then, it creates
location frames indicating the red ball as the answer. These frames will be converted into RelEx
format by the RelEx2Frames rule based system as described above, and NLGen will generate
the expected sentence "the red ball is next to the tree".
47.5.4 Example of the PetBrain Language Generation Pipeline
To illustrate the process of language generation using NLGen, as utilized in the context of
PetBrain query response, consider the sentence "The red ball is near the tree'. When parsed by
RelEx, this sentence is converted to:
_obj (near, tree)
_subj (near, ball)
imperative (near)
hyp (near)
definite (tree)
singular (tree)
_to-do (be, near)
_subj (be, ball)
present (be)
definite (ball)
singular (ball)
So, if sentences with this format are in the system's experience, these relations are stored by
NLGen and will he used to match future relations that must be converted into natural language.
NLGen matches at an abstract level, so sentences like "The stick is next to the fountain" will
also be matched even if the corpus contain only the sentence 'The ball is near the tree".
If the agent wants to say that 'The red ball is near the tree", it must invoke NLGen with
the above RelEx contents as input. However, the knowledge that the red ball is near the tree
is stored as frames, and not as RelEx format. More specifically, in this case the related frame
stored is the Locative_relation one, containing the following elements and respective values:
Figure —> red ball, Ground —> tree, Relation _type near.
So we must convert these frames and their elements' values into the RelEx format accept by
NLGen. For AGI purposes, a system must learn how to perform this conversion in a flexible and
context-appropriate way. In our current system, however, we have implemented a temporary
EFTA00624606
460 47 Embodied Language Processing
short-cut: a system of hand-coded rules, in which the preconditions are the required frames and
the output is the corresponding RelEx format that will generate the sentence that represents
the frames. The output of a rule may contains variables that mast be replaced by the frame
elements' values. For the example above, the output _subj(be, bail) is generated from the rule
output _subj(be, &ran) with the Strarl replaced by the Figure element value.
Considering specifically question-answering (QA), the PetBrain's Language Comprehension
module represents the answer to a question as a list of frames. In this case, we may have the
following situations:
• The frames match a precondition and the RelEx output is correctly recognized by NLGen,
which generates the expected sentence as the answer;
• The frames match a precondition, but NLGen did not recognize the RelEx output generated.
In this case, the answer will be "I know the answer, but I don't know how to say it", which
means that the question Was answered correctly by the Language Comphrehension, but the
NLGen could not generate the correct sentence;
• The frames didn't match any precondition; then the answer will also be "I know the answer,
but I don't know how to say it".
• Finally, if no frames are generated as answer by the Language Comprehension module, the
agent's answer will be "I don't know".
If the question is a truth-question, then NLGen is not required. In this case, the creation of
frames as answer is considered as a "Yes", otherwise, the answer will be "No" because it was not
possible to find the corresponding frames as the answer.
47.6 The Prospect of Massively Multiplayer Language Teaching
Now we tie in the theme of embodied language learning with more general considerations
regarding embodied experiential learning.
Potentially, this may provide a means to facilitate robust language learning on the part of
virtually embodied agents, and lead to an experientially-trained AGI language facility that can
then be used to power other sorts of agents such as virtual babies, and ultimately virtual adult-
human avatars that can communicate with experientially-grounded savvy rather than in the
manner of chat-bots.
As one concrete, evocative example, imagine millions of talking parrots spread across different
online virtual worlds - all communicating in simple English. Each parrot has its own local
memories, its own individual knowledge and habits and likes and dislikes - but there's also
a common knowledge-base underlying all the parrots, which includes a common knowledge of
English.
The interest of many humans in interacting with chatbots suggests that virtual talking
parrots or similar devices would be likely to meet with a large and enthusiastic audience.
Yes, humans interacting with parrots in virtual worlds can be expected to try to teach the
parrots ridiculous things, obscene things, and so forth. But still, when it conies down to it, even
pranksters and jokesters will have more fun with a parrot that can communicate better, and
will prefer a parrot whose statements are comprehensible.
And for a virtual parrot, the test of whether it has used English correctly, in a given instance,
will come down to whether its human friends have rewarded it, and whether it has gotten what
EFTA00624607
47.6 The Prospect of Massively Multiplayer Language Thaching 461
it wanted. If a parrot asks for food incoherently, it's less likely to get food - and since the virtual
parrots will be programmed to want food, they will have motivation to learn to speak correctly.
If a parrot interprets a human-controlled avatar's request "Fetch my hat please" incorrectly, then
it won't get positive feedback from the avatar - and it will be programmed to want positive
feedback.
And of course parrots are not the end of the story. Once the collective wisdom of throngs of
human teachers has induced powerful language understanding in the collective bird-brain, this
language understanding (and the commonsense understanding coming along with it) will be
useful for many, many other purposes as well. Humanoid avatars - both human-baby avatars
that may serve as more rewarding virtual companions than parrots or other virtual animals; and
language-savvy human-adult avatars serving various useful and entertaining functions in online
virtual worlds and games. Once AN have learned enough that they can flexibly and adaptively
explore online virtual worlds and gather information from human-controlled avatars according
to their own goals using their linguistic facilities, it's easy to envision dramatic acceleration in
their growth and understanding.
A baby Al has numerous disadvantages compared to a baby human being: it lacks the
intricate set of inductive biases built into the human brain, and it also lacks a set of teachers
with a similar form and psyche to it. .. and for that matter, it lacks a really rich body and world.
However, the presence of thousands to millions of teachers constitutes a large advantage for the
AI over human babies. And a flexible AGI framework will be able to effectively exploit this
advantage. If nonlinguistic learning mechanisms like the ones we've described here, utilized in a
virtually-embodied context, can go beyond enabling interestingly trainable virtual animals and
catalyze the process of language learning - then, within a few years time, we may find ourselves
significantly further along the path to AGI than most observers of the field currently expect.
EFTA00624608
EFTA00624609
Chapter 48
Natural Language Dialogue
Co-authored with Ruiting Lian
48.1 Introduction
Language evolved for dialogue - not for reading, writing or speechifying. So it's natural that
dialogue is broadly considered a critical aspect of humanlike AGI - even to the extent that (for
better or for worse) the conversational "Thring Test" is the standard test of human-level AGI.
Dialogue is a high-level functionality rather than a foundational cognitive process, and in
the CogPrime approach it is something that must largely be learned via experience rather
than being programmed into the system. In that some, it may seem odd to have a chapter on
dialogue in a book section focused on engineering aspects of general intelligence. One might
think: Dialogue is something that should emerge from an intelligent system in conjunction with
other intelligent systems, not something that should need to be engineered. And this is certainly
a reasonable perspective! We do think that, as a CogPrime system develops, it will develop its
own approach to natural language dialogue, based on its own embodiment, environment and
experience - with similarities and differences to human dialogue.
However, we have also found it interesting to design a natural language dialogue system
based on CogPrime, with the goal not of emulating human conversation, but rather of enabling
interesting and intelligent conversational interaction with CogPrime systems. We call this sys-
tem "ChatPrime" and will describe its architecture in this chapter. The components used in
ChatPrime may also be useful for enabling CogPrime systems to carry out more humanlike
conversation, via their incorporation in learned schemata; but we will not focus on that aspect
here. In addition to its intrinsic interest, consideration of ChatPrime sheds much light on the
conceptual relationship between NLP and other aspects of CogPrime.
We are very aware that there is an active subfield of computational linguistics focused on
dialogue systems iWaldlli, LDA05J, however we will not draw significantly on that literature here.
Making practical dialogue systems in the absence of a generally functional cognitive engine is
a subtle and difficult art, which has been addressed in a variety of ways; however, we have
found that designing a dialogue system within the context of an integrative cognitive engine
like CogPrime is a somewhat different sort of endeavor.
463
EFTA00624610
464 48 Natural Language Dialogue
48.1.1 Two Phases of Dialogue System Development
In practical terms, we envision the ChatPrime system as pacsessing two phases of development:
1. Phase 1:
• "Lower levels" of NL comprehension and generation executed by a relatively traditional
approach incorporating statistical and rule-based aspects (the RelEx and NLGen sys-
tems)
• Dialogue control utilizes hand-coded procedures and predicates (SpeechActSchema and
SpeechActTriggers) corresponding to fine-grained types of speech act
• Dialogue control guided by general cognitive control system (OpenPsi, running within
OpenCog)
• SpeechActSchema and SpeechActTriggers, in some cases, will internally consult proba-
bilistic inference, thus supplying a high degree of adaptive intelligence to the conversa-
tion
2. Phase 2:
• "Lower levels" of NL comprehension and generation carried out within primary cogni-
tion engine, in a manner enabling their underlying rules and probabilities to be modified
based the system's experience. Concretely, one way this could be done in OpenCog would
be via
- Implementing the RelEx and RelEx2F'ame rules as PLN implications in the Atom-
space
- Implementing parsing via expressing the link parser dictionary as Atoms in the
Atomspace, and using the SAT link parser to do parsing as an example of logical
unification (carried out by a MindAgent wrapping an SAT solver)
— Implementing NLGen within the OpenCog core, via making NLGen's sentence
database a specially indexed Atomspace, and wrapping the NLGen operations in a
MindAgent
• Reimplement the SpeechActSchema and SpeechActTriggers in an appropriate combina-
tion of Combo and PLN logical link types, so they are susceptible to modification via
inference and evolution
It's worth noting that the work required to move from Phase 1 to Phase 2 is essentially
software development and computer science algorithm optimization work, rather than compu-
tational linguistics or AI theory. Then after the Phase 2 system is built there will, of course,
be significant work involved in "tuning" PLN, MOSES and other cognitive algorithms to ex-
perientially adapt the various portions of the dialogue system that have been moved into the
OpenCog core and refactored for adaptiveness.
48.2 Speech Act Theory and its Elaboration
We review here the very basics of speech act theory, and then the specific variant of speech act
theory that we feel will be most useful for practical OpenCog dialogue system development.
EFTA00624611
48.3 Speech Act Schemata and 'niggers 465
The core notion of speech act theory is to analyze linguistic behavior in terms of discrete
speech acts aimed at achieving specific goals. This is a convenient theoretical approach in
an OpenCog context, because it pushes us to treat speech acts just like any other acts that
an OpenCog system may carry out in its world, and to handle speech acts via the standard
OpenCog action selection mechanism.
Searle, who originated speech act theory, divided speech acts according to the following (by
now well known) ontology:
• Assertives : The speaker commits herself to something being true. The sky is blue.
• Directives: The speaker attempts to get the hearer to do something. Clean your room!
• Commissives: The speaker commits to some future course of action. 1 will do it.
• Expressives: The speaker expresses some psychological state. I'm sorry.
• Declarations: The speaker brings about a different state of the world. The meeting is
adjourned.
Inspired by this ontology, Twitchell and Nunamaker (in their 2004 paper "Speech Act Pro-
filing: A Probabilistic Method for Analyzing Persistent Conversations and Their Participants")
created a much more fine-grained ontology of 42 kinds of speech acts, called SWBD-DAMSL
(DAMSL = Dialogue Act Markup in Several Layers). Nearly all of their 42 speech act types
can be neatly mapped into one of Searle's 5 high level categories, although a handful don't fit
Searle's view and get categorized as "other." Figures 48.1 and 48.2 depict the 42 acts and their
relationship to Searle's categories.
48.3 Speech Act Schemata and Triggers
In the suggested dialogue system design, multiple SpeechActSchema would be implemented,
corresponding roughly to the 42 SWBD-DAMSL speech acts. The correspondence is "rough"
because
• we may wish to add new speech acts not in their list
• sometimes it may be most convenient to merge 2 or more of their speech acts into a single
SpeechActSchema. For instance, it's probably easiest to merge their YES ANSWER and NO
ANSWER categories into a single TRUTH VALUE ANSWER schema, yielding affirmative,
negative, and intermediate answers like "probably", "probably not", "I'm not sure", etc.
• sometimes it may be best to split one of their speech acts into several, e.g. to separately
consider STATEMENTs which are responses to statements, versus statements that are
unsolicited disbursements of "what's on the agent's mind."
Overall, the SWBD-DAMSL categories should be taken as guidance rather than doctrine. How-
ever, they are valuable guidance due to their roots in detailed analysis of real human conversa-
tions, and their role as a bridge between concrete conversational analysis and the abstractions
of speech act theory.
Each SpeechActSchema would take in an input consisting of a DialogueNode, a Node type
possessing a collection of links to
• a series of past statements by the agent and other conversation participants, with
- each statement labeled according to the utterer
EFTA00624612
466 48 Natural Language Dialoglie
Ttll NNs T44 hust&
STAIttibtif-NOtö.OMtilt" t4c. 1 it in k heta *Mittlei
ACKNOWLISOL takttOLINNOL) b Uhnuly
STATUCIT.0•IXION I dun Co inn
AOKLUACCIPT het non>
AlehNOOMOD.TtitN.Eittellt LIMINTRIMILTAWLI.
ArtiliCIATION ta I on ~Mf
Tt.l.NO.QtY111Ct% Du >on Lnc ealtn Nn µAtIJ rabling•
11on utan 5.101.114. LTbruet kounj
Ytt Atnitital Yn
enNvitaintiAL.Cuttitil >Voll yCy ten okt tyting nyo.
plaQlnallOM Wen. hr. edd art yes?
N1nANAW.IA Ho.
IteShinSt ArittOWLEIXAMILM Hy Oh. n,
MIME 1de LAn. inn Koda/ ny onne n nyt
DtCLARATItE YM-NO-Qt..STIOS gyed So y ou aftfed to pfi • brunt
On» cent Mk k $1.« eitt • bon. >n tona
BACICCIUM4L IN LJI'.AiITONIVMY ib« neir
OtttIATION Cg Kny 'KIL It pultane mon tun con
SinitlAttantltOthttlAlt hf Ult >ny anm ownbol who* ful nt
Mital4ATittNON.YU "anal ba
Antat,Dlititelive MI Kilo Ja,'I ymt yo
CtILLAINotA/rtt cotill.ntoN C2 Onn mon nottyln
ItsrFAI.Platti tent 11b. luta,
ttn %i-1,M FMK... IP lk• ann you'
RHETOIIICAL-ottSTIONt Who unn' 'tal le onnpipo '
NOKO Mfl« ANnutfAORLIMENT O ryn...uer • 140K
ItateCT Wen. n>
~Ulv( t•Ce4-1•0 .•!1$Watt N. at t.fMNa.
SIGNAL.NON.I.MIGOSIANIXttO tt Lunt tott
OTIUM titir•IRS I Onn Omn
Otta IS T letat.oreltINCI Ile. a »XI'
Oattnrse 9 1 N h a non n • nemnt.,•
DisinelanD Akt•lat Kl• %tan Koma!Inn
tall.PAInInTAI X II My fat« aur. onda,. Ane nat.
0•TIONt CekalITS amti ril tatt ot chra Mana
51.11-TALK 11 MTeI'ulm.r 1'ro Knm. hr
DowNFLAVIII T oe. J yiKlt
lAnnt/Affettl , PAIET MP Sueonlan L4 thot
CK Risk?
IMICLARAInt q•Cd lny on Hai brøl N bar
»Ot ot
lin ~sakt
Fig. 48.1: The 42 DAMSL speech act categories.
~0 be:nise, Dita342
Simulan Onttat ttti•Nte.Q. atm%
Vis AA,WIIII. Altv4amilivtrlartattsitii i lit. QtraTio%
No Anonts ...La<.~L/Actiatiutilt Otatlinint res444)-91,1Nlita
gyVfAllOh M.M.,* ACILMMLWilliGUIT 1114<ms.rass ,cesatiow
Patietitiiti tot”ivi minner. bonn .nounk oni..... st46 Si tematnatttninti i kT1
<Lutte...mi Comnatket Atuotsmirvi tann Atittn•Dettnat
IIIIPTONCAL-911f1~1 Alr•IIC1Afilla ow.e.nan.
tistattin lot -to Alla\ WO CIA, Vt., 11~,CLO/IN4 longo/onn
Orm' Annvint 111~ DIXtAllArNt Wi.-Qtrasnaa
MOIDIMII*15 n valt antatt
Ditritansaiu,saltiat Ruin Ot«
Cconninon Anovrto Onaz
Colonlyannt DislaAtta TtnaPtal y ltal
Onen. °niom. • Cotolrts Untlill/ACCIPT"t7 Noitttaittt
Atokott
TtimiNG
Wittimaiiitta
~Sik Od.altylravnvYb ~borda nis-Ymta
Fig. 48.2: Connecting the 42 DAMSL speech act categories to Searle's 5 higher-level categories.
- each statement uttered by the agent, labeled according to which SpeechActSchema
was used to produce it, plus (see below) which SpeechAct'I\'igger and which response
generator was involvecl
EFTA00624613
48.3 Speech Act Schemata and whiggers 467
• a set of Atoms comprising the context of the dialogue. These Atoms may optionally be
linked to some of the Atoms representing some of the past statements. If they are not so
linked, they are considered as general context.
The enaction of SpeechActSchema would be carried out via PredictivelmplicationLinks em-
bodying "Context AND Schema —> Goal" schematic implications, of the general form
Predictivelmplication
AND
Evaluation
SpeechActTrigger T
DialogueNode D
Execution
SpeechActSchema S
DialogueNode D
Evaluation
Evaluation
Goal G
with
ExecutionOutput
SpeechActSchema S
DialogueNode D
UtteranceNode U
being created as a result of the enaction of the SpeechActSchema. (An UtteranceNode is a series
of one or more SentenceNodes.)
A single SpeechActSchema may be involved in many such implications, with different prob-
abilistic weights, if it naturally has many different Trigger contexts.
Internally each SpeechActSchema would contain a set of one or more response generators,
each one of which is capable of independently producing a response based on the given input.
These may also be weighted, where the weight determines the probability of a given response
generation process being chosen in preference to the others, once the choice to enact that
particular SpeechActSchema has already been made.
48.3.1 Notes Toward Example SpeechActSchema
To make the above ideas more concrete, let's consider a few specific SpeechActSchema. We
won't fully specify them here, but will outline them sufficiently to make the ideas clear.
48.3.1.1 TruthValueAnswer
The TruthValueAnswer SpeechActSchema would encompass SWED-DAMSL's YES ANSWER
and NO ANSWER, and also more flexible truth value based responses.
EFTA00624614
468 48 Natural Language Dialogue
Trigger context
: when the conversation partner produces an utterance that RelEx maps into a truth-value
query (this is simple as truth-value-query is one of RelEx's relationship types).
Goal
: the simplest goal relevant here is pleasing the conversation partner, since the agent may have
noticed in the past that other agents are pleased when their questions are answers. (More
advanced agents may of course have other goals for answering questions, e.g. providing the
other agent with information that will let it be more useful in future.)
Response generation schema
: for starters, this SpeechActSchema could simply operate as follows. It takes the relationship
(Atom) corresponding to the query, and uses it to launch a query to the pattern matcher or PLN
backward chainer. Then based on the result, it produces a relationship (Atom) embodying the
answer to the query, or else updates the truth value of the existing relationship corresponding
to the answer to the query. This "answer" relationship has a certain truth value. The schema
could then contain a set of rules mapping the truth values into responses, with a list of possible
responses for each truth value range. For example a very high strength and high confidence
truth value would be mapped into a set of responses like (definitely, certainly, surely, yes, in-
deed).
This simple case exemplifies the overall Phase 1 approach suggested here. The conversa-
tion will be guided by fairly simple heuristic rules, but with linguistic sophistication in the
comprehension and generation aspects, and potentially subtle inference invoked within the
SpeechActSchema or (less frequently) the Trigger contexts. Then in Phase 2 these simple heuris-
tic rules will be refactored in a manner rendering them susceptible to experiential adaptation.
48.3.1.2 Statement: Answer
The next few SpeechActSchema (plus maybe some similar ones not given here) are intended to
collectively cover the ground of SWBD-DAMSL's STATEMENT OPINION and STATEMENT
NON-OPINION acts.
Trigger context
: The trigger is that the conversation partner asks a wh- question
EFTA00624615
48.3 Speech Act Schemata and 'nigger's 469
Goal
: Similar to the case of a TruthValueAnswer, discussed above
Response generation schema
: When a wh- question is received, one reasonable response is to produce a statement comprising
an answer. The question Atom is posed to the pattern matcher or PLN, which responds with an
Atom-set comprising a putative answer. The answer Atoms are then pared down into a series
of sentence-sized Atom-sets, which are articulated as sentences by NLGen. If the answer Atoms
have very low-confidence truth values, or if the Atomspace contains knowledge that other agents
significantly disagree with the agent's truth value assessments, then the answer Atom-set may
have Atoms corresponding to "I think" or "In my opinion" etc. added onto it (this gives an
instance of the STATEMENT NON-OPINION act).
48.3.1.3 Statement: Unsolicited Observation
Trigger context
: when in the presence of another intelligent agent (human or AI) and nothing has been said
for a while, there is a certain probability of choosing to make a "random" statement.
Goal 1
: Unsolicited observations may be made with a goal of pleasing the other agent, as it may have
been observed in the past that other agents are happier when spoken to
Goal 2
: Unsolicited observations may be made with goals of increasing the agent's own pleasure or
novelty or knowledge - because it may have been observed that speaking often triggers conver-
sations, and conversations are often more pleasurable or novel or educational than silence
Response generation schema
: One option is a statement describing something in the mutual environment, another option is
a statement derived from high-STI Atoms in the agent's Atomspace. The particulars are similar
to the "Statement: Answer" case.
EFTA00624616
470 48 Natural Language Dialogue
48.3.1.4 Statement: External Change Notification
Trigger context
: when in a situation with another intelligent agent, and something significant changes in the
mutually perceived situation, a statement describing it may be made.
Goal 1
: External change notification utterances may be made for the same reasons as Unsolicited
Observations, described above.
Goal 2
: The agent may think a certain external change is important to the other agent it is talking
to, for some particular reason. For instance, if the agent sees a dog steal Bob's property, it may
wish to tell Bob about this.
Goal 3
: The change may be important to the agent itself - and it may want its conversation partner
to do something relevant to an observed external change ... so it may bring the change to the
partner's attention for this reason. For instance, "Our friends are leaving. Please try to make
them come back."
Response generation schema
: The Atom-set for expression characterizes the change observed. The particulars are similar to
the "Statement: Answer" case.
48.3.1.5 Statement: Internal Change Notification
Trigger context 1
: when the importance level of an Atom increases dramatically while in the presence of an-
other intelligent agent, a statement expressing this Atom (and some of its currently relevant
surrounding Atoms) may be made
EFTA00624617
48.4 Probabilistic Mining of Trigger contexts 471
Trigger context 2
: when the truth value of a reasonably important Atom changes dramatically while in the
presence of another intelligent agent, a statement expressing this Atom and its truth value may
be made
Goal
: Similar goals apply here as to External Change Notification, considered above
Response generation schema
: Similar to the "Statement: External Change Notification" case.
48.3.1.6 WHQuestion
Trigger context
: being in the presence of an intelligent agent thought capable of answering questions
Goal 1
: the general goal of increasing the agent's total knowledge
Goal 2
: the agent notes that, to achieve one of its currently important goals, it would be useful to
possess a Atom fulfilling a certain specification
Response generation schema
: Formulate a query whose answer would be an Atom fulfilling that specification, and then
articulate this logical query as an English question using NLGen
48.4 Probabilistic Milling of Trigger contexts
One question raised by the above design sketch is where the Trigger contexts come from. They
may be hand-coded, but this approach may suffer from excessive brittleness. The approach
suggested by Twitchell and Nunamaker's work (which involved modeling human dialogues rather
EFTA00624618
472 48 Natural Language Dialogue
than automatically generating intelligent dialogues) is statistical. That is, they suggest marking
up a corpus of human dialogues with tags corresponding to the 42 speech acts, and learning
from this annotated corpus a set of Markov transition probabilities indicating which speech acts
are most likely to follow which others. In their approach the transition probabilities refer only
to series of speech acts.
In an OpenCog context one could utilize a more sophisticated training corpus in a more
sophisticated way. For instance, suppose one wants to build a dialogue system for a game char-
acter conversing with human characters in a game world. Then one could conduct experiments
in which one human controls a "human" game character, and another human puppeteers an
"Al" game character. That is, the puppeteered character funnels its perceptions to the AI
system, but has its actions and verbalizations controlled by the human puppeteer. Given the
dialogue from this sort of session, one could then perform markup according to the 42 speech
acts.
As a simple example, consider the following brief snippet of annotated conversation:
speaker utterance speech act type
Ben Go get me the ball ad
Al Where is it? qw
Ben Over there 'points] sd
Al By the table? qy
Ben Yeah ny
Al Thanks ft
Al I'll get it now. commits
A DialogueNode object based on this snippet would contain the inforn ation in the table, plus
some physical information about the situation, such as, in this case: predicates describing the
relative locations of the two agents, the ball an the table (e.g. the two agents are very near each
other, the ball and the able are very near each other, but these two groups of entities are only
moderately near each other): and, predicates involving
Then, one could train a machine learning algorithm such as MOSES to predict the probability
of speech act type S1 occurring at a certain point in a dialogue history, based on the prior history
of the dialogue. This prior history could include percepts and cognitions as well as utterances,
since one has a record of the AI system's perceptions and cognitions in the course of the
marked-up dialogue.
One question is whether to use the 42 SWBD-DAMSL speech acts for the creation of the
annotated corpus, or whether instead to use the modified set of speech acts created in design-
ing SpeechActSchema. Either way could work, but we are mildly biased toward the former,
since this specific SWBD-DAMSL markup scheme has already proved its viability for marking
up conversations. It seems unproblematic to map probabilities corresponding to these speech
acts into probabilities corresponding to a slightly refined set of speech acts. Also, this way
the corpus would be valuable independently of ongoing low-level changes in the collection of
SpeechActSchema.
In addition to this sort of supervised training in advance, it will be important to enable the
system to learn Trigger contexts online as a consequence of its life experience. This learning
may take two forms:
1. Most simply, adjustment of the probabilities associated with the PredictivelmplicationLinks
between SpeechActTriggers and SpeechActSchema
EFTA00624619
48.5 Conclusion 473
2. More sophisticatedly, learning of new SpeechActTrigger predicates. using an algorithm such
as MOSES for predicate learning, based on mining the history of actual dialogues to estimate
fitness
In both cases the basis for learning is information regarding the extent to which system goals
were fulfilled by each past dialogue. PredictiveImplications that correspond to portions of suc-
cessful dialogues will be have their truth values increased, and those corresponding to portions of
unsuccessful dialogues will have their truth values decreased. Candidate SpeechActTriggers will
be valued based on the observed historical success of the responses they would have generated
based on historically perceived utterances; and (ultimately) more sophisticatedly, based on the
estimated success of the responses they generate. Note that. while somewhat advanced, this kind
of learning is much easier than the procedure learning required to learn new SpeechActSchema.
48.5 Conclusion
While the underlying methods are simple, the above methods appear capable of producing
arbitrarily complex dialogues about any subject that is represented by knowledge in the Atom-
Space. There is no reason why dialogue produced in this manner should be indistinguishable
from human dialogue; but it may nevertheless be humanly comprehensible, intelligent and in-
sightful. What is happening in this sort of dialogue system is somewhat similar to current
natural language query systems that query relational databases, but the "database" in ques-
tion is a dynamically self-adapting weighted labeled hypergraph rather than a static relational
database, and this difference means a much more complex dialogue system is required, as well
as more flexible language comprehension and generation components.
Ultimately, a CogPrime system - if it works as desired - will be able to learn increased
linguistic functionality, and new languages, on its own. But this is not a prerequisite for having
intelligent dialogues with a CogPrime system. Via building a ChatPrime type system, as out-
lined here, intelligent dialogue can occur with a CogPrime system while it is still at relatively
early stages of cognitive development, and even while the underlying implementation of the
CogPrime design is incomplete. This is not closely analogous to human cognitive and linguistic
development, but, it can still be pursued in the context of a CogPrime development plan that
follows the overall arc of human developmental psychology.
EFTA00624620
EFTA00624621
Section VIII
From Here to AGI
EFTA00624622
EFTA00624623
Chapter 49
Summary of Argument for the CogPrime
Approach
49.1 Introduction
By way of conclusion, we now return to the "key claims" that were listed at the end of Chapter
1 of Part 1. Quite simply, this is a list of claims such that - roughly speaking - if the reader
accepts these claims, they should accept that the CogPrime approach to AGI is a viable one.
On the other hand if the reader rejects one or more of these claims, they may well find one
or more aspects of CogPrime unacceptable for some related reason. In Chapter 1 of Part 1 we
merely listed these claims; here we briefly discuss each one in the context of the intervening
chapters, giving each one its own section or subsection.
As we clarified at the start of Part 1, we don't fancy that we have provided an ironclad
argument that the CogPrime approach to AGI Ls guaranteed to work as hoped, once it's fully
engineered, tuned and taught. Mathematics isn't yet adequate to analyze the real-world behavior
of complex systems like these; and we have not yet implemented, tested and taught enough of
CogPrime to provide convincing empirical validation. So, most of the claims listed here have
not been rigorously demonstrated, but only heuristically argued for. That is the reality of
AGI work right now: one assembles a design based on the best combination of rigorous and
heuristic arguments one can, then proceeds to create and teach a system according to the design,
adjusting the details of the design based on experimental results as one goes along. = For an
uncluttered list of the claims, please refer back to Chapter 1 of Part 1; here we will review the
claims integrated into the course of discussion.
The following chapter, aimed at the more mathematically-minded reader, gives a list of
formal propositions echoing many of the ideas in the chapter - propositions such that, if they
are true, then the success of CogPrime as an architecture for general intelligence is likely.
49.2 Multi-Memory Systems
The first of our key claims is that to achieve general intelligence in the context of human-
intelligence-friendly environments and goals using feasible computational resources, it's impor-
tant that an AGI system can handle different kinds of memory (declarative, procedural, episodic,
sensory, intentional, attentional) in customized but interoperable ways. The basic idea is that
477
EFTA00624624
478 49 Summary of Argument for the CogPrime Approach
these different kinds of knowledge have very different characteristics, so that trying to handle
them all within a single approach, while surely possible, is likely to be unacceptably inefficient.
The tricky issue in formalizing this claim is that "single approach" is an ambiguous notion:
for instance, if one has a wholly logic-based system that represents all forms of knowledge using
predicate logic, then one may still have specialized inference control heuristics corresponding
to the different kinds of knowledge mentioned in the claim. In this case one has "customized
but interoperable ways" of handling the different kinds of memory, and one doesn't really have
a "single approach" even though one is using logic for everything. To bypass such conceptual
difficulties, one may formalize cognitive synergy using a geometric framework as discussed in
Appendix B, in which different types of knowledge are represented as metrized categories,
and cognitive synergy becomes a statement about paths to goals being shorter in metric spaces
combining multiple knowledge types than in those corresponding to individual knowledge types.
In CogPrime we use a complex combination of representations, including the Atomspace for
declarative, attentional and intentional knowledge and some episodic and sensorimotor knowl-
edge, Combo programs for procedural knowledge, simulations for episodic knowledge, and hi-
erarchical neural nets for some sensorimotor knowledge (and related episodic, attentional and
intentional knowledge). In cases where the same representational mechanism is used for dif-
ferent types of knowledge, different cognitive processes are used, and often different aspects of
the representation (e.g. attentional knowledge is dealt with largely by ECAN acting on Atten-
tionValues and HebbianLinks in the Atomspace: whereas declarative knowledge is dealt with
largely by PLN acting on TruthValues and logical links, also in the AtomSpace). So one has
a mix of the "different representations for different memory types" approach and the "different
control processes on a common representation for different memory, types" approach.
It's unclear how closely dependent the need for a multi-memory approach is on the particulars
of "human-friendly environments." We argued in Chapter 9 of Part 1 that one factor militating
in favor of a multi-memory approach is the need for multimodal communication: declarative
knowledge relates to linguistic communication; procedural knowledge relates to demonstrative
communication; attentional knowledge relates to indicative communication; and so forth. But
in fact the multi-memory approach may have a broader importance, even to intelligences with-
out multimodal communication. This is an interesting issue but not particularly critical to the
development of human-like, human-level AGI, since in the latter case we are specifically con-
cerned with creating intelligences that can handle multimodal communication. So if for no other
reason, the multi-memory approach is worthwhile for handling multi-modal communication.
Pragmatically, it is also quite clear that the human brain takes a multi-memory approach,
e.g. with the cerebellum and closely linked cortical regions containing special structures for
handling procedural knowledge, with special structures for handling motivational (intentional)
factors, etc. And (though this point is certainly not definitive, it's meaningful in the light of
the above theoretical discussion) decades of computer science and narrow-Al practice strongly
suggest that the "one memory structure fits all" approach is not capable of leading to effective
real-world approaches.
49.3 Perception, Action and Environment
The more we understand of human intelligence, the clearer it becomes how closely it has evolved
to the particular goals and environments for which the human organism evolved. This is true
EFTA00624625
49.4 Developmental Pathways 479
in a broad sense, as illustrated by the above issues regarding multi-memory systems, and is
also true in many particulars, as illustrated e.g. by Changizi's IChaol evolutionary, analysis of
the human visual system. While it might be possible to create a human-like, human-level AGI
by abstracting the relevant biases from human biology and behavior and explicitly encoding
them in one's AGI architecture, it seems this would be an inordinately difficult approach in
practice, leading to the claim that to achieve human-like general intelligence, it's important for
an intelligent agent to have sensory data and motoric affordances that roughly emulate those
available to humans. We don't claim this is a necessity - just a dramatic convenience. And if
one accepts this point, it has major implications for what sorts of paths toward AGI it makes
most sense to follow.
Unfortunately, though, the idea of a "human-like" set of goals and environments is fairly
vague; and when you come right down to it, we don't know exactly how close the emulation
needs to be to form a natural scenario for the maturation of human-like, human-level AGI
systems. One could attempt to resolve this issue via a priori theory, but given the current level
of scientific knowledge it's hard to see how that would be possible in any definitive sense ...
which leads to the conclusion that our AGI systems and platforms need to support fairly flexible
experimentation with virtual-world and/or robotic infrastructures.
Our own intuition is that currently neither current virtual world platforms, nor current
robotic platforms, are quite adequate for the development of human-level, human-like AGI.
Virtual worlds would need to become a lot more like robot simulators, allowing more flexible
interaction with the environment, and more detailed control of the agent. Robots would need
to become more robust at moving and grabbing - e.g. with Big Dog's movement ability but the
grasping capability of the best current grabber arms. We do feel that development of adequate
virtual world or robotics platforms is quite possible using current technology, and could be done
at fairly low cost if someone were to prioritize this. Even without AGI-focused prioritization, it
seems that the needed technological improvements are likely to happen during the next decade
for other reasons. So at this point we feel it makes sense for AGI researchers to focas on AGI
and exploit embodiment-platform improvements as they come along - at least, this makes
sense in the case of AGI approaches (like CogPrime ) that can be primarily developed in an
embodiment-platform-independent manner.
49.4 Developmental Pathways
But if an AGI system is going to live in human-friendly environments, what should it do there?
No doubt very many pathways leading from incompetence to adult-human-level general intel-
ligence exist, but one of them is much better understood than any of the others, and that's
the one normal human children take. Of course, given their somewhat different embodiment, it
doesn't make sense to try to force AGI systems to take exactly the same path as human chil-
dren, but having AGI systems follow a fairly close approximation to the human developmental
path seems the smoothest developmental course ... a point summarized by the claim that: To
work toward adult human-level, roughly human-like general intelligence, one fairly easily com-
prehensible path is to use environments and goals reminiscent of human childhood, and seek to
advance one's AGI system along a path roughly comparable to that followed by human children.
Human children learn via a rich variety of mechanisms; but broadly speaking one conclusion
one may drawn from studying human child learning is that it may make sense to teach an
EFTA00624626
480 49 Summary of Argument for the CogPrime Approach
AGI system aimed at roughly human-like general intelligence via a mix of spontaneous learning
and explicit instruction, and to instruct it via a combination of imitation, reinforcement and
correction, and a combination of linguistic and nonlinguistic instruction. We have explored
exactly what this means in Chapter 31 and others, via looking at examples of these types of
learning in the context of virtual pets in virtual worlds, and exploring how specific CogPrime
learning mechanisms can be used to achieve simple examples of these types of learning.
One important case of learning that human children are particularly good at is language
learning; and we have argued that this is a case where it may pay for AGI systems to take
a route somewhat different from the one taken by human children. Humans seem to be born
with a complex system of biases enabling effective language learning, and it's not yet clear
exactly what these biases are nor how they're incorporated into the learning process. It is very
tempting to give AGI systems a "short cut" to language proficiency via making use of existing
rule-based and statistical-corpus-analysis-based NLP systems; and we have fleshed out this
approach sufficiently to have convinced ourselves it makes practical as well as conceptual sense,
in the context of the specific learning mechanisms and NLP tools built into OpenCog. Thus we
have provided a number of detailed arguments and suggestions in support of our claim that one
effective approach to teaching an AGI system human language is to supply it with some in-built
linguistic facility, in the form of rule-based and statistical-linguistics-based NLP systems, and
then allow it to improve and revise this facility based on experience.
49.5 Knowledge Representation
Many knowledge representation approaches have been explored in the AI literature, and ulti-
mately many of these could be workable for human-level AGI if coupled with the right cog-
nitive processes. The key goal for a knowledge representation for AGI should be naturalness
with respect to the AGI's cognitive processes - i.e. the cognitive processes shouldn't need
to undergo complex transformative gymnastics to get information in and out of the knowl-
edge representation in order to do their cognitive work. Toward this end we have come to a
similar conclusion to some other researchers (e.g. Joscha Bach and Stan Franklin), and con-
cluded that given the strengths and weaknesses of current and near-future digital computers,
a (loosely) neural-symbolic network is a good representation for directly storing many kinds of
memory, and interfacing between those that it doesn't store directly. CogPrinte's AtomSpace is
a neural-symbolic network designed to work nicely with PLN, MOSES, ECAN and the other
key CogPrime cognitive processes; it supplies them with what they need without causing them
undue complexities. It provides a platform that these cognitive processes can use to adaptively,
automatically construct specialized knowledge representations for particular sorts of knowledge
that they encounter.
49.6 Cognitive Processes
The crux of intelligence is dynamics, learning, adaptation; and so the crux of an AGI design is
the set of cognitive processes that the design provides. These processes must collectively allow
the AGI system to achieve its goals in its environments using the resources at hand. Given
EFTA00624627
49.6 Cognitive Processes 481
CogPrime's multi-memory design, it's natural to consider CogPrime's cognitive processes in
terms of which memory subsystems they focus on (although, this is not a perfect mode of
analysis, since some of the cognitive processes span multiple memory types).
49.6.1 Uncertain Logic for Declarative Knowledge
One major decision made in the creation of CogPrime was that given the strengths and weak-
nesses of current and near-future digital computers, uncertain logic is a good way to handle
declarative knowledge. Of course this is not obvious nor is it the only possible route. Declarative
knowledge can potentially be handled in other ways; e.g. in a hierarchical network architecture,
one can make declarative knowledge emerge automatically from procedural and sensorimotor
knowledge, as is the goal in the Numenta and DeSTIN designs reviewed in Chapter 4 of Part
1. It seems clear that the human brain doesn't contain anything closely parallel to formal logic
- even though one can ground logic operations in neural-net dynamics as explored in Chapter
34, this sort of grounding leads to "uncertain logic enmeshed with a host of other cognitive
dynamics" rather than "uncertain logic as a cleanly separable cognitive process."
But contemporary digital computers are not brains - they lack the human brain's capacity
for cheap massive parallelism, but have a capability for single-operation speed and precision
far exceeding the brain's. In this way computers and formal logic are a natural match (a fact
that's not surprising given that Boolean logic lies at the foundation of digital computer opera-
tions). Using uncertain logic is a sort of compromise between brainlike messiness and fuzziness,
and computerlike precision. An alternative to using uncertain logic is using crisp logic and in-
corporating uncertainty as content within the knowledge base - this is what SOAR does, for
example, and it's not a wholly unworkable approach. But given that the vast mass of knowledge
needed for confronting everyday human reality is highly uncertain, and that this knowledge of-
ten needs to be manipulated efficiently in real-time, it seems to us there is a strong argument
for embedding uncertainty in the logic.
Many approaches to uncertain logic exist in the literature, including probabilistic and fuzzy
approaches, and one conclusion we reached in formulating CogPrime is that none of them was
adequate on its own — leading us, for example, to the conclusion that to deal with the problems
facing a human-level AG!, an uncertain logic must integrate imprecise probability and fuzziness
with a broad scope of logical constructs. The arguments that both fuzziness and probability are
needed scent hard to counter - these two notions of uncertainty are qualitatively different yet
both appear cognitively necessary.
The argument for using probability in an AGI system is assailed by some AGI researchers such
as Pei Wang, but we are swayed by the theoretical arguments in favor of probability theory's
mathematically fundamental nature, as well as the massive demonstrated success of probability
theory in various areas of narrow AI and applied science. However, we are also swayed by
the arguments of Pei Wang, Peter %Valley and others that using single-number probabilities to
represent truth values leads to untoward complexities related to the tabulation and manipulation
of amounts of evidence. This has led us to an imprecise probability based approach; and then
technical arguments regarding the limitations of standard imprecise probability formalisms has
led us to develop our own "indefinite probabilities" formalism.
The PLN logic framework is one way of integrating imprecise probability and fuzziness in a
logical formalism that encompasses a broad scope of logical constructs. It integrates term logic
EFTA00624628
482 49 Summary of Argument for the CogPrime Approach
and predicate logic - a feature that we consider not necessary, but very convenient, for AGI.
Either predicate or term logic on its own would suffice, but each is awkward in certain cases,
and integrating them as done in PLN seems to result in more elegant handling of real-world
inference scenarios. Finally, PLN also integrates intensional inference in an elegant manner
that demonstrates integrative intelligence - it defines intension using pattern theory, which
binds inference to pattern recognition and hence to other cognitive processes in a conceptually
appropriate way.
Clearly PLN is not the only possible logical formalism capable of serving a human-level AGI
system; however, we know of no other existing, fleshed-out formalism capable of fitting the bill.
In part this is because PLN has been developed as part of an integrative AGI project whereas
other logical formalisms have mainly been developed for other purposes, or purely theoretically.
Via using PLN to control virtual agents, and integrating PLN with other cognitive processes, we
have tweaked and expanded the PLN formalism to serve all the roles required of the "declarative
cognition" component of an AGI system with reasonable elegance and effectiveness.
49.6.2 Program Learning for Procedural Knowledge
Even more so than declarative knowledge, procedural knowledge is represented in many different
ways in the Al literature. The human brain also apparently uses multiple mechanisms to embody
different kinds of procedures. So the choice of how to represent procedures in an AGI system
is not particularly obvious. However, there is one particular representation of procedures that
is particularly well-suited for current computer systems, and particularly well-tested in this
context: programs. In designing CogPrime, we have acted based on the understanding that
programs are a good way to represent procedures - including both cognitive and physical-action
procedures, but perhaps not including low-level motor-control procedures.
Of course, this begs the question of programs in what programming language, and in this
context we have made a fairly traditional choice, using a special language called Combo that is
essentially a minor variant of LISP, and supplying Combo with a set of customized primitives
intended to reduce the length of the typical programs CogPrime needs to learn and use. What
differentiates this use of LISP from many traditional uses of LISP in Al is that we are only
using the LISP-ish representational style for procedural knowledge, rather than trying to use it
for everything.
One test of whether the use of Combo programs to represent procedural knowledge makes
sense is whether the procedures useful for a CogPrime system in everyday human environments
have short Combo representations. We have worked with Combo enough to validate that they
generally do in the virtual world environment - and also in the physical-world environment
if lower-level motor procedures are supplied as primitives. That is, we are not convinced that
Combo is a good representation for the procedure a robot needs to do to move its fingers to
pick up a cup, coordinating its movements with its visual perceptions. It's certainly possible to
represent this sort of thing in Combo, but Combo may be an awkward tool. However, if one
represents low-level procedures like this using another method, e.g. learned cell assemblies in a
hierarchical network like DeSTIN, then it's very feasible to make Combo programs that invoke
these low-level procedures, and encode higher-level actions like "pick up the cup in front of you
slowly and quietly, then hand it to Jim who is standing next to you."
EFTA00624629
49.6 Cognitive Processes 483
Having committed to use programs to represent many procedures, the next question is how
to learn programs. One key conclusion we have come to via our empirical work in this area is
that some form of powerful program normalization is essential. Without normalization, it's too
hard for existing learning algorithms to generalize from known, tested programs and draw useful
uncertain conclusions about untested ones. We have worked extensively with a generalization
of Holman's "Elegant Normal Form" in this regard.
For learning normalized programs, we have come to the following conclusions:
• for relatively straightforward procedure learning problems, hilldimbing with random restart
and a strong Occam bias is an effective method
• for more difficult problems that elude hinclimbing, probabilistic evolutionary program learn-
Mg is an effective method
The probabilistic evolutionary program learning method we have worked with most in OpenCog
is MOSES, and significant evidence has been gathered showing it to be dramatically more
effective than genetic programming on relevant classes of problems. However, more work needs to
be done to evaluate its progress on complex and difficult procedure learning problems. Alternate,
related probabilistic evolutionary program learning algorithms such as PLEASURE have also
been considered and may be implemented and tested as well.
49.6.5 Attention Allocation
There is significant evidence that the brain uses some sort of "activation spreading" type method
to allocate attention, and many algorithms in this spirit have been implemented and utilized in
the Al literature. So, we find ourselves in agreement with many others that activation spreading
is a reasonable way to handle attentional knowledge (though other approaches, with greater
overhead cost, may provide better accuracy and may be appropriate in some situations). We
also agree with many others who have chosen Hebbian learning as one route of learning
associative relationships, with more sophisticated methods such as information-geometric
ones potentially also playing a role.
Where CogPrime differs from standard practice is in the use of an economic metaphor to reg-
ulate activation spreading. In this matter CogPrime is broadly in agreement with Eric Baum's
arguments about the value of economic methods in Al, although our specific use of economic
methods is very different from his. Baum's work (e.g. Hayek [13an0 ID embodies more complex
and computationally expensive uses of artificial economics, whereas we believe that in the con-
text of a neural-symbolic network, artificial economics is an effective approach to activation
spreading; and CogPrime's ECAN framework seeks to embody this idea. ECAN can also make
use of more sophisticated and expensive uses of artificial currency when large amount of system
resources are involved in a single choice, rendering the cost appropriate.
One major choice made in the CogPrime design is to focus on two kinds of attention: proces-
sor (represented by ShortTermImportance) and memory (represented by LongTermImportance).
This is a direct reflection of one of the key differences between the von Neumann architecture
and the human brain: in the former but not the latter, there is a strict separation between mem-
ory and processing in the underlying compute fabric. We carefully considered the possibility of
using a larger variety of attention values, and in Chapter 23 we presented some mathematics
and concepts that could be used in this regard, but for reasons of simplicity and computational
EFTA00624630
484 49 Summary of Argument for the CogPrime Approach
efficiency we are currently using only STI and LTI in our OpenCogPrime implementation, with
the passibility of extending further if experimentation proves it necessary.
49.6.4 Internal Simulation and Episodic Knowledge
For episodic knowledge, as with declarative and procedural knowledge, CogPrime has opted
for a solution motivated by the particular strengths of contemporary digital computers. When
the human brain runs through a "mental movie" of past experiences, it doesn't do any kind of
accurate physical simulation of these experiences. But that's not because the brain wouldn't
benefit from such - it's because the brain doesn't know how to do that sort of thing! On the
other hand, any modern laptop can run a reasonable Newtonian physics simulation of everyday
events, and more fundamentally can recall and manage the relative positions and movements of
items in an internal 3D landscape paralleling remembered or imagined real-world events. With
this in mind, we believe that in an AGI context, simulation is a good way to handle episodic
knowledge; and running an internal "world simulation engine" is an effective way to handle
simulation.
CogPrime can work with many different simulation engines; and since simulation technology
is continually advancing independently of AGI technology, this is an area where AGI can buy
some progressive advancement for free as time goes on. The subtle issues here regard interfacing
between the simulation engine and the rest of the mind: mining meaningful information out of
simulations using pattern mining algorithms; and more subtly, figuring out what simulations
to run at what times in order to answer the questions most relevant to the AGI system in the
context of achieving its goals. We believe we have architected these interactions in a viable way
in the CogPrime design, but we have tested our ideas in this regard only in some fairly simple
contexts regarding virtual pets in a virtual world, and much more remains to be done here.
49.6.5 Low-Level Perception and Action
The centrality or otherwise of low-level perception and action in human intelligence is a matter
of ongoing debate in the AI community. Sonic feel that the essence of intelligence lies in cognition
and/or language, with perception and action having the status of "peripheral devices." Others
feel that modeling the physical world and one's actions in it is the essence of intelligence, with
cognition and language emerging as side-effects of these more fundamental capabilities. The
CogPrime architecture doesn't need to take sides in this debate. Currently we are experimenting
both in virtual worlds, and with real-world robot control. The value added by robotic versus
virtual embodiment can thus be explored via experiment rather than theory, and may reveal
nuances that no one currently foresees.
As noted above, we are tmconfident of CogPrime's generic procedure learning or pattern
recognition algorithms in terms of their capabilities to handle large amounts of raw sensorimotor
data in real time, and so for robotic applications we advocate hybridizing CogPrime with a
separate (but closely cross-linked) system better customized for this sort of data, in line with
our general hypothesis that Hybridization of one's integrative neural-symbolic system with a
spatiotemporally hierarchical deep learning system is an effective way to handle representation
EFTA00624631
49.7 Fulfilling the "Cognitive Equation" 485
and learning of low-level sensorimotor knowledge. While this general principle doesn't depend
on any particular approach, DeSTIN is one example of a deep learning system of this nature
that can be effective in this context
We have not yet done any sophisticated experiments in this regard - our current experiments
using OpenCog to control robots involve cruder integration of OpenCog with perceptual and
motor subsystems, rather than the tight hybridization described in Chapter 26. Creating such
a hybrid system is last" a matter of software engineering, but testing such a system may lead
to many surprises!
49.6.6 Goals
Given that we have characterized general intelligence as "the ability to achieve complex goals in
complex environments," it should be plain that goals play a central role in our work. However,
we have chosen not to create a separate subsystem for intentional knowledge, and instead have
concluded that one effective way to handle goals is to represent them declaratively, and allocate
attention among them economically. An advantage of this approach is that it automatically
provides integration between the goal system and the declarative and attentional knowledge
systems.
Goals and subgoaLs are related using logical links as interpreted and manipulated by PLN, and
attention is allocated among goals using the STI dynamics of ECAN, and a specialized variant
based on RFS's (requests for service). Thus the mechanics of goal management is handled using
uncertain inference and artificial economics, whereas the figuring-out of how to achieve goals
is done integratively, relying heavily on procedural and episodic knowledge as well as PLN and
ECAN.
The combination of ECAN and PLN seems to overcome the well-known shortcomings found
with purely neural-net or purely inferential approaches to goals. Neural net approaches gener-
ally have trouble with abstraction, whereas logical approaches are generally poor at real-time
responsiveness and at tuning their details quantitatively based on experience. At least in prin-
ciple, our hybrid approach overcomes all these shortcomings; though of current, it has been
tested only in fairly simple cases in the virtual world.
49.7 Fulfilling the "Cognitive Equation"
A key claim based on the notion of the "Cognitive Equation" posited in Chaotic Logic [Coe9lJ is
that it is important for an intelligent system to have some way of recognizing large-scale patterns
in itself, and then embodying these patterns as new, localized knowledge items in its memory.
This dynamic introduces a feedback dynamic between emergent pattern and substrate, which
is hypothesized to be critical to general intelligence under feasible computational resources. It
also ties in nicely with the notion of "glocal memory" - essentially positing a localization of
some global memories, which naturally will result in the formation of some glocal memories.
One of the key ideas underlying the CogPrime design is that given the use of a neural-symbolic
network for knowledge representation, a graph-mining based "map formation" heuristic is one
good way to do this.
EFTA00624632
486 49 Summary of Argument for the CogPrime Approach
Map formation seeks to fulfill the Cognitive Equation quite directly, probably more directly
than happens in the brain. Rather than relying on other cognitive processes to implicitly recog-
nize overall system patterns and embody them in the system as localized memories (though this
implicit recognition may also happen), the MapFormation MindAgent explicitly carries out this
process. Mostly this is done using fairly crude greedy pattern mining heuristics, though if really
subtle and important patterns seem to be there, more sophisticated methods like evolutionary
pattern mining may also be invoked.
It seems possible that this sort of explicit approach could be less efficient than purely implicit
approaches; but, there is no evidence for this, and it may actually provide increased efficiency.
And in the context of the overall CogPrime design, the explicit NIapFormation approach seems
most natural.
49.8 Occam's Razor
The key role of "Occam's Razor" or the urge for simplicity in intelligence has been observed by
many before (going back at least to Occam himself, and probably earlier!), and is fully embraced
in the CogPrime design. Our theoretical analysis of intelligence, presented in Chapter 2 of Part
1 and elsewhere, portrays intelligence as closely tied to the creation of procedures that achieve
goals in environments in the simplest possible way. And this quest for simplicity is present in
many places throughout the CogPrime design, for instance
• In MOSES and hilIclimbing, where program compactness is an explicit component of pro-
gram tree fitness
• In PLN, where the backward and forward chainers. explicitly favor shorter proof chains,
and intensional inference explicitly characterizes entities in terms of their patterns (where
patterns are defined as compact characterizations)
• In pattern mining heuristics, which search for compact characterizations of data
• In the forgetting mechanism, which seeks the smallest set of Atoms that will allow the
regeneration of a larger set of useful Atoms via modestly-expensive application of cognitive
processes
• Via the encapsulation of procedural and declarative knowledge in simulations, which in
many cases provide a vastly compacted form of storing real-world experiences
Like cognitive synergy and emergent networks, Occam's Razor is not something that is imple-
mented in a single place in the CogPrime design, but rather an overall design principle that
underlies nearly every part of the system.
49.8.1 Mind Geometry
The three mind-geometric principles outlined in Appendix ?? are:
• syntax-semantics correlation
• cognitive geometrodynamics
• cognitive synergy
EFTA00624633
49.8 Occam's Razor 487
The key role of syntax-semantics correlation in CogPrime is clear. It plays an explicit role
in MOSES. In PLN, it is critical to inference control, to the extent that inference control is
based on the extraction of patterns from previous inferences. The syntactic structures are the
inference trees, and the semantic structures are the inferential conclusions produced by the trees.
History-guided inference control assumes that prior similar trees will be a good starting-point
for getting results similar to prior ones - i.e. it assumes a reasonable degree of syntax-semantics
correlation. Also, without a correlation between the core elements used to generate an episode,
and the whole episode, it would be infeasible to use historical data mining to understand what
core elements to use to generate a new episode - and creation of compact, easily manipulable
seeds for generating episodes would not be feasible.
Cognitive geometrodynamics is about finding the shortest path from the current state to
a goal state, where distance is judged by an appropriate metric including various aspects of
computational effort. The ECAN and effort management frameworks attempt to enforce this,
via minimizing the amount of effort spent by the system in getting to a certain conclusion.
MindAgents operating primarily on one kind of knowledge (e.g. MOSES, PLN) may for a time
seek to follow the shortest paths within their particular corresponding memory spaces; but then
when they operate more interactively and synergetically, it becomes a matter of finding short
paths in the composite mindspace corresponding to the combination of the various memory
types.
Finally, cognitive synergy is thoroughly and subtly interwoven throughout CogPrime. In a
way the whole design is about cognitive synergy - it's critical for the design's functionality
that it's important that the cognitive processes associated with different kinds of memory can
appeal to each other for assistance in overcoming bottlenecks in a manner that: a) works in
"real time'; i.e. on the time scale of the cognitive processes internal processes; b) enables each
cognitive process to act in a manner that is sensitive to the particularities of each others' internal
representations.
Recapitulating in a bit more depth, recall that another useful way to formulate cognitive
synergy as follows. Each of the key learning mechanisms underlying CogPrime is susceptible
to combinatorial explosions. As the problems they confront become larger and larger, the per-
formance gets worse and worse at an exponential rate, because the number of combinations
of items that mast be considered to solve the problems grows exponentially with the problem
size. This could be viewed as a deficiency of the fundamental design, but we don't view it that
way. Our view is that combinatorial explosion is intrinsic to intelligence. The task at hand is to
dampen it sufficiently that realistically large problems can be solved, rather than to eliminate
it entirely. One possible way to dampen it would be to design a single, really clever learning
algorithm - one that was still susceptible to an exponential increase in computational require-
ments as problem size increases, but with a surprisingly small exponent. Another approach is
the mirrorhouse approach: Design a bunch of learning algorithms, each focusing on different
aspects of the learning process, and design them so that they each help to dampen each others'
combinatorial explosions. This is the approach taken within CogPrime. The component algo-
rithms are clever on their own - they are less susceptible to combinatorial explosion than many
competing approaches in the narrow-AI literature. But the real meat of the design lies in the
intended interactions between the components, manifesting cognitive synergy.
EFTA00624634
488 49 Summary of Argument for the CogPrime Approach
49.9 Cognitive Synergy
To understand more specifically how cognitive synergy works in CogPrime, in the following sub-
sections we will review some synergies related to the key components of CogPrime as discussed
above. These synergies are absolutely critical to the proposed functionality of the CogPrime
system. Without them, the cognitive mechanisms are not going to work adequately well, but
are rather going to succumb to combinatorial explosions. The other aspects of CogPrime - the
cognitive architecture, the knowledge representation, the embodiment framework and associ-
ated developmental teaching methodology - are all critical as well, but none of these will yield
the critical emergence of intelligence without cognitive mechanisms that effectively scale. And,
in the absence of cognitive mechanisms that effectively scale on their own, we mast rely on
cognitive mechanisms that effectively help each other to scale. The reasons why we believe these
synergies will exist are essentially qualitative: we have not proved theorems regarded these syn-
ergies, and we have observed them in practice only in simple cases so far. However, we do have
some ideas regarding how to potentially prove theorems related to these synergies, and some of
these are described in Appendix H.
49.9.1 Synergies that Help Inference
The combinatorial explosion in PLN is obvious: forward and backward chaining inference are
both fundamentally explosive processes, reined in only by pruning heuristics. This means that
for nontrivial complex inferences to occur, one needs really, really clever pruning heuristics.
The CogPrime design combines simple heuristics with pattern mining, MOSES and economic
attention allocation as pruning heuristics. Economic attention allocation assigns importance
levels to Atoms, which helps guide pruning. Greedy pattern mining is used to search for patterns
in the stored corpus of inference trees, to see if there are any that can be used as analogies
for the current inference. And MOSES comes in when there is not enough information (from
importance levels or prior inference history) to make a choice, yet exploring a wide variety
of available options is unrealistic. In this case, MOSES tasks may be launched, pertinently to
the leaves at the fringe of the inference tree, under consideration for expansion. For instance,
suppose there is an Atom A at the fringe of the inference tree, and its importance hasn't been
assessed with high confidence, but a number of items B are known so that:
MemberLink A B
Then, MOSES may be used to learn various relationships characterizing A, based on recognizing
patterns across the set of B that are suspected to be members of A. These relationships may
then be used to assess the importance of A more confidently, or perhaps to enable the inference
tree to match one of the patterns identified by pattern mining on the inference tree corpus. For
example, if MOSES figures out that:
SimilarityLink G A
then it may happen that substituting G in place of A in the inference tree, results in something
that pattern mining can identify as being a good (or poor) direction for inference.
EFTA00624635
49.10 Synergies that Help MOSES 489
49.10 Synergies that Help MOSES
MOSES's combinatorial explosion is obvious: the number of possible programs of size N increases
very rapidly with N. The only way to get around this is to utilize prior knowledge, and as much
as possible of it. When solving a particular problem, the search for new solutions must make
use of prior candidate solutions evaluated for that problem, and also prior candidate solutions
(including successful and unsuccessful ones) evaluated for other related problems.
But, extrapolation of this kind is in essence a contextual analogical inference problem. In
some cases it can be solved via fairly straightforward pattern mining; but in subtler cases it will
require inference of the type provided by PLN. Also, attention allocation plays a role in figuring
out, for a given problem A, which problems B are likely to have the property that candidate
solutions for B are useful information when looking for better solutions for A.
49.10.1 Synergies that Help Attention Allocation
Economic attention allocation, without help from other cognitive processes, is just a very sim-
ple process analogous to "activation spreading" and "Hebbian learning" in a neural network.
The other cognitive processes are the things that allow it to more sensitively understand the
attentional relationships between different knowledge items (e.g. which sorts of items are often
usefully thought about in the same context, and in which order).
49.10.2 Further Synergies Related to Pattern Mining
Statistical, greedy pattern mining is a simple process, but it nevertheless can be biased in
various ways by other, more subtle processes.
For instance, if one has learned a population of programs via MOSES, addressing some
particular fitness function, then one can study which items tend to be utilized in the same
programs in this population. One may then direct pattern mining to find patterns combining
these items found to be in the MOSES population. And conversely, relationships denoted by
pattern mining may be used to probabilistically bias the models used within MOSES.
Statistical pattern mining may also help PLN by supplying it with information to work
on. For instance, conjunctive pattern mining finds conjunctions of items, which may then be
combined with each other using PLN, leading to the formation of more complex predicates.
These conjunctions may also be fed to MOSES as part of an initial population for solving a
relevant problem.
Finally, the main interaction between pattern mining and MOSES/PLN is that the former
may recognize patterns in links created by the latter. These patterns may then be fed back
into MOSES and PLN as data. This virtuous cycle allows pattern mining and the other, more
expensive cognitive processes to guide each other. Attention allocation also gets into the game,
by guiding statistical pattern mining and telling it which terms (and which combinations) to
spend more time on.
EFTA00624636
490 49 Summary of Argument for the CogPrime Approach
49.10.3 Synergies Related to Map Formation
The essential synergy regarding map formation is obvious: Maps are formed based on the
HebbianLinks created via PLN and simpler attentional dynamics, which are based on which
Atoms are usefully used together, which is based on the dynamics of the cognitive processes
doing the "using." On the other hand, once maps are formed and encapsulated, they feed into
these other cognitive processes. This synergy in particular is critical to the emergence of self
and attention.
What has to happen, for map formation to work well, is that the cognitive processes must
utilize encapsulated maps in a way that gives rise overall to relatively clear clusters in the
network of HebbianLinks. This will happen if the encapsulated maps are not too complex for
the system's other learning operations to understand. So, there must be useful coordinated
attentional patterns whose corresponding encapsulated-map Atoms are not too complicated.
This has to do with the system's overall parameter settings, but largely with the settings of
the attention allocation component. For instance. this is closely tied in with the limited size of
"attentional focus" (the famous 7 +/- 2 number associated with humans' and other mammals
short term memory capacity). If only a small number of Atoms are typically very important
at a given point in time, then the maps formed by grouping together all simultaneously highly
important things will be relatively small predicates, which will be easily reasoned about - thus
keeping the "virtuous cycle" of map formation and comprehension going effectively.
49.11 Emergent Structures and Dynamics
We have spent much more time in this book on the engineering of cognitive processes and
structures, than on the cognitive processes and structures that must emerge in an intelligent
system for it to display human-level AGI. However, this focus should not be taken to represent
a lack of appreciation for the importance of emergence. Rather, it represents a practical focus:
engineering is what we must do to create a software system potentially capable of AGI, and
emergence is then what happens inside the engineered AGI to allow it to achieve intelligence.
Emergence must however be taken carefully into account when deciding what to engineer!
One of the guiding ideas underlying the CogPrime design is that an AGI system with ade-
quate mechanisms for handling the key types of knowledge mentioned above, and the capability
to explicitly recognize large-scale pattern in itself should upon sustained interaction with
an appropriate environment in pursuit of appropriate goals, emerge a variety of com-
plex structures in its internal knowledge network, including (but not limited to): a hierarchical
network, representing both a spatiotemporal hierarchy and an approximate "default inheritance"
hierarchy, cross-linked; a heterarchical network of associativity, roughly aligned with the hierar-
chical network; a self network which is an approximate micro image of the whole network; and
inter-reflecting networks modeling self and others, reflecting a "mirrorhouse" design pattern.
The dependence of these posited emergences on the environment and goals of the AGI system
should not be underestimated. For instance, PLN and pattern mining don't have to lead to a
hierarchical structured Atomspace, but if the AGI system is placed in an environment which is
itself hierarchically structured, then they very likely will do so. And if this environment consists
of hierarchically structured language and culture, then what one has is a system of minds with
hierarchical networks, each reinforcing the hierarchality of each others' networks. Similarly,
EFTA00624637
49.12 Ethical AC! 491
integrated cognition doesn't have to lead to mirrorhouse structures, but integrated cognition
about situations involving other minds studying and predicting and judging each other, is very
likely to do so. What is needed for appropriate emergent structures to arise in a mind, is
mainly that the knowledge representation is sufficiently flexible to allow these structures, and
the cognitive processes are sufficiently intelligent to observe these structures in the environment
and then minor them internally. Of course, it also doesn't hurt if the internal structures and
processes are at least slightly biased toward the origination of the particular high-level emergent
structures that are characteristic of the system's environment/goals; and this is indeed the case
with CogPrime - biases toward hierarchical, heterarchical, dual and mirrorhouse networks are
woven throughout the system design, in a thoroughgoing though not extremely systematic way.
49.12 Ethical AGI
Creating an AGI with guaranteeably ethical behavior seems an infeasible task; but of course,
no human is guaranteeably ethical either, and in fact it seems almost guaranteed that in any
moderately large group of humans there are going to be some with strong propensities for
extremely unethical behaviors, according to any of the standard human ethical codes. One of
our motivations in developing CogPrime has been the belief that an AC! system, if supplied
with a commonsensically ethical goal system and an intentional component based on rigorous
uncertain inference, should be able to reliably achieve a much higher level of commonsensically
ethical behavior than any human being.
Our explorations in the detailed design of CogPrime's goal system have done nothing to
degrade this belief. While we have not yet developed any CogPrime system to the point where
experimenting with its ethics is meaningful, based on our understanding of the current design
it seems to us that
• a typical CogPrime system will display a much more consistent and less conflicted and
confused motivational system than any human being, due to its explicit orientation toward
carrying out actions that (based on its knowledge) rationally seem most likely to lead to
achievement of its goals
• if a CogPrime system is given goals that are consistent with commonsensical human ethics
(say, articulated in natural language), and then educated in an ethics-friendly environment
such as a virtual or physical school, then it is reasonable to expect the CogPrime system will
ultimately develop an advanced (human adult level or beyond) form of commmonsensical
human ethics
Human ethics is itself wracked with inconsistencies, so one cannot expect a rationality-based
AGI system to precisely mirror the ethics of any particular human individual or cultural system.
But given the degree to which general intelligence represents adaptation to its environment, and
interpretation of natural language depends on life history and context, it seems very likely to
us that a CogPrime system. if supplied with a human-commonsense-ethics based goal system
and then raised by compassionate and intelligent humans in a school-type environment, would
arrive at its own variant of human-commonsense-ethics. The AGI system's ethics would then
interact with human ethical systems in complex ways, leading to ongoing evolution of both
systems and the development of new cultural and ethical patterns. Predicting the future is
EFTA00624638
492 49 Summary of Argument for the CogPrime Approach
difficult even in the absence of radical advanced technologies, but our intuition is that this path
has the potential to lead to beneficial outcomes for both human and machine intelligence.
49.13 Toward Superhuman General Intelligence
Human-level AGI is a difficult goal, relative to the current state of scientific understanding
and engineering capability, and most of this book has been focused on our ideas about how to
achieve it. However, we also suspect the CogPrime architecture has the ultimate potential to
push beyond the human level in many ways. As part of this suspicion we advance the claim
that once sufficiently advanced, a CogPrime system should be able to radically self-improve via
a variety of methods, including supercompilation and automated theorem-proving.
Supercompilation allows procedures to be automatically replaced with equivalent but mas-
sively more time-efficient procedures. This is particularly valuable in that it allows AI algorithms
to learn new procedures without much heed to their efficiency, since supercompilation can al-
ways improve the efficiency afterwards. So it is a real boon to automated program learning.
Theorem-proving is difficult for current narrow-Al systems, but for an AGI system with
a deep understanding of the context in which each theorem exists, it should be much easier
than for human mathematicians. So we envision that ultimately an AGI system will be able to
design itself new algorithms and data structures via proving theorems about which ones will
best help it achieve its goals in which situations, based on mathematical models of itself and
its environment. Once this stage is achieved, it seems that machine intelligence may begin to
vastly outdo human intelligence, leading in directions we cannot now envision.
While such projections may seem science-fictional, we note that the CogPrime architecture
explicitly supports such steps. If human-level AGI is achieved within the CogPrime framework,
it seems quite feasible that profoundly self-modifying behavior could be achieved fairly shortly
thereafter. For instance, one could take a human-level CogPrime system and teach it computer
science and mathematics, so that it fully understood the reasoning underlying its own design,
and the whole mathematics curriculum leading up the the algorithms underpinning its cognitive
processes.
49.13.1 Conclusion
What we have sought to do in these pages is, mainly,
• to articulate a theoretical perspective on general intelligence, according to which the cre-
ation of a human-level AGI doesn't require anything that extraordinary, but "merely" an
appropriate combination of closely interoperating algorithms operating on an appropriate
multi-type memory system, utilized to enable a system in an appropriate body and envi-
ronment to figure out how to achieve its given goals
• to describe a software design (CogPrime ) that, according to this somewhat mundane but
theoretically quite well grounded vision of general intelligence, appears likely (according to
a combination of rigorous and heuristic arguments) to be able to lead to human-level AGI
using feasible computational resources
EFTA00624639
49.13 Toward Superhuman General Intelligence 493
• to describe some of the preliminary lessons we've learned via implementing and experiment-
ing with aspects of the CogPrime design, in the OpenCog system
In this concluding chapter we have focused on the "combination of rigorous and heuristic argu-
ments" that lead us to consider it likely that CogPrime has the potential to lead to human-level
AGI using feasible computational resources.
We also wish to stress that not all of our arguments and ideas need to be 100% correct in order
for the project to succeed. The quest to create AGI Ls a mix of theory, engineering, and scientific
and unscientific experimentation. If the current CogPrime design turns out to have significant
shortcomings, yet still brings us a significant percentage of the way toward human-level AGI,
the results obtained along the path will very likely give us clues about how to tweak the design
to more effectively get the rest of the way there. And the OpenCog platform is extremely flexible
and extensible, rather than being tied to the particular details of the CogPrime design. While
we do have faith that the CogPrime design as described here has human-level AGI potential,
we are also pleased to have a development strategy and implementation platform that will
allow us to modify and improve the design in whatever suggestions are made by our ongoing
experimentation.
Many great achievements in history have seemed more magical before their first achievement
than afterwards. Powered flight and spaceflight are the most obvious examples, but there are
many others such as mobile telephony, prosthetic limbs, electronically deliverable books, robotic
factory workers, and so on. We now even have wireless transmission of power (one can recharge
cellphones via wifi), though not yet as ambitiously as Tesla envisioned. We very strongly suspect
that human-level AGI Ls in the same category as these various examples: an exciting and
amazing achievement, which however is achievable via systematic and careful application of
fairly mundane principles. We believe computationally feasible human-level intelligence is both
complicated (involving many interoperating parts, each sophisticated in their own right) and
complex (in the sense of involving many emergent dynamics and structures whose details are
not easily predictable based on the parts of the system) ... but that neither the complication
nor the complexity is an obstacle to engineering human-level AGI.
Furthermore, while ethical behavior is a complex and subtle matter for humans or machines,
we believe that the production of human-level AGIs that are not only intelligent but also ben-
eficial to humans and other biological sentiences, is something that is probably tractable to
achieve based on a combination of careful AGI design and proper AGI education and "parent-
ing." One of the motivations underlying our design has been to create an artificial mind that
has broadly humanlike intelligence, yet has a more rational and self-controllable motivational
system than humans, thus ultimately having the potential for a greater-than-human degree of
ethical reliability alongside its greater-than-human intelligence.
In our view, what is needed to create human-level AGI is not a new scientific breakthrough,
nor a miracle, but "merely" a sustained effort over a number of years by a moderate-sized
team of appropriately-trained professionals, completing the implementation of the design in
this book and then parenting and educating the resulting implemented system. CogPrime is by
no means the only possible path to human-level AGI, but we believe it is considerably more
fully thought-through and fleshed-out than any available alternatives. Actually, we would love
to see CogPrime and a dozen alternatives simultaneously pursued - this may seem ambitious,
but it would cost a fraction of the money currently spent on other sorts of science or engineering,
let alone the money spent on warfare or decorative luxury items. We strongly suspect that, in
hindsight, our human and digital descendants will feel amazed that their predecessors allocated
EFTA00624640
494 49 Summary of Argument for the CogPrime Approach
so few financial and attentional resources to the creation of powerful AGI, and consequently
took so long to achieve such a fundamentally straightforward thing.
EFTA00624641
Chapter 50
Build Me Something I Haven't Seen: A CogPrime
Thought Experiment
50.1 Introduction
AGI design necessarily leads one into some rather abstract spaces — but being a human-like
intelligence in the everyday world Ls a pretty concrete thing. If the CogPrime research program
is successful, it will result not just in abstract ideas and equations, but rather in real AGI
robots carrying out tasks in the world, and AGI agents in virtual worlds and online digital
spaces conducting important business, doing science, entertaining and being entertained by us,
and so forth. With this in mind, in this final chapter we will bring the discussion closer to the
concrete and everyday, and pursue a thought experiment of the form "How would a completed
CogPrime system carry, out this specific task?"
The task we will use for this thought-experiment is one we have used as a running example
now and then in the preceding chapters. We consider the case of a robotically or virtually
embodied CogPrime system, operating in a preschool type environment, interacting with a
human whom it already knows and given the task of "Build me something with blocks that I
haven't seen before."
This target task is fairly simple, but it is complex enough to involve essentially every one of
CogPrime's processes, interacting in a unified way. It involves simple, grounded creativity of the
sort that normal human children display every day - and which, we conjecture, is structurally
and dynamically basically the same as the creativity underlying the genius of adult human
creators like Einstein, Dali, Dostoevsky, Hendrix, and so forth ... and as the creativity that will
power massively capable genius machines in future.
We will consider the case of a simple interaction based on the above task where:
1. The human teacher tells the CogPrime agent "Build me something with blocks that I haven't
seen before."
2. After a few false starts, the agent builds something it thinks is appropriate and says 'Do
you like it?"
3. The human teacher says "It's beautiful. What is it?"
4. The agent says "It's a car man" land indeed. the construct has 4 wheels and a chassis vaguely
like a car, but also a torso, arms and head vaguely like a person]
Of course, a complex system like CogPrime could carry out an interaction like this internally
in many different ways, and what is roughly described here is just one among many possibilities.
495
EFTA00624642
496 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
First we will enumerate a number of CogPrime processes and explain some ways that each
one may help CogPrime carry out the target task. Then we will give a more evocative narrative,
conveying the dynamics that would occur in CogPrime while carrying out the target task, and
mentioning how each of the enumerated cognitive processes as it arises in the narrative.
50.2 Roles of Selected Cognitive Processes
Now we review a number of the more interesting CogPrime cognitive processes mentioned in
previous chapters of the hook, for each one indicating one or more of the roles it might play in
helping a CogPrime system carry out the target task. Note that this list is incomplete in many
senses, e.g. it doesn't list all the cognitive processes, nor all the roles played by the ones listed.
The purpose is to give an evocative sense of the roles played by the different parts of the design
in carrying out the task.
• Chapter 19 (OpenCog Fnunework)
- Freezing/defrosting.
• When the agent builds a structure from blocks and decides it's not good enough to
show off to the teacher, what does it do with the detailed ideas and thought process
underlying the structure it built? If it doesn't like the structure so much, it may just
leave this to the generic forgetting process. But if it likes the structure a lot, it may
want to increase the VLTI (Very Long Term Importance) of the Atoms related to
the structure in question, to be sure that these are stored on disk or other long-term
storage, even after they're deemed sufficiently irrelevant to be pushed out of RAM
by the forgetting mechanism.
• When given the target task, the agent may decide to revive from disk the mind-
states it went through when building crowd-pleasing structures from blocks before,
so as to provide it with guidance.
• Chapter 22 (Emotion, Motivation, Attention and Control)
- Cognitive cycle.
• While building with blocks, the agent's cognitive cycle will be dominated by per-
ceiving, acting on, and thinking about the blocks it is building with.
• When interacting with the teacher, then interaction-relevant linguistic, perceptual
and gestural processes will also enter into the cognitive cycle.
- Emotion. The agent's emotions will fluctuate naturally as it carries out the task.
• If it has a goal of pleasing the teacher, then it will experience happiness as its
expectation of pleasing the teacher increases.
• If it has a goal of experiencing novelty, then it will experience happiness as it creates
structures that are novel in its experience.
• If it has a goal of learning, then it will experience happiness as it learns new things
about blocks construction.
• On the other hand, it will experience unhappiness as its experienced or predicted
satisfaction of these goals decreases.
- Action selection
EFTA00624643
50.2 Roles of Selected Cognitive Processes 497
In dialoguing with the teacher, action selection will select one or more DialogueCon-
troller schema to control the conversational interaction (based on which DC schema
have proved most effective in prior similar situations.
When the agent wants to know the teacher's opinion of its construct, what
is happening internally is that the "please teacher" Goal Atom gets a link of
the conceptual form (Implication "find out teacher's opinion of my current con-
struct" "please teacher"). This link may be created by PLN inference, prob-
ably largely by analogy to previously encountered similar situations. Then,
Goallmportance is spread from the "please teacher" Goal Atom to the "find out
teacher's opinion of my current construct" Atom (via the mechanism of sending
an RFS package to the latter Atom). More inference causes a link (Implication
"ask the teacher for their opinion of my current construct" "find out teacher's
opinion of my current construct") to be formed, and the "ask the teacher for
their opinion of my current construct" Atom to get Goallmportance also. Then
PredicateSchematization causes the predicate "ask the teacher for their opinion
of my current construct" to get turned into an actionable schema, which gets
Goallmportance, and which gets pushed into the ActiveSchemaPool via Goal-
driven action selection. Once the schema version of "ask the teacher for their
opinion of my current construct" is in the ActiveSchemaPool, it then invokes
natural language generation Tasks, which lead to the formulation of an English
sentence such as "Do you like it?"
When the teacher asks "It's beautiful. What is it?", then the NL comprehension
MindAgent identifies this as a question, and the "please teacher" Goal Atom gets
a link of the conceptual form (Implication "answer the question the teacher just
asked" "please teacher"). This follows simply from the knowledge ( Implication
("teacher has just asked a question" AND "I answer the teacher's question")
("please teacher")), or else from more complex knowledge refining this Impli-
cation. From this point, things proceed much as in the case "Do you like it?"
described just above.
Consider a schema such as "pick up a red cube and place it on top of the long red
block currently at the top of the structure" (let's call this P). Once P is placed in
the ActiveSchemaPool, then it runs and generates more specific procedures, such as
the ones needed to find a red cube, to move the agent's arm toward the red cube
and grasp it, etc. But the execution of these specific low-level procedures is done via
the ExecutionManager, analogously to the execution of the specifics of generating
a natural language sentence from a collection of semantic relationships. Loosely
speaking, reaching for the red cube and turning simple relationships into a simple
sentences, are considered as "automated processes" not requiring holistic engagement
of the agent's mind. What the generic, more holistic Action Selection mechanism
does in the present context is to figure out to put P in the ActiveSchemaPool in
the first place. This occurs because of a chain such as: P predictively implies (with
a certain probabilistic weight) "completion of the car-man structure", which in turn
predictively implies "completion of a structure that is novel to the teacher," which in
turn predictively implies "please the teacher," which in turn implies "please others,"
which is assumed an Ubergoal (a top-level system goal).
EFTA00624644
498 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
- Goal Atoms. As the above items make clear, the scenario in question requires the
initial Goal Atoms to be specialized, via the creation of more and more particular
subgoals suiting the situation at hand.
- Context Atoms.
• Knowledge of the context the agent is in can help it disambiguate language it hears,
e.g. knowing the context is blocks-building helps it understand which sense of the
word "blocks" is meant.
• On the other hand, if the context is that the teacher is in a bad mood, then the agent
might know via experience that in this context, the strength of (Implication "ask
the teacher for their opinion of my current construct" "find out teacher's opinion of
my current construct") is lower than in other contexts.
- Context formation.
• A context like blocks-building or teacher in a bad mood" may be formed by cluster-
ing over multiple experience-sets, i.e. forming Atoms that refer to spatiotemporally
grouped sets of percepts/concepts/actions, and grouping together similar Atoms of
this nature into clusters.
• The Atom referring to the cluster of experience-sets involving blocks-building will
then survive aS an Atom if it gets involved in relationships that are important or
have surprising truth values. If many relationships have significantly different truth-
value inside the blocks-building context than outside it, this means it's likely that
the blocks-building ConceptNode will remain as an Atom with reasonably high LTI,
so it can be used as a context in future.
- Time-dependence of goals. Many of the agent's goals in this scenario have different
importances over different time scales. For instance "please the teacher" is important
on multiple time-scales: the agent wants to please the teacher in the near term but also
in the longer term. But a goal like "answer the question the teacher just asked" has an
intrinsic time-scale to it; if it's not fulfilled fairly rapidly then its importance goes away.
• Chapter 23 (Attention allocation)
- ShortTermlmportance versus LongTermlmportance. While conversing, the con-
cepts and immediately involved in the conversation (including the Atoms describing the
agents in the conversation) have very high STI. While building, Atoms representing to
the blocks and related ideas about the structures being built (e.g. images of cars and
people perceived or imagined in the past) have very high STI. But the reason these
Atoms are in RAM prior to having their STI boosted due to their involvement in the
agent's activities, is because they had their LTI boosted at some point in the past.
And after these Atoms leave the AttentionalFocus and their STI reduces, they will
have boosted LTI and hence likely remain in RAM for a long while, to be involved in
"background thought", and in case they're useful in the AttentionalFocus again.
- HebbianLink formation. As a single example, the car-man has both wheels and
arms, so now a Hebbian association between wheels and arms will exist in the agent's
memory, to potentially pop up again and guide future thinking. The very idea of a
car-man likely emerged partly due to previously formed HebbianLinks - because people
were often seen sitting in cars, the association between person and car existed, which
made the car concept and the human concept natural candidates for blending.
- Data mining the System Activity Table. The HebbianLinks mentioned above may
have been formed via mining the SystemActivityTable
EFTA00624645
50.2 Roles of Selected Cognitive Processes 499
- ECAN based associative memory. When the agent thinks about making a car,
this spreads importance to various Atoms related to the car concept, and one thing
this does is lead to the emergence of the car attractor into the AttentionalFocus. The
different aspects of a car are represented by heavily interlinked Atoms, so that when
some of them become important, there's a strong tendency for the others to also become
important - and for "car" to then emerge as an attractor of importance dynamics.
- Schema credit assignment.
• Suppose the agent has a subgoal of placing a certain blue block on top of a certain red
block. It may use a particular motor schema for carrying out this action - involving,
for instance, holding the blue block above the red block and then gradually lowering
it. If this schema results in success (rather than in, say, knocking down the red
block), then it should get rewarded via having its STI and LTI boosted and also
having the strength of the link between it and the subgoal increased.
• Next, suppose that a certain cognitive schema (say, the schema of running multiple
related simulations and averaging the results, to estimate the success probability
of a motor procedure) was used to arrive at the motor schema in question. Then
this cognitive schema may get passed some importance from the motor schema, and
it will get the strength of its link to the goal increased. In this way credit passes
backwards from the goal to the various schema directly or indirectly involved in
fulfilling it.
- Forgetting. If the agent builds many structures from blocks during its lifespan, it will
accumulate a large amount of perceptual memory.
• Chapter 24 (Goal and Action Selection). Much of the use of the material in this chapter
was covered above in the bullet point for Chapter 22, but a few more notes are:
- Transfer of RFS between goals. Above it was noted that the link (Implication "ask
the teacher for their opinion of my current construct" "find out teacher's opinion of
my current construct") might be formed and used as a channel for Goallmportance
spreading.
- Schema Activation. Supposing the agent is building a man-car, it may have car-
building schema and man-building schema in its ActiveSchemaPool at the same time,
and it may enact both of them in an interleaved manner. But if each tend to require
two hands for their real-time enaction, then schema activation will have to pass back
and forth between the two of them, so that at any one time, one is active whereas the
other one is sitting in the ActiveSchemaPool waiting to get activated.
- Goal Based Schema Learning. To take a fairly low-level example, suppose the agent
has the (sub)goal of making an arm for a blocks-based person (or man-car), given the
presence of a blocks-based torso. Suppose it finds a long block that seems suitable to be
an arm. It then has the problem of figuring out how to attach the arm to the body. It
may try out several procedures in its internal simulation world, until it finds one that
works: hold the arm in the right position white one end of it rests on top of some block
that is part of the torso, then place some other block on top of that end, then slightly
release the arm and see if it falls. If it doesn't fall, leave it. If it seems about to fall, then
place something heavier atop it, or shove it further in toward the center of the torso.
The procedure learning process could be MOSES here, or it could be PLN.
• Chapter 25 (Procedure Evaluation)
EFTA00624646
500 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
- Inference Based Procedure Evaluation. A procedure for man-building such as
"first put up feet, then put up legs, then put up torso, then put up arms and head"
may be synthesized from logical knowledge (via predicate schematization) but without
filling in the details of how to carry out the individual steps, such as "put up legs." If
a procedure with abstract (ungrounded) schema like PutUpTorso is chosen for execu-
tion and placed into the ActiveSchemaPool, then in the course of execution, inferential
procedure evaluation must be used to figure out how to make the abstract schema ac-
tionable. The CoalDrivenActionSelection MindAgent must make the choice whether to
put a not-fully-grounded schema into the ActiveSchemaPool, rather than grounding it
first and then making it active; this is the sort of choice that may be made effectively
via learned cognitive schema.
• Chapter 26 (Perception and Action)
- ExperienceDB. No person remembers every blocks structure they ever saw or built,
except maybe some autists. But a CogPrime can store all this information fairly easily,
in its ExperienceDB, even if it doesn't keep it all in RAM in its AtomSpace. It can also
store everything anyone ever said about blocks structures in its vicinity.
- Perceptual Pattern Mining.
- Object Recognition. Recognizing structures made of blocks as cars, people, houses,
etc. requires fairly abstract object recognition, involving identifying the key shapes and
features involved in an object-type, rather than just going by simple visual similarity.
- Hierarchical Perception Networks. If the room is well-lit, it's easy to visually iden-
tify individual blocks within a blocks structure. If the room is darker, then more top-
down processing may be needed - identifying the overall shape of the blocks structure
may guide one in making out the individual blocks.
- Hierarchical Action Networks. Top-down action processing tells the agent that, if
it wants to pick up a block, it should move its arm in such a way as to get its hand
near the block, and then move its hand. But if it's still learning how to do that sort
of motion, more likely it will do this, but then start moving its its hand and find that
it's hard to get a grip on the block - and then have to go back and move its arm a
little differently. Iterating between broader arm/hand movements and more fine-grained
hand/finger movements is an instance of information iteratively passing up and down
a hierarchical action network.
- Coupling of Perception and Action Networks. Picking up a block in the dark is
a perfect example of rich coupling of perception and action networks. Feeling the block
with the fingers helps with identifying blocks that can't be clearly seen.
• Chapter 30 (Procedure Learning)
- Specification Based Procedure Learning.
• Suppose the agent has never seen a horse, but the teacher builds a number of blocks
structures and calls them horses, and draws a number of pictures and calls them
horses. This may cause a procedure learning problem to be spawned, where the
fitness function is accuracy at distinguishing horses from non-horses.
• Learning to pick up a block is specification-based procedure learning, where the
specification is to pick up the block and grip it and move it without knocking down
the other stuff near the block.
- Representation Building.
EFTA00624647
50.2 Roles of Selected Cognitive Processes 501
• In the midst of building a procedure to recognize horses, MOSES would experi-
mentally vary program nodes recognizing visual features into other program nodes
recognizing other visual features
• In the midst of building a procedure to pick up blocks, MOSES would experimentally
vary program nodes representing physical movements into other nodes representing
physical movements
• In both of these cases, MOSES would also carry out the standard experimen-
tal variations of mathematical and control operators according to its standard
representation-building framework
• Chapter 31 (Imitative, Reinforcement and Corrective Learning)
- Reinforcement Learning.
• Motor procedures for placing blocks (in simulations or reality) will get rewarded if
they don't result in the blocks structure falling down, punished otherwise.
• Procedures leading to the teacher being pleased, in internal simulations (or in re-
peated trials of scenarios like the one under consideration), will get rewarded; pro-
cedures leading to the teacher being displeased will get punished.
- Imitation Learning. If the agent has seen others build with blocks before, it may
summon these memories and then imitate the actions it has seen others take.
- Corrective Learning. This would occur if the teacher intervened in the agent's block-
building and guided him physically - e.g. steadying his shaky arm to prevent him from
knocking the blocks structure over.
• Chapter 32 (Hillclimbing)
- Complexity Penalty. In learning procedures for manipulating blocks, the complexity
penalty will militate against procedures that contain extraneous steps.
• Chapter 33 (Probabilistic Evolutionary
• Procedure Learning)
- Supplying Evolutionary Learning with Long-Term Memory. Suppose the agent
has previously built people from clay, but never from blocks. It may then have learned a
"classification model" predicting which clay people will look appealing to humans, and
which won't. It may then transfer this knowledge, using PLN, to form a classification
model predicting which blocks-people will look appealing to humans, and which won't.
- Fitness Function Estimation via Integrative Intelligence. To estimate the fitness
of a procedure for, say, putting an arm on a blocks-built human, the agent may try out
the procedure in the internal simulation world; or it may use PLN inference to reason
by analogy to prior physical situations it's observed. These allow fitness to be estimated
without actually trying out the procedure in the environment.
• Chapter 34 (Probabilistic Logic Networks)
- Deduction. This is a tall skinny structure; tall skinny structures fall down easily; thus
this structure may fall down easily.
- Induction. This teacher is talkative; this teacher is friendly; therefore the talkative are
generally friendly.
- Abduction. This structure has a head and arms and torso; a person has a head and
arms and torso; therefore this structure is a person.
EFTA00624648
502 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
- PLN forward chaining. What properties might a car-man have, based on inference
from the properties of cars and the properties of men?
- PLN backward chaining.
• An inference target might be: Find X so that X looks something like a wheel and
can be attached to this Mocks-chassis, and I can find four fairly similar copies.
• Or: Find the truth value of the proposition that this structure looks like a car.
- Indefinite truth values. Consider the deductive inference 'This is a tall skinny struc-
ture; tall skinny structures fall down easily; thus this structure may fall down easily."
In this case, the confidence of the second premise may be greater than the confidence
of the first premise, which may result in an intermediate confidence for the conclusion,
according to the propagation of indefinite probabilities through the PLN deduction rule.
- Intensional inference. Is the blocks-structure a person? According to the definition of
intensional inheritance, it shares many informative properties with people (e.g. having
arms, torso and head), so to a significant extent, it is a person.
- Confidence decay. The agent's confidence in propositions regarding building things
with blocks should remain nearly constant. The agent's confidence in propositions re-
garding the teacher's taste should decay more rapidly. This should occur because the
agent should observe that, in general, propositions regarding physical object manipula-
tion tend to retain fairly constant truth value, whereas propositions regarding human
tastes tend to have more rapidly decaying truth value.
• Chapter 35 (Spatiotemporal Inference)
- Temporal reasoning. Suppose, after the teacher asks 'What is it?", the agent needs
to think a while to figure out a good answer. But maybe the agent knows that it's rude
to pause too long before answering something to a direct question. Temporal reasoning
helps figure out "how long is too long" to wait before answering.
- Spatial reasoning. Suppose the agent puts shoes on the wheels of the car. This is a
joke relying on the understanding that wheels hold a car up, whereas feet hold a person
up, and the structure is a car-man. But it also relies on the spatial inferences that:
the car's wheels are in the right position for the man's feet (below the torso); and, the
wheels are below the car's chassis just like a person's feet are below its torso.
• Chapter 36 (Inference Control)
- Evaluator Choice as a Bandit Problem. In doing inference regarding how to make
a suitably humanlike arm for the blocks-man, there may be a choice between multiple
inference pathways, perhaps one that relies on analogy to other situations building
arms, versus one that relies on more general reasoning about lengths and weights of
blocks. The choice between these two pathways will be made randomly with a certain
probabilistic bias assigned to each one, via prior experience.
- Inference Pattern Mining. The probabilities used in choosing which inference path
to take. are determined in part by prior experience - e.g. maybe it's the case that in
prior situations of building complex blocks structures, analogy, has proved a better guide
than naive physics, thus the prior probability of the analogy inference pathway will be
nudged up.
- PLN and Bayes Nets. What's the probability that the blocks-man's hat will fall
off if the man-car is pushed a little bit to simulate driving? This question could be
resolved in many ways (e.g. by internal simulation), but one possibility is inference. If
EFTA00624649
50.2 Roles of Selected Cognitive Processes 503
this is resolved by inference, it's the sort of conditional probability calculation that could
potentially be done faster if a lot of the probabilistic knowledge from the AtomSpace
were summarized in a Bayes Net. Updating the Bayes net structure can be slow, so this
is probably not appropriate for knowledge that is rapidly shifting; but knowledge about
properties of blocks structures may be fairly persistent after the agent has gained a fair
bit of knowledge by playing with blocks a lot.
• Chapter 37 (Pattern Mining)
- Greedy Pattern Mining.
• "Push a tall structure of blocks and it tends to fall down" is the sort of repetitive
pattern that could easily be extracted from a historical record of perceptions and
(the agent's and others') actions via simple greedy pattern mining algorithm.
• If there is a block that is shaped like a baby's rattle, with a long slender handle
and then a circular shape at the end, then greedy pattern mining may be helpful
due to having recognized the pattern that structures like this are sometimes rattles
- and also that structures like this are often stuck together, with the handle part
connected sturdily to the circular part.
- Evolutionary Pattern Mining. 'Push a tall structure of blocks with a wide base and
a gradual narrowing toward the top and it may not fall too badly" is a more complex
pattern that may not be found via greedy mining, unless the agent has dealt with a lot
of pyramids.
• Chapter 38 (Concept Formation)
- Formal Concept Analysis. Suppose there are many long, slender blocks of different
colors and different shapes (some cylindrical, some purely rectangular for example).
Learning this sort of concept based on common features is exactly what FCA is good
at (and when the features are defined fuzzily or probahilistically, it's exactly what
uncertain FCA is good at). Learning the property of "slender" itself is another example
of something uncertain FCA is good at - it would learn this if there were many concepts
that preferentially involved slender things (even though formed on the basis of concepts
other than slenderness)
- Conceptual Blending. The concept of a "car-man" or "man-car" is an obvious instance
of conceptual blending. The agents know that building a man won't surprise the teacher,
and nor will building a car ... but both "man" and "car" may pop to the forefront of its
mind (i.e. get a briefly high STI) when it thinks about what to build. But since it knows
it has to do something new or surprising, there may be a cognitive schema that boosts
the amount of funds to the ConceptBlending MindAgent, causing it to be extra-active.
In any event, the ConceptBlending agent seeks to find ways to combine important
concepts; and then PLN explores these to see which ones may be able to achieve the
given goal of surprising the teacher (which includes subgoals such as actually being
buildable).
• Chapter 39 (Dimensional Embedding)
- Dimensional Embedding. When the agent needs to search its memory for a previ-
ously seen blocks structure similar to the currently observed one - or for a previously
articulated thought similar to the one it's currently trying to articulate - then it needs
to to a search through its large memory for "an entity similar to X" (where X is a
EFTA00624650
504 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
structure or a thought). This kind of search can be quite computationally difficult - but
if the entities in question have been projected into an embedding space, then it's quite
rapid. (The cost is shifted to the continual maintenance of the embedding space, and
its periodic updating; and there is some error incurred in the projection, but in many
cases this error is not a show-stopper.)
Embedding Based Inference Control. Rapid search for answers to similarity or in-
heritance queries can be key for guiding inference in appropriate directions; for instance
reasoning about how to build a structure with certain properties, can benefit greatly
from rapid search for previously-encountered substructures currently structurally or
functionally similar to the substructures one desires to build.
• Chapter 40 (Simulation and Episodic Memory)
- Fitness Estimation via Simulation . One way to estimate whether a certain blocks
structure is likely to fall down or not, is to build it in one's "mind's eye" and see if
the physics engine in one's mind's-eye causes it to fall down. This is something that in
many cases will work better for CogPrime than for humans, because CogPrime has a
more mathematically accurate physics engine than the human mind does; however, in
cases that rely heavily on naive physics rather than, say, direct applications of Newton's
Laws, then CogPrime's simulation engine may tmderperform the typical human mind.
- Concept Formation via Simulation . Objects may be joined into categories using
uncertain FCA, based on features that they are identified to have via "simulation exper-
iments" rather than physical world observations. For instance, it may be observed that
pyramid-shaped structures fall less easily than pencil-shaped tower structures - and
the concepts corresponding to these two categories may be formed - from experiments
run in the internal simulation world, perhaps inspired by isolated observations in the
physical world.
- Episodic Memory. Previous situations in which the agent has seen similar structures
built, or been given similar problems to solve, may be brought to mind as "episodic
movies" playing in the agent's memory. By watching what happens in these replayed
episodic movies, the agent may learn new declarative or procedural knowledge about
what to do. For example, maybe there was some situation in the agent's past where it
saw someone asked to do something surprising, and that someone created something
funny. This might (via a simple PLN step) bias the agent to create something now,
which it has reason to suspect will cause others to laugh.
• Chapter 41 (Integrative Procedure Learning)
- Concept-Driven Procedure Learning. Learning the concept of "horse", as discussed
above in the context of Chapter 30, is an example of this.
- Predicate Schematization. The synthesis of a schema for man-building, as discussed
above in the context of Chapter 25, is an example of this.
• Chapter 42 (Map Formation)
- Map Formation. The notion of a car involves many aspects: the physical appearance
of cars, the way people get in and out of cars, the ways cars drive, the noises they make,
etc. All these aspects are represented by Atoms that are part of the car map, and are
richly interconnected via HebbianLinks as well as other links.
EFTA00624651
50.2 Roles of Selected Cognitive Processes 505
- Map Encapsulation . The car map forms implicitly via the interaction of multiple
cognitive dynamics, especially ECAN. But then the MapEncapstdation MindAgent may
do its pattern mining and recognize this map explicitly, and form a PredicateNode
encapsulating it. This PredicateNode may then be used in PLN inference, conceptual
blending, and so forth (e.g. helping with the formation of a concept like car-man via
blending).
• Chapter 44 (Natural Language Comprehension)
- Experience Based Diszunbiguation. The particular dialogue involved in the present
example doesn't require any nontrivial word sense disambiguation. But it does require
parse selection, and semantic interpretation selection:
In "Build me something with blocks," the agent has no trouble understanding that
"blocks" means "toy building blocks" rather than, say, "city blocks", based on many
possible mechanisms, but most simply importance spreading.
"Build me something with blocks" has at least three interpretations: the building
could be carried out using blocks with a tool; or the thing built could be presented
alongside blocks; or the thing built could be composed of blocks. The latter is the
most commonsensical interpretation for most humans, but that is because we have
heard the phrase "building with blocks" used in a similarly grounded way before
(as well as other similar phrases such as "playing with Legos", etc., whose meaning
helps militate toward the right interpretation via PLN inference and importance
spreading). So here we have a simple example of experience-based disambiguation,
where experiences at various distances of association from the current one are used
to help select the correct parse.
A subtler form of semantic disambiguation is involved in interpreting the clause
"that I haven't seen before." A literal-minded interpretation would say that this
requirement is fulfilled by any blocks construction that's not precisely identical to
one the teacher has seen before. But of course, any sensible human knows this is an
idiomatic clause that means "significantly different from anything I've seen before."
This could be determined by the CogPrime agent if it has heard the idiomatic
clause before, or if it's heard a similar idiomatic phrase such as "something I've
never done before." Or, even if the agent has never heard such an idiom before, it
could potentially figure out the intended meaning simply because the literal-minded
interpretation would be a pointless thing for the teacher to say. So if it knows the
teacher usually doesn't add useless modificatory clauses onto their statements, then
potentially the agent could guess the correct meaning of the phrase.
• Chapter 46 (Language Generation)
- Experience-Based Knowledge Selection for Language Generation. When the
teacher asks ' hat is it?", the agent must decide what sort of answer to give. Within
the confines of the QuestionAnswering DialogueController, the agent could answer "A
structure of blocks", or "A part of the physical world", or "A thing", or "Mine." (Or, if it
were running another DC, it could answer more broadly, e.g. "None of your business,"
etc.). However, the QA DC tells it that, in the present context, the most likely desired
answer is one that the teacher doesn't already know; and the most important property
of the structure that the teacher doesn't obviously already know is the fact that it
EFTA00624652
506 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
depicts a "car man." Also, memory, of prior conversations may bring up statements like
'It's a horse" in reference to a horse built of blocks, or a drawing of a horse, etc.
- Experience-Based Guidance of Word and Syntax Choice . The choice of phrase
"car man" requires some choices to be made. The agent could just as well say "It's a
man with a car for feet" or "It's a car with a human upper body and head" or "It's
a car centaur," etc. A bias toward simple expressions would lead to "car man." If the
teacher were known to prefer complex expressions, then the agent might be biased
toward expressing the idea in a different way.
• Chapter 48 (Natural Language Dialogue)
- Adaptation of Dialogue Controllers. The QuestionAsking and QuestionAnswer-
ing DialogueControllers both get reinforcement from this interaction, for the specific
internal rules that led to the given statements being made.
50.3 A Semi-Narrative Treatment
Now we describe how a CogPriine system might carry out the specified task in a semi-narrative
form, weaving in the material from the previous section as we go along, and making some more
basic points as well. The semi-narrative covers most but not all of the bullet points from the
previous section, but with some of the technical details removed; and it introduces a handful
of new examples not given in the bullet points.
The reason this is called a semi-narrative rather than a narrative is that there is no particular
linear order to the processes occurring in each phase of the situation described here. CogPrime's
internal cognitive processes do not occur in a linear narrative; rather, what we have is a complex
network of interlocking events. But still, describing some of these events concretely in a manner
correlated with the different stages of a simple interaction, may have some expository value.
The human teacher tells the CogPrime agent "Build me something with blocks
that I haven't seen before."
Upon hearing this, the agent's cognitive cycles are dominated by language processing and
retrieval from episodic and sensory memory.
The agent may decide to revive from disk the mind-states it went through when building
human-pleasing structures from blocks before, so as to provide it with guidance
It will likely experience the emotion of happiness, because it anticipates the pleasure of
getting rewarded for the task in future.
The ubergoal of pleasing the teacher gets active (gets funded significantly with STI currency),
as it becomes apparent there are fairly clear ways of fulfilling that goal (via the subgoal S of
building blocks structures that will get positive response from the teacher). Other ubergoals
like gaining knowledge are not funded as much with STI currency just now, as they are not
immediately relevant.
Action selection, based on ImplicationLinks derived via PLN (between various possible activ-
ities and the subgoal S) causes it to start experimentally building some blocks structures. Past
experience with building (turned into ImplicationLinks via mining the SystemActivityTable)
tells it that it may want to build a little bit in its internal simulation world before building in
the external world, causing STI currently to flow to the simulation MindAgent.
EFTA00624653
50.3 A Semi-Narrative Treatment 507
The Atom corresponding to the context blocks-building gets high STI and is pushed into the
AttentionalFocus, making it likely that many future inferences will occur in this context. Other
Atoms related to this one also get high STI (the ones in the blocks-building map, and others
that are especially related to blocks-building in this particular context).
After a few false starts, the agent builds something it thinks is appropriate and
says "Do you like it?"
Now that the agent has decided what to do to fulfill its well-funded goal, its cognitive cycles
are dominated by action, perception and related memory access and concept creation.
An obvious subgoal is spawned: build a new structure now, and make this particular structure
under construction appealing and novel to the teacher. This subgoal has a shorter time scale
than the high level goal. The subgoal gets some currency from its supergoal using the mechanism
of RFS spreading.
Action selection must tell it when to continue building the same structure and when to try
a new one, as well as more micro level choices.
Atoms related to the currently pursued blocks structure get high STI.
After a failed structure (a "false start") is disassembled, the corresponding Atoms lose STI
dramatically (leaving AF) but may still have significant LTI, so they can be recalled later as
appropriate. They may also have VLTI so they will be saved to disk later on if other things
push them out of RAM due to getting higher LTI.
Meanwhile everything that's experienced from the external world goes into the Experi-
enceDB.
Atoms representing different parts of aspects of the same blocks structure will get Hebbian
links between them, which will guide future reasoning and importance spreading.
Importance spreading helps the system go from an idea for something to build (say, a rock
or a car) to the specific plans and ideas about how to build it, via increasing the STI of the
Atoms that will be involved in these plan and ideas.
If something apparently good is done in building a blocks structure, then other processes and
actions that helped lead to or support that good thing, get passed some STI from the Atoms
representing the good thing, and also may get linked to the Goal Atom representing "good" in
this context. This leads to reinforcement learning.
The agent may play with building structures and then seeing what they most look like, thus
exercising abstract object recognition (that uses procedures learned by MOSES or hillclimbing,
or uncertain relations learned by inference, to guess what object category a given observed
collection of percepts mast likely falls into).
Since the agent has been asked to come up with something surprising, it knows it should
probably try to formulate some new concepts - because it has learned in the past, via Sys-
temActivityTable mining, that often newly formed concepts are surprising to others. So, more
STI currency is given to concept formation MindAgents, such as the ConceptualBlending Mind
Agent (which, along with a lot of stuff that gets thrown out or stored for later use, comes up
with "car-man").
When the notion of "car" is brought to mind, the distributed map of nodes corresponding to
"car" get high STI. When car-man is formed, it is reasoned about (producing new Atoms), but
it also serves as a nexus of importance-spreading, causing the creation of a distributed car-man
map.
If the goal of making an arm for a man-car occurs, then goal-driven schema learning may
be done to learn a procedure for arm-making (where the actual learning is done by MOSES or
EFTA00624654
508 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
If the agent Ls building a man-car, it may have man-building and car-building schema in its
ActiveSchemaPool at the same time, and SchemaActivation may spread back and forth between
the different modules of these two schema.
If the agent wants to build a horse, but has never seen a horse made of blocks (only various
pictures and movies of horses), it may uses MOSES or hillclimbing internally to solve the
problem of creating a horse-recognizer or a horse-generator which embodies appropriate abstract
properties of horses. Here as in all cases of procedure learning, a complexity penalty rewards
simpler programs, from among all programs that approximately fulfill the goals of the learning
process.
If a procedure being executed has some abstract parts, then these may be executed by
inferential procedure evaluation (which makes the abstract parts concrete on the fly in the
course of execution).
To guess the fitness of a procedure for doing something (say; building an arm or recognizing
a horse), inference or simulation may be used, as well as direct evaluation in the world.
Deductive, inductive and abductive PLN inference may be used in figuring out what a blocks
structure will look or act like like before building it (it's tall and thin so it may fall down; it
won't be bilaterally symmetric so it won't look much like a person; etc.)
Backward-chaining inference control will help figure out how to assemble something matching
a certain specification e.g. how to build a chassis based on knowledge of what a chassis looks
like. Forward chaining inference (critically including intensional relationships) will be used to
estimate the properties that the teacher will perceive a given specific structure to have. Spatial
and temporal algebra will be used extensively in this reasoning, within the PLN framework.
Coordinating different parts of the body - say an arm and a hand - will involve importance
spreading (both up and down) within the hierarchical action network, and from this network
to the hierarchical perception network and the heterarchical cognitive network.
In looking up Atoms in the AtomSpace, sonic have truth values whose confidences have
decayed significantly (e.g. those regarding the teacher's tastes), whereas others have confidences
that have hardly decayed at all (e.g. those regarding general physical properties of blocks).
Finding previous blocks structures similar to the current one (useful for guiding building by
analogy to past experience) may be done rapidly by searching the system's internal dimensional-
embedding space.
As the building process occurs, patterns mined via past experience (tall things often fall
down) are used within various cognitive processes (reasoning, procedure learning, concept cre-
ation, etc.); and new pattern mining also occurs based on the new observations made as different
structures are build and experimented with and destroyed.
Simulation of teacher reactions, based on inference from prior examples, helps with the
evaluation of possible structures, and also of procedures for creating structures.
As the agent does all this, it experiences the emotion of curiosity (likely among other emo-
tions), because as it builds each new structure it has questions about what it will look like and
how the teacher would react to it.
The human teacher says "It's beautiful. What is it?" The agent says "It's a car
man"
Now that the building is done and the teacher says something, the agent's cognitive cycles
are dominated by language understanding and generation. The Atom representing the context
of talking to the teacher gets high STI, and is used as the context for many ensuing inferences.
EFTA00624655
50.4 Conclusion 509
Comprehension of "it" uses anaphor resolution based on a combination of ECAN and PLN
inference based on a combination of previously interpreted language and observation of the
external world situation.
The agent experiences the emotion of happiness because the teacher has called its creation
beautiful, which is recognizes as a positive evaluation - so the agent knows one of its ubergoals
("please the teacher") has been significantly fulfilled.
The goal of pleasing the teacher causes the system to want to answer the question. So
the QuestionAnswering DialogueController schema gets paid a lot and gets put into the Ac-
tiveSchemaPool. In reaction to the question asked, this DC chooses a semantic graph to speak,
then invokes NL generation to say it.
NL generation chooses the most compact expression that seems to adequately convey the
intended meaning, so it decides on "car man" as the best simple verbalization to match the
newly created conceptual blend that it thinks effectively describes the newly created blocks
structure.
The positive feedback from the user leads to reinforcement of the Atoms and processes that
led to the construction of the blocks structure that has been judged beautiful (via importance
spreading and SystemActivityTable mining).
50.4 Conclusion
The simple situation considered in this chapter is complex enough to involve nearly all the
different cognitive processes in the CogPrime system - and many interactions between these
processes. This fact illustrates one of the main difficulties of designing, building and testing an
artificial mind like CogPrime - until nearly all of the system is build and made to operate in
an integrated way, it's hard to do any meaningful test of the system. Testing PLN or MOSES
or conceptual blending in isolation may be interesting computer science, but it doesn't tell you
much about CogPrime as a design for a thinking machine.
According to the CogPrime approach, getting a simple child-like interaction like "build me
something with blocks that I haven't seen before" to work properly requires a holistic, integrated
cognitive system. Once one has built a system capable of this sort of simple interaction then,
according to the theory underlying CogPrime, one is not that far from a system with adult
human-level intelligence. And once one has an adult human-level AGI built according to a
highly flexible design like CogPrime. given the potential of such systems to self-analyze and
self-modify, one is not far off from a dramatically powerful Genius Machine. Of course there
will be a lot of work to do to get from a child-level system to an adult-level system - it won't
necessarily unfold as "automatically" as seems to happen with a human child, because CogPrime
lacks the suite of developmental processes and mechanisms that the young human brain has.
But still, a child CogPrime mind capable of doing the things outlined in this chapter will have
all the basic components and interactions in place, all the ones that are needed for a much more
advanced artificial mind.
Of course, one could concoct a narrow-Al system carrying out the specific activities described
in this chapter, much more simply than one could build a CogPrime system capable of doing
these activities. But that's not the point — the point of this chapter is not to explain how to
achieve some particular narrow set of activities "by any means necessary", but rather to explain
EFTA00624656
510 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
how these activities might be achieved within the CogPrime framework, which has been designed
with much more generality in mind.
It would be worthwhile to elaborate a number of other situations similar to the one described
in this chapter, and to work through the various cognitive processes and structures in CogPrime
carefully in the context of each of these situations. In fact this sort of exercise has frequently
been carried out informally in the context of developing CogPrime. But this book is already
long enough, so we will end here, and leave the rest for future works - emphasizing that it is
via intimate interplay between concrete considerations like the ones presented in this chapter,
and general algorithmic and conceptual considerations as presented in most of the chapters of
this book, that we have the greatest hope of creating advanced AGI. The value of this sort of
interplay actually follows from the theory of real-world general intelligence presented in Part
1 of the book. Thoroughly general intelligence is only possible given unrealistic computational
resources, so real-world general intelligence is about achieving high generality given limited
resources relative to the specific classes of environments relevant to a given agent. Specific
situations like building surprising things with blocks are particularly important insofar as they
embody broader information about the classes of environments relevant to broadly human-like
general intelligence.
No doubt, once a CogPrime system is completed, the specifics of its handling of the situation
described here will differ somewhat from the treatment presented in this chapter. Furthermore,
the final CogPrime system may differ algorithmically and structurally in some respects from
the specifics given in this book - it would be surprising if the process of building, testing and
interacting with CogPrime didn't teach us some new things about various of the topics covered.
But our conjecture is that, if sufficient effort is deployed appropriately, then a system much like
the CogPrime system described in this book will be able to handle the situation described in
this chapter in a roughly similar manner to the one described in this chapter - and that this
will serve as a natural precursor to much more dramatic AGI achievements.
EFTA00624657
Appendix A
Glossary
A.1 List of Specialized Acronyms
This includes acronyms that are commonly used in discussing CogPrime, OpenCog and related
ideas, plus sonic that occur here and there in the text for relatively ephemeral reasons.
• AA: Attention Allocation
• ADF: Automatically Defined Function (in the context of Genetic Programming)
• AF: Attentional Focus
• AGI: Artificial General Intelligence
• AV: Attention Value
• BD: Behavior Description
• C-space: Configuration Space
• CBV: Coherent Blended Volition
• CEV: Coherent Extrapolated Volition
• CGGP: Contextually Guided Greedy Parsing
• CSDLN: Compositional Spatiotemporal Deep Learning Network
• CT: Combo 'Bee
• ECAN: Economic Attention Network
• ECP: Embodied Communication Prior
• EPW : Experiential Possible Worlds (semantics)
• FCA: Formal Concept Analysis
• FI : Fisher Information
• FLM: Frequent Itemset Mining
• FOI: First Order Inference
• FOPL: First Order Predicate Logic
• FOPLN: First Order PLN
• FS-MOSES: Feature Selection MOSES (i.e. MOSES with feature selection integrated a la
LIFES)
• GA: Genetic Algorithms
511
EFTA00624658
512 A Glossary
• GB: Global Brain
• GEOP: Goal Evaluator Operating Procedure (in a GOLEM context)
• GIS: Geospatial Information System
• GOLEM: Goal-Oriented LEarning Meta-architecture
• GP: Genetic Programming
• HOE Higher-Order Inference
• HOPLN: Higher-Order PLN
• HR: Historical Repository (in a GOLEM context)
• HTM: Hierarchical Temporal Memory
• IA: (Allen) Interval Algebra (an algebra of temporal intervals)
• IRC: Imitation / Reinforcement Correction (Learning)
• LIFES: Learning-Integrated Feature Selection
• LTI: Long Term Importance
• MA: MindAgent
• MOSES: Meta-Optimizing Semantic Evolutionary Search
• MSH: Mirror System Hypothesis
• NARS: Non-Axiomatic Reasoning System
• NLGen: A specific software component within OpenCog, which provides one way of dealing
with Natural Language Generation
• OCP: OpenCogPrime
• OP: Operating Program (in a GOLEM context)
• PEPL: Probabilistic Evolutionary Procedure Learning (e.g. MOSES)
• PLN: Probabilistic Logic Networks
• RCC: Region Connection Calculus
• RelEx: A specific software component within OpenCog, which provides one way of dealing
with natural language Relationship Extraction
• SAT: Boolean SATisfaction, as a mathematical / computational problem
• SMEPH: Self-Modifying Evolving Probabilistic Hypergraph
• SRAM: Simple Realistic Agents Model
• STI: Short Term Importance
• STY: Simple Truth VAlue
• TV: Truth Value
• VLTI: Very Long Term Importances
• WSPS: Whole-Sentence Purely-Syntactic Parsing
A.2 Glossary of Specialized Terms
• Abduction: A general form of inference that goes from data describing something to a
hypothesis that accounts for the data. Often in an OpenCog context, this refers to the PLN
abduction rule, a specific First-Order PLN rule (If A implies C, and B implies C, then
maybe A is B), which embodies a simple form of abductive inference. But OpenCog may
also carry out abduction, as a general process, in other ways.
• Action Selection: The process via which the OpenCog system chooses which Schema to
enact, based on its current goals and context.
• Active Schema Pool: The set of Schema currently in the midst of Schema Execution.
EFTA00624659
A.2 Glossary of Specialized Terms 513
• Adaptive Inference Control: Algorithms or heuristics for guiding PLN inference, that
cause inference to be guided differently based on the context in which the inference is taking
place, or based on aspects of the inference that are noted as it proceeds.
• AGI Preschool: A virtual world or robotic scenario roughly similar to the environment
within a typical human preschool, intended for AGIs to learn in via interacting with the
environment and with other intelligent agents.
• Atom: The basic entity used in OpenCog as an element for building representations. Some
Atoms directly represent patterns in the world or mind, others are components of represen-
tations. There are two kinds of Atoms: Nodes and Links.
• Atom, Frozen: See Atom, Saved
• Atom, Realized: An Atom that exists in RAM at a certain point in time.
• Atom, Saved: An Atom that has been saved to disk or other similar media, and is not
actively being processed.
• Atom, Serialized: An Atom that is serialized for transmission from one software process
to another, or for saving to disk, etc.
• Atom2Link: A part of OpenCogPrime
s language generation system, that transforms appropriate Atoms into words connected via
link parser link types.
• Atomspace: A collection of Atoms, comprising the central part of the memory, of an
OpenCog instance.
• Attention: The aspect of an intelligent system's dynamics focused on guiding which aspects
of an OpenCog system's memory & functionality gets more computational resources at a
certain point in time
• Attention Allocation: The cognitive process concerned with managing the parameters
and relationships guiding what the system pays attention to, at what points in time. This
is a term inclusive of Importance Updating and Hebbian Learning.
• Attentional Currency: Short Term Importance and Long Term Importance values are
implemented in terms of two different types of artificial money, STICurrency and LTICur-
rency. Theoretically these may be converted to one another.
• Attentional Focus: The Atoms in an OpenCog Atomspace whose ShortTennImportance
values lie above a critical threshold (the AttentionalFocus Boundary). The Attention Allo-
cation subsystem treats these Atoms differently. Qualitatively, these Atoms constitute the
system's main focus of attention during a certain interval of time, i.e. it's a moving bubble
of attention.
• Attentional Memory: A system's memory of what it's useful to pay attention to, in what
contexts. In CogPrime this is managed by the attention allocation subsystem.
• Backward Chainer: A piece of software, wrapped in a MindAgent, that carries out back-
ward chaining inference using PLN.
• CIM-Dynamic: Concretely-Implemented Mind Dynamic, a term for a cognitive process
that is implemented explicitly in OpenCog (as opposed to allowed to emerge implicitly from
other dynamics). Sometimes a CIM-Dynamic will be implemented via a single MindAgent,
sometimes via a set of multiple interrelated MindAgents, occasionally by other means.
• Cognition: In an OpenCog context, this is an imprecise term. Sometimes this term means
any process closely related to intelligence; but more often it's used specifically to refer to
more abstract reasoning/learning/etc, as distinct from lower-level perception and action.
• Cognitive Architecture: This refers to the logical division of an AI system like OpenCog
into interacting parts and processes representing different conceptual aspects of intelligence.
EFTA00624660
514 A Glossary
It's different from the software architecture, though of course certain cognitive architectures
and certain software architectures fit more naturally together.
• Cognitive Cycle: The basic "loop" of operations that an OpenCog system, used to control
an agent interacting with a world, goes through rapidly each "subjective moment." Typically
a cognitive cycle should be completed in a second or less. It minimally involves perceiving
data from the world, storing data in memory, and deciding what if any new actions need
to be taken based on the data perceived. It may also involve other processes like deliber-
ative thinking or metacognition. Not all OpenCog processing needs to take place within a
cognitive cycle.
• Cognitive Schematic: An implication of the form "Context AND Procedure IMPLIES
goal". Learning and utilization of these is key to CogPrime's cognitive process.
• Cognitive Synergy: The phenomenon by which different cognitive processes, controlling a
single agent, work together in such a way as to help each other be more intelligent. Typically,
if one has cognitive processes that are individually susceptible to combinatorial explosions,
cognitive synergy involves coupling them together in such a way that they can help one
another overcome each other's internal combinatorial explosions. The CogPrime design is
reliant on the hypothesis that its key learning algorithms will display dramatic cognitive
synergy when utilized for agent control in appropriate environments.
• CogPrime : The name for the AGI design presented in this book, which is designed specifi-
cally for implementation within the OpenCog software framework (and this implementation
is OpenCogPrime).
• CogServer: A piece of software, within OpenCog, that wraps up an Atomspace and a
number of MindAgents, along with other mechanisms like a Scheduler for controlling the
activity of the MindAgents, and code for important and exporting data from the Atomspace.
• Cognitive Equation: The principle, identified in Ben Goertzel's 1994 book "Chaotic
Logic", that minds are collections of pattern-recognition elements, that work by iteratively
recognizing patterns in each other and then embodying these patterns as new system ele-
ments. This is seen as distinguishing mind from "self-organization" in general, as the latter
is not so focused on continual pattern recognition. Colloquially this means that "a mind is
a system continually creating itself via recognizing patterns in itself."
• Combo: The programming language used internally by MOSES to represent the programs
it evolves. Schemallodes may refer to Combo programs, whether the latter are learned via
MOSES or via some other means. The textual realization of Combo resembles LISP with
less syntactic sugar. Internally a Combo program is represented as a program tree.
• Composer: In the PLN design, a rule is denoted a composer if it needs premises for
generating its consequent. See generator.
• CogBuntu: an Ubuntu Linux remix that contains all required packages and tools to test
and develop OpenCog.
• Concept Creation: A general term for cognitive processes that create new ConceptNodes,
PredicateNodes or concept maps representing new concepts.
• Conceptual Blending: A process of creating new concepts via judiciously combining
pieces of old concepts. This may occur in OpenCog in many ways, among them the explicit
use of a ConceptBlending MindAgent, that blends two or more ConceptNodes into a new
one.
• Confidence: A component of an OpenCog/PLN TruthValue, which is a scaling into the
interval 10,11 of the weight of evidence associated with a truth value. In the simplest case
(of a probabilistic Simple Truth Value), one uses confidence c = n / (n+k), where n is
EFTA00624661
A.2 Clossary of Specialized Terms 515
the weight of evidence and k is a parameter. In the case of an Indefinite Truth Value, the
confidence is associated with the width of the probability interval.
• Confidence Decay: The process by which the confidence of an Atom decreases over time,
as the observations on which the Atom's truth value is bawd become increasingly obsolete.
This may be carried out by a special MindAgent. The rate of confidence decay is subtle and
contextually determined, and must be estimated via inference rather than simply assumed
a priori.
• Consciousness: CogPrime is not predicated on any particular conceptual theory of con-
sciousness. Informally, the AttentionalFocus is sometimes referred to as the "conscious"
mind of a CogPrime system, with the rest of the Atomspace as "unconscious" but this is
just an informal usage, not intended to tie the CogPrime design to any particular theory of
consciousness. The primary originator of the CogPrime
design (Ben Goertzel) tends toward panpsychism, as it happens.
• Context: In addition to its general common-sensical meaning, in CogPrime the term Con-
text also refers to an Atom that is used as the first argument of a ContextLink. The second
argument of the ContextLink then contains Links or Nodes, with TruthValues calculated
restricted to the context defined by the first argument. For instance, (ContextLink USA
(InheritanceLink person obese )).
• Core: The MindOS portion of OpenCog, comprising the Atomspace, the CogServer, and
other associated "infrastructural" code.
• Corrective Learning: When an agent learns how to do something, by having another
agent explicitly guide it in doing the thing. For instance, teaching a dog to sit by pushing
its butt to the ground.
• CSDLN: (Compositional Spatiotemporal Deep Learning Network): A hierarchical pattern
recognition network, in which each layer corresponds to a certain spatiotemporal granularity,
the nodes on a given layer correspond to spatiotemporal regions of a given size, and the
children of a node correspond to sub-regions of the region the parent corresponds to. Jeff
Hawkins's HTM is one example CSDLN, and Itamar Arel's DeSTIN (currently used in
OpenCog) is another.
• Declarative Knowledge: Semantic knowledge as would be expressed in propositional or
predicate logic facts or beliefs.
• Deduction: In general, this refers to the derivation of conclusions from premises using
logical rules. In PLN in particular, this often refers to the exercise of a specific inference
rule, the PLN Deduction rule (A B, B C, therefore A—> C)
• Deep Learning: Learning in a network of elements with multiple layers, involving feedfor-
ward and feedback dynamics, and adaptation of the links between the elements. An example
deep learning algorithm is DeSTIN, which is being integrated with OpenCog for perception
processing.
• Defrosting: Restoring, into the RAM portion of an Atomspace, an Atom (or set thereof)
previously saved to disk.
• Demand: In CogPrime's OpenPsi subsystem, this term is used in a manner inherited from
the Psi model of motivated action. A Demand in this context is a quantity whose value the
system is motivated to adjust. Typically the system wants to keep the Demand between
certain minimum and maximum values. An Urge develops when a Demand deviates from
its target range.
• Deme: In MOSES, an island" of candidate programs, closely clustered together in program
space, being evolved in an attempt to optimize a certain fitness function. The idea is that
EFTA00624662
516 A Glossary
within a dome, programs are generally similar enough that reasonable syntax-semantics
correlation obtains.
• Derived Hypergraph: The SMEPH hypergraph obtained via modeling a system in terms
of a hypergraph representing its internal states and their relationships. For instance, a
SMEPH vertex represents a collection of internal states that habitually occur in relation to
similar external situations. A SMEPH edge represents a relationship between two SMEPH
vertices (e.g. a similarity or inheritance relationship). The terminology "edge /vertex" is
used in this context, to distinguish from the 'link / node" terminology used in the context
of the Atomspace.
• DeSTIN — Deep SpatioTemporal Inference Network: A specific CSDLN created by
Itamar Arel, tested on visual perception, and appropriate for integration within CogPrime.
• Dialogue: Linguistic interaction between two or more parties. In a CogPrime context, this
may be in English or another natural language, or it may be in Lojban or Psynese.
• Dialogue Control: The process of determining what to say at each juncture in a dialogue.
This is distinguished from the linguistic aspects of dialogue, language comprehension and
language generation. Dialogue control applies to Psynese or Lojban, as well as to human
natural language.
• Dimensional Embedding: The process of embedding entities from some non-dimensional
space (e.g. the Atomspace) into an n-dimensional Euclidean space. This can be useful in an
Al context because some sorts of queries (e.g. "find everything similar to X", "find a path
between X and V) are much faster to carry out among points in a Euclidean space, than
among entities in a space with less geometric structure.
• Distributed Atomspace: An implementation of an Atomspace that spans multiple com-
putational processes; generally this is done to enable spreading an Atomspace across mul-
tiple machines.
• Dual Network: A network of mental or informational entities with both a hierarchical
structure and a heterarchical structure, and an alignment among the two structures so that
each one helps with the maintenance of the other. This is hypothesized to be a critical
emergent structure, that must emerge in a mind (e.g. in an Atomspace) in order for it to
achieve a reasonable level of human-like general intelligence (and possibly to achieve a high
level of pragmatic general intelligence in any physical environment).
• Efficient Pragmatic General Intelligence: A formal, mathematical definition of general
intelligence (extending the pragmatic general intelligence), that ultimately boils down to:
the ability to achieve complex goals in complex environments using limited computational
resources (where there is a specifically given weighting function determining which goals
and environments have highest priority). More specifically, the definition weighted-sums the
system's normalized goal-achieving ability over (goal. environment pairs), and where the
weights are given by some assumed measure over (goal. environment pairs), and where the
normalization is done via dividing by the (space and time) computational resources used
for achieving the goal.
• Elegant Normal Form (ENF): Used in MOSES, this is a way of putting programs in
a normal form while retaining their hierarchical structure. This is critical if one wishes
to probabilistically model the structure of a collection of programs, which is a meaningful
operation if the collection of programs is operating within a region of program space where
syntax-semantics correlation holds to a reasonable degree. The Reduct library, is used to
place programs into ENF.
EFTA00624663
A.2 Glossary of Specialized Terms 517
• Embodied Communication Prior: The class of prior distributions over (goal, environ-
ment pairs), that are imposed by placing an intelligent system in an environment where
most of its tasks involve controlling a spatially localized body in a complex world, and in-
teracting with other intelligent spatially localized bodies. It is hypothesized that many key
aspects of human-like intelligence (e.g. the use of different subsystems for different memory
types, and cognitive synergy between the dynamics associated with these subsystems) are
consequences of this prior assumption. This is related to the Mind-World Correspondence
Principle.
• Embodiment: Colloquially, in an OpenCog context, this usually means the use of an AI
software system to control a spatially localized body in a complex (usually 3D) world. There
are also passible "borderline cases" of embodiment, such as a search agent on the Internet.
In a sense any Al is embodied, because it occupies some physical system (e.g. computer
hardware) and has some way of interfacing with the outside world.
• Emergence: A property or pattern in a system is emergent if it arises via the combination
of other system components or aspects, in such a way that its details would be very difficult
(not necessarily impossible in principle) to predict from these other system components or
aspects.
• Emotion: Emotions are system-wide responses to the system's current and predicted state.
Dorner's Psi theory of emotion contains explanations of many human emotions in terms
of underlying dynamics and motivations, and most of these explanations make sense in a
CogPrime context, due to CogPrime's use of OpenPsi (modeled on Psi) for motivation and
action selection.
• Episodic Knowledge: Knowledge about episodes in an agent's life-history, or the life-
history of other agents. CogPrime includes a special dimensional embedding space only for
episodic knowledge, easing organization and recall.
• Evolutionary Learning: Learning that proceeds via the rough process of iterated differen-
tial reproduction based on fitness, incorporating variations of reproduced entities. MOSES
is an explicitly evolutionary-learning-based portion of CogPrime; but CogPrime's dynamics
as a whole may also be conceived as evolutionary.
• Exemplar: (in the context of imitation learning) - When the owner wants to teach an
OpenCog controlled agent a behavior by imitation, he/she gives the pet an exemplar. To
teach a virtual pet "fetch" for instance, the owner is going to throw a stick, run to it, grab
it with his/her mouth and come back to its initial position.
• Exemplar: (in the context of MOSES) - Candidate chosen as the core of a new deme, or
as the central program within a deme, to be varied by representation building for ongoing
exploration of program space.
• Explicit Knowledge Representation: Knowledge representation in which individual,
easily humanly identifiable pieces of knowledge correspond to individual elements in a knowl-
edge store (elements that are explicitly there in the software and accessible via very rapid,
deterministic operations)
• Extension: In PLN, the extension of a node refers to the instances of the category that
the node represents. In contrast is the intension.
• Fishgram (Frequent and Interesting Sub-hypergraph Mining): A pattern mining
algorithm for identifying frequent and/or interesting sub-hypergraphs in the Atom.space.
• First-Order Inference (FOI): The subset of PLN that handles Logical Links not in-
volving VariableAtoms or higher-order functions. The other aspect of PLN, Higher-Order
Inference, uses Truth Value formulas derived from First-Order Inference.
EFTA00624664
518 A Glossary
• Forgetting: The process of removing Atoms from the in-RAM portion of AtomSpace, when
RAM gets short and they are judged not as valuable to retain in RAM as other Atoms. This
is commonly done using the LTI values of the Atoms (removing lowest LTI-Atoms, or more
complex strategies involving the LTI of groups of interconnected Atoms). May be done by
a dedicated Forgetting MindAgent. VLTI may be used to determine the fate of forgotten
Atoms.
• Forward Chainer: A control mechanism (MindAgent) for PLN inference, that works by
taking existing Atoms and deriving conclusions from them using PLN rules, and then iter-
ating this process. The goal is to derive new Atoms that are interesting according to some
given criterion.
• Frame2Atom: A simple system of hand-coded rules for translating the output of RelEx2Frame
(logical representation of semantic relationships using FrameNet relationships) into Atoms.
• Freezing: Saving Atoms from the in-RAM AtomSpace to disk.
• General Intelligence: Often used in an informal, commonsensical sense, to mean the
ability to learn and generalize beyond specific problems or contexts. Has been formalized
in various ways as well, including formalizations of the notion of "achieving complex goals
in complex environments" and "achieving complex goals in complex environments using
limited resources." Usually interpreted as a fuzzy concept, according to which absolutely
general intelligence is physically unachievable, and humans have a significant level of general
intelligence, but far from the maximally physically achievable degree.
• Generalized Hypergraph: A hypergraph with some additional features, such as links
that point to links, and nodes that are seen as "containing" whole sub-hypergraphs. This is
the most natural and direct way to mathematically/visually model the Atomspace.
• Generator: In the Pia design, a rule is denoted a generator if it can produce its consequent
without needing premises (e.g. LookupRule, which just looks it up in the AtomSpace). See
composer.
• Global, Distributed Memory: Memory that stores items as implicit knowledge, with
each memory item spread across multiple components, stored as a pattern of organization
or activity among them.
• Glocal Memory: The storage of items in memory in a way that involves both localized
and global, distributed aspects.
• Goal: An Atom representing a function that a system (like OpenCog) is supposed to spend
a certain non-trivial percentage of its attention optimizing. The goal, informally speaking,
is to maximize the Atom's truth value.
• Goal, Implicit: A goal that an intelligent system, in practice, strives to achieve; but that
is not explicitly represented as a goal in the system's knowledge base.
• Goal, Explicit: A goal that an intelligent system explicitly represents in its knowledge
has' and expends some resources trying to achieve. Goal Nodes (which may be Nodes or,
e.g. hmplicationLLinks) are used for this purpose in OpenCog.
• Goal-Driven Learning: Learning that is driven by the cognitive schematic i.e. by the quest
of figuring out which procedures can be expected to achieve a certain goal in a certain sort
of context.
• Grounded Schemallode: See Schemallode, Grounded.
• Hebbian Learning: An aspect of Attention Allocation, centered on creating and updating
HebbianLinks, which represent the simultaneous importance of the Atoms joined by the
HebbianLink.
EFTA00624665
A.2 Glossary of Specialized Terms 519
• Hebbian Links: Links recording information about the associative relationship (co-
occurrence) between Atoms. These include symmetric and asymmetric HebbianLinks.
• Heterarchical Network: A network of linked elements in which the semantic relationships
associated with the links are generally symmetrical (e.g. they may be similarity links, or
symmetrical associative links). This is one important sort of subnetwork of an intelligent
system; see Dual Network.
• Hierarchical Network: A network of linked elements in which the semantic relationships
associated with the links are generally asymmetrical, and the parent nodes of a node have
a more general scope and some measure of control over their children (though there may be
important feedback dynamics too). This is one important sort of subnetwork of an intelligent
system; see Dual Network.
• Higher-Order Inference (HOI): PLN inference involving variables or higher-order func-
tion.s. In contrast to First-Order Inference (F00.
• Hillclimbing: A general term for greedy, local optimization techniques, including some
relatively sophisticated ones that involve "mildly nonlocal" jumps.
• Human-Level Intelligence: General intelligence that's "as smart as" human general in-
telligence, even if in some respects quite unlike human intelligence. An informal concept,
which generally doesn't come up much in CogPrime work, but is used frequently by some
other AI theorists.
• Human-Like Intelligence: General intelligence with properties and capabilities broadly
resembling those of humans, but not necessarily precisely imitating human beings.
• Hypergraph: A conventional hypergraph is a collection of nodes and links, where each
link may span any number of nodes. OpenCog makes use of generalized hypergraphs (the
Atomspace is one of these).
• Imitation Learning: Learning via copying what some other agent is observed to do.
• Implication: Often refers to an ImplicationLink between two PredicateNodes, indicating
an (extensional, intensional or mixed) logical implication.
• Implicit Knowledge Representation: Representation of knowledge via having easily
humanly identifiable pieces of knowledge correspond to the pattern of organization and/or
dynamics of elements, rather than via having individual elements correspond to easily hu-
manly identifiable pieces of knowledge.
• Importance: A generic term for the Attention Values associated with Atoms. Most com-
monly these are STI (short term importance) and LTI (long term importance) values. Other
importance values corresponding to various different time scales are also possible. In general
an importance value reflects an estimate of the likelihood an Atom will be aseful to the
system over some particular future time-horizon. STI is generally relevant to processor time
allocation, whereas LTI is generally relevant to memory allocation.
• Importance Decay: The process of Atom importance values (e.g. STI and LTI) decreasing
over time, if the Atoms are not utilized. Importance decay rates may in general be context-
dependent.
• Importance Spreading: A synonym for Importance Updating, intended to highlight the
similarity with "activation spreading" in neural and semantic networks.
• Importance Updating: The CIM-Dynamic that periodically (frequently) updates the STI
and LTI values of Atoms based on their recent activity and their relationships.
• Imprecise Truth Value: Peter Walley's imprecise truth values are intervals inter-
preted as lower and upper bounds of the means of probability distributions in an envelope
EFTA00624666
520 A Glossary
of distributions. In general, the term may be used to refer to any truth value involving
intervals or related constructs, such as indefinite probabilities.
• Indefinite Probability: An extension of a standard imprecise probability, comprising a
credible interval for the means of probability distributions governed by a given second-order
distribution.
• Indefinite Truth Value: An OpenCog TruthValue object wrapping up an indefinite prob-
ability
• Induction: In PLN, a specific inference rule (A —> B, A —> C, therefore B r C). In general,
the process of heuristically inferring that what has been seen in multiple examples, will be
seen again in new examples. Induction in the broad sense, may be carried out in OpenCog
by methods other than PLN induction. When emphasis needs to be laid on the particular
PLN inference rule, the phrase "PLN Induction" is used.
• Inference: Generally speaking, the process of deriving conclusions from assumptions. In
an OpenCog context. this often refers to the PLN inference system. Inference in the broad
sense is distinguished from general learning via some specific characteristics, such as the
intrinsically incremental nature of inference: it proceeds step by step.
• Inference Control: A cognitive process that determines what logical inference rule (e.g.
what PLN rule) is applied to what data, at each point in the dynamic operation of an
inference process.
• Integrative AGI: An AGI architecture, like CogPrime, that relies on a number of different
powerful, reasonably general algorithms all cooperating together. This is different from an
AGI architecture that is centered on a single algorithm, and also different than an AGI
architecture that expects intelligent behavior to emerge from the collective interoperation
of a number of simple elements (without any sophisticated algorithms coordinating their
overall behavior).
• Integrative Cognitive Architecture: A cognitive architecture intended to support inte-
grative AGI.
• Intelligence: An informal, natural language concept. "General intelligence" is one slightly
more precise specification of a related concept; "Universal intelligence" is a fully precise
specification of a related concept. Other specifications of related concepts made in the
particular context of CogPrime research are the pragmatic general intelligence and the
efficient pragmatic general intelligence.
• Intension: In PLN, the intention of a node consists of Atoms representing properties of
the entity the node represents.
• Intentional memory: A system's knowledge of its goals and their subgoaLs. and associa-
tions between these goals and procedures and contexts (e.g. cognitive schematics).
• Internal Simulation World: A simulation engine used to simulate an external environ-
ment (which may be physical or virtual), used by an AGI system as its "mind's eye" in order
to experiment with various action' q sequences and envision their consequences, or observe
the consequences of various hypothetical situations. Particularly important for dealing with
episodic knowledge.
• Interval Algebra: Allen Interval Algebra, a mathematical theory of the relationships be-
tween time intervals. CogPrime utilizes a fuzzified version of classic Interval Algebra.
• IRC Learning (Imitation, Reinforcement, Correction): Learning via interaction with
a teacher, involving a combination of imitating the teacher, getting explicit reinforcement
signals from the teacher, and having one's incorrect or suboptimal behaviors guided toward
betterness by the teacher in real-time. This is a large part of how young humans learn.
EFTA00624667
A.2 Glossary of Specialized Terms 521
• Knowledge Base: A shorthand for the totality of knowledge possessed by an intelligent
system during a certain interval of time (whether or not this knowledge is explicitly rep-
resented). Put differently: this is an intelligence's total memory contents (inclusive of all
types of memory) during an interval of time.
• Language Comprehension: The process of mapping natural language speech or text into
a more "cognitive", largely language-independent representation. In OpenCog this has been
done by various pipelines consisting of dedicated natural language processing tools, e.g. a
pipeline: text —r Link Parser —> RelEx r RelEx2Frame Frame2Atom Atomspace; and
alternatively a pipeline Link Parser —r Link2Atom i Atomspace. It would also be possi-
ble to do language comprehension purely via PLN and other generic OpenCog processes,
without using specialized language processing tools.
• Language Generation: The process of mapping (largely language-independent) cognitive
content into speech or text. In OpenCog this has been done by various pipelines consisting of
dedicated natural language processing tools, e.g. a pipeline: Atomspace NLGen text;
or more recently Atomspace Atom2Link —> surface realization —r text. It would also be
possible to do language generation purely via PLN and other generic OpenCog processes,
without using specialized language processing tools.
• Language Processing: Processing of human language is decomposed, in CogPrime, into
Language Comprehension, Language Generation, and Dialogue Control.
• Learning: In general, the process of a system adapting based on experience, in a way that
increases its intelligence (its ability to achieve its goals). The theory underlying CogPrime
doesn't distinguish learning from reasoning, associating, or other aspects of intelligence.
• Learning Server: In some OpenCog configurations, this refers to a software server that
performs "offline" learning tasks (e.g. using MOSES or hilIclimbing), and is in communica-
tion with an Operational Agent Controller software server that performs real-time agent
control and dispatches learning tasks to and receives results from the Learning Server.
• Linguistic Links: A catch-all tenn for Atoms explicitly representing linguistic content,
e.g. WordNode, SentenceNode, CharacterNode.
• Link: A type of Atom, representing a relationship among one or more Atoms. Links and
Nodes are the two basic kinds of Atoms.
• Link Parser: A natural language syntax parser, created by Sleator and Temperley at
Carnegie-Mellon University, and currently used as part of OpenCogPrime's natural language
comprehension and natural language generation system.
• Link2Atom: A system for translating link parser links into Atoms. It attempts to resolve
precisely as much ambiguity as needed in order to translate a given assemblage of link parser
links into a unique Atom structure.
• Lobe: A term sometimes used to refer to a portion of a distributed Atomspace that lives
in a single computational process. Often different lobes will live on different machines.
• Localized Memory: Memory that stores each item using a small number of closely-
connected elements.
• Logic: In an OpenCog context, this usually refers to a set of formal rules for translating
certain combinations of Atoms into "conclusion" Atoms. The paradigm case at present is the
PLN probabilistic logic system, but OpenCog can also be used together with other logics.
• Logical Links: Any Atoms whose truth values are primarily determined or adjusted via
logical rules, e.g. PLN's hnheritanceLink, SimilarityLink, hnplicationLink, etc. The term
isn't usually applied to other links like HebbianLinks whose semantics isn't primarily logic-
EFTA00624668
522 A Glossary
based, even though these other links can be processed via (e.g. PLN) logical inference via
interpreting them logically.
• Lojban: A constructed human language, with a completely formalized syntax and a highly
formalized semantics, and a small but active community of speakers. In principle this seems
an extremely good method for communication between humans and early-stage AGI sys-
tems.
• Lojban-l-+: A variant of Lojban that incorporates English words, enabling more flexible
expression without the need for frequent invention of new Lojban words.
• Long Term Importance (LTI): A value associated with each Atom, indicating roughly
the expected utility to the system of keeping that Atom in RAM rather than saving it to
disk or deleting it. It's possible to have multiple LTI values pertaining to different time
scales, but so far practical implementation and most theory has centered on the option of
a single LTI value.
• LTI: Long Term Importance
• Map: A collection of Atoms that are interconnected in such a way that they tend to be
commonly active (i.e. to have high STI, e.g. enough to be in the AttentionalFocus, at the
same time).
• Map Encapsulation: The process of automatically identifying maps in the Atomspace,
and creating Atoms that "encapsulate" them; the Atom encapsulation a map would link to
all the Atoms in the map. This is a way of making global memory into local memory, thus
making the system's memory glocal and explicitly manifesting the "cognitive equation."
This may be carried out via a dedicated MapEncapsulation MindAgent.
• Map Formation: The process via which maps form in the Atomspace. This need not be
explicit; maps may form implicitly via the action of Hebbian Learning. It will commonly
occur that Atoms frequently co-occurring in the AttentionalFocus, will come to be joined
together in a map.
• Memory Types: In CogPrime
this generally refers to the different types of memory that are embodied in different data
structures or processes in the CogPrime
architecture, e.g. declarative (semantic), procedural, attentional, intentional, episodic, sen-
sorimotor.
• Mind-World Correspondence Principle: The principle that, for a mind to display
efficient pragmatic general intelligence relative to a world, it should display many of the
same key structural properties as that world. This can be formalized by modeling the world
and mind as probabilistic state transition graphs, and saying that the categories implicit
in the state transition graphs of the mind and world should be inter-mappable via a high-
probability morphism.
• Mind OS: A synonym for the OpenCog Core.
• MindAgent: An OpenCog software object, residing in the CogServer, that carries out
some processes in interaction with the Atomspace. A given conceptual cognitive process
(e.g. PLN inference, Attention allocation, etc.) may be carried out by a number of different
MindAgents designed to work together.
• Mindspace: A model of the set of states of an intelligent system as a geometrical space,
imposed by assuming some metric on the set of mind-states. This may be used as a tool for
formulating general principles about the dynamics of generally intelligent systems.
• Modulators: Parameters in the Psi model of motivated, emotional cognition, that modu-
late the way a system perceives, reasons about and interacts with the world.
EFTA00624669
A.2 Glossary of Specialized Terms 523
• MOSES (Meta-Optimizing Semantic Evolutionary Search): An algorithm for proce-
dure learning, which in the current implementation learns programs in the Combo language.
MOSES is an evolutionary learning system, which differs from typical genetic programming
systems in multiple aspects including: a subtler framework for managing multiple "demos"
or 'islands" of candidate programs: a library, called Reduct for placing programs in Elegant
Normal Form; and the use of probabilistic modeling in place of, or in addition to, imitation
and crossover as means of determining which new candidate programs to try.
• Motoric: Pertaining to the control of physical actuators, e.g. those connected to a robot.
May sometimes be used to refer to the control of movements of a virtual character as well.
• Moving Bubble of Attention: The Attentional Focus of a CogPrime system.
• Natural Language Comprehension: See Language Comprehension
• Natural Language Generation: See Language Generation
• Natural Language Processing (NLP): See Language Processing
• NLGen: Software for carrying out the surface realization phase of natural language gen-
eration, via translating collections of RelEx output relationships into English sentences.
Was made functional for simple sentences and some complex sentences; not currently under
active development, as work has shifted to the related Atom2Link approach to language
generation.
• Node: A type of Atom. Links and Nodes are the two basic kinds of Atoms. Nodes, math-
ematically, can be thought of as "0-ary" links. Some types of Nodes refer to external or
mathematical entities (e.g. INordNode, NumberNode); others are purely abstract, e.g. a
ConceptNode is characterized purely by the Links relating it to other atoms. Grounded-
PredicateNodes and GroundedSchemallodes connect to explicitly represented procedures
(sometimes in the Combo language); ungrounded PredicateNodes and Schemallodes are
abstract and, like ConceptNodes, purely characterized by their relationships.
• Node Probability: Many PLN inference rules rely on probabilities associated with Nodes.
Node probabilities are often easiest to interpret in a specific context, e.g. the probability
P(cat) makes obvious sense in the context of a typical American house, or in the context
of the center of the sun. Without any contextual specification, P(A) is taken to mean
the probability that a randomly chosen occasion of the system's experience includes some
instance of A.
• Novamente Cognition Engine (NCE): A proprietary proto-AGI software system, the
predecessor to OpenCog. Many parts of the NCE were open-sourced to form portions of
OpenCog, but some NCE code was not included in OpenCog; and now OpenCog includes
multiple aspects and plenty of code that was not in NCE.
• OpenCog: A software framework intended for development of AGI systems, and also for
narrow-AI application using tools that have AGI applications. Co-designed with the Cog-
Prime cognitive architecture, but not exclusively bound to it.
• OpenCog Prime (OCP): The implementation of the CogPrime cognitive architecture
within the OpenCog software framework.
• OpenPsi: CogPrime's architecture for motivation-driven action selection, which is based
on adapting Dormer's Psi model for use in the OpenCog framework.
• Operational Agent Controller (OAC): In some OpenCog configurations, this is a soft-
ware server containing a CogServer devoted to real-time control of an agent (e.g. a virtual
world agent, or a robot). Background, offline learning tasks may then be dispatched to other
software processes, e.g. to a Learning Server.
EFTA00624670
524 A Glossary
• Pattern: In a CogPrime context, the term "pattern" is generally used to refer to a process
that produces some entity, and is judged simpler than that entity.
• Pattern Mining: Pattern mining is the process of extracting an (often large) number of
patterns from some body of information, subject to some criterion regarding which patterns
are of interest. Often (but not exclusively) it refers to algorithms that are rapid or "greedy",
finding a large number of simple patterns relatively inexpensively.
• Pattern Recognition: The process of identifying and representing a pattern in some
substrate (e.g. some collection of Atoms, or some raw perceptual data, etc.).
• Patternism: The philosophical principle holding that, from the perspective of engineering
intelligent systems, it is sufficient and useful to think about mental processes in terms of
(static and dynamical) patterns.
• Perception: The process of understanding data from sensors. When natural language is
ingested in textual format, this is generally not considered perceptual. Perception may be
taken to encompass both pre-processing that prepares sensory data for ingestion into the
Atomspace, processing via specialized perception processing systems like DeSTIN that are
connected to the Atomspace, and more cognitive-level process within the Atomspace that
is oriented toward understanding what has been sensed.
• Piagetan Stages: A series of stages of cognitive development hypothesized by develop-
mental psychologist Jean Piaget, which are easy to interpret in the context of developing
CogPrime systems. The basic stages are: Infantile, Pre-operational, Concrete Operational
and Formal. Post-formal stages have been discussed by theorists since Piaget and seem
relevant to AGI, especially advanced AGI systems capable of strong self-modification.
• PLN: short for Probabilistic Logic Networks
• PLN, First-Order: See First-Order hnference
• PLN, Higher-Order: See Higher-Order Inference
• PLN Rules: A PLN Rule takes as input one or more Atoms (the "premises", usually Links),
and output an Atom that is a 'logical conclusion" of those Atoms. The truth value of the
consequence is determined by a PLN Formula associated with the Rule.
• PLN Formulas: A PLN Formula, corresponding to a PLN Rule, takes the TruthValues
corresponding to the premises and produces the TruthValue corresponding to the conclusion.
A single Rule may correspond to multiple Formulas, where each Formula deals with a
different sort of TruthValue.
• Pragmatic General Intelligence: A formalization of the concept of general intelligence,
based on the concept that general intelligence is the capability to achieve goals in environ-
ments, calculated as a weighted average over some fuzzy set of goals and environments.
• Predicate Evaluation: The process of determining the Truth Value of a predicate, embod-
ied in a PredicateNode. This may be recursive, as the predicate referenced internally by a
Grounded PredicateNode (and represented via a Combo program tree) may itself internally
reference other PredicateNodes.
• Probabilistic Logic Networks (PLN): A mathematical and conceptual framework for
reasoning under uncertainty, integrating aspects of predicate and term logic with extensions
of imprecise probability theory. OpenCogPrime's central tool for symbolic reasoning.
• Procedural Knowledge: Knowledge regarding which series of actions (or action-combinations)
are useful for an agent to undertake in which circumstances. In CogPrime these may be
learned in a number of ways, e.g. via PLN or via Hebbian learning of Schema Maps, or via
explicit learning of Combo programs via MOSES or hilklimbing. Procedures are represented
as Schemallodes or Schema Maps.
EFTA00624671
A.2 Glossary of Specialized Terms 525
• Procedure Evaluation/Execution: A general term encompassing both Schema Execu-
tion and Predicate Evaluation, both of which are similar computational processes involving
manipulation of Combo trees associated with ProcedureNodw.
• Procedure Learning: Learning of procedural knowledge, based on any method, e.g. evo-
lutionary learning (e.g. MOSES), inference (e.g. PLN), reinforcement learning (e.g. Hebbian
learning).
• Procedure Node: A Schemallode or PredicateNode
• Psi: A model of motivated action and emotion, originated by Dietrich Dorner and further
developed by Joscha Bach, who incorporated it in his proto-AGI system MicroPsi. OpenCog-
Prime's motivated-action component, OpenPsi, is roughly based on the Psi model.
• Psynese: A system enabling different OpenCog instances to communicate without using
natural language, via directly exchanging Atom subgraphs, using a special system to map
references in the speaker's mind into matching references in the listener's mind.
• Psynet Model: An early version of the theory of mind underlying CogPrime, referred to
in some early writings on the Webmind AI Engine and Novamente Cognition Engine. The
concepts underlying the psynet model are still part of the theory underlying CogPrime, but
the name has been deprecated as it never really caught on.
• Reasoning: See inference
• Reduct: A code library, used within MOSES, applying a collection of hand-coded rewrite
rules that transform Combo programs into Elegant Normal Form.
• Region Connection Calculus: A mathematical formalism describing a system of basic
operations among spatial regions. Used in CogPrime as part of spatial inference to provide
relations and rules to be referenced via PLN and potentially other subsystems.
• Reinforcement Learning: Learning procedures via experience, in a manner explicitly
guided to cause the learning of procedures that will maximize the system's expected future
reward. CogPrime does this implicitly whenever it tries to learn procedures that will maxi-
mize some Goal whose Truth Value is estimated via an expected reward calculation (where
"reward" may mean simply the Truth Value of some Atom defined as "reward"). Goal-driven
learning is more general than reinforcement learning as thus defined; and the learning that
CogPrime does, which is only partially goal-driven, is yet more general.
• RelEx: A software system used in OpenCog as part of natural language comprehension, to
map the output of the link parser into more abstract semantic relationships. These more
abstract relationships may then he entered directly into the Atomspace, or they may be
further abstracted before being entered into the Atomspace, e.g. by RelEx2Frame rules.
• RelEx2Frame: A system of rules for translating RelEx output into Atoms, based on the
FrameNet ontology. The output of the RelEx2Frame rules make use of the FrameNet library
of semantic relationships. The current (2012) RelEx2Frame rule-based is problematic and
the RelEx2Frame system is deprecated as a result, in favor of Link2Atom. However, the
ideas embodied in these rules may be useful; if cleaned up the rules might profitably be
ported into the Atomspace as ImplicationLinks.
• Representation Building: A stage within MOSES, wherein a candidate Combo program
tree (within a deme) is modified by replacing one or more tree nodes with alternative tree
nodes, thus obtaining a new, different candidate program within that deme. This process
currently relies on hand-coded knowledge regarding which types of tree nodes a given tree
node should be experimentally replaced with (e.g. an AND node might sensibly be replaced
with an OR node, but not so sensibly replaced with a node representing a "kick" action).
EFTA00624672
525 A Glossary
• Request for Services (RFS): In CogPrime's Goal-driven action system, a RFS is a
package sent from a Goal Atom to another Atom, offering it a certain amount of STI
currency if it is able to deliver the goal what it wants (an increase in its Truth Value).
RFS's may be passed on, e.g. from goals to subgoals to sub-subgoals, but eventually an
RFS reaches a Grounded Schemallode, and when the corresponding Schema is executed,
the payment implicit in the RFS is made.
• Robot Preschool: An AGI Preschool in our physical world, intended for robotically em-
bodied AGIs.
• Robotic Embodiment: Using an AGI to control a robot. The AGI may be running on
hardware physically contained in the robot, or may run elsewhere and control the robot via
networking methods such as wifi.
• Scheduler: Part of the CogServer that controls which processes (e.g. which MindAgents)
get processor time, at which point in time.
• Schema: A "script" describing a process to be carried out. This may be explicit, as in the
case of a GroundedSchemallode, or implicit, as the case in Schema maps or ungrounded
Schemallodes.
• Schema Encapsulation: The process of automatically recognizing a Schema Map in an
Atomspace, and creating a Combo (or other) program embodying the process carried out
by this Schema Map, and then storing this program in the Procedure Repository and
associating it with a particular Schemallode. This translates distributed, global procedural
memory into localized procedural memory. It's a special case of Map Encapsulation.
• Schema Execution: The process of "running" a Grounded Schema, similar to running a
computer program. Or, phrased alternately: The process of executing the Schema referenced
by a Grounded Schemallode. This may be recursive, as the predicate referenced internally by
a Grounded Schemallode (and represented via a Combo program tree) may itself internally
reference other Grounded Schemallodes.
• Schema, Grounded: A Schema that is associated with a specific executable program
(either a Combo program or, say, C++ code)
• Schema Map: A collection of Atoms, including Schemallodes, that tend to be enacted
in a certain order (or set of orders), thus habitually enacting the same process. This is a
distributed, globalized way of storing and enacting procedures.
• Schema, Ungrounded: A Schema that represents an abstract procedure, not associated
with any particular executable program.
• Schematic Implication: A general, conceptual name for implications of the form ((Con-
text AND Procedure) IMPLIES Goal)
• SegSim: A name for the main algorithm underlying the NLGen language generation soft-
ware. The algorithm is based on segmenting a collection of Atoms into small parts, and
matching each part against memory, to find, for each part, cases where similar Atom-
collections already have known linguistic expression.
• Self-Modification: A term generally used for Al systems that can purposefully modify
their core algorithms and representations. Formally and crisply distinguishing this sort of
"strong self-modification" from "mere" learning is a tricky matter.
• Sensorimotor: Pertaining to sensory, data, motoric actions, and their combination and
intersection.
• Sensory: Pertaining to data received by the AGI system from the outside world. In a
CogPrime system that perceives language directly as text, the textual input will generally
EFTA00624673
A.2 Glossary of Specialized Terms 527
not be considered as "sensory" (on the other hand, speech audio data would be considered
as "sensory").
• Short Term Importance: A value associated with each Atom, indicating roughly the
expected utility to the system of keeping that Atom in RAM rather than saving it to disk
or deleting it. It's possible to have multple LTI values pertaining to different time scales,
but so far practical implementation and most theory has centered on the option of a single
LTI value.
• Similarity: a link type indicating the probabilistic similarity between two different Atoms.
Generically this is a combination of Intensional Similarity (similarity of properties) and
Extensional Similarity (similarity of members).
• Simple Truth Value: a TruthValue involving a pair (s,d) indicating strength (e.g. proba-
bility or fuzzy set membership) and confidence d. d may be replaced by other options such
as a count n or a weight of evidence w.
• Simulation World: See Internal Simulation World
• SMEPH (Self-Modifying Evolving Probabilistic Hypergraphs): a style of modeling
systems, in which each system is associated with a derived hypergraph
• SMEPH Edge: A link in a SMEPH derived hypergraph, indicating an empirically observed
relationship (e.g. inheritance or similarity) between two
• SMEPH Vertex: A node in a SMEPH derived hypergraph representing a system, indicat-
ing a collection of system states empirically observed to arise in conjunction with the same
external stimuli
• Spatial Inference: PLN reasoning including Atoms that explicitly reference spatial rela-
tionships
• Spatiotemporal Inference: PLN reasoning including Atoms that explicitly reference spa-
tial and temporal relationships
• STI: Short hand for Short Term Importance
• Strength: The main component of a TruthValue object, lying in the interval I0,1J, refer-
ring either to a probability (in cases like InheritanceLink, SimilarityLink, EquivalenceLink,
ImplicationLink, etc.) or a fuzzy value (as in MemberLink, EvaluationLink).
• Strong Self-Modification: This is generally used as synonymous with Self-Modification,
in a CogPrime context.
• Subsymbolic: Involving processing of data using elements that have no correspondence to
natural language terms, nor abstract concepts; and that are not naturally interpreted as
symbolically "standing for" other things. Often used to refer to processes such as perception
processing or motor control, which are concerned with entities like pixels or commands like
"rotate servomotor 15 by 10 degrees theta and 55 degrees phi." The distinction between
"symbolic" and "subsymbolic" is conventional in the history of Al, but seems difficult to
formalize rigorously. Logic-based Al systems are typically considered "symbolic", yet
• Supercompilation: A technique for program optimization, which globally rewrites a pro-
gram into a usually very different looking program that does the same thing. A prototype
supercompiler was applied to Combo programs with successful results.
• Surface Realization: The process of taking a collection of Atoms and transforming them
into a series of words in a (usually natural) language. A stage in the overall process of
language generation.
• Symbol Grounding: The mapping of a symbolic term into perceptual or motoric entities
that help define the meaning of the symbolic term. For instance, the concept "Cat" may be
EFTA00624674
528 A Glossary
grounded by images of cats, experiences of interactions with cats, imaginations of being a
cat, etc.
• Symbolic: Pertaining to the formation or manipulation of symbols, i.e. mental entities that
are explicitly constructed to represent other entities. Often contrasted with subsymbolic.
• Syntax-Semantics Correlation: In the context of MOSES and program learning more
broadly, this refers to the property via which distance in syntactic space (distance between
the syntactic structure of programs, e.g. if they're represented as program trees) and se-
mantic space (distance between the behaviors of programs, e.g. if they're represented as
sets of input/output pairs) are reasonably well correlated. This can often happen among
sets of programs that are not too widely dispersed in program space. The Reduct library
is used to place Combo programs in Elegant Normal Form, which increases the level of
syntax-semantics corellation between them. The programs in a single MOSES deme are
often closely enough clustered together that they have reasonably high syntax-semantics
correlation.
• System Activity Table: An OpenCog component that records information regarding
what a system did in the past.
• Temporal Inference: Reasoning that heavily involves Atoms representing temporal in-
formation, e.g. information about the duration of events, or their temporal relationship
(before, after, during, beginning, ending). As implemented in CogPrime, makes use of an
uncertain version of Allen Interval Algebra.
• Truth Value: A package of information associated with an Atom, indicating its degree
of truth. Simple'I1uthValue and IndefiniteTruthValue are two common, particular kinds.
Multiple truth values associated with the same Atom from different perspectives may be
grouped into CompositeTruthValue objects.
• Universal Intelligence: A technical term introduced by Shane Legg and Marcus Hutter,
describing (roughly speaking) the average capability of a system to carry out computable
goals in computable environments, where goal/environment pairs are weighted via the length
of the shortest program for computing them.
• Urge: In OpenPsi, an Urge develops when a Demand deviates from its target range.
• Very Long Term Importance (VLTI): A bit associated with Atoms, which determines
whether, when an Atom is forgotten (removed from RAM), it is saved to disk (frozen) or
simply deleted.
• Virtual AGI Preschool: A virtual world intended for AGI teaching/training/learning,
bearing broad resemblance to the preschool environments used for young humans.
• Virtual Embodiment: Using as AGI to control an agent living in a virtual world or game
world, typically (but not necessarily) a 3D world with broad similarity to the everyday
human world.
• Webmind AI Engine: A predecessor to the Novamente Cognition Engine and OpenCog,
developed 1997-2001 - with many similar concepts (and also some different ones) but quite
different algorithms and software architecture
EFTA00624675
References 529
References
ABS+11. Itamar Arel, $ Berant, T Slonint, A Nfoyal, B Li, and K Chai Sim. Acoustic spatiotemporal
modeling using deep machine learning for robust phoneme recognition. In Afeka-AVIOS Speech
Processing Conference, 2011.
A1183. James F. Allen. Maintaining knowledge about temporal. Intervals CACM, 26:198-3, 1983.
AMOI. J. S. Albus and A. M. Nfeystel. Engineering of Mind: An Introduction to the Science of Intelligent
Systems. Wiley and Sons, 2001.
Ama85. S. Amari. Differential-geometrical methods in statistics. Lecture notes in statistics, 1985.
Ama98. S. Amari. Natural gradient works efficiently in learning. Neural Computing, 10:251-276, 1998.
ANOO. Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry. ANIS, 2000.
ARC09a. I. Aral, D. Rose, and R. Coop. Destin: A scalable deep learning architecture with application
to high-dimensional robust pattern recognition. Proc. AAA! Workshop on Biologically Inspired
Cognitive Architectures, 2009.
ARC09b. Itamtu. Arel, Derek Rose, and Robert Coop. A biologically-inspired deep learning architecture with
application to high-dimensional pattern recognition. In Biologically Inspired Cognitive Architec-
tures, 2009. AAAI Press, 2009.
ARKO9. I. Arel, D. Rose, and T. Karnowski. A deep learning architecture comprising homogeneous cortical
circuits for scalable spatiotemporal pattern inference. NIPS 2009 Workshop on Deep Learning for
Speech Recognition and Related Applications, 2009.
Arn69. Rudolf Arnheim. Visual Thinking. University of California Press. Berkeley, 1969.
AS94. Rakesh Agrawal and Ra0takrishnan Srilaint. Fast algorithms for mining association rules. In Proc.
20th Int. Conf. Very Large Data Bases, 1994.
Ash65. Robert B. Ash. Information Theory. Dover Publications, 1965.
Bau04. E. B. Baum. What is Thought? MIT Press, 2004.
Bau06. E. Baum. A working hypothesis for general intelligence. In Advances in Artificial General
ligence: Concepts, Architectures and Algorithms, 2006.
BE07. Neven Boric and Pablo A. Estevez. Genetic programming-based clustering using an information
theoretic fitness measure. In Dipti Srinivasan and Lipo Wang, editors, 2007 IEEE Congress on
Evolutionary Computation, pages 31-38, Singapore, 25-28 September 2007. IEEE Computational
Intelligence Society, IEEE Press.
Bel03. Anthony J. Bell. The co-information lattice. Somewhere or other, 2003.
Ben94. Brandon Bennett. Spatial reasoning with propositional logics. In Principles of Knowledge Repre-
sentation and Reasoning: Proceedings of the 4th International Conference (KR94), pages 51-62.
Morgan Kaufmann, 1994.
BF97. A Blum and M Furst. Fast planning through planning graph analysis. Artificial intelligence, 1997.
BH10. Bundzel and Hashimoto. Object identification in dynamic images based on the memory-prediction
theory of brain function. Journal of Intelligent Learning Systems and Applications, 2-4, 2010.
Bic08. Derek Bickerton. Bastard Tongues. Hill and Wang, 2008.
BKLO6. Aline Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proc.
International Conference on Machine Learning, 2006.
BL99. Avrim Blum and John Langford. Probabilistic planning in the graphplan framework. In 5th
European Conference on Planning (ECP '99), 1999.
Bor05. Christian Borgelt. Keeping things simple: Finding frequent item sets by recursive elimination. In
Workshop on Open Source Data Mining Software (OSDM'05). Chicago IL, pages 66-70. 2005.
Car06. Pereira Francisco Cara. Creativity and Artificial Intelligence: A Conceptual Blending Approach,
Applications of Cognitive Linguistics. Amsterdam: Mouton de Gruyter, 2006.
Cas04. N. L. Cassimatis. Grammatical processing using the mechanisms of physical inferences. In Pro-
ceedings of the Twentieth-Sixth Annual Conference of the Cognitive Science Society. 2004.
CB00. W. H. Calvin and D. Bickerton. Lingua er Machin. MIT Press, 2000.
CFH97. Eliseo Clementini, Paolino Di Felice, and Daniel HernAindez. Qualitative representation of posi-
tional information. Artificial Intelligence, 95:317-356, 1997.
COPH09. Lucio Coelho, Ben Goertzel, Cassio Pennachin, and Chris Howard. Classifier ensemble based
analysis of a genome-wide snp dataset concerning late-onset alzheimer disease. In Proceedings of
8th IEEE International Conference on Cognitive Informatics., 2009.
Cha0s. Gregory Chaitin. Algorithmic Information Theory. Cambridge University Press, 2008.
EFTA00624676
530 A Glossary
Cha09. Mark Changizi. The Vision Revolution. l3enBella Books, 2009.
Che97. K. Chellapilla. Evolving computer programs without subtree crossover. IEEE Transactions on
Evolutionary Computation, 1997.
Coh95. A.C. Cohn. A hierarchical representation of qualitative shape based on connection and convexity.
In Proc COSIT95, LNCS, pages 311-326. Springer Verlag, 1995.
Cox61. Richard Cox. The Algebra of Probable Inference. Johns Hopkins University Press, 1961.
CS10. Shay B. Cohen and Noah A. Smith. Covariance in unsupervised learning of probabilistic grammars.
Journal of Machine Learning Research, 11:3117-3151, 2010.
CSZ06. Olivier Chapelk, Bernhard Schakopf, and Alexander Zien. Semi-Supervised Learning. MIT Press,
2006.
CXYNI05. Yun Chi, Yi Xia, Yirong Yang, and Richard Ft Muntz. Mining closed and maximal frequent subtrees
from databases of labeled rooted trees IEEE Trans. Knowledge and Data Engineering. 2005.
Dab99. A.C. Dabak. A Geometry for Detection Theory. PhD Thesis, Rice U., 1999.
Dea98. Terrence Deacon. The Symbolic Species. Norton, 1998.
dF37. Bnmo de Finetti. La prevision: ses lois logiques, sea sources subjectives,. Annetta de l'In.stitut
Henri Poincard, 1937.
DP09. Yassine Djouadi and Henri Prade. Interval-valued fuzzy formal concept analysis. In ISMS '09:
Proc. of the 18th International Symposium on Foundations of Intelligent Systems, pages 592-601,
Berlin, Heidelberg, 2009. Springer-Verlag.
dS77. Ferdinand de Saussure. Course in General Linguistics. Flmtana/Collins, 1977. Orig. published
1916 as "emirs de linguistique generale".
EBJ+97. J. Elman, E. Bates, M. Johnson, A. Karmiloff-Smith, D. Parisi. and K. Plunkett. Rethinking
Innateness: A Connectionist Perspective on Development. MIT Press, 1997.
Ede93. Gerald Edelman. Neural darwinism: Selection and reentrant signaling in higher brain function.
Neuron, 10, 1993.
PF92. Christian Freksa and Robert Milton. Temporal reasoning based on semi-intervals. Artificial Intel-
ligence, 54(1-2):199 - 227, 1992.
PLI2. Jeremy Fishel and Gerald Loeb. Bayesian exploration for intelligent identification of textures.
Frontiers in Neurorobotics 6-4, 2012.
Pri98. Roy. FYieden. Physics from Fisher Information. Cambridge U. Press, 1998.
PT02. G. Pauconnier and M. Turner. The Way We Think: Conceptual Blending and the Mind's Hidden
Complexities. Basic, 2002.
Gar00. Peter Gardenfors. Conceptual spaces: the geometry of thought. MIT Press, 2000.
GBK04. S. Gustafson. E. K. Burke, and G. Kendall. Sampling of unique structures and behaviours in
genetic programming. In European Conf. on Genetic Programming. 2004.
CCPM06. Ben Goertzel, Lucio Coelho, Cassio Pennachin, and Mauricio Mudada. Identifying Complex Bio-
logkal Interactions based on Categorical Gene Expression Data. In Proceedings of Conference on
Evolutionary Computing. Vancouver CA. 2006.
CE01. Roop Goyal and Max Egenhofer. S" 'larity in cardinal directions. In in Proc. of the Seventh
International Symposium on Spatial and Temporal Databases, pages 36-55. Springer-Verlag, 2001.
Cea05. Ben Coertzel and et al. Combinations of single nucleotide polymorphisms in neuroendocrine
effector and receptor genes predict chronic fatigue syndrome. Pharmacogenatnics, 2005.
GEA08. Ben Gone! and Cassie. Pennachin Et Al. An integrative methodology, for teaching embodied
non-linguistic agents, applied to virtual animals in second life. In Proc.of the First Conf. on AGL
IOS Press, 2008.
Ceal3. Ben Goertzel and et al. The cogprime architecture for embodied artificial general intelligence. In
Proceedings of IEEE Symposium on Human-Level Al, Singapore, 2013.
CCC+ 11. Ben Goertzel, Nil Geisweiller, Lucio Coelho, Predrag Janicic, and Cassio Pennachin. Real World
Reasoning. Atlantis, 2011.
CH11. N Garg and J Henderson. Temporal restricted boltzmann machines for dependency parsing. In
Proc. ACL, 2011.
GIII. B. Coertzel and M. Ikle. Steps toward a geometry of mind. In J Schmidhuber and K Thorisson,
editors, Subm.to ACI-11. Springer, 2011.
CICH08. B. Goertzel, M. Ikle, I. Goertzel, and A. Heljakka. Probabilistic Logic Networks. Springer, 2008.
CKD89. D. E. Goldberg, B. Korb, and K. Deb. Messy genetic algorithms: Motivation, analysis, and first
results. Complex Systems, 1989.
EFTA00624677
References 531
CLI0. Ben Coertzel and Ruiting Lian. A probabilistic characterization of fuzzy semantics. Proc. of
ICAI-10, Beijing, 2010.
CLdG+ 10. Ben Coertzel, Ruiting Lien, Hugo de Gar's, Shuo Chen, and hamar Arel. World survey of artificial
brains, part ii: Biologically inspired cognitive architectures. Neurocomputing, April 2010.
GM11108. B. Goertzel, I. Coertzel M. lkle, and A. Heljakka. Probabilistic Logic Networks. Springer, 2008.
GN02. Alfonso Gerevini and Bernhard Nebel. Qualitative spatio-temporal reasoning with rcc-8 and alien's
interval calculus: Computational complexity. In Frank van Harmelen, editor, ECAI, pages 312-316.
IOS Press, 2002.
Goe94. Ben Coertzel. Chaotic Logic. Plenum, 1994.
Coe06. Ben Coertzel. The Hidden Pattern. Brown Walker, 2006.
Coe08a. B. Coertzel. The pleasure algorithm. groups.google.com/group/opencog/files, 2008.
Coe08b. Ben Coertzel. A pragmatic path toward endowing virtually-embodied ais with human-level lin-
guistic capability. IEEE World Congress on Computational Intelligence (Wee!), 2008.
Goel0a. Ben Goertzel. Infinite-order probabilities and their application to modeling self-referential seman-
tics. In Proceedings of Conference on Advanced Intelligence 2010, Beijing, 2010.
Goel0b. Ben et al Coertzel. A general intelligence oriented architecture for embodied natural language
processing. In Proc. of the Third Conf. on Artificial General Intelligence (ACI-10). Atlantis
Press, 2010.
Cool 1 a. B Coertzel. Integrating a compositional spatiotemporal deep learning network with symbolic
representation/reasoning within an integrative cognitive architecture via an intermediary semantic
network. In Proceedings of AAA! Symposium on Cognitive Systems„ 2011.
Goel1 b. Ben Goertzel. Imprecise probability as a linking mechanism between deep learning, symbolic
cognition and local feature detection in vision processing. In Proc. of AC!-11, 2011.
CPPG116. Ben Coertzel, Hugo Pinto, Cassio Pennachin, and Izabela Freire Coertzel. Using dependency pars-
ing and probabilistic inference to extract relationships between genes, proteins and malignancies
implicit among multiple biomedical research abstracts. In Proc. of Bio-NLP 2006, 2006.
CR00. Alfonso Gerevini and Jochett Renz. Combining topological and size information for spatial reason-
ing. Artificial Intelligence, 137:2002, 2000.
GSW05. Bernhard Canter, Gerd Stumme, and Rudolf Wille. Formal Concept Analysis: Foundations and
Applications. Springer-Verlag, 2005.
HB06. Jeff Hawkins and Sandra Blakeslee. On Intelligence. Brown Walker, 2006.
HDY+12. Geoffrey Hinton, Li Deng, bong Yu, George bald, Abdel rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, and Tara Sainathand Brian Kingsbury. Deep
neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine,
2012.
141407. Barbara Hammer and Pascal Ritzier, editors. Perspectives of Neural-Symbolic Integration. Studies
in Computational Intelligence, Vol. 77. Springer, 2007.
Hi189. Daniel Hillis. The Connection Machine. MIT Press, 1989.
HK02. David Harel and Yehuda Koren. Graph Drawing by High-Dimensional Embedding. 2002.
Hob78. J. Hobbs. Resolving pronoun references. Lingua, 44:311-338, 1978.
Hof79. Douglas Holstadter. Codel, Escher, Bach: An Eternal Golden Braid. Basic, 1979.
Hol75. J. R. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975.
Hud84. Richard Hudson. Word Grammar. Oxford: Blackwell, 1984.
Htic190. Richard Hudson. English Word Grammar. Blackwell Press, 1990.
Hud07a. Richard Hudson. Language Networks. The new Word Grammar. Oxford University Press, 2007.
Hud07b. Richard Hudson. Language Networks: The New Word Grammar. Oxford Linguistics, 2007.
Hut99. G. Hutton. A tutorial on the universality and expressiveness of fold. Journal of anctional
Programming, 1999.
Hut05a. Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, 2005.
Hut05b. Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, 2005.
HWP03. Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraph in the presence of
isomorphism. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM),
pages 549-552. 2003.
Jac03. Ray Jackendoff. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford Uni-
versity Press, 2003.
EFTA00624678
532 A Glossary
JLO8. D. J. Jilk and C. Lebiere. and o'reilly. R. C. and Anderson, J. R. (2008). SAL: An explicitly
pluralistic cognitive architecture. Journal of Experimental and Theoretical Artificial Intelligence,
20:197-218, 2008.
Joh05. Mark Johnson. LATEX: A Developmental Cognitive Neuroscience. Wiley-Blackwell, 2005.
Jolla. I. T. Joliffe. Principal Component Analysis. Springer, 2010.
KA95. J. R. Koza and D. Andre. Parallel genetic programming on a network of transputers. Technical
report, Stanford University, 1995.
KARIO. Tom Karnowski, hamar Arel, and D. Rose. Deep spatiotemporal feature learning with application
to image classification. In The 9th International Conference on Machine Learning and Applications
(ICAILA '10), 2010.
KK01. Michihiro Kuramochi and George Karypis. Ftequent subgraph discovery. In Proceedings of the
2001 IEEE International Conference on Data Mining, pages 313-320. 2001.
KM04. Dan Klein and Christopher D. Manning. Corpus-based induction of syntactic structure: Models of
dependency and constituency. In ACL '04 Proceedings of the 42nd Annual Meeting on Association
for Computational Linguistics, pages 479-486. Association for Computational Linguistics, 2004.
Koh01. Teuvo Kohonen. Self-Organizing Maps. Springer, 2001.
Koz92. J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural
Selection. MIT Press, 1992.
Koz94. J. R. Koza. Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press,
1994.
KSPC13. Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Stephen Putman, and Bob Coecke. Reasoning about
meaning in natural language with compact closed categories and frobenius algebras. 2013.
Kur09. Yohei Kuria. 9-intersection calculi for spatial reasoning on the topological relations between
multi-domain objects. USA, June 2009.
Kur12. Ray Kurzweil. How to Create a Mind. Viking, 2012.
LA93. C Lebiere and J R Anderson. A connectionist implementation of the act-r production system. In
Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, 1993.
Lai12. John E Laird. The Soar Cognitive Architecture. MIT Press, 2012.
LBHIO. Jens Lehmann, Sebastian Bader, and Pascal Hitzler. Extracting reduced logic programs from
artificial neural networks. Applied Intelligence, 2010.
LDA05. Ramon Lopez, Cozar Delgado, and Masahiro Araki. Spoken, Multilingual and Afziltimodal Dialogue
Systems: Development and Assessment. Wiley, 2005.
Lamb°. B. Lemoine. Nlgen2: a linguistically plausible, general purpose natural language generation system.
http://www.louisiana.edu/??ba12277/WLGen2, 2010.
Lev94. L. Levin. Randomness and ttondeterminism. In The International Congress of Mathematicians,
1994.
LGEIO. Ruiting Lien, Ben Goertzel, and Al Et. Language generation via glocal similarity matching.
Neurocomputing, 2010.
LOKI' 12. Ruiting Lien, Ben Goertzel, Shujing Ke, Jade OONeill, Keyvan Sadeghi, Simon Shin, Dingjie
Wang, Oliver Watkins. and Gino Yu. Syntax-semantic mapping for general intelligence: Language
comprehension as hypergraph homomorphism, language generation as constraint satisfaction. In
Artificial General Intelligence: Lecture Notes in Computer Science Volume 7716. Springer, 2012.
LKP+05. Sung Hee Lee, Junggon Kim, Frank Chongwoo Park, Munsang Kim, and James E. Bobrow.
Newton-type algorithms for dynamics-based robot movement optimization. IEEE Transactions
on Robotics, 21(4):657-667, 2005.
LLR09. Weiming Liu. Sanjiang Li, and Jochen Rens. Combining rcc-8 with qualitative direction calculi:
Algorithms and complexity. In IJCAI, 2009.
LXIDK07. Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch. Handbook of
Latent Semantic Analysis. Psychology Press, 2007.
LNOO. George Lakoff and Rafael Nunez. Where Mathematics Comes From. Basic Books, 2000.
Loo06. Moshe Looks. Competent Program Evolution. PhD Thesis, Computer Science Department, Wash-
ington University, 2006.
Loo07a. M. Looks. On the behavioral diversity of random programs. In Genetic and evolutionary compu-
tation conference, 2007.
Loo07b. M. Looks. Scalable estimation-of-distribution program evolution. In Genetic and evolutionary
computation conference, 2007.
EFTA00624679
References 533
Loo07c. Moshe Looks. Meta-optimizing semantic evolutionary search. In Hod Lipson, editor, Genetic and
Evolutionary Computation Conference, CECCO 2007, Proceedings, London, England, UK, July
7-11, 2007, page 626. ACM, 2007.
Low99. David Lowe. Object recognition front local scale-invariant features. In P MC. of the International
Conf. on Computer Vision, pages 1150-1157, 1999.
LP0I. Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In Proceedings of
the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD'01), pages 323428. ACM Press, 2001.
LP02. W. B. Langdon and IL Poli. Foundations of Genetic Programming. Springer-Verlag, 2002.
Mai00. Monika Maidl. The common fragment of ctl and lel. In In IEEE Symposium on Foundations of
Computer Science, pages 643-652, 2000.
May04. M. T. Maybury. New Directions in Question Answering. MIT Press, 2004.
Mea07. Reiman E M and et al. Gab2 alleles modify alzheimer's risk in apoe e4 carriers. Neuron 54 (5),
2007.
Mih05. Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algo-
rithms for sequence data labeling. In HLT Proceedings of the conference on Human Language
Technology end Empirical Alethods in Natural Language Processing, pages 411-418, Morristown,
NJ, USA, 2005. Association for Computational Linguistics.
Mih07. Rada Nlihalcea. Wand sense disambiguation. Encyclopedia of Machine Learning. Springer-Verlag,
2007.
Min88. Marvin Minsky. The Society of Mind. MIT Press, 1988.
Mit96. Steven !Althorn The Prehistory of Mind. Thames and Hudson, 1996.
MS99. Christopher Manning and Heinrich Scheutze. Foundations of Statistical Natural Language Pro-
cessing. MIT Press, 1999.
NITF04. Rada Nlihalcea, Paul Tarau, and Elizabeth Figa. Pagerank on semantic networks, with application
to word sense disambiguation. In COLIN° '04: Proceedings of the 20th international confer-
ence on Computational Linguistics, Morristown, NJ, USA, 2004. Association for Computational
Linguistics.
OCC90. Andrew Ortony, Gerald Clore, and Allan Collins. The Cognitive Structure of Emotions. Cambridge
University Press, 1990.
O1s95. J. R. Olsson. Inductive functional programming using incremental program transformation. Arti-
ficial Intelligence, 1995.
PAF00. H. Park, S. Antari, and K. Fukumizu. Adaptive natural gradient learning algorithms for various
stochastic models. Neural Computing, 13:755-764, 2000.
Pal04. Girish Keshav Palshficar. Fuzzy region connection calculus in finite discrete space domains. Appl.
Soft Comput., 4(1):13-23, 2004.
PCP00. Papageorgiou, C., and T. Poggio. A trainable system for object detection, volume 38-1. Intl. .)
ComputerVision, 2000.
PD09. Hoifung Poon and Pedro Domingos. Unsupervised semantic parsing. In Proceedings of the 2009
Conference on Empirical Methods in Natural Language Processing, pages 1-10, Singapore, August
2009. Association for Computational Linguistics.
Pei34. C. Peirce. Collected papers: Volume V. Pragmatism end pragmaticism. Harvard University Press.
Cambridge MA., 1934.
Pel05. Martin Pelikan. Hierarchical Bayesian Optimization Algorithm: Toward a New Generation of
Evolutionary Algorithms. Springer, 2005.
PJ88a. Pearl and J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufman. 1988.
PJ88b. Steven Pinker and .1acquesMehler. Connections and Symbols. MIT Press, 1988.
Pro13. The Univalent Foundations Program. Homology Type Theory: Univalent Foundations of Mathe-
matics. Institute for Advanced Study, 2013.
RCC93. D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions and connection. 1993.
Rcs99. J. Reece. Genetic programming acquires solutions by combining top-down and bottom-up refine-
ment. In Foundations of Generic Programming. 1999.
Row90. John Rowan. Subpersonalities: The People Inside Us. Routledge Press, 1990.
RVG05. Mike Ross, Linos Vepstas, and Ben Goertzel. Rolex semantic relationship extractor.
http://opencog.org/wilci/RelEx, 2005.
EFTA00624680
534 A Glossary
Sch06. .1. Schmidhuber. Code! machines: Fully Self-referential Optimal Universal Self-improvers. In B. Go-
ertzel and C. Pennachin, editors, Artificial General Intelligence, pages 119-226. 2006.
SDCCK08a. Steven Schockaert, Martine De Cock. Chris Cornelis, and Etienne E. Kerne. Fuzzy region con-
nection calculus: An interpretation based on closeness. Int. J. Approx. Reasoning, 48(1):332-347,
2008.
SDCCK08b. Steven Schockaert, Martine De Cock, Chris Cornelis, and Etienne E. Kerne. Fuzzy region connec-
tion calculus: Representing vague topological information. Int. J. Approx. Reasoning, 48(1):314-
331, 2008.
SM07. Ravi Sinha and Rada Mihalcea. Unsupervised graph-basedword sense disambiguation using mea-
sures of word semantic similarity. In ICSC '07: Proceedings of the International Conference on
Semantic Computing, pages 363-369, Washington, DC, USA, 2007. IEEE Computer Society.
SM09. Ravi Sinha and Rada Mihalcea. Unsupervised graph-based word sense disambiguation. In Nicolas
Nicolov and John Benjamins Ruslan Xlitkov, editors, Current Issues in Linguistic Theory: Recent
Advances in Natural Language Processing. 2009.
SMI97. F-R. Sinot, Fernandez M., and Mackie I. Efficient reductions with director strings. Evolutionary
Computation, 1997.
SMK12. Jeremy Stober, Risto Miikkulainen, and Benjamin Kuipers. Learning geometry from sensorimo-
tor experience. In Proceedings of the First Joint Conference on Development and Learning and
Epigenetic Robotics, 2012.
Sol64a. Ray Solomonoff. A Formal Theory of Inductive Inference, Part I. Information and Control, 1964.
Sol64b. Ray Solomonoff. A Formal Theory of Inductive Inference, Part IL. Information and Control, 1964.
Spe96. L. Spector. Simultaneous evolution of programs and their control structures. In Advances in
Genetic Programming 2. MIT Press, 1996.
SR0t. Murray Shanahan and David A Randell. A logic-based formulation of active visual perception. In
Knowledge Representation, 2004.
5S03. R. P. Salustowicz and J. Schmidhuber. Probabilistic incremental program evolution. Lecture Notes
in Computer Science vol. 2706, 2003.
ST91. Daniel Sleator and Davy Tetnperley. Parsing english with a link grammar. Technical report,
Carnegie Mellon University Computer Science technical report CMU-CS-91-196, 1991.
ST93. Daniel Sleator and Davy Temperley. Parsing english with a link grammar. Third International
Workshop on Parsing Technologies., 1993.
SV99. A. J. Storkey and R. Valabregue. The basins of attraction of a new hopfield learning rule. Neural
Networks, 12:869-876, 1999.
SW05. Reza Shadmehr and Steven P. Wise. The Computational Neurobiology of Reaching and Pointing
: A Foundation for Motor Learning. MIT Press, 2005.
SWM90. Timothy Starkweather, Darrell Whitley, and Keith Mathias. Optimization using disributed genetic
algorithms. In Parallel Problem Solving from Nature, Edited by H Schwefel and R Manner, 1990.
5204. R. Sun and X. Zhang. Top-down versus bottom-up learning in cognitive skill acquisition. Cognitive
Systems Research, 5, 2004.
Tes59. Lucien TesniEre. Elements de syntaxe structurale. Klincksieck, Paris, 1959.
Tom03. Michael Tomasello. Constructing a Language: A Usage-Based Theory of Language Acquisition.
2003.
TSH11. Mohamad Tarifi, Moore Sitharam, and Jeffery Ho. Learning hierarchical sparse representations
using iterative dictionary learning and dimension reduction. In Proc. of RICA 2011, 2011.
TVCC05. M. Tomassini, L. Vanneschi, P. Collard, and M. Clergue. A study of fitness distance correlation
as a difficulty measure in genetic programming. Evolutionary Computation, 2005.
VK94. T. Veale and XI. T. Keane. Metaphor and Memory and Meaning in Sapper: A Hybrid Model
of Metaphor Interpretation. Proceedings of the workshop on Hybrid Connectionist Systems of
ECAI94, at the llth European Conference on Artificial Intelligence, 1994.
VO07. Tony Veak and Diannuid O'Donoghue. Computation and Blending. Cognitive Linguistics, 2007.
Wah06. Wolfgang Wahlster. SmartKom: Foundations of Multimodel Dialogue Systems. Springer, 2006.
Wan06. Pei Wang. Rigid Flexibility: The Logic of Intelligence. Springer, 2006.
WF05. Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, 2005.
Win95. Stephan Winter. Topological relations between discrete regions. In Advances in Spatial Databasesd
4th International Symposium, SSD -5, pages 310-327. Springer, 1995.
EFTA00624681
References 535
Win00. Stephan Winter. Uncertain topological relations between imprecise regions. Journal of Geograph-
ical Information Science, 14(5):411-430, 2000.
WKB05. Nice Van De Weghe, Bart Kuijpers, and Peter Bogaert. A qualitative trajectory calculus and the
composition of its relations. In Proc. of GeoS, pages 60-76. Springer-Verlag, 2005.
Yan10. King-Yin Yen. A fuzzy-probabilistic calculus for vagueness. In Unpublished manuscript, 2010.
Subm.
YKL+04. Sanghoon We, Jinwook Kim, Sung Hee Lee, Frank Chongwoo Park, Wooram Park, Junggon Kim,
Changbeom Park, mid Intaeck Yeo. A modular object-oriented framework for hierarchical multi-
resolution robot simulation. Robotics, 22(2):141-154, 2004.
Yur98. Denis Yuret. Discovery of Linguistic Relations Using Lexical Attraction. PhD thesis, MIT, 1998.
ZHIO. Ruoyu Zou and Lawrence B. Holder. Frequent subgraph mining on a single large graph using
sampling techniques. In International Conference on Knowledge Discovery and Data Mining
archive. Proceedings of the Eighth Workshop on Mining and Learning with Graphs. Washington
DC, pages 171-178, 2010.
ZLLY08. Xiaotong Zhang, Weiming Liu, Sanjiang Li, and Xlingsheng Ying. Reasoning with cardinal di-
rections: an efficient algorithm. In AAAI'08: Proc. of the 28rd national conference on Artificial
intelligence, pages 387-392. AAAI Press, 2008.
ZMO6. Song-Chun Zhu and David Mumford. A stochastic grammar of images. Foundations and Trends
in Computer Graphics and Vision, 2006.
EFTA00624682