Ben Goertzel with Cassio Pennachin & Nil Geisweiller &
the OpenCog Team
Engineering General Intelligence, Part 2:
The CogPrime Architecture for Integrative, Embodied
AGI
September 19, 2013
EFTA00624128
EFTA00624129
This book is dedicated by Ben Goertzel to his beloved,
departed grandfather, Leo Ztuell - an amazingly
warm-hearted, giving human being who was also a deep
thinker and excellent scientist, who got Ben started on the
path of science. As a careful experimentalist, Leo would
have been properly skeptical of the big hypotheses made
here - but he would have been eager to see them put to the
test!
EFTA00624130
EFTA00624131
Preface
Welcome to the second volume of Engineering General Intelligence! This is the second half of
a two-part technical treatise aimed at outlining a practical approach to engineering software
systems with general intelligence at the human level and ultimately beyond.
Our goal here is an ambitious one and not a modest one: Machines with flexible problem-
solving ability, open-ended learning capability, creativity and eventually, their own kind of
genius.
Part 1 set the stage, dealing with with a variety of general conceptual issues related to the
engineering of advanced AGI, as well as presenting a brief overview of the CogPrime design
for Artificial General Intelligence. Now here in Part 2 we plunge deep into the nitty-gritty, and
describe the multiple aspects of the CogPrime with a fairly high degree of detail.
First we describe the CogPrime software architecture and knowledge representation in de-
tail; then we review the "cognitive cycle" via which CogPrime perceives and acts in the world
and reflects on itself. We then turn to various forms of learning: procedural. declarative (e.g.
inference), simulative and integrative. Methods of enabling natural language functionality in
CogPrime are then discussed; and the volume concludes with a chapter summarizing the ar-
gument that CogPrime can lead to human-level (and eventually perhaps greater) AGI, and a
chapter giving a "thought experiment" describing the internal dynamics via which a completed
CogPrime system might solve the problem of obeying the request "Build me something with
blocks that I haven't seen before."
Reading this book before Engineering General Intelligence, Part 1 first is not especially
recommended, since the prequel not only provides context for this one, but it also defines a
number of specific terms and concepts that are used here without explanation (for example,
Part One has an extensive Glossary). However, the impatient reader who has not mastered
Part 1, or the reader who has finished Part 1 but is tempted to hop through Part 2 nonlinearly,
might wish to first skim the final two chapters, and then return to reading in linear order.
While the majority of the text here was written by the lead author Ben Goertzel, the overall
work and underlying ideas have been very much a team effort, with major input from the sec-
ondary authors Cassio Pennachin and Nil Geisweiller, and large contributions from various other
contributors as well. Nlany chapters have specifically indicated coauthors; but the contributions
from various collaborating researchers and engineers go far beyond these. The creation of the
AGI approach and design presented here is a process that has occurred over a long period of
time among a community of people; and this book is in fact a quite partial view of the existent
Iii
EFTA00624132
via
body of knowledge and intuition regarding CogPrime. For example, beyond the ideas presented
here, there is a body of work on the OpenCog wiki site, and then the OpenCog codebase itself.
More extensive introductory remarks may be found in Preface of Part 1, including a brief
history of the book and acknowledgements to some of those who helped inspire it.
Also, one brief comment from the Preface of Part 1 bears repeating: At several places in this
volume, as in its predecessor, we will refer to the "current" CogPrime implementation (in the
OpenCog framework); in all cases this refers to the OpenCog software system as of late 2013.
We fully realize that this book is not "easy reading", and that the level and nature of
exposition varies somewhat from chapter to chapter. We have done our best to present these
very complex ideas as clearly as we could, given our own time constraints, and the lack of
commonly understood vocabularies for discussing many of the concepts and systems involved.
Our hope is that the length of the book, and the conceptual difficulty of some portions, will
be considered as compensated by the interest of the ideas we present. For, make no mistake —
for all their technicality and subtlety, we find the ideas presented here incredibly exciting. We
are talking about no less than the creation of machines with intelligence, creativity and genius
equaling and ultimately exceeding that of human beings.
This is, in the end, the kind of book that we (the authors) all hoped to find when we first
entered the AI field: a reasonably detailed description of how to go about creating thinking
machines. The fact that so few treatises of this nature, and so few projects explicitly aimed
at the creation of advanced AGI, exist, is something that has perplexed us since we entered
the field. Rather than just complain about it, we have taken matters into our own hands, and
worked to create a design and a codebase that we believe capable of leading to human-level
AGI and beyond.
We feel tremendously fortunate to live in times when this sort of pursuit can be discussed in
a serious, scientific way.
Online Appendices
Just one more thing before getting started! This book originally had even more chapters than
the ones currently presented in Parts 1 and 2. In order to decrease length and increase fo-
cus, however, a number of chapters dealing with peripheral - yet still relevant and interest-
ing - matters were moved to online appendices. These may be downloaded in a single PDF
file at http: higoert zel.orgiengineering_general_Intenigence_appendices_
B-I4.pdf. The titles of these appendices are:
• Appendix A: Possible Worlds Semantics and Experiential Semantics
• Appendix B: Steps Toward a Formal Theory of Cognitive Structure and Dynamics
• Appendix C: Emergent Reflexive Mental Structures
• Appendix D: GOLEM: Toward an AGI Meta-Architecture Enabling Both Goal Preservation
and Radical Self-Improvement
• Appendix E: Lojban++: A Novel Linguistic Mechanism for Teaching AGI Systems
• Appendix F: PLN and the Brain
• Appendix G: Possible Worlds Semantics and Experiential Semantics
• Appendix H: Propositions About Environments in Which CogPrime Components are Useful
EFTA00624133
ix
None of these are critical to understanding the key ideas in the book, which is why they were
relegated to online appendices. However, reading them will deepen your understanding of the
conceptual and formal perspectives underlying the CogPrime design.
September 2013
Ben Goertzet
EFTA00624134
EFTA00624135
Contents
Section I Architectural and Representational Mechanisms
19 The OpenCog Framework 3
19.1 Introduction 3
19.1.1 Layers of Abstraction in Describing Artificial Minds 3
19.1.2 The OpenCog Framework 4
19.2 The OpenCog Architecture 5
19.2.1 OpenCog and Hardware Models 5
19.2.2 The Key Components of the OpenCog Framework 6
19.3 The AtomSpace 7
19.3.1 The Knowledge Unit: Atoms 7
19.3.2 AtomSpace Requirements and Properties 8
19.3.3 Accessing the Atomspace 9
19.3.4 Persistence 10
19.3.5 Specialized Knowledge Stores 11
19.4 MindAgents: Cognitive Processes 13
19.4.1 A Conceptual View of CogPrime Cognitive Processes 14
19.4.2 Implementation of MindAgents 15
19.4.3 Tasks 16
19.4.4 Scheduling of MindAgents and Tasks in a Unit 16
19.4.5 The Cognitive Cycle 17
19.5 Distributed AtomSpace and Cognitive Dynamics 18
19.5.1 Distributing the AtomSpace 18
19.5.2 Distributed Processing 23
20 Knowledge Representation Using the Atomspace 27
20.1 Introduction 27
20.2 Denoting Atoms 28
20.2.1 Meta-Language 28
20.2.2 Denoting Atoms 30
20.3 Representing Functions and Predicates 35
20.3.1 Execution Links 36
20.3.2 Denoting Schema and Predicate Variables 39
xi
EFTA00624136
xii Contents
20.3.3 Variable and Combinator Notation 41
20.3.4 Inheritance Between Higher-Order Types 43
20.3.5 Advanced Schema Manipulation 44
21 Representing Procedural Knowledge 49
21.1 Introduction 49
21.2 Representing Programs 50
21.3 Representational Challenges 51
21.4 What Makes a Representation Tractable? 53
21.5 The Combo Language 55
21.6 Normal Forms Postulated to Provide Tractable Representations 55
21.6.1 A Simple Type System 56
21.6.2 Boolean Normal Form 57
21.6.3 Number Normal Form 57
21.6.4 List Normal Form 57
21.6.5 Tuple Normal Form 57
21.6.6 Enum Normal Form 58
21.6.7 Function Normal Form 58
21.6.8 Action Result Normal Form 58
21.7 Program Transformations 59
21.7.1 Reductions 59
21.7.2 Neutral Transformations 60
21.7.3 Non-Neutral Transformations 62
21.8 Interfacing Between Procedural and Declarative Knowledge 63
21.8.1 Programs Manipulating Atoms 63
21.9 Declarative Representation of Procedures 64
Section II The Cognitive Cycle
22 Emotion, Motivation, Attention and Control 67
22.1 Introduction 67
22.2 A Quick Look at Action Selection 68
22.3 Psi in C,ogPrime 69
22.4 Implementing Emotion Rules atop Psi's Emotional Dynamics 72
22.4.1 Grounding the Logical Structure of Emotions in the Psi Model 73
22.5 Goals and Contexts 73
22.5.1 Goal Atoms 74
22.6 Context Atoms 76
22.7 Ubergoal Dynamics 77
22.7.1 Implicit Ubergoal Pool Modification 77
22.7.2 Explicit Ubergoal Pool Modification 78
22.8 Goal Formation 78
22.9 Goal Fulfillment and Predicate Schematization 79
22.10Context Formation 79
22.11Execut ion Management 80
22.12Goals and Time 81
EFTA00624137
Contents xiii
23 Attention Allocation 83
23.1 Introduction 83
23.2 Semantics of Short and Long Temi Importance 85
23.2.1 The Precise Semantics of STI and LTI 86
23.2.2 STI, STIFund, and Juju 89
23.2.3 Formalizing LTI 89
23.2.4 Applications of LT/bunt versus LT/cont 90
23.3 Defining Burst LTI in Terms of STI 91
23.4 Valuing LTI and STI in terms of a Single Currency 92
23.5 Economic Attention Networks 94
23.5.1 Semantics of Hebbian Links 94
23.5.2 Explicit and Implicit Hebbian Relations 95
23.6 Dynamics of STI and LTI Propagation 95
23.6.1 ECAN Update Equations 96
23.6.2 ECAN as Associative Memory 101
23.7 Glocal Economic Attention Networks 101
23.7.1 Experimental Explorations 102
23.8 Long-Term Importance and Forgetting 102
23.9 Attention Allocation via Data Mining on the System Activity Table 103
23.10Schema Credit Assignment 104
23.11Interaction between ECANs and other CogPrime Components 106
23.11.1Use of PLN and Procedure Learning to Help ECAN 106
23.11.2Use of ECAN to Help Other Cognitive Processes 106
23.12MindAgent Importance and Scheduling 107
23.13Information Geometry for Attention Allocation 108
23.13.1Brief Review of Information Geometry 108
23.13.2Information-Geometric Learning for Recurrent Networks: Extending
the ANGL Algorithm 109
23.13.3Information Geometry for Economic Attention Allocation: A Detailed
Example 110
24 Economic Goal and Action Selection 113
24.1 Introduction 113
24.2 Transfer of STI "Requests for Services" Between Goals 114
24.3 Feasibility Structures 116
24.4 Goal Based Schema Selection 116
24.4.1 A Game-Theoretic Approach to Action Selection 117
24.5 SchemaActivation 118
24.6 GoalBasedSchemaLearning 119
25 Integrative Procedure Evaluation 121
25.1 Introduction 121
25.2 Procedure Evaluators 121
25.2.1 Simple Procedure Evaluation 122
25.2.2 Effort Based Procedure Evaluation 122
25.2.3 Procedure Evaluation with Adaptive Evaluation Order 123
25.3 The Procedure Evaluation Process 123
EFTA00624138
xiv Contents
25.3.1 Truth Value Evaluation 124
25.3.2 Schema Execution 125
Section III Perception and Action
26 Perceptual and Motor Hierarchies 129
26.1 Introduction 129
26.2 The Generic Perception Process 130
26.2.1 The ExperienceDB 131
26.3 Interfacing CogPrime with a Virtual Agent 131
26.3.1 Perceiving the Virtual World 132
26.3.2 Acting in the Virtual World 133
26.4 Perceptual Pattern Mining 134
26.4.1 Input Data 134
26.4.2 Transaction Graphs 135
26.4.3 Spatiotemporal Conjunctions 135
26.4.4 The Mining Task 136
26.5 The Perceptual-Motor Hierarchy 136
26.6 Object Recognition from Polygonal Meshes 137
26.6.1 Algorithm Overview 138
26.6.2 Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes 138
26.6.3 Creating Adjacency Graphs from PPNodes 139
26.6.4 Clustering in the Adjacency Graph 140
26.6.5 Discussion 140
26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action
Hierarchy 140
26.7.1 Hierarchical Perception Action Networks 141
26.7.2 Declarative Memory 142
26.7.3 Sensory Memory 142
26.7.4 Procedural Memory 142
26.7.5 Episodic Memory 143
26.7.6 Action Selection and Attention Allocation 144
26.8 Multiple Interaction Channels 144
27 Integrating CogPrime with a Compositional Spatiotemporal Deep
Learning Network 147
27.1 Introduction 147
27.2 Integrating CSDLNs with Other AI Frameworks 149
27.3 Semantic CSDLN for Perception Processing 149
27.4 Semantic CSDLN for Motor and Sensorimotor Processing 152
27.5 Connecting the Perceptual and Motoric Hierarchies with a Goal Hierarchy 154
28 Making DeSTIN Representationally Transparent 157
28.1 Introduction 157
28.2 Review of DeSTIN Architecture and Dynamics 158
28.2.1 Beyond Gray-Scale Vision 159
28.3 Uniform DeSTIN 159
28.3.1 Translation-Invariant DeSTIN 160
EFTA00624139
Contents xv
28.3.2 Mapping States of Tran.slation-Invariant De$TIN into the Atomspace 161
28.3.3 Scale-Invariant DeSTIN 162
28.3.4 Rotation Invariant DeSTIN 163
28.3.5 Temporal Perception 164
28.4 Interpretation of DeSTIN's Activity 164
28.4.1 DeSTIN's Assumption of Hierarchical Decomposability 165
28.4.2 Distance and Utility 165
28.5 Benefits and Costs of Uniform DeSTIN 166
28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 167
28.6.1 Visual Attention Focusing 167
28.6.2 Using Imprecise Probabilities to Guide Visual Attention Focusing 168
28.6.3 Sketch of Application to DeSTIN 168
29 Bridging the Symbolic/Subsymbolic Gap 171
29.1 Introduction 171
29.2 Simplified OpenCog Workflow 173
29.3 Integrating De$TIN and OpenCog 174
29.3.1 Mining Patterns from DeSTIN States 175
29.3.2 Probabilistic Inference on Mined Hypergraphs 176
29.3.3 Insertion of OpenCog-Learned Predicates into DeSTIN's Pattern Library 177
29.4 Multisensory Integration, and Perception-Action Integration 178
29.4.1 Perception-Action Integration 179
29.4.2 Thought-Experiment: Eye-Hand Coordination 181
29.5 A Practical Example: Using Subtree Mining to Bridge the Gap Between
DeSTIN and PLN 182
29.5.1 The Importance of Semantic Feedback 184
29.6 Some Simple Experiments with Letters 184
29.6.1 Mining Subtrees from DeSTIN States Induced via Observing Letterforms 184
29.6.2 Mining Subtrees from DeSTIN States Induced via Observing Letterforms 185
29.7 Conclusion 188
Section IV Procedure Learning
30 Procedure Learning as Program Learning 193
30.1 Introduction 193
30.1.1 Program Learning 193
30.2 Representation-Building 195
30.3 Specification Based Procedure Learning 196
31 Learning Procedures via Imitation, Reinforcement and Correction 197
31.1 Introduction 197
31.2 IRC Learning 197
31.2.1 A Simple Example of Imitation/Reinforcement Learning 198
31.2.2 A Simple Example of Corrective Learning 199
31.3 IRC Learning in the PetBrain 201
31.3.1 Introducing Corrective Learning 203
31.4 Applying A Similar IRC Methodology to Spontaneous Learning 203
EFTA00624140
xti Contents
32 Procedure Learning via Adaptively Biased Hillcimbing 205
32.1 Introduction 205
32.2 Hillclimbing 206
32.3 Entity and Perception Filters 207
32.3.1 Entity filter 207
32.3.2 Entropy perception filter 207
32.4 Using Action Sequences as Building Blocks 208
32.5 Automatically Parametrizing the Program Size Penalty 208
32.5.1 Definition of the complexity penalty 208
32.5.2 Parameterizing the complexity penalty 209
32.5.3 Definition of the Optimization Problem 210
32.6 Some Simple Experimental Results 211
32.7 Conclusion 214
33 Probabilistic Evolutionary Procedure Learning 215
33.1 Introduction 215
33.1.1 Explicit versus Implicit Evolution in CogPrime 217
33.2 Estimation of Distribution Algorithms 218
33.3 Competent Program Evolution via MOSES 219
33.3.1 Statics 219
33.3.2 Dynamics 222
33.3.3 Architecture 223
33.3.4 Example: Artificial Ant Problem 224
33.3.5 Discussion 229
33.3.6 Conclusion 229
33.4 Integrating Feature Selection Into the Learning Process 230
33.4.1 Machine Learning, Feature Selection and AGI 231
33.4.2 Data- and Feature- Focusable Learning Problems 232
33.4.3 Integrating Feature Selection Into Learning 233
33.4.4 Integrating Feature Selection into MOSES Learning 234
33.4.5 Application to Genomic Data Classification 234
33.5 Supplying Evolutionary Learning with Long-Term Memory 236
33.6 Hierarchical Program Learning 237
33.6.1 Hierarchical Modeling of Composite Procedures in the AtomSpace 238
33.6.2 Identifying Hierarchical Structure In Combo trees via Metallodes and
Dimensional Embedding 239
33.7 Fitness Function Estimation via Integrative Intelligence 242
Section V Declarative Learning
34 Probabilistic Logic Networks 247
34.1 Introduction 247
34.2 A Simple Overview of PLN 248
34.2.1 Forward and Backward Chaining 249
34.3 First Order Probabilistic Logic Networks 250
34.3.1 Core FOPLN Relationships 250
34.3.2 PLN Truth Values 251
EFTA00624141
Contents xvii
34.3.3 Auxiliary FOPLN Relationships 251
34.3.4 PLN Rules and Formulas 252
34.3.5 Inference Trails 253
34.4 Higher-Order PLN 254
34.4.1 Reducing HOPLN to FOPLN 255
34.5 Predictive Implication and Attraction 256
34.6 Confidence Decay 257
34.6.1 An Example 258
34.7 Why is PLN a Good Idea' 260
35 Spatiotemporal Inference 263
35.1 Introduction 263
35.2 Related Work on Spatio-temporal Calculi 264
35.3 Uncertainty with Distributional Fuzzy Values 267
35.4 Spatio-temporal Inference in PLN 270
35.5 Examples 272
35.5.1 Spatiotemporal Rules 272
35.5.2 The Laptop is Safe from the Rain 273
35.5.3 Fetching the Toy Inside the Upper Cupboard 273
35.6 An Integrative Approach to Planning 275
36 Adaptive, Integrative Inference Control 277
36.1 Introduction 277
36.2 High-Level Control Mechanisms 277
36.2.1 The Need for Adaptive Inference Control 278
36.3 Inference Control in PLN 279
36.3.1 Representing PLN Rules as GroundedSchemallodes 279
36.3.2 Recording Executed PLN Inferences in the Atomspace 279
36.3.3 Anatomy of a Single Inference Step 280
36.3.4 Basic Forward and Backward Inference Steps 281
36.3.5 Interaction of Forward and Backward Inference 282
36.3.6 Coordinating Variable Bindings 282
36.3.7 An Example of Problem Decomposition 284
36.3.8 Example of Casting a Variable Assignment Problem as an Optimization
Problem 284
36.3.9 Backward Chaining via Nested Optimization 285
36.4 Combining Backward and Forward Inference Steps with Attention Allocation
to Achieve the Same Effect as Backward Chaining (and Even Smarter Inference
Dynamics) 288
36.4.1 Breakdown into MindAgents 289
36.5 Hebbian Inference Control 289
36.6 Inference Pattern Mining 293
36.7 Evolution As an Inference Control Scheme 293
36.8 Incorporating Other Cognitive Processes into Inference 294
36.9 PLN and Bayes Nets 295
EFTA00624142
xtiii Contents
37 Pattern Mining 297
37.1 Introduction 297
37.2 Finding Interesting Patterns via Program Learning 298
37.3 Pattern Mining via Frequent/Surprising Subgraph Mining 299
37.4 Fishgram 300
37.4.1 Example Patterns 300
37.4.2 The Fishgram Algorithm 301
37.4.3 Preprocessing 302
37.4.4 Search Process 303
37.4.5 Comparison to other algorithms 304
38 Speculative Concept Formation 305
38.1 Introduction 305
38.2 Evolutionary Concept Formation 306
38.3 Conceptual Blending 308
38.3.1 Outline of a CogPrime Blending Algorithm 310
38.3.2 Another Example of Blending 311
38.4 Clustering 312
38.5 Concept Formation via Formal Concept Analysis 312
38.5.1 Calculating Membership Degrees of New Concepts 313
38.5.2 Forming New Attributes 313
38.5.3 Iterating the Fuzzy Concept Formation Process 314
Section VI Integrative Learning
39 Dimensional Embedding 319
39.1 Introduction 319
39.2 Link Based Dimensional Embedding 320
39.3 Harel and Koren's Dimensional Embedding Algorithm 322
39.3.1 Step 1: Choosing Pivot Points 322
39.3.2 Step 2: Similarity Estimation 323
39.3.3 Step 3: Embedding 323
39.4 Embedding Based Inference Control 323
39.5 Dimensional Embedding and InheritanceLinks 325
40 Mental Simulation and Episodic Memory 327
40.1 Introduction 327
40.2 Internal Simulations 328
40.3 Episodic Memory 328
41 Integrative Procedure Learning 333
41.1 Introduction 333
41.1.1 The Diverse Technicalities of Procedure Learning in CogPrime 334
41.2 Preliminary Comments on Procedure Map Encapsulation and Expansion 336
41.3 Predicate Schematization 337
41.3.1 A Concrete Example 339
41.4 Concept-Driven Schema and Predicate Creation 340
41.4.1 Concept-Driven Predicate Creation 340
EFTA00624143
Contents xix
41.4.2 Concept-Driven Schema Creation 341
41.5 Inference-Guided Evolution of Pattern-Embodying Predicates 342
41.5.1 Rewarding Surprising Predicates 342
41.5.2 A More Formal Treatment 344
41.6 PredicateNode Mining 345
41.7 Learning Schema Maps 346
41.7.1 Goal-Directed Schema Evolution 347
41.8 Occam's Razor 349
42 Map Formation 351
42.1 Introduction 351
42.2 Map Encapsulation 353
42.3 Atom and Predicate Activity Tables 355
42.4 Mining the AtomSpace for Maps 356
42.4.1 Frequent Itemset Mining for Map Mining 357
42.4.2 Evolutionary Map Detection 359
42.5 Map Dynamics 359
42.6 Procedure Encapsulation and Expansion 360
42.6.1 Procedure Encapsulation in More Detail 361
42.6.2 Procedure Encapsulation in the Human Brain 361
42.7 Maps and Focused Attention 362
42.8 Recognizing and Creating Self-Referential Structures 363
42.8.1 Encouraging the Recognition of Self-Referential Structures in the
AtomSpace 364
Section VII Communication Between Human and Artificial Minds
43 Communication Between Artificial Minds 369
43.1 Introduction 369
43.2 A Simple Example Using a PsyneseVocabulary Server 371
43.2.1 The Psynese Match Schema 373
43.3 Psynese as a Language 373
43.4 Psynese Mindplexes 374
43.4.1 AGI Mindplexes 375
43.5 Psynese and Natural Language Processing 376
43.5.1 Collective Language Learning 378
44 Natural Language Comprehension 379
44.1 Introduction 379
44.2 Linguistic Atom Types 381
44.3 The Comprehension and Generation Pipelines 382
44.4 Parsing with Link Grammar 383
44.4.1 Link Grammar vs. Phrase Structure Grammar 385
44.5 The RelEx Framework for Natural Language Comprehension 386
44.5.1 RelEx2Frame: Mapping Syntactico-Semantic Relationships into
FrameNet Based Logical Relationships 387
44.5.2 A Priori Probabilities For Rules 389
44.5.3 Exclusions Between Rules 389
EFTA00624144
xx Contents
44.5.4 Handling Multiple Prepositional Relationships 390
44.5.5 Comparatives and Phantom Nodes 391
44.6 Frame2Atom 392
44.6.1 Examples of Frame2Atom 393
44.6.2 Issues Involving Disambiguation 396
44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and RelEx2Frame 397
44.8 Mapping Link Parses into Atom Structures 398
44.8.1 Example Training Pair 399
44.9 Making a Training Corpus 399
44.9.1 Leveraging RelEx to Create a Training Corpus 399
44.9.2 Making an Experience Based Training Corpus 399
44.9.3 Unsupervised, Experience Based Corpus Creation 400
44.10Limiting the Degree of Disambiguation Attempted 400
44.11Rule Format 401
44.11.1Example Rule 402
44.12Rule Learning 402
44.13Creating a Cyc-Like Database via Text Mining 403
44.14PROWL Grammar 404
44.14.1Brief Review of Word Grammar 405
44.14.2Word Grammar's Logical Network Model 406
44.14.3Link Grammar Parsing vs Word Grammar Parsing 407
44.14.4Contextually Guided Greedy Parsing and Generation Using Word Link
Grammar 411
44.15Aspects of Language Learning 413
44.15.1 Word Sense Creation 413
44.15.2Feature Structure Learning 414
44.15.3Transformation and Semantic Mapping Rule Learning 414
44.16Experiential Language Learning 415
44.17Which Path(s) Forward? 416
45 Language Learning via Unsupervised Corpus Analysis 417
45.1 Introduction 417
45.2 Assumed Linguistic Infrastructure 419
45.3 Linguistic Content To Be Learned 421
45.3.1 Deeper Aspects of Comprehension 423
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 423
45.4.1 A High Level Perspective on Language Learning 424
45.4.2 Learning Syntax 426
45.4.3 Learning Semantics 430
45.5 The Importance of Incremental Learning 434
45.6 Integrating Language Learned via Corpus Analysis into CogPrime's
Experiential Learning 435
46 Natural Language Generation 437
46.1 Introduction 437
46.2 SegSim for Sentence Generation 437
46.2.1 NLGen: Example Results 441
EFTA00624145
Contents xxi
46.3 Experiential Learning of Language Generation 444
46.4 Sem2Syn 445
46.5 Conclusion 445
47 Embodied Language Processing 447
47.1 Introduction 447
47.2 Semiosis 448
47.3 Teaching Gestural Communication 450
47.4 Simple Experiments with Embodiment and Anaphor Resolution 455
47.5 Simple Experiments with Embodiment and Question Answering 456
47.5.1 Preparing/Matching Framm 456
47.5.2 Frames2RelEx 458
47.5.3 Example of the Question Answering Pipeline 458
47.5.4 Example of the PetBrain Language Generation Pipeline 459
47.6 The Prospect of Massively Multiplayer Language Teaching 460
48 Natural Language Dialogue 463
48.1 Introduction 463
48.1.1 Two Phases of Dialogue System Development 464
48.2 Speech Act Theory and its Elaboration 464
48.3 Speech Act Schemata and Triggers 465
48.3.1 Notes Toward Example SpeechActSchema 467
48.4 Probabilistic Mining of Trigger contexts 471
48.5 Conclusion 473
Section VIII From Here to AGI
49 Summary of Argument for the CogPrime Approach 477
49.1 Introduction 477
49.2 Multi-Memory Systems 477
49.3 Perception, Action and Environment 478
49.4 Developmental Pathways 479
49.5 Knowledge Representation 480
49.6 Cognitive Processes 480
49.6.1 Uncertain Logic for Declarative Knowledge 481
49.6.2 Program Learning for Procedural Knowledge 482
49.6.3 Attention Allocation 483
49.6.4 Internal Simulation and Episodic Knowledge 484
49.6.5 Low-Level Perception and Action 484
49.6.6 Goals 485
49.7 Fulfilling the "Cognitive Equation" 485
49.8 Occam's Razor 486
49.8.1 Mind Geometry 486
49.9 Cognitive Synergy 488
49.9.1 Synergies that Help Inference 488
49.10Synergies that Help MOSES 489
49.10.1Synergies that Help Attention Allocation 489
49.10.2Further Synergies Related to Pattern Mining 489
EFTA00624146
xxii Contents
49.10.3Synergim Related to Map Formation 490
49.11Emergent Structures and Dynamics 490
49.12Ethical AGI 491
49.13Toward Superhuman General Intelligence 492
49.13.1Conclusion 492
50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment 495
50.1 Introduction 495
50.2 Roles of Selected Cognitive Processes 496
50.3 A Semi-Narrative Treatment 506
50.4 Conclusion 509
A Glossary 511
A.1 List of Specialized Acronyms 511
A.2 Glossary of Specialized Terms 512
References 529
EFTA00624147
Section I
Architectural and Representational Mechanisms
EFTA00624148
EFTA00624149
Chapter 19
The OpenCog Framework
19.1 Introduction
The primary burden of this book is to explain the CogPrime architecture for AGI - the broad
outline of the design, the main dynamics it's intended to display once complete, and the reasons
why we believe it will be capable of leading to general intelligence at the human level and beyond.
The crux of CogPrime lies in its learning algorithms and how they are intended to interact
together synergetically, making use of CogPrime's knowledge representation and other tools.
Before we can get to this, however, we need to elaborate some of the "plumbing" within which
this learning dynamics occurs. We will start out with a brief description of the OpenCog frame-
work in which implementation of CogPrime has been, gradually and incrementally, occurring
for the last few years.
19.1.1 Layers of Abstraction in Describing Artificial Minds
There are multiple layers intervening between a conceptual theory of mind and a body of source
code. How many layers to explicitly discuss is a somewhat arbitrary decision, but one way to
picture it is exemplified in Table 19.1.
In Part 1 of this work we have concerned ourselves mainly with levels 5 and 6 in the table:
mathematical/conceptual modeling of cognition and philosophy of mind (with occasional forays
into levels 3 and 4). Most of Part 2, on the other hand, deals with level 4 (mathematical/concep-
tual AI design), verging into level 3 (high-level software design). This chapter however will focus
on somewhat lower-level material, mostly level 3 with sonic dips into level 2. We will describe the
basic architecture of CogPrime as a software system, implemented as "OpenCogPrime" within
the OpenCog Framework (OCF). The reader may want to glance back at Chapter 6 of Part 1
before proceeding through this one, to get a memory-refresh on basic CogPrime terminology.
Also, OpenCog and OpenCogPrime are open-source, so the reader who wishes to dig into the
source code (mostly C++, some Python and Scheme) is welcome to; directions to find the code
are on the opencog . org website.
3
EFTA00624150
4 19 The OpenCog Framework
Level of Abstraction Description/Example
1 Source code
2 Detailed software design
8 Software architecture Largely programming-language-independent, but not
hardware-architecture-independent: much of the ma-
terial in this chapter, for example, and most of the
OpenCog Framework
4 Mathematical and concep- e.g., the sort of characterization of CogPrime given in
tual Al design most of this Part of this book
5 Abstract mathematical mod- e.g. the SRAM model discussed in chapter 7 of Part
eling of cognition 1, which could be used to inspire or describe many
different Al systems
6 Philosophy of mind e.g. Patternism, the Mind-World Correspondence
Principle
Table 19.1: Levels of abstractions in CogPrime's implementation and design
19.1.2 The OpenCog Framework
The OpenCog Framework forms a bridge between the mathematical structures and dynamics
of CogPrime's concretely implemented mind, and the nitty-gritty realities of modern computer
technology. While CogPrime could in principle be implemented in a quite different infrastruc-
ture, in practice the CogPrime design has been developed closely in conjunction with OpenCog,
so that a qualitative understanding of the nature of the OCF is fairly necessary for an under-
standing of how CogPrime is intended to function, and a detailed understanding of the OCF is
necessary for doing concrete implementation work on CogPrime.
Marvin Minsky, in a personal conversation with one of the authors (Goertzel), once expressed
the opinion that a human-level general intelligence could probably be implemented on a 486
PC, if we just knew the algorithm. We doubt this is the case — at least not unless the 486
PC were supplied with masses of external memory and allowed to proceed much, much slower
than any human being - and it is certainly not the case for CogPrime. By current computing
hardware standards, a CogPrime system is a considerable resource hog. And it will remain so
for a number of years, even considering technology progress.
It is one of the jobs of the OCF to manage the system's gluttonous behavior. It is the software
layer that abstracts the real world efficiency compromises from the rest of the system; this is
why we call it a "Mind OS": it provides services, rules, and protection to the Atoms and cognitive
processes (see Section 19.4) that live on top of it, which are then allowed to ignore the software
architecture they live on.
And so, the nature of the OCF is strongly influenced by the quantitative requirements ha-
posed on the system, as well as the general nature of the structure and dynamics that it must
support. The large number and great diversity of Atoms needed to create a significantly intelli-
gent CogPrime, demands that we pay careful attention to such issues as concurrent, distributed
processing, and scalability in general. The number of Nodes and Links that we will need in
order to create a reasonably complete CogPrime is still largely unknown. But our experiments
with learning, natural language processing, and cognition over the past few years have given
us an intuition for the question. We currently believe that we are likely to need billions - but
probably not trillions, and almost surely not quadrillions - of Atoms in order to achieve a high
degree of general intelligence. Hundreds of millions strikes us as possible but overly optimistic.
EFTA00624151
19.2 The OpenCog Architecture 5
In fact we have already run CogPrime systems utilizing hundreds of millions of Atoms, though
in a simplified dynamical regime with only a couple very simple processes acting on most of
them.
The operational infrastructure of the OCF is an area where pragmatism must reign over ide-
alism. What we describe here is not the ultimate possible "mind operating system" to underlie
a CogPrime system, but rather a workable practical solution given the hardware, networking
and software infrastructure readily available today at reasonable prices. Along these lines, it
must be emphasized that the ideas presented in this chapter are the result of over a decade of
practical experimentation by the authors and their colleagues with implementations of related
software systems. The journey began in earnest in 1997 with the design and implementation of
the Webmind AI Engine at Intelligenesis Corp., which itself went through a few major design
revisions; and then in 2001-2002 the Novamente Cognition Engine was architected and imple-
mented, and evolved progressively until 2008, when a subset of it was adapted for open-sourcing
as OpenCog. Innumerable mistakes were made, and lessons learned, along this path. The OCF
as described here is significantly different. and better, than these previous architectures, thanks
to these lessons, as well as to the changing lamLscape of concurrent, distributed computing over
the past few years.
The design presented here reflects a mix of realism and idealism, and we haven't seen fit
here to describe all the alternatives that were pursued on the route to what we present. We
don't claim the approach we've chosen is ideal, but it's in use now within the OpenCog sys-
tem, and it seems both workable in practice and capable of effectively supporting the entire
CogPrime design. No doubt it will evolve in some respects as implementation progresses; one
of the principles kept in mind during the design and development of OpenCog was modular-
ity, enabling substantial modifications to particular parts of the framework to occur without
requiring wholesale changes throughout the codebase.
19.2 The OpenCog Architecture
19.2.1 OpenCog and Hardware Models
The job of the OCF Ls closely related to the nature of the hardware on which it runs. The ideal
hardware platform for CogPrime would be a massively parallel hardware architecture, in which
each Atom was given its own processor and memory. The closest thing would have been the
Connection Machine II Ii1891: a CM5 was once built with 64000 processors and local RAM for
each processor. But even 64000 processors wouldn't be enough for a highly intelligent CogPrime
to run in a fully parallelized manner, since we're sure we need more than 64000 Atoms.
Connection Machine style hardware seems to have perished in favor of more standard SNIP
(Symmetric Multi-Processing) machines. It is true that each year we see SNIP machines with
more and more processors on the market, and more and more cores per processor. However,
the state of the art is still in the hundreds of cores range, many orders of magnitude from what
would be necessary for a one Atom per processor CogPrime implementation.
So, at the present time, technological and financial reasons have pushed us to implement
the OpenCog system using a relatively mundane and standard hardware architecture. If the
CogPrime project is successful in the relatively near term, the first human-level OpenCogPrime
system will most likely live on a network of high-end commodity SMP machines. These are
EFTA00624152
6 19 The OpenCog Ftamework
machines with dozens of gigabytes of RAM and several processor cores, perhaps dozens but
not thousands. A highly intelligent CogPrime would require a cluster of dozens and possibly
hundreds or thousands of such machines. We think it's unlikely that tens of thousands will be
required, and extremely unlikely that hundreds of thousands will be.
Given this sort of architecture, we need effective ways to swap Atoms back and forth be-
tween disk and RAM, and carefully manage the allocation of processor time among the various
cognitive processes that demand it. The use of a widely-distributed network of weaker ma-
chines for peripheral processing is a serious possibility, and we have some detailed software
designs addressing this option: but for the near future we believe that this can best be used as
augmentation to core CogPrime processing, which must remain on a dedicated cluster.
Of course, the use of specialized hardware is also a viable passibility, and we have considered
a host of possibilities such as
• True supercomputers like those created by IBM or Cray (which these days are distributed
systems, but with specialized, particularly efficient interconnection frameworks and overall
control mechanisms)
• GPU supercomputers such as the Nvidia Tesla (which are currently being used for vision
processing systems considered for hybridization with OCP), such as DeSTIN and Hugo de
Garis's Parcone
• custom chips designed to implement the various CogPrime algorithms and data structures
in hardware
• More speculatively, it might be possible to use evolutionary quantum computing or adiabatic
quantum computing a la Dwave (ht tp : //dwave . corn) to accelerate CogPrime procedure
learning.
All these possibilities and many more are exciting to envision, but the CogPrime architecture
does not require any of them in order to be successful.
19.2.2 The Key Components of the OpenCog Framework
Given the realities of implementing CogPrime on clustered commodity servers, as we have seen
above, the three key questions that have to be answered in the OCF design are:
1. How do we store CogPrime's knowledge?
2. How do we enable cognitive processes to act on that knowledge, refining and improving it?
3. How do we enable scalable, distributed knowledge storage and cognitive processing of that
knowledge?
The remaining sections of this Chapter are dedicated to answering each of these questions
in more detail.
While the basic landscape of concurrent, distributed processing is largely the same as it was
a decade ago - we're still dealing with distributed networks of multiprocessor von Neumann ma-
chines - we can draw on advancements in both computer architecture and software. The former
is materialized on the increasing availability of multiple real and virtual cores in commodity
processors. The latter reflects the emergence of a number of tools and architectural patterns,
largely thanks to the rise of "big data" problems and businesses. Companies and projects dealing
EFTA00624153
19.3 The AtomSpace 7
with massive datasets face challenges that aren't entirely alike those of building CogPrime, but
which share many useful similarities.
These advances are apparent mostly in the architectute of the AtomSpace, a distributed
knowledge store for efficient storage of hypergraphs and its use by CogPrime's cognitive dy-
namics. The AtomSpace, like many NoSQL datastores, is heavily distributed, utilizing local
caches for read and write operations, and a special purpose design for eventual consistency
guarantees.
We also attempt to minimize the complexities of multi-threading in the scheduling of cogni-
tive dynamics, by allowing those to be deployed either as agents sharing a single OS process,
or, preferably, as processes of their own. Cognitive dynamics communicate through message
queues, which are provided by a sub-system that hides the deployment decision, so the mes-
sages exchanged are the same whether delivered within a proms, to another process in the
same machine, or to a process in another machine in the cluster.
19.3 The AtomSpace
As alluded to above and in Chapter 13. and discussed more fully in Chapter 20 below, the
foundation of CogPrime's knowledge representation is the Atom, an object that can be either
a Node or a Link. CogPrime's hypergraph is implemented as the AtomSpace, a specialized
datastore that comes along with an API designed specifically for CogPrime's requirements.
19.3.1 The Knowledge Unit: Atoms
Atoms are used to represent every kind of knowledge in the system's memory in one way or
another. The particulars of Atoms and how they represent knowledge will be discussed in later
chapters; here we present only a minimal description in order to motivate the design of the
AtomSpace. From that perspective, the most important properties of Atoms are:
• Every Atom has an AtomHandle, which is a universal ID across a CogPrime deployment
(possibly involving thousands of networked machines). The AtomHandles are the keys for
acessing Atoms in the AtomSpace, and once a handle is assigned to an Atom it can't be
changed or reused.
• Atoms have TruthValue and AttentionValue entities associated with them, each of which
are small collections of numbers; there are multiple versions of truth values, with varying
degrees of detail. TruthValues are context-dependent, and useful Atoms will typically have
multiple TruthValues. indexed by context.
• Some Atoms are nodes, and may have names.
• Atoms that are links will have a list of targets, of variable size (as in CogPrime's hypergraph
links may connect more than two nodes).
Some Atom attributes are immutable, such as Node names and, most importantly, Link
targets, called outgoing sets in AtomSpace lingo. One can remove a Link, but not change its
targets. This enables faster implementation of some neighborhood searches, as well as index-
EFTA00624154
8 19 The OpenCog Ftamework
Mg. Truth and attention values, on the other hand, are mutable, an essential requirement for
CogPrime.
For performance reasons, some types of knowledge have alternative representations. These
alternative representations are necessary for space or speed reasons, but knowledge stored that
way can always be translated back into Atoms in the AtomSpace as needed. So, for instance,
procedures are represented as program trees in a ProcedureRepository, which allows for faster
execution, but the trees can be expanded into a set of Nodes and Links if one wants to do
reasoning on a specific program.
19.3.2 AtomSpace Requirements and Properties
The major high-level requirements for the AtomSpace are the following ones:
• Store Atoms indexed by their immutable AtomHandles as compactly as possible, while still
enabling very efficient modification of the mutable properties of an Atom (TruthValues and
AttentionValues).
• Perform queries as fast as possible.
• Keep the working set of all Atoms currently being used by CogPrime's cognitive dynamics
in RAM.
• Save and restore hypergraphs to disk, a more traditional SQL or non-SQL database, or
other structure such as binary files, XML, etc.
• Hold hypergraphs consisting of billions or trillions of Atoms, scaling up to petabytes of
data.
• Be transparently distributable across a cluster of machines.
The design trade-offs in the AtomSpace implementation are driven by the needs of CogPrime.
The datastore is implemented in a way that maximizes the performance of the cognitive dynam-
ics running on top of it. From this perspective, the AtomSpace differs from most datastores, as
the key decisions aren't made in terms of flexibility, consistency, reliability and other common
criteria for databases. It is a very specialized database. Among the factors that motivate the
AtomSpace's design, we can highlight a few:
1. Atoms tend to be small objects, with very few exceptions (links with many targets or Atoms
with many different context-derived TruthValu).
2. Atom creation and deletion are common events, and occur according to complex patterns
that may vary a lot over time, even for a particular CogPrime instance.
3. Atoms involved in CogPrime's cognitive dynamics at any given time need to live in RAM.
However, the system still needs the ability to save sets of Atoms to disk in order to preserve
RAM, and then retrive those later when they get contextually relevant.
4. Some Atoms will remain around for a really long time, others will be ephemeral and get
removed shortly after they're created. Removal may be to disk, as outlined above, or plain
deletion.
Besides storing Atoms, the AtomSpace also contains a number of indices for fast Atom re-
trieval according to several criteria. It can quickly search for Atoms given their type, importance,
truth value, arity, targets (for Links), name (for Nodes), and any combination of the above.
These are built-in indexes. The AtomSpace also allows cognitive processes to create their own
EFTA00624155
19.3 The AtomSpace 9
indexes. based on the evaluation of a Procedure over the universe of Atoms, or a subset of that
universe specified by the process responsible for the index.
The AtomSpace also allows pattern matching queries for a given Atom structure template,
which allows for fast search for small subgraphs displaying some desirable properties. In ad-
dition to pattern matching, it provides neighborhood searches. Although it doesn't implement
any graph-traversal primitives, it's easy for cognitive processes to do so on top of the pattern
matching and neighborhood primitives.
Note that. since CogPrime's hypergraph is quite different from a regular graph, using a graph
database without modification would probably be inadequate. While it's possible to automati-
cally translate a hypergraph into a regular graph, that process is expensive for large knowledge
bases, and leads to higher space requirements, reducing the overall system's scalability.
In terms of database taxonomy, the AtomSpace lies somewhere between a key-value store
and a document store, as there is some structure in the contents of each value (an Atom's
properties are well defined, and listed above), but no built-in flexibility to add more contents
to an existing Atom.
We will now discuss the above requirements in more detail, starting with querying the Atom-
Space, followed by persistence to disk, and then handling of specific forms of knowledge that
are best handled by specialized stores.
19.3.3 Accessing the Atomspace
The AtomSpace provides an API, which allows the basic operations of creating new Atoms,
updating their mutable properties, searching for Atoms and removing Atoms. More specifically,
the API supports the following operations:
• Create and store a new Atom. There are special methods for Nodes and Links, in the latter
case with multiple convenience versions depending on the number of targets and other
properties of the link.
• Remove an Atom. This requires the validation that no Links currently point to that Atom,
otherwise they'd be left dangling.
• Look up one or more Atoms. This includes several variants, such as:
- Look up an Atom by AtomHandle;
- Look up a Node by name;
- Find links with an Atom as target;
- Pattern matching, i.e., find Atoms satisfying some predicate, which is designed as a
"search criteria" by some cognitive process, and results in the creation of a specific
index for that predicate;
- Neighborhood search, i.e., find Atoms that are within some radius of a given centroid
Atom;
- Find Atoms by type (this can be combined with the previous queries, resulting in type
specific versions);
- Find Atoms by some AttentionValue criteria, such as the top N most important Atoms,
or those with importance above some threshold (can also be combined with previous
queries);
EFTA00624156
10 19 The OpenCog Framework
- Find Atoms by some TruthValue criteria, similar to the previous one (can also be
combined with other queries);
- Find Atoms based on some temporal or spatial association, a query that relies on the
specialized knowledge stores mentioned below;
Queries can be combined, and the Atom type, AttentionValue and TruthValue criteria are
often used as filters for other queries, preventing the result set size from exploding.
• Manipulate an Atom, retrieving or modifying its AttentionValue and TruthValue. In the
modification case, this causes the respective indexes to be updated.
19.3.4 Persistence
In many planned CogPrime deployment scenarios, the amount of knowledge that needs to be
stored is too vast to fit in RAM, even if one considers a large cluster of machines hosting the
AtomSpace and the cognitive processes. The AtomSpace must then be able to persist subsets
of that knowledge to disk, and reload them later when necessary.
The decision of whether to keep an Atom in RAM or remove it is made based on its At-
tentionValue, through the process of economic attention allocation that is the topic of Chapter
23. AttentionValue determines how important an Atom is to the system, and there are mul-
tiple levels of importance. For the persistence decisions, the ones that matter are Long Term
Importance (LTI) and Very Long Term Importance (VLTI).
LTI is used to estimate the probability that the Atom will be necessary or useful in the not
too distant future. If this value is low, below a threshold i.1, then it is safe to remove the Atom
from RAM, a process called forgetting. When the decision to forget an Atom has been made,
VLTI enters the picture. VLTI is used to estimate the probability that the Atom will be useful
eventually at some distant point in the future. If VLTI is high enough, the forgotten Atom is
persisted to disk so it can be reloaded. Otherwise, the Atom is permanently forgotten.
When an Atom has been forgotten, a proxy is kept in its place. The proxy is more compact
than the original Atom, preserving only a crude measure of its LTI. When the proxy's LTI
increases above a second threshold is, the system understands that the Atom has become
relevant again, and loads it from disk.
Eventually, it may happen that the proxy doesn't become important enough over a very
long period of time. In this case, the system should remove even the proxy, if its Long Term
Importance (LTI) is below a third threshold i3. Other actions, usually taken by the system
administrator, can cause the removal of Atoms and their proxies from RAM. For instance, in a
CogPrime system managing information about a number of asers of some information system,
the deletion of a user from the system would came all that user's specific Atoms to be removed.
When Atoms are saved to disk and have no proxies in RAM, they can only be reloaded
by the system administrator. When reloaded, they will be disconnected from the rest of the
AtomSpace, and should be given special attention in order to pursue the creation of new Links
with the other Atoms in the system.
It's important that the values of i1, is, and is be set correctly. Otherwise, one or more of the
following problems may arise:
• If i1 and is are too close, the system may spend a lot of resources with saving and loading
Atoms.
EFTA00624157
19.3 The AtomSpace 11
• If i t is set too high, important Atoms will be excluded from the system's dynamics, de-
creasing its intelligence.
• If i3 is set too high, the system will forget very quickly and will have to spend resources
re-creating necessary but no longer available evidence.
• If either it or i3 is set too low, the system will consume significantly more resources than
it needs to with knowledge store, sacrificing cognitive processes.
Generally, we want to enforce a degree of hysteresis for the freezing and defrosting process.
What we mean is that:
iz - it > et > 0
it — is > c2 > 0
This ensures that when Atoms are reloaded, their importance is still above the threshold for
saving, so they will have a chance to be part of cognitive dynamics and become more important,
and won't be removed again too quickly. It also ensures that saved Atoms stay in the system
for a period of time before their proxies are removed and they're definitely forgotten.
Another important consideration is that forgetting individual Atoms makes little sense, be-
cause, as pointed out above, Atoms are relatively small objects. So the forgetting process should
prioritize the removal of clusters of highly interconnected Atoms whenever possible. In that case,
it's passible that a large subset of those Atoms will only have relations within the cluster, so
their proxies aren't needed and the memory, savings are maximized.
19.3.5 Specialized Knowledge Stores
Some specific kinds of knowledge are best stored in specialized data structures, which allow
big savings in space, query time, or both. The information provided by these specialized stores
isn't as flexible as it would be if the knowledge were stored in full fledged Node and Link form,
but most of the time CogPrime doesn't need the fully flexible format. Translation between the
specialized formats and Nodes and Links is always possible, when necessary.
We note that the ideal set of specialized knowledge stores is application domain specific.
The stores we have deemed necessary reflect the pro -school based roadmap towards AGI, and
are likely sufficient to get as through most of that roadmap, but not sufficient nor particularly
adequate for an architecture where self-modification plays a key role. These specialized stores
are a pragmatic compromise between performance and formalism, and their existence and design
would need to be revised once CogPrime is mostly functional.
19.3.5.1 Procedure Repository
Procedural knowledge, meaning knowledge that can be used both for the selection and execution
of actions, has a specialized requirement - this knowledge needs to be executable by the system.
While it will be passible, and conceptually straightforward, to execute a procedure that is stored
as a set of Atoms in the AtomSpace, it is much simpler, faster, and safer to rely on a specialized
repository.
EFTA00624158
12 19 The OpenCog Ftamework
Procedural knowledge in CogPrime is stored as programs in a special-purpose LISP-like
programming language called Combo. The motivation and details of this language are the
subject of Chapter 21.
Each Combo program is associated with a Node (a GroundedProcedureNode, to be more
precise), and the AtomHandle of that Node is used to index the procedure repository, where
the executable version of the program is kept, along with specifications of the necessary inputs
for its evaluation and what kind of output to expect. Combo programs can also be saved to
disk and loaded, like regular Atoms. There is a text representation of Combo for this purpose.
Program execution can be very fast, or, in cognitive dynamics terms, very slow, if it involves
interacting with the external world. Therefore, the procedure repository should also facilitate
the storage of program states during the execution of procedures. Concurrent execution of many
procedures is possible with no significant overhead.
19.3.5.2 3D Space Map
In the AGI Preschool setting, CogPrime is embodied in a three-dimensional world (either a real
one, in which it controls a robot, or a virtual one, in which it controls an avatar). This requires
the efficient storage and querying of vast amounts of spatial data, including very specialized
queries about the spacial interrelationship between entities. This spatial data is a key form of
knowledge for CogPrime's world perception, and it also needs to be accessible during learning,
action selection, and action execution.
All spatial knowledge is stored in a 3D Space Map, which allows for fast queries about specific
regions of the world, and for queries about the proximity and relative placement of objects and
entities. It can be used to provide a coarse-grained object level perception for the AtomSpace,
or it can be instrumental in supporting a lower level vision layer in which pixels or polygons
are used as the units of perception. In both cases. the knowledge stored in the 3D Space Map
can be translated into full-fledged Atoms and Links through the AtomHandles.
One characteristic feature of spatial perception is that vast amounts of data are generated
constantly, but most of it is very quickly forgotten. The mind abstracts the perceptual data
into the relevant concepts, which are linked with other Atoms, and most of the underlying
information can then be discarded. The process is repeated at a high frequency as long as
something novel is being perceived in the world. 3D Space Map is then optimized for quick
inserts and deletes.
19.3.5.3 Time Server
Similarly to spatial information, temporal information poses challenges for a hypergraph-based
storage. It can be much more compactly stored in specific data structures, which also allow
for very fast querying. The Time Server is the specialized structure for storing and querying
temportal data in CogPrime.
Temporal information can be stored by any cognitive process, based on its own criteria for
determining that some event should be remembered in a specific temporal context in the future.
This can include the perception of specific events, or the agents participation in those. such as
the first time it meets a new human teacher. It can also include a collection of concepts describing
specific contexts in which a set of actions has been particularly useful. The possibilities are
EFTA00624159
19.4 NlindAgents: Cognitive Processes 13
numerous, but from the Time Server perspective, all equivalent. They add up to associating a
time point or time interval with a set of Atoms.
The Time Server is a bi-directional storage, as AtomHandles can be used as keys, but also
as objects indexed by time points or time intervals. In the former case, the Time Server tells
us when an Atom was associated with temporal data. In the latter case, it tells us, for a given
time point or interval. which Atoms have been marked as relevant.
Temporal indexing can be based on time points or time intervals. A time point can be at any
granularity: from years to sub-seconds could be useful. A time interval is simply a set of two
points, the second being necessary after the first one, but their granularities not necessarily the
same. The temporal indexing inside the Time Server is hierarchical, so one can query for time
points or intervals in granularities other than the ones originally used when the knowledge was
first stored.
19.3.5.4 System Activity Table Set
The last relevant specialized store is the System Activity Table Set, which is described in more
detail in Chapter 23. This set of tables records, with fine-grained temporal associations, the most
important activities that take place inside CogPrime. There are different tables for recording
cognitive process activity (at the level of MindAgents, to be described in the next section), for
maintaining a history of the level of achievement of each important goal in the system, and for
recording other important aspects of the system state. such as the most important Atoms and
contexts.
19.4 MindAgents: Cognitive Processes
The AtomSpace holds the system's knowledge, but those Atoms are inert. How Ls that knowledge
used and useful? That is the province of cognitive dynamics. These dynamics, in a CogPrime
system, can be considered on two levels.
First, we have the cognitive processes explicitly programmed into CogPrime's source code.
These are what we call Concretely-Implemented Mind Dynamics, or CIM-Dynamics. Their
implementation in software happens through objects called MindAgents. We use the term CIM-
Dynamic to discuss a conceptual cognitive process, and the term MindAgents for its actual
implementation and execution dynamics.
The second level corresponds to the dynamics that emerge through the system's self-
organizing dynamics, based on the cooperative activity of the CIM-Dynamics on the shared
AtomSpace.
Most of the material in the following chapters is concerted with particular CIM-Dynamics
in the CogPrime system. In this section we will simply give some generalities about the CIM-
Dynamics as abstract processes and as software processes, which are largely independent of the
actual Al contents of the CIM-Dynamics. In practice the CIM-Dynamics involved in a CogPrime
system are fairly stereotyped in form, although diverse in the actual dynamics they induce.
EFTA00624160
14 19 The Opetteog Ftamework
19.4.1 A Conceptual View of CogPrime Cognitive Processes
We return now to the conceptual trichotomy of cognitive processes presented in Chapter 3 of
Part 1, according to which CogPrime cognitive processes may be divided into:
• Control processes;
• Global cognitive processes;
• Focused cognitive processes.
In practical terms, these may be considered as three categories of CIM-Dynamic.
Control Process CIM-Dynamics are hard to stereotype. Examples are the process of homeo-
static parameter adaptation of the parameters associated with the various other CIM-Dynamics,
and the CIM-Dynamics concerned with the execution of procedures, especially those whose ex-
ecution is made lengthy by the interactions with the external world.
Control Processes tend to focus on a limited and specialized subset of Atoms or other entities,
and carry out specialized mechanical operations on them (e.g. adjusting parameters, interpreting
procedures). To an extent, this may be considered a "grab bag" category containing CIM-
Dynamics that are not global or focused cognitive processes according to the definitions of the
latter two categories. However, it is a nontrivial observation about the CogPrime system that
the CIM-Dynamics that are not global or focused cognitive processes are all explicitly concerned
with system control in some way or another, so this grouping makes sense.
Global and Focused Cognitive Process CIM-Dynamics all have a common aspect to their
structure. Then, there are aspects in which Global versus Focused CIM-Dynamics diverge from
each other in stereotyped ways.
In most cases, the process undertaken by a Global or Focused CIM-Dynamic involves two
parts: a selection process and an actuation process. Schematically, such a CIM-Dynamic typi-
cally looks something like this:
1. Fetch a set of Atoms that it is judged will be useful to process, according to some selection
process.
2. Operate on these Atoms, possibly together with previously selected ones (this is what we
sometimes call the actuation process of the CIM-Dynamic).
3. Go back to step 1.
The major difference between Global and Focused cognitive processes lies in the selection
process. In the case of a Global process, the selection process Ls very broad, sometimes yielding
the whole AtomSpace, or a significant subset of it. This means that the actuation process must
be very simple, or the activation of this CIM-Dynamic mast be very infrequent.
On the other hand, in the case of a Focused process, the selection process is very narrow,
yielding only a small number of Atoms, which can then be processed more intensively and
expensively, on a per-Atom basis.
Common selection processes for Focused cognitive processes are fitness-oriented selectors,
which pick one or a set of Atoms from the AtomSpace with a probability based on some nu-
merical quantity associated with the atom, such as properties of TruthValue or AttentionValue.
There are also more specific selection processes, which choose for example Atoms obeying
some particular combination of relationships in relation to some other Atoms; say choosing only
Atoms that inherit from some given Atom already being processed. There is a notion, described
in the PLN book, of an Atom Structure Template; this is basically just a predicate that applies
to Atoms, such as
EFTA00624161
19.4 MindAgents: Cognitive Processes 15
P (X) .tv
equals
((InheritanceLink X cat) AND (EvaluationLink eats(X e cheese)).tv
which is a template that matches everything that inherits from cat and eats cheese. Templates
like this allow a much more refined selection than the above fitness-oriented selection process.
Selection processes can be created by composing a fitness-oriented process with further re-
strictions, such as templates. or simpler type-based restrictions.
19.4.2 Implementation of MindAgents
MindAgents follow a very, simple design. They need to provide a single method through which
they can be enacted. and they should execute their actions in atomic, incremental steps, where
each step should be relatively quick. This design enables collaborative scheduling of MindAgents,
at the cost of allowing "opportunistic" agents to have more than their fair share of resources.
We rely on CogPrime developers to respect the above guidelines, instead of trying to enforce
exact resource allocations on the software level.
Each MindAgent can have a set of system parameters that guide its behavior. For instance, a
MindAgent dedicated to inference can provide drastically different conclusions if its parameters
tell it to select a small set of Atoms for processing each time, but to spend significant time on
each Atom, rather than selecting many Atoms and doing shallow inferences on each one. It's
expected that multiple copies of the same MindAgent will exist in the cluster, but delivering
different dynamics thanks to those parameters.
In addition to their main action method, MindAgents can also communicate with other
MindAgents through message queues. CogPrime has. in its runtime configuration, a list of
available MindAgents and their locations in the cluster. Communications between MindAgents
typically take the form of specific, one-time requests, which we call Tasks.
The default action of MindAgents and the processing of Tasks constitute the cognitive dy-
namics of CogPrime. Nearly everything that takes place within a CogPrime deployment is done
by either a MindAgent (including the control processes), a Task, or specialized code handling
AtomSpace internals or communications with the external world. We now talk about how those
dynamics are scheduled.
MindAgents live inside a process called a CogPrime Unit. One machine in a CogPrime cluster
can contain one or more Units, and one Unit can contain one or more MindAgents. In practice,
given the way the AtomSpace is distributed, which requires a control process in each machine, it
typically makes more sense to have a single Unit per machine, as this enables all MindAgents in
that machine to make direct function calls to the AtomSpace, instead of using more expensive
inter-process communication.
There are exceptions to the above guideline, to accommodate various situations:
1. Very specific MindAgents may not need to communicate with other agents, or only do so
very rarely, so it makes sense to give them their own process.
EFTA00624162
16 19 The Opetteog Ftamework
2. MindAgents whose implementation is a poor fit for the collaborative processing in small
increments design described above also should be given their own process, so they don't
interfere with the overall dynamics in that machine.
3. MindAgents whose priority is either much higher or much lower than that of other agents in
the same machine should be given their own proems, so operating system-level scheduling
can be relied upon to reflect those wry different priority levels.
19.4.5 Tasks
It is not convenient for CogPrime to do all its work directly via the action of MindAgent objects
embodying CIM-Dynamics. This is especially true for MindAgents embodying focused cognitive
processes. These have their selection algorithms, which are ideally suited to guarantee that, over
the long run, the right Atoms get selected and processed. This, however, doesn't address the
issue that, on many occasions, it may be necessary to quickly process a specific set of Atoms
in order to execute an action or rapidly respond to some demand. These actions tend to be
one-time, rather than the recurring patterns of mind dynamics.
While it would be possible to design MindAgents so that they could both cover their long
term processing needs and rapidly respond to urgent demands, we found it much simpler to
augment the MindAgent framework with an additional scheduling mechanism that we call
the Task framework. In essence, this is a ticketing system, designed to handle cases where
MindAgents or Schema spawn one - off tasks to be executed - things that need to be done only
once, rather that repeatedly and iteratively as with the things embodied in MindAgents.
For instance, grab the most important Atoms from the AtomSpace and do shallow PLN rea-
soning to derive immediate conclusions from them is a natural job for a MindAgent. But do
search to find entities that satisfy this particular predicate P is a natural job for a Task.
Tasks have AttentionValues and target MindAgents. When a Task is created it is submitted
to the appropriate Unit and then put in a priority queue. The Unit will schedule some resources
to processing the more important Tasks, as we'll see next.
19.4.4 Scheduling of MindAgents and Tasks in a Unit
Within each Unit we have one or more MindAgents, a Task queue and, optionally, a subset of the
distributed AtomSpace. If that subset isn't held in the unit, it's held in another process running
on the same machine. If there is more than one Unit per machine, their relative priorities are
handled by the operating system's scheduler.
In addition to the Units, CogPrime has an extra maintenance process per machine, whose
job is to handle changes in those priorities as well as reconfigurations caused by MindAgent
migration, and machines joining or leaving the CogPrime cluster.
So, at the Unit level, attention allocation in CogPrime has two aspects: how MindAgents
and Tasks receive attention from CogPrime, and how Atoms receive attention from different
MindAgents and Tasks. The topic of this Section is the former. The latter is dealt with elsewhere,
in two ways:
EFTA00624163
19.4 MindAgents: Cognitive Processes 17
• in Chapter 23, which discusses the dynamic updating of the AttentionValue structures asso-
ciated with Atoms, and how these determine how much attention various focused cognitive
processes MindAgents pay to them.
• in the discussion of various specific CIM-Dynamics, each of which may make choices of
which Atoms to focus on in its own way (though generally making use of AttentionValue
and TruthValue in doing so).
The attention allocation subsystem is also pertinent to MindAgent scheduling, because it
discusses dynamics that update ShortTermlmportance (STI) values associated with MindA-
gents, based on the usefulness of MindAgents for achieving system goals. In this chapter, we
will not enter into such cognitive matters, but will merely discuss the mechanics by which these
STI values are used to control processor allocation to MindAgents.
Each instance of a MindAgent has its own AttentionValue, which is used to schedule processor
time within the Unit. That scheduling is done by a Scheduler object which controls a collection
of worker threads, whose size is a system parameter. The Scheduler aims to allocate worker
threads to the MindAgents in a way that's roughly proportional to their STI, but it needs to
account for starvation, as well as the need to process the Tasks in the task queue.
This is an area in which we can safely borrow from reasonably mature computer science
research. The requirements of cognitive dynamics scheduling are far from unique, so this is not
a topic where new ideas need to be invented for OpenCog; rather, designs need to be crafted
meeting CogPrime's specific requirements based on state-of-the•art knowledge and experience.
One example scheduler design has two important inputs: the STI associated with each
MindAgent, and a parameter determining how much resources should go to the MindAgents vs
the Task queue. In the CogPrime implementation, the Scheduler maps the MindAgent STIs to
a set of priority queues, and each queue is nut a number of times per cycle. Ideally one wants
to keep the number of queues small, and rely on multiple Units and the OS-level scheduler to
handle widely different priority levels.
When the importance of a MindAgent changes, one just has to reassign it to a new queue,
which is a cheap operation that can be done between cycles. MindAgent insertions and removals
are handled similarly.
Finally, Task execution is currently handled via allocating a certain fixed percentage of pro-
cessor time, each cycle, to executing the top Tasks on the queue. Adaptation of this percentage
may be valuable in the long term but was not yet implemented.
Control processes are also implemented as MindAgents, and processed in the same way as
the other kinds of CIM-Dynamics, although they tend to have fairly low importance.
19.4.5 The Cognitive Cycle
We have mentioned the concept of a "cycle" in the discussion about scheduling, without ex-
plaining what we mean. Let's address that now. All the Units in a CogPrime cluster are kept
in sync by a global cognitive cycle, whose purpose is described in Section II.
We mentioned above that each machine in the CogPrime cluster has a housekeeping process.
One of its tasks is to keep track of the cognitive cycle, broadcasting when the machine has
finished its cycle, and listening to similar broadcasts from its counterparts in the cluster. When
all the machines have completed a cycle, a global counter is updated, and each machine is then
free to begin the next cycle.
EFTA00624164
18 19 The OpenCog Ftamework
One potential annoyance with this global cognitive cycle is that some machines may complete
their cycle much faster than others, and then sit idly while the stragglers finish their jobs.
CogPrime addresses this issue in two ways:
• Over the long run, a load balancing process will assign MindAgents from overburdened
machines to underutilized ones. The MindAgent migration process is described in the next
section.
• In a shorter time horizon, during which a machine's configuration is fixed, there are two
heuristics to minimize the waste of processor time without breaking the overall cognitive
cycle coordination:
- The Task queue in each of the machine's Units can be processed more extensively than
it would by default; in extreme cases, the machine can go through the whole queue.
- Background process MindAgents can be given extra activations, as their activity is
unlikely to throw the system out of sync, unlike with more focused and goal-oriented
processes.
Both heuristics are implemented by the scheduler inside each unit, which has one boolean
trigger for each heuristic. The triggers are set by the housekeeping process when it observes
that the machine has been frequently idle over the recent past, and then reset if the situation
changes.
19.5 Distributed AtomSpace and Cognitive Dynamics
As hinted above, realistic CogPrime deployments will be spread around reasonably large clus-
ters of co-located machines. This section describes how this distributed deployment scenario
is planned for in the design of the AtomSpace and the MindAgents, and how the cognitive
dynamics take place in such a scenario.
We won't review the standard principles of distributed computing here, but we will focus
on specific issues that arise when the CogPrime is spread across a relatively large number of
machines. The two key issues that need to be handled are:
• How to distribute knowledge (i.e., the AtomSpace) in a way that doesn't impose a large
performance penalty?
• How to allocate resources (i.e., machines) to the different cognitive processes (MindAgents)
in a way that's flexible and dynamic?
19.5.1 Distributing the AtomSpace
The design of a distributed AtomSpace was guided by the following high level requirements:
1. Scale up, transparently, to clusters of dozens to hundreds of machines, without requiring a
single central master server.
2. The ability to store portions of an Atom repository on a number of machines in a cluster,
where each machine also runs some MindAgens. The distribution of Atoms across the ma-
EFTA00624165
19.5 Distributed AtomSpace and Cognitive Dynamics 19
chines should benefit from the fact that the cognitive processes on one machine are likely
to access local Atoms more often than remote ones.
3. Provide transparent access to all Atoms in RAM to all machines in the cluster, even if at
different latency and performance levels.
4. For local access to Atoms in the same machine, performance should be as close as possible
to what one would have in a similar, but non-distributed AtomSpace.
5. Allow multiple copies of the same Atom to exist in different machines of the cluster, but
only one copy per machine.
6. As Atoms are updated fairly often by cognitive dynamics, provide a mechanism for even-
tual consistency. This mechanism needs not only to propagate changes to the Atoms, but
sometimes to reconcile incompatible changes, such as when two cognitive processes update
an Atom's TruthValue in opposite ways. Consistency is less important than efficiency, but
should be guaranteed eventually.
7. Resolution of inconsistencies should be guided by the importance of the Atoms involved, so
the more important ones are more quickly resolved.
8. System configuration can explicitly order the placement of sonic Atoms to specific machines,
and mark a subset of those Atoms as immovable, which should ensure that local copies are
always kept.
9. Atom placement across machines, aside from the immovable Atoms, should be dynamic,
rebalancing based on frequency of access to the Atom by the different machines.
The first requirement follows obviously from our estimates of how many machines CogPrime
will require to display advanced intelligence.
The second requirement above means that we don't have two kinds of machines in the
cluster, where some are processing servers and some are database servers. Rather, we prefer
each machine to store some knowledge and host some processes acting on that knowledge.
This design assumes that there are simple heuristic ways to partition the knowledge across the
machines, resulting in allocations that, most of the time, give the MindAgents local access to
the Atoms they need most often.
Alas, there will always be some cases in which a MindAgent needs an Atom that isn't
available locally. In order to keep the design on the MindAgents simple, this leads to the third
requirement, transparency, and to the fourth one, performance.
This partition design, on the other hand, means that there must be some replication of
knowledge, as there will always be some Atoms that are needed often by MindAgents on differ-
ent machines. This leads to requirement five (allow redundant copies of an Atom). However, as
MindAgents frequently update the mutable components of Atoms, requirements six and seven
are needed, to minimize the impact of conflicts on system performance while striving to guar-
antee that conflicts are eventually solved, and with priority proportional to the importance of
the impacted Atoms.
19.5.1.1 Mechanisms of Managing Distributed Atomspaces
When one digs into the details of distributed AtomSpaces, a number of subtleties emerge. Going
into these in full detail here would not be appropriate, but we will make a few comments, just
to give a flavor of the sorts of issues involved.
To discuss these issues clearly, some special terminology is useful. In this context, it is useful to
reserve the word "Atom" for its pure, theoretical definition, viz: "a Node is uniquely determined
EFTA00624166
20 19 The OpenCog Ftamework
by its name. A Link is uniquely determined by its outgoing set". Atoms sitting in RAM may then
be called "Realized Atoms". Thus, given a single, pure "abstract/theoretical" Atom, there might
be two different Realized Atoms, on two different servers, having the same name/outgoing-set.
It's OK to think of a RealizedAtom as a clone of the pure, abstract Atom, and to talk about
it that way. Analogously, we might call atoms living on disk, or flying on a wire, "Serialized
Atoms"; and, when need be, use specialized terms like "ZMQ-serialized atoms", or " BerkeleyDB-
serialized Atoms", etc.
An important and obvious coherency requirement is: "If a MindAgent asks for the Handle of
an Atom at time A, and then asks, later on, for the Handle of the same Atom, it should receive
the same Handle."
By the "AtomSpace", in general, we mean the container(s) that are used to store the set of
Atoms used in an OpenCog system, both in RAM and on disk. In the case of an Atom space
that is distributed across multiple machines or other data stores, we will call each of these an
"Atom space portion"
Atoms and Handles
Each OpenCog Atom is associated with a Handle object, which is used to identify the Atom
uniquely. The Handle is a sort of "key" used, at the infrastructure level, to compactly identify
the Atom. In a single-machine, non-distributed Atomspace, one can effectively just use long
ints as Handles, and assign successive ints as Handles to successively created new Atoms. In a
distributed Atomspace, it's a little subtler. Perhaps the cleanest approach in this case is to use
a hash of the serialized Atom data as the handle for an Atom. That way, if an Atom Ls created
in any portion, it will inherently have the same handle as any of its clones.
The issue of Handle collisions then occurs - it is possible, though it will be rare, that two
different Atoms will be assigned the same Handle via the hashing function. This situation can
be identified via checking, when an Atom is imported into a portion, whether there is already
some Atom in that portion with the same Handle but different fundamental aspects. In the rare
occasion where this situation does occur, one of the Atoms must then have its Handle changed.
Changing an Atom's handle everywhere it's referenced in RAM is not a big deal, so long as it
only happens occasionally. However, some sort of global record of Handle changes should be
kept, to avoid confusion in the process of loading saved Atoms from disk. If a loaded Atomspace
contains Atoms that have changed Handle since the file was saved, the Atom loading process
needs to know about this.
The standard mathematics of hash functions collisions, shows that if one has a space of H
possible Handles, one will get two Atoms with the same Handle after 1.25 x NO) tries, on
average.... Rearranging this, it means we'd need a space of around N2 Handles to have a space
of Handles for N possible Atoms, in which one collision would occur on average.... So to have a
probability of one collision, for N possible Atoms, one would have to use a handle range up to
N2. The number of bits needed to encode N2 is twice as many as the number needed to encode
N. So, if one wants to minimize collisions, one may need to make Handles twice as long, thus
taking up more memory.
However, this memory cost can be palliated via introducing "local Handles" separate from the
global, system-wide Handles. The local Handles are used internally within each local Atomspace,
and then each local Atomspace contains a translation table going back and forth between local
EFTA00624167
19.5 Distributed AtomSpace and Cognitive Dynamics 21
and global Handles. Local handles may be long ints, allocated sequentially to each new Atom
entered into a portion. Persistence to disk would always use the global Handles.
To understand the memory, tradeoffs involved in these solutions, assume that the global
Handles were k times as long as the local handles... and suppose that the average Handle
occurred r times in the local Atomspace. Then the memory inflation ratio of the local/global
solution as opposed to a solution using only the shorter local handles, would be
(1 -F k + r)/r = 1 -F (k+ 1)/r
if k=2 and r=10 (each handle is used 10 times on average, which is realistic based on current
real-world OpenCog Atomspaces), then the ratio is just 1.3x - suggesting that using hash codes
for global Handles, and local Handles to save memory in each local AtomSpace, is acceptable
memory-wise.
19.5.1.2 Distribution of Atoms
Given the goal of maximizing the probability that an Atom will be local to the machines of the
MindAgents that need it, the two big decisions are how to allocate Atoms to machines, and
then how to reconcile the results of MindAgents actuating on those Atoms.
The initial allocation of Atoms to machines may be done via explicit system configuration,
for Atoms known to have different levels of importance to specific MindAgents. That is, after
all, how MindAgents are initially allocated to machines as well.
One may, for instance, create a CogPrime cluster where one machine (or group) focuses on
visual perception, one focuses on language processing, one focuses on abstract reasoning, etc.
In that case one can hard-wire the location of Atoms.
What if one wants to have three abstract-reasoning machines in one's cluster? Then one
can define an abstract-reasoning zone consisting of three Atom repository portions. One can
hard-wire that Atoms created by MindAgents in the zone must always remain in that zone -
but can potentially be moved among different portions within that zone, as well as replicated
across two or all of the machines, if need be. By default they would still initially be placed in
the same portion as the MindAgent that created them.
However Atoms are initially placed in portions, sometimes it will be appropriate to move
them. And sometimes it will be appropriate to clone an Atom, so there's a copy of it in a
different portion from where it exists. Various algorithms could work for this, but the following
is one simple mechanism:
• When an Atom A in machine M1 is requested by a MindAgent in machine M2, then a clone
of A is temporarily created in M2.
• When an Atom is forgotten (due to low LTI), then a check is made if it has any clones, and
any links to it are changed into links to its clones.
• The LTI of an Atom may get a boost if that Atom has no clones (the amount of this boost
is a parameter that may be adjusted).
EFTA00624168
22 19 The OpenCog Ftamework
19.5.1.3 MindAgents and the Distributed AtomSpace
In the context of a distributed AtomSpace, the interactions between MindAgents and the knowl-
edge store become subtler, as we'll now discuss.
When a MindAgent wants to create an Atom, it will make this request of the local AtomSpace
process, which hosts a subset of the whole AtomSpace. It can, on Atom creation, specify whether
the Atom is immovable or not. In the former case, it will initially only be accessible by the
MindAgents in the local machine.
The process of assigning the new Atom an AtomHandle needs to be taken care of, in a way
that doesn't introduce a central master. One way to achieve that is to make handles hierarchical,
so the higher order bits indicate the machine. This, however, means that AtomHandles are
no longer immutable. A better idea is to automatically allocate a subset of the AtomHandle
universe to each machine. The initial use of those AtomHandles is the privilege of that machine
but, as Atoms migrate or are cloned, the handles can move through the cluster.
When a MindAgent wants to retrieve one or more Atoms, it will perform a query on the
local AtomSpace subset, just as it would with a single machine repository. Along with the
regular query parameters, it may specify whether the request should be processed locally only,
or globally. Local queries will be fast, but may fail to retrieve the desired Atoms, while global
queries may take a while to return. In the approach outlined above for MindAgent dynamics
and scheduling, this would just cause the MindAgent to wait until results are available.
Queries designed to always return a set of Atoms can have a third mode, which is "prioritize
local Atoms". In this case, the AtomSpace, when processing a query that looks for Atoms that
match a certain pattern would try to find all local responses before asking other machines.
19.5.1.4 Conflict Resolution
A key design decision when implementing a distributed AtomSpace is the trade-off between
consistency and efficiency. There is no universal answer to this conflict, but the usage scenarios
for CogPrime, current and planned, tend to fall on the same broad category as far consistency
goes. CogPrime's cognitive processes are relatively indifferent to conflicts and capable of working
well with outdated data, especially if the conflicts are temporary. For applications such as
the AGI Preschool, it is unlikely that outdated properties of single Atoms will have a large,
noticeable impact on the system's behavior; even if that were to happen on rare occasions, this
kind of inconsistency is often present in human behavior as well.
On the other hand, CogPrime assumes fairly fast access to Atoms by the cognitive processes,
so efficiency shouldn't be too heavily penalized. The robustness against mistakes and the need
for performance mean that a distributed AtomSpace should follow the principle of "eventual
consistency". This means that conflicts are allowed to arise, and even to persist for a while, but
a mechanism is needed to reconcile them.
Before describing conflict resolution, which in CogPrime is a bit more complicated than in
most applications, we note that there are two kinds of conflicts. The simple one happens when
an Atom that exists in multiple machines is modified in one machine, and that change isn't
immediately propagated. The less obvious one happens when some process creates a new Atom
in its local AtomSpace repository, but that Atom conceptually "already exists" elsewhere in
the system. Both scenarios are handled in the same way, and can become complicated when,
instead of a single change or creation, one needs to reconcile multiple operations.
EFTA00624169
19.5 Distributed AtomSpace and Cognitive Dynamics 23
The way to handle conflicts is to have a special purpose control process, a reconciliation
MindAgent, with one copy running on each machine in the cluster. This MindAgent keeps
track of all recent write operations in that machine (Atom creations or changes).
Each time the reconciliation MindAgent is called, it processes a certain number of Atoms
in the recent writes list. It chooses the Atoms to process based on a combination of their STI,
LTI and recency of creation/change. Highest priority is given to Atoms with higher STI and
LTI that have been around longer. Lowest priority is given to Atoms with low STI or LTI that
have been very recently changed - both because they may change again in the very near future,
and because they may be forgotten before it's worth solving any conflicts. This will be the case
with most perceptual Atoms, for instance.
By tuning how many Atoms this reconciliation MindAgent processes each time it's activated
we can tweak the consistency vs efficiency trade-off.
When the AtomReconciliation agent processes an Atom, what it does is:
• Searches all the machines in the cluster to see if there are other equivalent Atoms (for
Nodes, these are Atoms with the same name and type; for Links, these are Atoms with the
same type and targets).
• If it finds equivalent Atoms, and there are conflicts to be reconciled, such as different
TruthValues or AttentionValues, the decision of how to handle the conflicts is made by a
special probabilistic reasoning rule, called the Rule of Choice (see Chapter 34). Basically,
this means:
- It decided whether to merge the conflicting Atoms. We always merge Links, but some
Nodes may have different semantics, such as Nodes representing different procedures
that have been given the same name.
- In the case that the two Atoms A and B should be merged, it creates a new Atom C
that has all the same immutable properties as A and B. It merges their TruthValues
according to the probabilistic revision rule (see Chapter 34). The AttentionValues are
merged by prioritizing the higher importances.
- In the case that two Nodes should be allowed to remain separate, it allocates one of
them (say, B) a new name. Optionally, it also evaluates whether a SimilarityLink should
be created between the two different Nodes.
Another use for the reconcilitation MindAgent is maintaining approximate consistency be-
tween clones, which can be created by the AtomSpace itself, aS described above in Subsection
19.5.1.2. When the system knows about the multiple clones of an Atom, it keeps note of these
versions in a list, which is processed periodically by a conflict resolution MindAgent, in order
to prevent the clones from drifting too far apart by the actions of local cognitive processes in
each machine.
19.5.2 Distributed Processing
The OCF infrastructure as described above already contains a lot of distributed processing
implicit in it. However. it doesn't tell you how to make the complex cognitive processes that are
part of the CogPrime design distributed unto themselves - say, how to make PLN or MOSES
themselves distributed. This turns out to be quite passible, but becomes quite intricate and
EFTA00624170
24 19 The OpenCog Ftamework
specific depending on the particular algorithms involved. For instance, the current MOSES
implementation is now highly amenable to distributed and multiprocessor implementation, but
in a way that depends subtly on the specifics of MOSES and has little to do with the role of
MOSES in CogPrime as a whole. So we will not delve into these topics here.
Another possibility worth mentioning is broadly distributed processing, in which CogPrime
intelligence is spread across thousands or millions of relatively weak machines networked via the
Internet. Even if none of these machines is exclusively devoted to CogPrime, the total processing
power may be massive, and massively valuable. The use of this kind of broadly distributed
computing resource to help CogPrime is quite possible, but involves numerous additional control
problems which we will not address here.
A simple case is massive global distribution of MOSES fitness evaluation. In the case where
fitness evaluation is isolated and depends only on local data, this is extremely straightforward. In
the more general case where fitness evaluation depends on knowledge stored in a large Atom-
Space, it requires a subtler design. wherein each globally distributed MOSES subpopulation
contains a pool of largely similar genotypes and a cache of relevant parts of the AtomSpace,
which is continually refreshed during the fitness evaluation process. This can work so long as
each globally distributed lobe has a reasonably reliable high bandwidth, low latency connection
to a machine containing a large AtomSpace.
On the more mundane topic of distributed processing within the main CogPrime cluster,
three points are worth discussing:
• Distributed communication and coordination between MindAgents.
• Allocation of machines to functional groups, and MindAgent migration.
• Machines entering and leaving the cluster.
19.5.2.1 Distributed Communication and Coordination
Communications between MindAgents, Units and other CogPrime components are handled by
a message queue subsystem. This subsystem provides a unified API, so the agents involved
are unaware of the location of their partners: distributed messages, inter-process messages in
the same machine, and intra-process messages in the same Unit are sent through the same
API. and delivered to the same target queues. This design enables transparent distribution of
MindAgents and other components.
In the simplest case, of MindAgents within the same Unit, messages are delivered almost
immediately, and will be available for processing by the target agent the next time it's enacted
by the scheduler. In the case of messages sent to other Units or other machines, they're delivered
to the messaging subsystem component of that unit, which has a dedicated thread for message
delivery. That subsystem is scheduled for processing just like any other control process, although
it tends to have a reasonably high importance, to ensure speedy delivery.
The same messaging API and subsystem is used for control-level communications, such as
the coordination of the global cognitive cycle. The cognitive cycle completion message can be
used for other housekeeping contents as well.
EFTA00624171
19.5 Distributed AtomSpace and Cognitive Dynamics 25
19.5.2.2 Functional Groups and MindAgent Migration
A CogPrime cluster is composed of groups of machines dedicated to various high-level cognitive
tasks: perception processing, language processing, background reasoning, procedure learning,
action selection and execution, goal achievement planning, etc. Each of these high-level tasks
will probably require a number of machines, which we call functional groups.
Most of the support needed for functional groups is provided transparently by the mechanisms
for distributing the AtomSpace and by the communications layer. The main issue is how much
resources (i.e., how many machines) to allocate to each functional group. The initial allocation
is determined by human administrators via the system configuration - each machine in the
cluster has a local configuration file which tells it exactly which processes to start, along with
the collection of MindAgents to be loaded onto each process and their initial AttentionValues.
Over time, however, it may be necessary to modify this allocation, adding machines to
overworked or highly important functional groups. For instance, one may add more machines
to the natural language and perception processing groups during periods of heavy interaction
with humans in the preschool environment, while repurposing those machines to procedure
learning and background inference during periods in which the agent controlled by CogPrime
is resting or 'sleeping'.
This allocation of machines is driven by attention allocation in much the same way that
processor time is allocated to MindAgents. Functional groups can be represented by Atoms,
and their importance levels are updated according to the importance of the system's top level
goals, and the usefulness of each functional group to their achievement. Thus, once the agent is
engaged by humans, the goals of pleasing them and better understanding them would become
highly important, and would thus drive the STI of the language understanding and language
generation functional groups.
Once there is an imbalance between a functional group's STI and its share of the machines
in the cluster, a control process CIM-Dynamic is triggered to decide how to reconfigure the
cluster. This CIM-Dynamic works approximately as follows:
• First, it decides how many extra machines to allocate to each sub-represented functional
group.
• Then, it ranks the machines not already allocated to those groups based on a combination
of their workload and the aggregate STI of their MindAgents and Units. The goal is to
identify machines that are both relatively unimportant and working under capacity.
• It will then migrate the MindAgents of those machines to other machines in the same
functional group (or just remove them if clones exist), freeing them up.
• Finally, it will decide how best to allocate the new machines to each functional group.
This decision is heavily dependent on the nature of the work done by the MindAgents in
that group, so in CogPrime these decisions will be somewhat hardcoded, as is the set of
functional groups. For instance, background reasoning can be scaled just by adding extra
inference MindAgents to the new machines without too much trouble. but communicating
with humans requires MindAgents responsible for dialog management, and it doesn't make
sense to clone those, so it's better to just give more resources to each MindAgent without
increasing their numbers.
The migration of MindAgents becomes, indirectly, a key driver of Atom migration. As MindA-
gents move or are cloned to new machines, the AtomSpace repository in the source machine
EFTA00624172
26 19 The OpenCog Ftamework
should send clones of the Atoms most recently used by these MindAgents to the target ma-
chine(s), anticipating a very likely distributed request that would create those clones in the near
future anyway. If the MindAgents are moved but not cloned, the local copies of those Atoms in
the source machine can then be (locally) forgotten.
19.5.2.3 Adding and Removing Machines
Given the support for MindAgent migration and cloning outlined above, the issue of adding new
machines to the cluster becomes a specific application of the heuristics just described. When a
new machine Ls added to the cluster, CogPrime initially decides on a functional group for it,
based both on the importance of each functional group and on their current performance - if
a functional group consistently delays the completion of the cognitive cycle, it should get more
machines, for instance. When the machine is added to a functional group, it is then populated
with the most important or resource starved MindAgents in that group, a decision that is taken
by economic attention allocation.
Removal of a machine follows a similar process. First the system checks if the machine can be
safely removed from its current functional group, without greatly impacting its performance. If
that's the case, the non-cloned MindAgents in that machine are distributed among the remaining
machines in the group, following the heuristic described above for migration. Any local-only
Atoms in that machine's AtomSpace container are migrated as well, provided their LTI is high
enough.
In the situation in which removing a machine M1 would have an intolerable impact on the
functional group's performance, a control process selects another functional group to lose a
machine 2112. Then, the NlindAgents and Atoms in M1 are migrated to M2, which goes through
the regular removal process first.
In principle, one might use the insertion or removal of machines to perform a global op-
timization of resource allocation within the system, but that process tends to be much more
expensive than the simpler heuristics we just described. We believe these heuristics can give
us most of the benefits of global re-allocation at a fraction of the disturbance for the system's
overall dynamics during their execution.
EFTA00624173
Chapter 20
Knowledge Representation Using the Atomspace
20.1 Introduction
CogPrime's knowledge representation must be considered on two levels: implicit and explicit.
This chapter considers mainly explicit knowledge representation, with a focus on representation
of declarative knowledge. We will describe the Atom knowledge representation, a generalized
hypergraph formalism which comprises a specific vocabulary of Node and Link types, used to
represent declarative knowledge but also, to a lesser extent, other types of knowledge as well.
Other mechanisms of representing procedural, episodic, attentional, and intentional knowledge
will be handled in later chapters, as will the subtleties of implicit knowledge representation.
The AtomSpace Node and Link formalism is the most obviously distinctive aspect of the
OpenCog architecture, from the point of view of a software developer building AI processes in
the OpenCog framework. But yet, the features of CogPrime that are most important, in terms
of our theoretical reasons for estimating it likely to succeed as an advanced AGI system, are
not really dependent on the particulars of the AtomSpace representation.
What's important about the AtomSpace knowledge representation is mainly that it provides
a flexible means for compactly representing multiple forms of knowledge, in a way that allows
them to interoperate - where by "interoperate" we mean that e.g. a fragment of a chunk of
declarative knowledge can link to a fragment of a chunk of attentional or procedural knowledge;
or a chunk of knowledge in one category can overlap with a chunk of knowledge in another
category (as when the same link has both a (declarative) truth value and an (attentional)
importance value). In short, any representational infrastructure sufficiently flexible to support
• compact representation of all the key categories of knowledge playing dominant roles in
human memory
• the flexible creation of specialized sub-representations for various particular subtypes of
knowledge in all these categories, enabling compact and rapidly manipulable expression of
knowledge of these subtypes
• the overlap and interlinkage of knowledge of various types, including that represented using
specialized sub-representations
will probably be acceptable for CogPrime's purposes. However, precisely formulating these
general requirements is tricky, and is significantly more difficult than simply articulating a
single acceptable representational scheme, like the current OpenCog Atom formalism. The Atom
27
EFTA00624174
28 20 Knowledge Representation Using the Atomspace
formalism satisfies the relevant general requirements and has proved workable from a practical
software perspective.
In terms of the Mind-World Correspondence Principle introduced in Chapter 10, the impor-
tant point regarding the Atom representation is that it must be flexible enough to allow the
compact and rapidly manipulable representation of knowledge that has aspects spanning the
multiple common human knowledge categories, in a manner that allows easy implementation
of cognitive processes that will manifest the Mind-World Correspondence Principle in everyday
human-like situations. The actual manifestation of mind-world correspondence Ls the job of
the cognitive processes acting on the AtomSpace - the job of the AtomSpace is to be an effi-
cient and flexible enough representation that these cognitive processes can manifest mind-world
correspondence in everyday human contexts given highly limited computational resources.
20.2 Denoting Atoms
First we describe the textual notation we'll use to denote various sorts of Atoms throughout the
following chapters. The discussion will also serve to give some particular examples of cognitively
meaningful Atom constructs.
20.2.1 Meta-Language
As always occurs when discussing (even partially) logic-based systems, when discussing Cog-
Prime there is some potential for confusion between logical relationships inside the system, and
logical relationships being used to describe parts of the system. For instance, we can state as
observers that two Atoms inside CogPrime are equivalent, and this is different from stating
that CogPrime itself contains an Equivalence relation between these two Atoms. Our formal
notation needs to reflect this difference.
Since we will not be doing any fancy mathematical analyses of CogPrime structures or
dynamics here, there is no need to formally specify the logic being used for the metalanguage.
Standard predicate logic may be assumed.
So, for example, we will say things like
1lntensionallnheritanceLink Ben monster).TruthValue.strength is .5
This is a metalanguage statement. which means that the strength field of the TruthValue
object associated with the link (IntensionalInheritance Ben monster) is equal to .5. This is
different than saying
EquivalenceLink
ExOutLink
GetStrength
ExOutLink
GetTruthValue
IntensionalInheritanceLink Ben monster
NumberNode 0.5
EFTA00624175
20.2 Denoting Atoms 29
which refers to an equivalence relation represented inside CogPrime. The former refers to an
equals relationship observed by the authors of the book, but perhaps never represented explicitly
inside CogPrime.
In the first example above we have used the C++ convention
structure_variable_name.field_name
for denoting elements of composite structures; this convention will be stated formally below.
In the second example we have used schema corresponding to TruthValue and Strength; these
schema extract the appropriate fields from the Atoms they're applied to, so that e.g.
ExOutLink
GetTruthValue
A
returns the number
A.TruthValue
Following a convention from mathematical logic, we will also sometimes use the special
symbol
I-
to mean "implies in the metalanguage." For example, the first-order PLN deductive inference
strength rule may be written
InheritanceLink A B csAB>
InheritanceLink B C <BBC>
i-
InheritanceLink A C <sAC>
where
sAC sAB sBC + (1-sAB) ( sC - se sBC ) / (1- se )
This is different from saying
ForAll SA, SB, SC, SsAB, Ss8C, SsAC
ExtensionalImplicationLink_HOJ
AND
InheritanceLink SA $8 <$sAE>
InheritanceLink SB SC <$sBC>
AND
InheritanceLink SA SC <$sAC>
$sAC SsAB SsBC (1-$2AB) ($2C - Sae $28C) / (1- SsB)
which is the most natural representation of the independence-based PLN deduction rule (for
strength-only truth values) as a logical statement within CogPrime. In the latter expression
the variables $A, $sAB, and so forth represent actual Variable Atoms within CogPrime. In
the former expression the variables represent concrete, non-Variable Atoms within CogPrime,
which however are being considered as variables within the metalanguage.
(As explained in the PLN book, a link labeled with "Hos" refers to a "higher order judgment",
meaning a relationship that interprets its relations as entities with particular truth values. For
instance,
ImplicationLink_HOJ
Inh SX stupid <.9>
Inh SX rich <.97
EFTA00624176
30 20 Knowledge Representation Using the Atomspace
means that if (Init $X stupid) has a strength of .9, then (Inh $X rich) has a strength of .9).
WIKISOURCE:AtomNotation
20.2.2 Denoting Atoms
Atoms are the basic objects making up CogPrime knowledge. They come in various types, and
are associated with various dynamics, which are embodied in MindAgents. Generally speaking
Atoms are endowed with TruthValue and AttentionValue objects. They also sometimes have
names, and other associated Values as previously discussed. In the following subsections we will
explain how these are notated, and then discuss specific notations for Links and Nodes, the two
types of Atoms in the system.
20.2.2.1 Names
In order to denote an Atom in discussion, we have to call it something. Relatedly but separately,
Atoms may also have names within the CogPrime system. (As a matter of implementation, in
the current OpenCog version, no Links have names; whereas, all Nodes have names, but some
Nodes have a null name, which is conceptually the same as not having a name.)
(name,type) pairs mast be considered as unique within each Unit within a OpenCog system,
otherwise they can't be used effectively to reference Atoms. It's OK if two different OpenCog
Units both have Schemallodes named "+", but not if one OpenCog Unit has two Schemallodes
both named "+" - this latter situation is disallowed on the software level, and is assumed in
discussions not to occur.
Sonic Atoms have natural names. For instance. the Schemallode corresponding to the ele-
mentary schema function + may quite naturally be named "+". The NumberNode corresponding
to the number .5 may naturally be named ".5", and the CharacterNode corresponding to the
character c may naturally be named "c". These cases are the minority, however. For instance,
a SpecificEntityNode representing a particular instance of + has no natural name, nor does a
SpecificEntityNode representing a particular instance of c.
Names should not be confused with Handles. Atoms have Handles, which are unique identi-
fiers (in practice, numbers) assigned to them by the OpenCog core system; and these Handles
are how Atoms are referenced internally, within OpenCog, nearly all the time. Accessing of
Atoms by name is a special case - not all Atoms have names, but all Atoms have Handles. An
example of accessing an Atom by name is looking up the CharacterNode representing the letter
"c" by its name "c". There would then be two possible representations for the word "cat":
1. this word might be associated with a ListLink - and the ListLink corresponding to "cat"
would be a list of the Handles of the Atoms of the nodes named "c", "ar, and "t".
2. for expedience, the word might be associated with a WordNode named "cat."
In the case where an Atom has multiple versions, this may happen for instance if the Atom
is considered in a different context (via a ContextLink), each version has a VersionHandle, so
that accesbing an AtomVersion requires specifying an AtomHandle plus a VersionHandle. See
Chapter 19 for more information on Handles.
EFTA00624177
20.2 Denoting Atoms 31
OpenCog never assigns Atoms names on its own; in fact, Atom names are assigned only in
the two sorts of cases just mentioned:
1. Via preprocessing of perceptual inputs (e.g. the names of NumberNode, CharacterNodes)
2. Via hard-wiring of names for Schemallodes and PredicateNodes corresponding to built-in
elementary schema (e.g. +, AND, Say)
If an Atom A has a name n in the system, we may write
A.name n
On the other hand, if we want to assign an Atom an external name, we may make a meta-
language assertion such as
LI :o (InheritanceLink Ben animal)
indicating that we decided to name that link LI for our discussions, even though inside OpenCog
it has no name.
In denoting (nameless) Atoms we may use arbitrary manes like Ll. This is more convenient
than using a Handle based notation which Atoms would be referred to as 1, 3433322, etc.; but
sometimes we will use the Handle notation as well.
Some ConceptNodes and conceptual PredicateNode or Schemallodes may correspond with
human-language words or phrases like cat, bite, and so forth. This will be the minority case;
more such nodes will correspond to parts of human-language concepts or fuzzy collections of
human-language concepts. In discussions in this book, however, we will often invoke the unusual
case in which Atoms correspond to individual human-language concepts. This is because such
examples are the easiest ones to write about and discuss intuitively. The preponderance of
named Atoms in the examples in the book implies no similar preponderance of named Atoms
in the real OpenCog system. It is merely easier to talk about a hypothetical Atom named "cat"
than it is about a hypothetical Atom with Handle 434. It is not impossible that a OpenCog
system represents "cat" as a single ConceptNode, but it is just as likely that it will represent
"cat" as a map composed of many different nodes without any of these having natural names.
Each OpenCog works out for itself, implicitly, which concepts to represent as single Atoms and
which in distributed fashion.
For another example,
ListLink
CharacterNode "c"
CharacterNode "a"
CharacterNode "t"
corresponds to the character string
("c", "a", "e")
and would naturally be named using the string cat. In the system itself, however, this ListLink
need not have any name.
20.2.2.2 Types
Atoms also have types. When it is necessary to explicitly indicate the type of an atom, we will
use the keyword Type, as in
EFTA00624178
32 20 Knowledge Representation Using the Atomspace
A.Type InheritanceLink
N_345.Type ConceptNode
On the other hand, there is also a built-in schema Ha.sType which lets us say
EvaluationLink HasType A InheritanceLink
EvaluationLink HasType N_345 ConceptNode
This covers the case in which type evaluation occurs explicitly in the system, which is useful
if the system is analyzing its own emergent structures and dynamics.
Another option currently implemented in OpenCog is to explicitly restrict the type of a
variable using TypedVariableLink such as follows
TypedVariableLink
VariableNode $X
VariableTypeNode •ConceptNode•
Note also that we will frequently remove the suffix Link or Node from their type name,
such as
Inheritance
Concept A
Concept B
instead of
InheritanceLink
ConceptNode A
ConceptNode B
20.2.2.3 Truth Values
The truth value of an atom is a bundle of information describing how true the Atom is, in one
of several different senses depending on the Atom type. It is encased in a TruthValue object
associated with the Atom. Most of the time, we will denote the truth value of an atom in <>'s
following the expression denoting the atom. This very handy notation may be used in several
different ways.
A complication is that some Atoms may have CompositeTruthValues, which consist of differ-
ent estimates of their truth value made by different sources, which for whatever reason have not
been reconciled (maybe no process has gotten around to reconciling them, maybe they corre-
spond to different truth values in different contexts and thus logically need to remain separate,
maybe their reconciliation is being delayed pending accumulation of more evidence, etc.). In
this case we can still assume that an Atom has a default truth value, which corresponds to the
highest-confidence truth value that it has, in the Universal Context.
Most frequently, the notation is used with a single number in the brackets, e.g.
A <.4>
to indicate that the atom A has truth value .4: or
IntensionallnheritanceLink Ben monster <.5>
EFTA00624179
20.2 Denoting Atoms 33
to indicate that the Intensionallnheritance relation between Ben and monster has truth value
strength .5. In this case, <tv> indicates (roughly speaking) that the truth value of the atom
in question involves a probability distribution with a mean of tv. The precise semantics of the
strength values associated with OpenCog Atoms is described in Probabilistic Logic Networks
(see Chapter 34). Please note, though: This notation does not imply that the only data retained
in the system about the distribution is the single number .5.
If we want to refer to the truth value of an Atom A in the context C, we can use the construct
ContextLink <truth value>
C
Sometimes, Atoms in OpenCog are labeled with two truth value components as defined by
PLN: strength and weight-of-evidence. To denote these two components, we might write
IntensionalInheritanceLink Ben scary <.9,.l>
indicating that there is a relatively small amount of evidence in favor of the proposition that
Ben is very scary.
We may also put the TruthValue indicator in a different place, e.g. using indent notation,
IntensionallnheritanceLink <.9,.1>
Ben
scary
This is mostly useful when dealing with long and complicated constructions.
If we want to denote a composite truth value (whose components correspond to different
"versions" of the Atom), we can use a list notation, e.g.
Intensionallnheritance (<.9,.1>, <.5,.9> [h,1231,<.6,.7> (c,655))
Ben
scary
where e.g.
<.5,.9> (h,123)
denotes the TruthValue version of the Atom indexed by Handle 123. The h denotes that the
AtomVersion indicated by the VersionHandle h,123 is a Hypothetical Atom, in the sense de-
scribed in the PLN book. Some versions may not have any index Handles.
The semantics of composite TruthValues are described in the PLN book, but roughly they
are as follows. Any version not indexed by a VersionHandle is a "primary TruthValue" that
gives the truth value of the Atom based on some body of evidence. A version indexed by a
VersionHandle is either contextual or hypothetical, as indicated rotationally by the c or h in its
VersionHandle. So, for instance, if a TruthValue version for Atom A has VersionHandle h,123
that means it denotes the truth value of Atom A under the hypothetical context represented
by the Atom with handle 123. If a TruthValue version for Atom A has VersionHandle c,655
this means it denotes the truth value of Atom A in the context represented by the Atom with
Handle 655.
Alternately, truth values may be expressed sometimes in <L,U,b> or <L,U,b,N> format,
defined in terms of indefinite probability theory as defined in the PLN book and recalled in
Chapter 34. For instance,
IntensionallnheritanceLink Ben scary <.7,.9,.8,20>
EFTA00624180
34 20 Knowledge Representation Using the Atomapace
has the semantics that There is an estimated 80% chance that after 20 more observations have
been made, the estimated strength of the link will be in the interval (.7,.9).
The notation may also be used to specify a TruthValue probability distribution, e.g.
A <g(5,7,12)›
would indicate that the truth value of A is given by distribution g with parameters (5,7,12), or
A <M>
where M is a table of numbers, would indicate that the truth value of A is approximated by
the table M.
The <> notation for truth value is an unabashedly incomplete and ambiguous notation, but
it is very convenient. If we want to specify, say, that the truth value strength of IntensionalIn-
heritanceLink Ben monster is in fact the number .5, and no other truth value information is
retained in the system, then we need to say
(Intensionallnheritance Ben monster).TruthValue= [(strength, .5))
(where a hashtable form is assumed for TruthValue objects, i.e. a list of name-value pairs). But
this kind of issue will rarely arise here and the <> notation will serve us well.
20.2.2.4 Attention Values
The AttentionValue object associated with an Atom does not need to be notated nearly as often
as truth value. When it does however we can use similar notational methods.
AttentionValues may have several components, but the two critical ones are called short-
term importance (STI) and long-term importance (LTI). Furthermore, multiple STI values are
retained: for each (Atom, MindAgent) pair there may be a Mind-Agent-specific STI value for
that Atom. The pragmatic import of these values will become clear in a later chapter when we
discuss attention allocation.
Roughly speaking, the long-term importance is used to control memory usage: when memory
gets scarce, the atoms with the lowest LTI value are removed. On the other hand, the short-term
importance is used to control processor time allocation: MindAgents, when they decide which
Atoms to act on, will generally, but not only, choose the ones that have proved most useful to
them in the recent past, and additionally those that have been useful for other MindAgents in
the recent past.
We will use the double bracket <c>> to denote attention value (in the rare cases where
such denotation is necesbary). So, for instance,
Cow_7 <<.5»
will mean the node Cow_7 has an importance of .5; whereas,
Cow_7 <<STI,
•.1, LTI - .8»
or simply
Cow_7 <<.1, .8>>
will mean the node Cow_7 has short-term importance = .1 and long-term importance = .8 .
Of course, we can also use the style
EFTA00624181
20.3 Representing Functions and Predicates 35
(IntensionalInheritanceLink Ben monster).AttentionValue
= [(STI,.1), (LTI, .8))
where appropriate.
20.2.2.5 Links
Links are represented using a simple notation that has already occurred many times in this
book. For instance,
Inheritance A
Similarity A B
Note that here the symmetry or otherwise of the link is not implicit in the notation. Similar-
ityLinks are symmetrical, InheritanceLinks are not. When this distinction is necessary, it will
be explicitly made. WIKISOURCE:FunctionNotation
20.3 Representing Functions and Predicates
Schemallodes and PredicateNodes contain functions internally; and Links may also usefully be
considered as functions. We now briefly discuss the representations and notations we will use
to indicate functions in various contexts.
Firstly, we will make some use of the currying notation drawn from combinatory, logic, in
which adjacency indicates function application. So, for instance, using currying,
f x
means the function f evaluated at the argument x; and (f x y) means (f(x))(y). If we want to
specify explicitly that a block of terminology is being specified using currying we will use the
notation ©Impression], for instance
ia[f x y z)
means
(Cf (WW1 (z)
We will also frequently use conventional notation to refer to functions, such as f(x,y). Of
course, this is consistent with the currying convention if (x,y) is interpreted as a list and f is
then a function that acts on 2-element lists. We will have many other occasions than this to
use list notation.
Also, we will sometimes use a non-curried notation, most commonly with Links, so that e.g.
InheritanceLink x y
does not mean a curried evaluation but rather means InheritanceLink(x,y).
EFTA00624182
36 20 Knowledge Representation Using the Atomspace
20.3.0.6 Execution Output Links
In the case where f refers to a schema, the occurrence of the combination f x in the system is
represented by
ExOutLink f x
or graphically
e
f x
Note that, just as when we write
f xl
we mean to apply f to the result of applying g to x, similarly when we write
ExOutLink f (ExOutLink g x)
we mean the same thing. So for instance
EvaluationLink (ExOutLink g x) y <.8>
means that the result of applying g to x is a predicate r, so that r(y) evaluates to 'Rue with
strength .8.
This approach, in its purest incarnation, does not allow multi-argument schemata. Now,
multi-argument schemata are never actually necessary, because one can use argument currying
to simulate multiple arguments. However, this is often awkward, and things become simpler if
one introduces an explicit tupling operator, which we call ListLink. Simply enough,
ListLink Al ... An
denotes an ordered list (Al, ..., An)
20.3.1 Execution Links
ExecutionLinks give the system an easy way to record acts of schema execution. These are
ternary links of the form:
Schemallode: S
Atom: A,
ExecutionLink S A
In words, this says the procedure represented by Schemallode S has taken input A and
produced output B.
There may also be schemata that do not take output, or do not take input. But these are
treated as PredicateNodes, to be discussed below; their activity is recorded by EvaluationLinIcs,
not ExecutionLinks.
The TruthValue of an ExecutionLink records how frequently the result encoded in the Exe-
cutionLink occurs. Specifically,
EFTA00624183
20.3 Representing Functions and Predicates 37
• the TruthValue of (ExecutionLink S A B) tells you the probability of getting B as output,
given that you have run schema S on input A
• the TruthValue of (ExecutionLink S A) tells you the probability that if S is run, it is run
on input A
Often it is useful to record the time at which a given act of schema execution was carried out;
in that case one ruses the atTime link, writing e.g.
atTimeLink
ExecutionLink S A B
where T is a TimeNode, or else one uses an implicit method such as storing the time-stamp of
the ExecutionLink in a core-level data-structure called the TimeServer. The implicit method is
logically equivalent to explicitly using atTime, and is treated the same way by PLN inference,
but provides significant advantages in terms of memory usage and lookup speed.
For purposes of logically reasoning about schema, it is useful to create binary links repre-
senting ExecutionLinks with some of their arguments fixed. We name these as follows:
ExecutionLink) A B means: X so that ExecutionLink X A B
ExecutionLink2 A B means: X so that ExecutionLink A X B
ExecutionLink3 A B means: X so that ExecutionLink A B X
Finally, a Schemallode may be associated with a structure called a Graph.
Where S is a Schemallode,
Graph(S) (x,y): ExecutionLink S x y )
Sometimes, the graph of a Schemallode may be explicitly embodied as a ConceptNode; other
times, it may be constructed implicitly by a MindAgent in analyzing the Schemallode (e.g. the
inference MindAgent).
Note that the set of ExecutionLinks describing a Schemallode may not define that SchemaN-
ode exactly, because some of them may be derived by inference. This means that the model of
a Schemallode contained in its ExecutionLinks may not actually be a mathematical function,
in the sense of assigning only one output to each input. One may have
ExecutionLink S X A c.5>
ExecutionLink S X El c.S>
meaning that the system does not know whether S(X) evaluates to A or to B. So the set of
ExecutionLinks modeling a Schemallode may constitute a non-function relation, even if the
schema inside the Schemallode is a function.
Finally, what of the case where f x represents the action of a built-in system function f on
an argument x? This is an awkward case that would not be necessary if the CogPrime system
were revised so that all cognitive functions were carried out using Schemallod . However,
in the current CogPrime version, where most cognitive functions are carried out using C++
MindAgent objects, if we want CogPrime to study its own cognitive behavior in a statistical
way, we need BuiltInSchemallodes that refer to MindAgents rather than to ComboTrees (or
else, we need to represent MindAgents using ComboTrees, which will become practicable once
we have a sufficiently efficient Combo interpreter). The semantics here is thus basically the
same as where f refers to a schema. For instance we might have
EFTA00624184
38 20 Knowledge Representation Using the Atomspace
ExecutionLink FirstOrderlnferenceMindAgent (LI, L2) L3
where LI, L2 and L3 are links related by
LI
L2
L3
according to the first-order PLN deduction rules.
20.3.1.1 Predicates
Predicates are related but not identical to schema, both conceptually and notationally. Predi-
cateNodes involve predicate schema which output TruthValue objects. But there is a difference
between a Schemallode embodying a predicate schema and a PredicateNode, which is that a
PredicateNode doesn't output a TruthValue, it adjusts its own TruthValue as a result of the
output of its own internal predicate schema.
The record of the activity of a PredicateNode is given not by an ExecutionLink but rather
by an:
EvaluationLink P A <tv>
where P is a PredicateNode, A is its input, and <tv> is the truth value assumed by the
EvaluationLink corresponding to the PredicateNode being fed the input A. There is also the
variant
EvaluationLink P <tv>
for the case where the PredicateNode P embodies a schema that takes no inputs'.
A simple example of a PredicateNode is the predicate GreaterThan. In this case we have,
for instance
EvaluationLink GreaterThan S 6 <0>
EvaluationLink GreaterThan S 3 <I>
and we also have:
EquivalenceLink
GreaterThan
ExOutLink
And
ListLink
ExOutLink
Not
LessThan
ExOutLink
Not
EqualTo
Note how the variables have been stripped out of the expression, see the PLN book for more
explanation about that. We will also encounter many commonsense-semantics predicates such
as isMale, with e.g.
actually, if P does take some inputs, EvaluationLink P <tv> is defined too and tv corresponds to the
average of P(X) over all inputs X, this is explained in more depth in the PLN book.
EFTA00624185
20.3 Representing Functions and Predicates 39
EvaluationLink isMale Sen_Goertzel <1>
Schemata that return no outputs are treated as predicates, and handled using Evaluation-
Links. The truth value of such a predicate, as a default, is considered as True if execution is
successful, and False otherwise.
And, analogously to the Graph operator for Schemallodes, we have for PredicateNodes the
SatisfyingSet operator, defined so that the SatisfyingSet of a predicate is the set whose members
are the elements that satisfy the predicate. Formally, that is:
S SatisfyingSet P
means
Truthvalue(MemberLink X s)
equals
TruthValue(EvaluationLink P X)
This operator allows the system to carry out advanced logical operations like higher-order
inference and unification.
20.3.2 Denoting Schema and Predicate Variables
CogPrime sometimes uses variables to represent the expressions inside schemata and predicates,
and sometimes uses variable-free, combinatory-logic-based representations. There are two sorts
of variables in the system, either of which may exist either inside compound schema or predi-
cates, or else in the AtomSpace as VariableNodes:
It is important to distinguish between two sorts of variables that may exist in CogPrime:
• Variable Atoms, which may be quantified (bound to existential or universal quantifiers) or
unquantified
• Variables that are used solely as function-arguments or local variables inside the "Combo
tree" structures used inside some ProcedureNodes (PredicateNodes or Schemallodes) (to
be described below), but are not related to Variable Atoms
Examples of quantified variables represented by Variable Atoms are $X and $Y in:
ForAll SX <.0001>
ExtensionallmplicationLink
ExtensionallnheritanceLink $X human
ThereExists SY
AND
ExtensionallnheritanceLink SY human
EvaluationLink parent_of (SX, SY)
An example of an unquantified Variable Atom is $X in
ExtensionallmplicationLink <.3>
ExtensionallnheritanceLink $X human
ThereExists SY
AND
ExtensionallnheritanceLink $Y human
EvaluationLink parent_of (SX, SY)
EFTA00624186
40 20 Knowledge Representation Using the Atomspace
This ImplicationLink says that 30% of humans are parents: a more useful statement than
the ForAll Link given above, which says that it is very very unlikely to be true that all humans
are parents.
We may also say, for instance,
SatisfyingSet( EvaluationLink eats (cat, $X)
to refer to the set of X so that eats(cat, X).
On the other hand, suppose we have the implication
Implication
Evaluation f $X
Evaluation
f
ExOut reverse $X
where f is a PredicateNode embodying a mathematical operator acting on pairs of NumberN-
odes, and reverse is an operator that reverses a list. So, this implication says that the f predicate
is commutative. Now, suppose that f is grounded by the formula
f(a,b) a (a > b - 1)
embodied in a Combo Tree object (which is not commutative but that is not the point), stored
in the ProcedureRepository and linked to the PredicateNode for f. These f-internal
which are expressed here using the letters a and b, are not VariableNodes in the CogPrime
AtomTable. The notation we use for these within the textual Combo language, that goes with
the Combo Tree formalism, is to replace a and bin this example with #1 and #2, so the above
grounding would be denoted
f -> Cul > #2 - 1)
version, it is assumed that type restrictions are always crisp, not probabilistically truth-
valued. This assumption may be revisited in a later version of the system.
20.3.2.1 Links as Predicates
It. is conceptually important to recognize that CogPrime link types may be interpreted as
predicates. For instance, when one says
InheritanceLink cat animal <.8>
indicating an Inheritance relation between cat and animal with a strength .8, effectively one is
declaring that one has a predicate giving an output of .8. Depending on the interpretation of
InheritanceLink as a predicate, one has either the predicate
InheritanceLink cat $X
acting on the input
animal
or the predicate
InheritanceLink $X animal
acting on the input
EFTA00624187
20.3 Representing Functions and Predicates 41
cat
or the predicate
InheritanceLink SX SY
acting on the list input
(cat, animal)
This means that, if we wanted to, we could do away with all Link types except OrderedLink
and UnorderedLink, and represent all other Link types as PredicateNodes embodying appro-
priate predicate schema.
This is not the approach taken in the current codebase. However, the situation is somewhat
similar to that with CIM-Dynamics:
• In future we will likely create a revision of CogPrime that regularly revises its own vocabu-
lary, of Link types, in which case an explicit representation of link types as predicate schema
will be appropriate.
• In the shorter term, it can be useful to treat link types as virtual predicates, meaning that
one lets the system create Schemallodes corresponding to them, and hence do some meta
level reasoning about its own link types.
20.3.3 Variable and Combinator Notation
One of the most important aspects of combinatory logic, from a CogPrime perspective, is that
it allows one to represent arbitrarily complex procedures and patterns without using variables
in any direct sense. In CogPrime, variables are optional, and the choice of whether or how to
use them may be made (by CogPrime itself) on a contextual basis.
This section deals with the representation of variable expressions in a variable-free way, in a
CogPrime context. The general theory underlying this is well-known, and is usually expressed
in terms of the elimination of variables from lambda calculus expressions (lambda lifting). Here
we will not present this theory but will restrict ourselves to presenting a simple, hopefully
illustrative example, and then discussing some conceptual implications.
20.3.3.1 Why Eliminating Variables is So Useful
Before launching into the specifics, a few words about the general utility of variable-free ex-
pression may be worthwhile.
Some expressions look simpler to the trained human eye with variables, and some look simpler
without them. However, the main reason why eliminating all variables from an expression is
sometimes very useful, is that there are automated program-manipulation techniques that work
much more nicely on programs (schemata, in CogPrime lingo) without any variables in them.
As will he discussed later (e.g. Chapter 33 on evolutionary learning, although the same
process is also useful for supporting probabilistic reasoning on procedures), in order to mine
patterns among multiple schema that all try to do the same (or related) things, we want to put.
schema into a kind of "hierarchical normal form." The normal form we wish to use generalizes
EFTA00624188
42 20 Knowledge Representation Using the Atomspace
Holman's Elegant Normal Form (which is discussed in Moshe Looks' PhD thesis) to program
trees rather than just Boolean trees.
But, putting computer programs into a useful, nicely-hierarchically-structured normal form
is a hard problem - it requires one to have a pretty nice and comprehensive set of program
transformations.
But the only general, robust, systematic program transformation methods that exist in the
computer science literature require one to remove the variables from one's programs, so that
one can use the theory of functional programming (which ties in with the theory of monads in
category, theory, and a lot of beautiful related math).
In large part, we want to remove variables so we can use functional programming tools to
normalize programs into a standard and pretty hierarchical form, in order to mine patterns
among them effectively.
However, we don't always want to be rid of variables, because sometimes, from a logical
reasoning perspective, theorem-proving is easier with the variables in there. (Sometimes not.)
So, we want to have the option to use variables, or not.
20.3.3.2 An Example of Variable Elimination
Consider the PredicateNode
AND
InheritanceLink X cat
eats X mice
Here we have used a syntactically sugared representation involving the variable X. How can we
get rid of the X?
Recall the C combinator (from combinatory logic), defined by
Cfxy fyx
Using this tool.
InheritanceLink X cat
becomes
C InheritanceLink cat X
and
eats X mice
becomes
C eats mice X
so that overall we have
AND
C InheritanceLink cat
C eats mice
where the C combinators essentially give instructions as to where the virtual argument X should
go.
In this case the variable-free representation is basically just as simple as the variable-based
representation, so there is nothing to lose and a lot to gain by getting rid of the variables. This
EFTA00624189
20.3 Representing Functions and Predicates 43
won't always be the case - sometimes execution efficiency will be significantly enhanced by use
of variables.
WIKISOURCE:TypeInheritance
20.3.4 Inheritance Between Higher-Order Types
Next, this section deals with the somewhat subtle matter of Inheritance between higher-order
types. This is needed, for example, when one wants to cross over or mutate two complex
schemata, in an evolutionary learning context. One encounters questions like: When mutation
replaces a schema that takes integer input, can it replace it with one that takes general numerical
input? How about vice versa? These questions get more complex when the inputs and outputs
of schema may themselves be schema with complex higher-order types. However, they can be
dealt with elegantly using some basic mathematical rules.
Denote the type of a mapping from type T to type S, as T -> S. Use the shorthand inh to
mean inherits from. Then the basic rule we use is that
Ti -> Si inh T2 -> S2
iff
T2 inh Ti
Si inh S2
In other words, we assume higher-order type inheritance is countervariant. The reason is
that, if RI = Ti -> Si is to be a special case of R2 = T2 -> 52, then one has to be able to
use the latter everywhere one uses the former. This means that any input R2 takes, has to also
be taken by RI (hence T2 inherits from Ti). And it means that the outputs R2 gives must be
able to be accepted by any function that accepts outputs of RI (hence Si inherits from 52).
This type of issue comes up in programming language design fairly frequently, and there are
a number of research papers debating the pros and cons of countervariance versus covariance for
complex type inheritance. However, for the purpose of schema type inheritance in CogPrime,
the greater logical consistency of the countervariance approach holds sway.
For instance, in this approach, INT -> INT is not a subtype of NO -> INT (where NO
denotes FLOAT), because NO -> INT is the type that includes all functions which take a real
and return an int, and an INT -> INT does not take a real. Rather, the containment is the
other way around: every NO -> INT function is an example of an INT -> INT function. For
example, consider the NO -> INT that takes every real number and rounds it up to the nearest
integer. Considered as an INT -> INT function, this is simply the identity function: it is the
function that takes an integer and rounds it up to the nearest integer.
Of course. sapling of types is different, it's covariant. If one has an ordered pair whose
elements are of different types, say (Ti, T2) , then we have
(TI , Si) inh (T2, S2)
if
TI inh T2
Si inh S2
As a mnemonic formula, we may say
EFTA00624190
44 20 KnowkdgeRepresentationUsingtheAtomspace
(general -> specific) inherits from (specific -> general)
(specific, specific) inherits from (general, general)
In schema learning, we will also have use for abstract type constructions, such as
(T1, T2) where T1 inherits from T2
Notationally, we will refer to variable types as Xvl, Xv2, etc., and then denote the inheritance
relationships by using ntunerical indices, e.g. using
[1 inh 2]
to denote that
Xv1 inh Xv2
So for example,
(INT, VOID) inh (Xv1, Xv2)
is true, because there are no restrictions on the variable types, and we can just assign Xv1 =
INT, Xv2 = VOID.
On the other hand,
( INT, VOID ) inh ( Xvl, Xv2 ), 1 inh 2 ]
is false because the restriction Xvl inh Xv2 is imposed, but it's not true that INT inh VOID.
The following list gives some examples of type inheritance, using the elementary types INT,
FLOAT (FL), NUMBER (NO), CHAR and STRING (STR), with the elementary type inheri-
tance relationships
• INT inh NUMBER
• FLOAT inh NUMBER
• CHAR inh STRING
• ( NO -> FL ) inh ( INT -> FL )
• ( FL -> INT ) inh ( FL -> NO )
• ( ( INT -> FL ) -> ( FL -> INT ) ) inh ( ( NO -> FL ) -> ( FL -> NO ) ) WIK-
ISOURCE:AbstractSchemaManipulation
20.3.5 Advanced Schema Manipulation
Now we describe some special schema for manipulating schema, which seem to be very useful
in certain contexts.
20.3.5.1 Listification
First, there are two ways to represent n-ary relations in CogPrime's Atom level knowledge
representation language: using lists as in
f_list (xl, xn)
or using currying as in
EFTA00624191
20.3 Representing Functions and Predicates 45
f_curry xl xn
To make conversion between list and curried forms easier, we have chosen to introduce special
schema (combinators) just for this purpose:
listify f - f_list so that f_list (xl, xn ) f xl xn
unlistify listify f - f
For instance
kick_curry Ben Ken
denotes
(kick_curry Ben) Ken
which means that kick is applied to the argument Ben to yield a predicate schema applied to
ICen. This is the curried style. The list style is
kick List (Ben, Ken)
where kick is viewed as taking as an argument the List (Ben, ICen). The conversion between
the two is done by
listify kick_curry kick_list
unlistify kick_list - kick_curry
As a more detailed example of unlistification, let us utilize a simple mathematical example,
the function (X — 1)2. If we use the notations - and pow to denote Schemallodes embodying
the corresponding operations, then this formula may be written in variable-free node-and-link
form as
ExOutLink
pow
ListLink
ExOutLink
ListLink
1
2
But to get rid of the nasty variable X, we need to first unlistify the functions pow and -, and
then apply the C and B combinators a couple times to move the variable X to the front. The
B combinator (see Combinatory Logic REF) is recalled below:
Bfghof (g h)
This is accomplished as follows (using the standard convention of left-associativity for the
application operator, denoted C.g) in the tree representation given in Section 20.3.0.6)
pow(-(x, 1), 2)
unlistify pow (-(x, 1) 2)
C (unlistify pow) 2 (-(x,1))
C (unlistify pow) 2 ((unlistify -) x 1)
C (unlistify pow) 2 (C (unlistify -) 1 x)
B (C (unlistify pow) 2) (C (unlistify -) 1) x
EFTA00624192
46 20 Knowledge Representation Using the Atomspace
yielding the final schema
(C (unlistify pow) 2) (C (unlistify -) 1)
By the way, a variable-free representation of this schema in CogPrime would look like
ExOutLink
ExOutLink
B
ExOutLink
ExOutLink
C
ExOutLink
unlistify
pow
2
ExOutLink
ExOutLink
C
ExOutLink
unlistify
1
The main thing to be observed is that the introduction of these extra schema lets us remove
the variable X. The size of the schema is increased slightly in this case, but only slightly - an
increase that is well - justified by the elimination of the many difficulties that explicit variables
would bring to the system. Furthermore, there is a shorter rendition which looks like
ExOutLink
ExOutLink
B
ExOutLink
ExOutLink
C
pow_curried
2
ExOutLink
ExOutLink
C
-_curried
1
This rendition uses alternate variants of - and pow schema, labeled -_cu rried and
pow_curried, which do not act on lists but are curried in the manner of combinatory logic
and Haskell. It is 13 lines whereas the variable-bearing version is 9 lines, a minor increase in
length that brings a lot of operational simplification.
20.3.5.2 Argument Permutation
In dealing with List relationships. there will sometimes be use for an argument-permutation
operator, let us call it P, defined as follows
(P p 0 (v1, vn) f (p (v1, vn ))
where p is a permutation on n letters. This deals with the case where we want to say, for
instance that
EFTA00624193
20.3 Representing Functions and Predicates 47
Equivalence parent(x,y) child(y,x)
Instead of positing variable names x and y that span the two relations parent(x,y) and
child(y,x), what we can instead say in this example is
Equivalence parent (P (2,11 child)
For the case of two-argument functions, argument permutation is basically doing on the list
level what the C combinator does in the curried function domain. On the other hand, in the
case of n-argument functions with n>2, argument permutation doesn't correspond to any of
the standard combinators.
Finally, let's conclude with a similar example in a more standard predicate logic notation,
involving both combinators and the permutation argument operator introduced above. We will
translate the variable-laden predicate
likes(y,x) AND likes ()cry)
into the equivalent combinatory logic tree. Let us first recall the combinator S whose function
is to distribute an argument over two terms.
sfgx•• (f x) (g x)
Assume that the two inputs are going to be given to us as a list. Now, the combinatory logic
representation of this is
S (B AND (B (P (2,1) likes))) likes
We now show how this would be evaluated to produce the correct expression:
S (B AND (B (P (2,1) likes))) likes (x,y)
S gets evaluated first, to produce
(B AND (B (P (2,11 likes)) (x,y)) (likes (x,y))
now the first B
AND f(B (P (2,1) likes)) (x,y)) (likes (x,y))
now the second one
AND ((P (2,11 likes) (x,y)) (likes (xl y))
now P
AND (likes (y.x)) (likes (x. IF) )
which is what we wanted.
EFTA00624194
EFTA00624195
Chapter 21
Representing Procedural Knowledge
21.1 Introduction
We now turn to CogPrime's representation and manipulation of procedural knowledge. In a
sense this is the most fundamental kind of knowledge - since intelligence is most directly about
action selection, and it is procedures which generate actions.
CogPrime involves multiple representations for procedures, including procedure maps and
(for sensorimotor procedures) neural nets or similar structures. Its most basic procedural knowl-
edge representation, however, is the program. The choice to use programs to represent proce-
dures was made after considerable reflection — they are not of course the only choice, as other
representations such as recurrent neural networks possess identical representational power, and
are preferable in some regards (e.g. resilience with respect to damage). Ultimately, however,
we chase programs due to their consilience with the software and hardware underlying Cog-
Prime (and every other current AI program). CogPrime is a program, current computers and
operating systems are optimized for executing and manipulating programs; and we humans
now have many tools for formally and informally analyzing and reasoning about programs. The
human brain probably doesn't represent most procedures as programs in any simple sense, but
CogPrime is not intended to be an emulation of the human brain. So, the representation of
programs as procedures is one major case where CogPrime deviates from the human cogni-
tive architecture in the interest of more effectively exploiting its own hardware and software
infrastructure.
CogPrime represents procedures as programs in an internal programming language called
"Combo." While Combo has a textual representation, described online at the OpenCog wiki,
this isn't one of its more important aspects (and may be redesigned slightly or wholly without
affecting system intelligence or architecture); the essence of Combo programs lies in their tree
representation not their text representation. One could fairly consider Combo as a dialect of
LISP, although it's not equivalent to any standard dialect, and it hasn't particularly been
developed with this in mind. In this chapter we discuss the key concepts underlying the Combo
approach to program representation, seeking to make clear at each step the motivations for
doing things in the manner proposed.
In terms of the overall CogPrime architecture diagram given in Chapter 6 of Part 1, this
chapter is about the box labeled "Procedure Repository." The latter, in OpenCog, is a special-
ized component connected to the AtomSpace, storing Combo tree representations of programs;
49
EFTA00624196
50 21 Representing Procedural Knowledge
each program in the repository is linked to a Schemallode in the AtomSpace, ensuring full
connectivity between procedural and declarative knowledge.
21.2 Representing Programs
What is a "program" anyway? What distinguishes a program front an arbitrary representation
of a procedure?
The essence of programmatic representations is that they are well-specified, compact, com-
binatorial, and hierarchical:
• Well-specified: unlike sentences in natural language, programs are unambiguous; two distinct
programs can be precisely equivalent.
• Compact: programs allow us to compress data on the basis of their regularities. Accordingly,
for the purposes of this chapter, we do not consider overly constrained representations such
as the well-known conjunctive and disjunctive normal forms for Boolean formulae to be
programmatic. Although they can express any Boolean function (data), they dramatically
limit the range of data that can be expressed compactly, compared to unrestricted Boolean
formulae.
• Combinatorial: programs access the results of running other programs (e.g. via function
application), as well as delete, duplicate, and rearrange these results (e.g., via variables or
combinators).
• Hierarchical: programs have intrinsic hierarchical organization, and may be decomposed
into subprograms.
Eric Baum has advanced a theory "under which one understands a problem when one has
mental programs that can solve it and many naturally occurring variations" lBatiOtil. In this
perspective - which we find an agreeable way to think about procedural knowledge, though
perhaps an overly limited perspective on mind as a whole - one of the primary goals of ar-
tificial general intelligence is systems that can represent, learn, and reason about such pro-
grams [Bau06, B=01]. Furthermore, integrative AGI systems such as CogPrime may contain
subsystems operating on programmatic representations. Would-be AGI systems with no direct
support for programmatic representation will clearly need to represent procedures and proce-
dural abstractions somehow. Alternatives such as recurrent neural networks have serious down-
sides, including opacity and inefficiency, but also have their advantages (e.g. recurrent neural
nets can be robust with regard to damage, and learnable via biologically plausible algorithms).
Note that the problem of how to represent programs for an AGI system dissolves in the
unrealistic case of unbounded computational resources. The solution is algorithmic information
theory rha081, extended recently to the case of sequential decision theory Illut05al. The latter
work defines the universal algorithmic agent AIXI, which in effect simulates all possible pro-
grams that are in agreement with the agent's set of observations. While AIXI is =computable,
the related agent AIM" may be computed, and is superior to any other agent bounded by
time t and space 1 II Int0514. The choice of a representational language for programs' is of no
consequence, as it will merely introduce a bias that will disappear within a constant number of
time steps.2
I As well as a language for proofs in the case of AIXIti.
2 The universal distribution converges quickly.
EFTA00624197
21.3 Representational Challenges 51
Our goal in this chapter Ls to provide practical techniques for approximating the ideal pro-
vided by algorithmic probability, based on what Pei Wang has termed the assumption of in-
sufficient knowledge and resources tWan061, and assuming an AGI architecture that's at least
vaguely humanlike in nature, and operates largely in everyday human environments, but uses
programs to represent many procedures. Given these assumptions, how programs are repre-
sented is of paramount importance. as we shall see in the next two sections, where we give a
conceptual fornmlation of what we mean by tractable program representations, and introduce
tools for formalizing such representations. The fourth section delves into effective techniques
for representing programs. A key concept throughout is syntactic-semantic correlation, meaning
that programs which are similar on the syntactic level, within certain constraints will tend to
also be similar in terms of their behavior (i.e. on the semantic level). Lastly, the fifth section
changes direction a bit and discusses the translation of programmatic structure into declarative
form for the purpases of logical inference.
In the future, we will experimentally validate that these normal forms and heuristic trans-
formations do in fact increase the syntactic-semantic correlation in program spaces, as has
been shown so far only in the Boolean case. We would also like to explore the extent to which
even stronger correlation, and additional tractability properties, can be observed when realistic
probabilistic constraints on "natural" environment and task spaces are imposed.
The importance of a good programmatic representation of procedural knowledge becomes
quite clear when one thinks about it in terms of the Mind-World Correspondence Principle
introduced in Chapter 10. That principle states, roughly, that transition paths between world-
states should map naturally onto transition paths between mind-states. This suggests that there
should be a natural, smooth mapping between red-world action series and the corresponding
series of internal states. Where internal states are driven by explicitly given programs, this
means that the transitions between internal program states should nicely mirror transitions
between the states of the real world as it interacts with the system controlled by the program.
The extent to which this is true will depend on the specifics of the programming language - and
it will be true for a much greater extent, on the whole, if the programming language displays
high syntactic-semantic correlation for behaviors that commonly occur when the program is
used to control the system in the real world. So, the various technical issues mentioned above
and considered below, regarding the qualities desired in a programmatic representation, are
merely the manifestation of the general Mind-World Correspondence Principle in the context of
procedural knowledge, under the assumption that procedures are represented as programs. The
material in this chapter may be viewed as an approach to ensuring the validity of the Mind-
World Correspondence principle for programmatically-represented procedural knowledge, for
CogPrime systems concerned with achieving humanly meaningful goals in everyday human
environments.
21.3 Representational Challenges
Despite the advantages outlined in the previous section, there are a number of challenges in
working with programmatic representations:
• Open-endedness - in contrast to some other knowledge representations current in machine
learning, programs vary in size and "shape", and there is no obvious problem-independent
EFTA00624198
52 21 Representing Procedural Knowledge
upper bound on program size. This makes it difficult to represent programs as points in a
fixed-dimensional space, or to learn programs with algorithms that assume such a space.
• Over-representation - often, syntactically distinct programs will be semantically iden-
tical (i.e. represent the same underlying behavior or functional mapping). Lacking prior
knowledge, many algorithms will inefficiently sample semantically identical programs re-
peatedly [Loo07a, GIW.01h
• Chaotic Execution - programs that are very similar, syntactically, may be very different,
semantically. This presents difficulties for many heuristic search algorithms, which require
syntactic and semantic distance to be correlated Ilmo07b, TVCC05].
• High resource-variance - programs in the same space vary greatly in the space and time
they require to execute.
It's easy to see how the latter two issues may present a challenge for mind-world correspon-
dence! Chaotic execution makes it hard to predict whether a program will indeed manifest
state-sequences mapping nicely to a corresponding world-sequences: and high resource-variance
makes it hard to predict whether, for a given program, this sort of mapping can be achieved
for relevant goals given available resources.
Based on these concerns, it is no surprise that search over program spaces quickly succumbs
to combinatorial explosion, and that heuristic search methods are sometimes no better than
random sampling II.P02j. However, alternative representations of procedures also have their
difficulties, and so far we feel the thornier aspects of programmatic representation are generally
an acceptable price to pay in light of the advantages.
For some special cases in CogPrime we have made a different choice - e.g. when we use
DeSTIN for sensory perception (see Chapter 28 we utilize a more specialized representation
comprising a hierarchical network of more specialized elements. DeSTIN doesn't have problems
with resource variance or chaotic execution, though it does stiffer from over-representation. It
is not very open-ended, which helps increase its efficiency in the perceptual processing domain,
but may limit its applicability to more abstract cognition. In short we feel that, for general
representation of cognitive procedures, the benefits of programmatic representation outweigh
the casts; but for some special cases such as low-level perception and motor procedures, this
may not be true and one may do better to opt for a more specialized, more rigid but less
problematic representation.
It would be possible to modify CogPrime to use, say, recurrent neural nets for procedure
representation, rather than programs in an explicit language. However, this would rate as a
rather major change in the architecture, and would cause multiple problems in other aspects
of the system. For example, programs are reasonably straightforward to reason about using
PLN inference, whereas reasoning about the internals of recurrent neural nets is drastically
more problematic, though not impassible. The choice of a procedure representation approach
for CogPrime has been made considering not only procedural knowledge in itself, but the
interaction of procedural knowledge with other sorts of knowledge. This reflects the general
synergetic nature of the CogPrime design.
There are also various computation-theoretic issues regarding programs; however, we suspect
these are not particularly relevant to the task of creating human-level AGI, though they may
rear their heads when one gets into the domain of super-human, profoundly self-modifying AGI
systems. For instance, in the context of the the difficulties caused by over-representation and
high resource-variance, one might observe that determinations of e.g. programmatic equivalence
for the former, and e.g. halting behavior for the latter, are uncomputable. But we feel that,
given the assumption of insufficient knowledge and resources, these concerns dissolve into the
EFTA00624199
21.4 What Makes a Representation Tractable? 53
larger issue of computational intractability and the need for efficient heuristics. Determining the
equivalence of two Boolean formulae over 500 variables by computing and comparing their truth
tables is trivial from a computability standpoint, but, in the words of Leonid Levin, "only math
nerds would call 2500 finite" ILev9-1]. Similarly, a program that never terminates is a special
case of a program that runs too slowly to be of interest to us.
One of the key ideas underlying our treatment of programmatic knowledge is that, in order
to tractably learn and reason about programs, an Al system must have prior knowledge of
programming language semantics. That is, in the approach we advocate, the mechanism whereby
programs are executed is assumed known a priori, and assumed to remain constant across many
problems. One may then craft AI methods that make specific use of the programming language
semantics, in various ways. Of course in the long run a sufficiently powerful AGI system could
modify these aspects of its procedural knowledge representation; but in that case, according
to our approach, it would also need to modify various aspects of its procedure learning and
reasoning code accordingly.
Specifically, we propose to exploit prior knowledge about program structure via enforcing
programs to be represented in normal forms that preserve their hierarchical structure, and to
be heuristically simplified based on reduction rules. Accordingly, one formally equivalent pro-
gramming language may be preferred over another by virtue of making these reductions and
transformations more explicit and concise to describe and to implement. The current OpenCog-
Prime system uses a simple LISP-like language called Combo (which takes both tree form and
textual form) to represent procedures, but this is not critical; the main point is using some
language or language variant that is "tractable" in the sense of providing a context in which
the semantically useful reductions and transformations we've identified are naturally expressible
and easily usable.
21.4 What Makes a Representation Tractable?
Creating a comprehensive formalization of the notion of a tractable program representation
would constitute a significant achievement; and we will not answer that summons here. We will,
however, take a step in that direction by enunciating a set of positive principles for tractable
program representations, corresponding closely to the list of representational challenges above.
While the discussion in this section is essentially conceptual rather than formal, we will use a
bit of notation to ensure clarity of expression; S to denote a space of programmatic functions
of the same type (e.g. all pure Lisp A-expressions mapping from lists to numbers), and B to
denote a metric space of behaviors.
In the case of a deterministic, side-effect-free program, execution maps from programs in $ to
points in B, which will have separate dimensions for the function's output across various inputs
of interest, as well as dimensions corresponding to the time and space costs of executing the
program. In the case of a program that interacts with an external environment, or is intrinsically
nondeterministic, execution will map from $ to probability distributions over points in B, which
will contain additional dimensions for any side-effects of interest that programs in $ might have.
Note the distinction between syntactic distance, measured as e.g. tree-edit distance between
programs in 5, and semantic distance, measured between program's corresponding points in or
probability distributions over B. We assume that semantic distance accurately quantifies our
EFTA00624200
54 21 Representing Procedural Knowledge
preferences in terms of a weighting on the dimensions of B; i.e., if variation along some axis is
of great interest, our metric for semantic distance should reflect this.
Let P be a probability distribution over B that describes our knowledge of what sorts of
problems we expect to encounter, let R(n) C S be all the programs in our representation with
(syntactic) size no greater than n. We will say that R(n) d-covers the pair (B,P) to extent
p if the probability that, for a random behavior b E B chosen according to 12, there is some
program in R whose behavior is within semantic distance d of 6, is greater than or equal to
p. Then, some among the various properties of tractability that seem important based on the
above discussion are as follows:
• for fixed d, p quickly goes to I as n increases,
• for fixed p, d quickly goes to 0 as n increases,
• for fixed d and p, the minimal n needed for R(n) to d-cover (B, P) to extent p should be
as small as possible,
• ceteris paribus, syntactic and semantic distance (measured according to P) are highly cor-
related.
This is closely related to the Mind-Brain Correspondence Principle articulated in Chapter
10, and to the geometric formulation of cognitive synergy posited in Appendix ??. Syntactic
distance has to do with distance along paths in mind-space related to formal program structures,
and semantic distance has to do with distance along paths in mind-space and world-space
corresponding to the record of the program's actual behavior. If syntax-semantics correlation
failed, then there would be paths through mind-space (related to formal program structures)
that were poorly matched to their closest corresponding paths through the rest of mind-space
and world-space, hence causing a failure (or significant diminution) of cognitive synergy and
mind-world correspondence.
Since execution time and memory usage considerations may be incorporated into the defini-
tion of program behavior, minimizing chaotic execution and managing resource variance emerges
conceptually here as subcases of maximizing correlation between syntactic and semantic dis-
tance. Minimizing over-representation follows from the desire for small program size: roughly
speaking the less over-representation there is, the smaller average program size can be achieved.
In some cases one can achieve fairly strong results about tractability of representations
without any special assumptions about P: for example in prior work we have shown that
adoption of an appropriate hierarchical normal form can generically increase correlation between
syntactic and semantic distance in the space of Boolean functions I?, Loonbl. In this case we
may say that we have a generically tractable representation. However, to achieve tractable
representation of more complex programs, some fairly strong assumptions about P will be
necessary. This should not be philosophically disturbing, since it's clear that human intelligence
has evolved in a manner strongly conditioned by certain classes of environments; and similarly,
what we need to do to create a viable program representation system for pragmatic AGI usage,
is to achieve tractability relative to the distribution P corresponding to the actual problems
the AGI is going to need to solve. Formalizing the distributions P of real-world interest is
a difficult problem, and one we will not address here (recall the related, informal discussions
of Chapter 9, where we considered the various important peculiarities of the human everyday
world). However, we hypothesize that the representations presented in the following section may
be tractable to a significant extent irrespective of P,3 and even more powerfully tractable with
3 Specifically, with only weak biases that prefer smaller and faster programs with hierarchical decompositions.
EFTA00624201
21.6 Normal Forms Postulated to Provide Tractable Representations 55
respect to this as-yet unfonnalized distribution. As weak evidence in favor of this hypothesis,
we note that many of the representations presented have proved useful so far in various narrow
problem-solving situations.
21.5 The Combo Language
The current version of OpenCogPrime uses a simple language called Combo, which is an example
of a language in which the transformations we consider important for AGI-focused program
representation are relatively simple and natural. Here we illustrate the Combo language by
example, referring the reader to the OpenCog wiki site for a fonnal presentation.
The main use of the Combo language in OpenCog is behind-the-scenes, i.e. using tree repre-
sentations of Combo programs: but there is also a human-readable syntax, and an interpreter
that allows humans to write Combo programs when needed. The main use of Combo, however,
is not for human-coded programs, but rather for programs that are learned via various AI
methods.
In Combo all expressions are in prefix form like LISP, but the left parenthesis is placed after
the operator instead of before, for example:
• +(4 5)
is a 0-ari expression that returns 4 t 5
• and (41 O<(42))
is a binary expression of type boot x float boot that returns true if and only if the first
input is true and the second input positive. #n designates the n-th input.
• fact (1) if (0<(41) *(41. fact (+(41 - 1))) 1)
is a recursive definition of factorial.
• and_seg (goto (stick) grab(stick) goto (owner) drop)
is a 0-ari expression with side effects, it evaluates a sequence of actions until completion or
failure of one of them. Each action is executed in the environment the agent is connected to
and returns action_success upon success or action_ failure otherwise. The action sequence
returns action_success if it completes or action_ failure if it does not.
• if (near (owner self)
lick (owner)
and_seg (goto(owner) wag)
is a 0-ary expression with side effects; it means that if at the time of its evaluation the agent
referred as self (here a virtual pet) is near its owner then lick him/her, otherwise go to the
owner and wag the tail.
21.6 Normal Forms Postulated to Provide Tractable Representations
We now present a series of normal forms for programs, postulated to provide tractable repre-
sentations in the contexts relevant to human-level, roughly human-like general intelligence.
EFTA00624202
56 21 Representing Procedural Knowledge
21.6.1 A Simple Type System
We use a simple type system to distinguish between the various normal forms introduced below.
This is necessary to convey the minimal information needed to correctly apply the basic func-
tions in our canonical forms. Various systems and applications may of course augment these
with additional type information, up to and including the satisfaction of arbitrary predicates
(e.g. a type for prime numbers). This can be overlaid on top of our minimalist system to con-
vey additional bias in selecting which transformations to apply, and introducing constraints as
necessary. For instance, a call to a function expecting a prime number, called with a potentially
composite argument, may be wrapped in a conditional testing the argument's primality. A sim-
ilar technique is used in the normal form for functions to deal with list arguments that may be
empty.
Normal forms are provided for Boolean and number primitive types, and the following
parametrized types:
• list types, tistr, where T is any type,
• tuple types, tupleThrz_m, where all Z are types, and N is a positive natural number,
• enum types, {s1,s2,... sN}, where N is a positive number and all se are unique identifiers,
• function types T1, T2, ... TN -, 0, where 0 and all 711 are types,
• action result types.
A list of type tistr is an ordered sequence of any number of elements, all of which mast have
type T. A tuple of type tuple2),T2....2), is an ordered sequence of exactly N elements, where
every ith element is of type An cm= of type (s1, 82, sN) is some element si from the set.
Action result types concern side-effectful interaction with some world external to the system
(but perhaps simulated, of course), and will be described in detail in their subsection below.
Other types may certainly be added at a later date, but we believe that those listed above
provide sufficient expressive power to conveniently encompass a wide range of programs, and
serve as a compelling proof of concept.
The normal form for a type T is a set of elementary functions with codomain T, a set
of constants of type T, and a tree grammar. Internal nodes for expressions described by the
grammar are elementary functions, and leaves are either U„„r or ticonstani, where U is some
type (often U = T).
Sentences in a normal form grammar may be transformed into normal form expressions. The
set of expressions that may be generated is a function of a set of bound variables and a set
of external functions that must be provided (both bound variables and external functions are
typed). The transformation is as follows:
• Tc.„,i,„„t leaves are replaced with constants of type T,
• T„ar leaves are replaced with either bound variables matching type T, or expressions of the
form f (expri,expr2,... expr,w), where f is an external function of type 71 , T2, Tim T,
and each exp.; is a normal form expression of type Z (given the available bound variables
and external functions).
EFTA00624203
21.6 Normal Forms Postulated to Provide Tractable Representations 57
21.6.2 Boolean Normal Form
The elementary functions are and, or, and not. The constants are {true, false}. The grammar
is:
bool_root or_form I and_form I literal I bool_constant
literal - bool_var I not( bool_var )
or_form - or( (and_form I literalf(2,1 )
and_form and( (or_form I literalf(2,1 )
The construct foo(x,) refers to x or more matches of foo (e.g. ( x I 10(2,1 is two or more
items in sequences where each item is either an x or a y).
21.6.3 Number Normal Form
The elementary functions are * (times) and t (plus). The constants are some subset of the
rationals (e.g. those with IEEE single-precision floating-point representations). The grammar
is:
num_root times_form I plus_form I num_constant I num_var
times_form *( (num_constant I plus_form) plus_form(1,) )
I num_var
plus_form +( (num_constant I times form} times_form(1,1 )
I num_var
21.6.4 List Normal Form
For list types /istr, the elementary functions are list (an n-ary list constructor) and append.
The only constant is the empty list (nil). The grammar is:
list_T_root append_form I list_form I list_T_var
I list_T_constant
append_form - append( (list_form I list_T_var)(2,) I
list_form - list( T_root(1,) )
21.6.5 Tuple Normal Form
For tuple types tapleT,,,r,,„,n ,, the only elementary function is the tuple constructor (tuple).
The constants are
Ti_constant xT2_constant x • • • x TN_constant
The normal form is either a constant, a var, or
tuple( Ti_root Ts_root TN_root )
EFTA00624204
58 21 Representing Procedural Knowledge
21.6.6 Enum. Normal Form
Emun.s are atomic tokens with no internal structure - accordingly, there are no elementary
functions. The constants for the enum (Si. .92 91.11 are the as. The normal form is either a
constant or a variable.
21.6.7 Function Normal Form
For T1, T2, ... TN 0. the normal form is a lambda-expression of arity N whose body is of
type O. The list of variable names for the lambda-expression Ls not a "proper" argument - it
does not have a normal form of its own. Assuming that none of the Tis is a list type, the body
of the lambda-expression is simply in the normal form for type O (with the possibility of the
lambda-expressions arguments appearing with their appropriate types). If one or more Tis are
list types, then the body is a call to the split function with all arguments in normal form.
Split is a family of functions with type signatures
(71,1iStri ,T2, Tk, iiSirk -, 0),
tupletistr,,o,tuplelistr2,0, • • • tupletistr„,O -> 0.
To evaluate split( f i tuple(li,o1),tuple(I2,O2), tuple(lk,ok)), the list arguments 11,12,...1k
are examined sequentially. If some is found that is empty, then the result is the corresponding
value If all l; are nonempty, we deconstruct each of them into xi : xsi, where xi is the first
element of the list and xsi is the rest. The result is then f(x i, xst , x2, xs2, . xk,xsk). The split
function thus acts as an implicit case statement to deconstruct lists only if they are nonempty.
21.6.8 Action Result Normal Form
An action result type act corresponds to the result of taking an action in some world. Every,
action result type has a corresponding world type, world. Associated with action results and
worlds are two special sorts of functions.
• Perceptions - functions that take a world as their first argument and regular (non-world
and non-action-result) types as their remaining arguments, and return regular types. Unlike
other function types, the result of evaluating a perception call may be different at different
times, because the world will have different configurations at different times.
• Actions - functions that take a world as their first argument and regular types as their
remaining arguments, and return action results (of the type associated with the type of
their world argument). As with perceptions, the result of evaluating an action call may be
different at different times. Furthermore, actions may have side effects in the associated
world that they are called in. Thus, unlike any other sort of function, actions must be
evaluated, even if their return values are ignored.
Other sorts of functions acting on worlds (e.g. ones that take multiple worlds as arguments) are
disallowed.
EFTA00624205
21.7 Program Transformations 59
Note that an action result expression cannot appear nested inside an expression of any other
type. Consequently, there is no way to convert e.g. an action result to a Boolean, although con-
version in the opposite direction is permitted. This is required because mathematical operations
in our language have classical mathematical semantics; x and y must equal y and x, which will
not generally be the case if x or y can have side-effects. Instead, there are special sequential
versions of logical functions which may be used instead.
The elementary functions for action result types are andacq (sequential and, equivalent to
C's short-circuiting &&), arscl (sequential or, equivalent to C's short-circuiting I I ), and fails
(negates success to failure and vice versa). The constants may vary from type to type but must
at least contain success and failure, indicating absolute success/failure in execution.4 The
normal form is as follows:
act_root orseq_form I andseq_form I seqlit
seqlit act I fails( act )
act act constant I act_var
orseq_form orseq( (andseq_form I secrliti(2,) )
andseq_form andseq( (orseq_form I secrliti(2,) )
21.7 Program Transformations
A program transformation is any type-preserving mapping from expressions to expressions.
Transformations may be guaranteed to preserve semantics. When doing program evolution there
is an intermediate category of fitness preserving transformations that may alter semantics, but
not fitness. In general, the only way that fitness preserving transformations will be uncovered is
by scoring programs that have had their semantics potentially transformed to determine their
fitness, which is what most fitness function does. On the other hand if the fitness function is
encompassed in the program itself, so a candidate directly outputs the fitness itself, then only
preserving semantics transformations are needed.
21.7.1 Reductions
These are semantics preserving transformations that do not increase some size measure (typ-
ically number of symbols), and are idempotent. For example, and(x, x, y) and(x, y) is a
reduction for Boolean expressions. A set of canonical reductions is defined for every type that
has a normal form. For numerical functions, the simplifier in a computer algebra system may
be used. The full list of reductions is omitted in for brevity. An expression is reduced if it maps
to itself under all canonical reductions for its type, and all of its children are reduced.
Another important set of reductions are the compressive abstractions, which reduce or keep
constant the size of expressions by introducing new functions. Consider
list is (Ha p q) r)
*(+(b p cil r)
it(+(c p q) r))
A do(arghar92,...argN) statement (known as progn in Lisp), which evaluates its arguments sequen-
tially regardless of success or failure, is equivalent to and,,cq(oracq (argi, success). orseq(arst2, success). ...
orstq(argiv , success)).
EFTA00624206
60 21 Representing Procedural Knowledge
which contains 19 symbols. Transforniing this to
f(x) w(+(x p q) r)
list (f (a) f(b) f(el)
reduces the total number of symbols to 15. One can generalize this notion to consider com-
pressive abstractions across a set of programs. Compressive abstractions appear to be rather
expensive to uncover, although not prohibitively so, the computation may easily be parallelized
and may rely heavily on subtree mining !TODD REF].
21.7.1.1 A Simple Example of Reduction
We now give a simple example of how CogPrhne's reduction engine can transform a program
into a semantically equivalent but shorter one.
Consider the following program and the chain of reduction:
1. We start with the expression
if(P and_seq (if (P A B) B) and_seq(A B))
2. A reduction rule permits to reduce the conditional if IP A B) to if (true A B). Indeed if
P is true, then the first branch is evaluated and P must still be true.
if(P and_seq (if (true A B) B) and_seq (A B) I
3. Then a rule can reduce if (true A B) to A.
if(P and_seq (A B) and_seq (A B))
4. And finally another rule replaces the conditional by one of its branches since they are
identical
and_seq (A B)
Note that the reduced program is not only smaller (3 symbols instead of 11) but a bit faster
too. Of course it is not generally true that smaller programs are faster but in the restricted
context of our experiments it has often been the case.
21.7.2 Neutral Transformations
Semantics preserving transformations that are not reductions are not useful on their own - they
can only have value when followed by transformations from some other class. They are thus
more speculative than reductions, and more costly to consider. I will refer to these as neutral
transformations 1O1s951.
• Abstraction - given an expression E containing non-overlapping subexpressions Et , E2,
... EN, let E' be E with all E1 replaced by the unbound variables Define the function
f (vi, v2, ... vs) = E', and replace E with f(Ei, E2,... EN). Abstraction is distinct from
compressive abstraction because only a single call to the new function f is introduced.'
• Inverse abstraction - replace a call to a user-defined function with the body of the
function, with arguments instantiated (note that this can also be used to partially invert a
compressive abstraction).
6 In compressive abstraction there must be at least two calls in order to avoid increasing the number of symbols.
EFTA00624207
21.7 Program 'I'ransformations 61
• Distribution - let E be a call to some function f, and let E' be an expression of E's ith
argument that is a call to some function g, such that f is distributive over g's arguments,
or a subset thereof. We shall refer to the actual arguments to g in these positions in E' as
x1, x2,... xn. Now, let D(F) be the function that is obtained by evaluating E with its ith
argument (the one containing E') replaced with the expression F. Distribution is replacing
E with E', and then replacing each xj (1 ≤ j ≤ n) with D(x3). For example, consider
+(x *(y if(cond a b)))
Since both + and * are distributive over the result branches of if, there are two possible
distribution transformations, giving the expressions
if(cond +(x *(y a)) +(x *CY b)))
+ (x (cond *(y a) *ly b) ) )
• Inverse distribution (factorization) - the opposite of distribution. This is nearly a
reduction; the exceptions are expressions such as f(g(x)), where f and g are mutually
distributive.
• Arity broadening - given a function f, modify it to take an additional argument of some
type. All calls to f must be correspondingly broadened to pass it an additional argument
of the appropriate type.
• List broadening - given a function f with some ith argument x of type T, modify f to
instead take an argument y of type list?, which gets split into x : xs. All calls to f with
ith argument x' must be replaced by corresponding calls with ith argument list(e).
• Conditional insertion - an expression x is replaced by if(true,x,y), where y is some
expression of the same type of x.
As a technical note, action result expressions (which may cause side-effects) complicate neu-
tral transformations. Specifically, abstractions and compressive abstractions must take their
arguments lazily (i.e. not evaluate them before the function call itself is evaluated), in order
to be neutral. Furthermore, distribution and inverse distribution may only be applied when f
has no side-effects that will vary (e.g. be duplicated or halved) in the new expression, or affect
the nested computation (e.g. change the result of a condition within a conditional). Another
way to think about this issue is to consider the action result type as a lazy domain-specific
language embedded within a pure functional language (where evaluation order is unspecified).
Spector has performed an empirical study of the tradeoffs in lazy vs. eager function abstraction
for program evolution ISpe961.
The number of neutral transformations applicable to any given program grows quickly with
program size.? Furthermore, synthesis of complex programs and abstractions does not seem to
be passible without them. Thus, a key hypothesis of any approach to AGI requiring significant
program synthesis, without assuming the currently infeasible computational capacities required
to brute-force the problem, is that the inductive bias to select promising neutral transforma-
tions can be learned and/or programmed. Referring back to the initial discussion of what con-
stitutes a tractable representation, we speculate that perhaps, whereas well-chosen reductions
are valuable for generically increasing program representation tractability, well-chosen neutral
transformations will be valuable for increasing program representation tractability relative to
distributions P to which the transformations have some (possibly subtle) relationship.
6 Analogous tuple-broadening transformations may be defined as well, but are omitted for brevity.
7 Exact calculations are given by Olsson
EFTA00624208
62 21 Representing Procedural Knowledge
21.7.5 Non-Neutral Transformations
Non-neutral transformations are the general class defined by removal, replacement, and inser-
tion of subexpressions, acting on expressions in normal form, and preserving the normal form
property. Clearly these transformations are sufficient to convert any normal form expression
into any other. What is desired is a subclass of the non-neutral transformations that is combi-
natorially complete, where each individual transformation is nonetheless a semantically small
step.
The full set of transformations for Boolean expressions is given in Ilmo06]. For numerical
expressions, the transcendental functions sin, log, and 9 are used to construct transformations.
These obviate the need for division (a/b = el°5O)-mg(b) ), and subtraction (a — b = a + —1 * b).
For lists, transformations are based on insertion of new leaves (e.g. to append function calls),
and "deepening" of the normal form by insertion of subclauses (see II.00061 for details). For
tuples, we take the union of the transformations of all the subtypes. For other mixed-type
expressions the union of the non-neutral transformations for all types mast be considered as
well. For enum types the only transformation is replacing one symbol with another. For function
types, the transformations are based on function composition. For action result types. actions
are inserted/removed/altered, akin to the treatment of Boolean literals for the Boolean type.
We propose an additional class of non-neutral transformations based on the marvelous fold
function:
fokl(f,v,I) = if (empty(I),v, f (first(l), fold(f, v, rest(l))))
With fold we can express a wide variety of iterative constructs, with guaranteed termination
and a bias towards low computational complexity. In fact, fold allows us to represent exactly
the primitive recursive functions pint 94
Even considering only this reduced space of passible transformations, in many cases there
are still too many possible programs "nearby" some target to effectively consider all of them.
For example many probabilistic model-building algorithms, such as learning the structure of a
Bayesian network from data, can require time cubic in the number of variables (in this context
each independent non-neutral transformation can correspond to a variable). Especially as the
size of the programs we wish to learn grows, and as the number of typologically matching
functions increases, there will be simply too many variables to consider each one intensively, let
alone apply a quadratic-time algorithm.
To alleviate this scaling difficulty, we propose three techniques.
The first is to consider each potential variable (i.e. independent non-neutral transformation)
to heuristically determine its usefulness in expressing constructive semantic variation. For exam-
ple, a Boolean transformation that collapses the overall expression into a tautology is assumed
to be useless.8
The second is heuristic coupling rules that allow us to calculate, for a pair of transformations,
the expected utility of applying them in conjunction.
Finally, while fold is powerful, it may need to be augmented by other methods in order to
provide tractable representation of complex programs that would normally be written using nu-
merous variables with diverse scopes. One approach that we have explored involves application
of ISN1191's ideas about director strings as combinators. In Sinot's approach, special program
8 This S heuristic because such a transformation might be useful together with other transformations.
EFTA00624209
21.8 Interfacing Between Procedural and Declarative Knowledge 63
tree nodes are labeled with director strings, and special algebraic operators interrelate these
strings. One then achieves the representational efficiency of local variables with diverse scopes,
without needing to do any actual variable management. Reductions and other (non-)neutral
transformation rules related to broadening and reducing variable scope may then be defined
using the director string algebra.
21.8 Interfacing Between Procedural and Declarative Knowledge
Finally, another critical aspect of procedural knowledge is its interfacing with declarative knowl-
edge. We now discuss the referencing of declarative knowledge within procedures, and the ref-
erencing of the details of procedural knowledge within CogPrime's declarative knowledge store.
21.8.1 Programs Manipulating Atoms
Above we have used Combo syntax implicitly, referring to Appendix ?? for the formal defi-
nitions. Now we introduce one additional, critical element of Combo syntax: the capability to
explicitly reference declarative knowledge within procedures.
For this purpose Combo mast contain the following types:
Atom, Node, Link,TruthVclue, AtomType, AtomTable
Atom is the union of Node and Link.
So a type Node within a Combo program refers to a Node in CogPrime's AtomTable. The
mechanisms used to evaluate these entities during program evaluation are discussed in Chapter
25.
For example, suppose one wishes to write a Combo program that creates Atoms embodying
the predicate-argument relationship eats(cat, fish), represented
Evaluation eats (cat, fish)
aka
Evaluation
eats
List
cat
fish
To do this, one could say for instance,
new-link (EvaluationLink
new-node(PredicateNode "eats")
new-link (ListLink
new-node(ConceptNode "cat")
new-node(ConceptNode "fish"))
(new-sty .99 .99))
EFTA00624210
64 21 Representing Procedural Knowledge
21.9 Declarative Representation of Procedures
Next, we consider the representation of program tree internals using declarative data structures.
This is important if we want OCP to inferentially understand what goes on inside programs.
In itself, it is more of a "bookkeeping" issue than a deep conceptual issue, however.
First, note that each of the entities that can live at an internal node of a program, can also
live in its own Atom. For example. a number in a program tree corresponds to a NmnberNode;
an argument in a Combo program already corresponds to some Atom; and an operator in a
program can be wrapped up in a Schemallode all its own, and considered as a one-leaf program
tree.
Thus, one can build a kind of virtual, distributed program tree by linking a number of
ProcedureNodes (i.e. PredicateNodes or Schemallodes) together. All one needs in order to
achieve this is an analogue of the @ symbol (as defined in Section 20.3 of Chapter 20) for
relating ProcedureNodes. This is provided by the ExecutionLink type, where
(ExecutionLink f g1
essentially means the same as
f g
in curried notation or
e
/ \
f g
The same generalized evaluation rules used inside program trees may be thought of in terms
of ExecutionLinks; formally, they are crisp ExtensionalimplicationLinks among ExecutionLinks.
Note that we are here using ExecutionLink as a curried function; that is, we are looking at
(ExecutionLink f g) as a function that takes an argument x, where the truth value of
(ExecutionLink f g1 x
represents the probability that executing f, on input g, will give output x.
One may then construct combinator expressions linking multiple ExecutionLinks together;
these are the analogues of program trees.
For example, using ExecutionLinks, one equivalent of y = x + x^2 is:
Hypothetical
SeguentialAND
ExecutionLink
pow
List vl 2
v2
ExecutionLink
List vl v2
v3
Here the vl, v2, v3 are variables which may be internally represented via combinators. This
AND is sequential in case the evaluation order inside the program interpreter makes a difference.
As a practical matter, it seems there is no purpose to explicitly storing program trees in
conjunction-of-ExecutionLinks form. The information in the ExecutionLink conjunct is already
there in the program tree. However, the PLN reasoning system, when reasoning on program
trees, may carry out this kind of expansion internally as part of its analytical process.
EFTA00624211
Section II
The Cognitive Cycle
EFTA00624212
EFTA00624213
Chapter 22
Emotion, Motivation, Attention and Control
Co-authored with Zhenhua Cai
22.1 Introduction
This chapter begins the heart of the book: the part that explains how the CogPrime design
aims to implement roughly human-like general intelligence, at the human level and ultimately
beyond. First, here in Section II we explain how CogPrime can be used to implement a simplistic
animal-like agent without much learning: an agent that perceives, acts and remembers, and
chooses actions that it thinks will achieve its goals; but doesn't do any sophisticated learning or
reasoning or pattern recognition to help it better perceive, act, remember or figure out how to
achieve its goals. We're not claiming CogPrime is the best way to implement such an animal-like
agent, though we suggest it's not a bad way and depending on the complexity and nature of
the desired behaviors, it could be the best way. We have simply chosen to split off the parts of
CogPrime needed for animal-like behavior and present them first, prior to presenting the various
"knowledge creation" (learning, reasoning and pattern recognition) methods that constitute the
more innovative and interesting part of the design.
In Stan Franklin's terms, what we explain here in Section II is how a basic cognitive cycle
may be achieved within CogPrime. In that sense, the portion of CogPrime explained in this
Sectionis somewhat similar to the parts of Stan's LIDA architecture that have currently been
worked out in detail, and that . However, while LIDA has not yet been extended in detail (in
theory or implementation) to handle advanced learning, cognition and language, those aspects
of CogPrime have been developed and in fact constitute the largest portion of this book.
Looking back to the integrative diagram from Chapter 5, the cognitive cycle is mainly about
integrating vaguely LIDA-like structures and mechanisms with heavily Psi-like structures and
mechanisms - but doing so in a way that naturally links in with perception and action mecha-
nisms "below," and more abstract and advanced learning mechanisms "above."
In terms of the general theory of general intelligence, the basic CogPrime cognitive cycle can
be seen to have a foundational importance in biasing the CogPrime system toward the problem
of controlling an agent in an environment requiring a variety of real-time and near-real-time
responses based on a variety of kinds of knowledge. Due to its basis in human and animal
cognition, the CogPrime cognitive cycle likely incorporates many useful biases in ways that
are not immediately obvious, but that would become apparent if comparing intelligent agents
controlled by such a cycle versus intelligent agents controlled via other means.
The cognitive cycle also provides a framework in which other cognitive processes, relating
to various aspects of the goals and environments relevant to human-level general intelligence,
67
EFTA00624214
68 22 Emotion, Motivation, Attention and Control
may conveniently dynamically interoperate. The "Mind OS" aspect of the CogPrime archi-
tecture provides general mechanisms in which various cognitive processes may interoperate on
a common knowledge store; the cognitive cycle goes further and provides a specific dynamical
pattern in which multiple cognitive processes may intersect. Its effective operation places strong
demands on the cognitive synergy between the various cognitive processes involved, but also
provides a framework that encourages this cognitive synergy to develop and persist.
Finally, it should be stressed that the cognitive cycle is not all-powerful nor wholly pervasive
in CogPrime's dynamics. It's critical for the real-time interaction of a CogPrime-controlled agent
with a virtual or physical world; but there may be many processes within CogPrime that most
naturally operate outside such a cycle. For instance, humans will habitually do deep intellectual
thinking (even something so abstract as mathematical theorem proving) within a cognitive cycle
somewhat similar to the one they use for practical interaction with the external world. But,
there's no reason that CogPrime systems need to be constrained in this way. Deviating from a
cognitive cycle based dynamic may cause a CogPrime system to deviate further from human-
likeness in its intelligence, but may also help it to perform better than humans on some tasks,
e.g. tasks like scientific data analysis or mathematical theorem proving that benefit from styles
of information processing that humans aren't particularly good at.
22.2 A Quick Look at Action Selection
We will begin our exposition of CogPrime's cognitive cycle with a quick look at action selection.
As Stan Franklin likes to point out, the essence of an intelligent agent is that it does things; it
takes actions. The particular mechanisms of action selection in CogPrime are a bit involved and
will be given in Chapter 24; in this chapter we will give the basic idea of the action selection
mechanism and then explain how a variant of the Psi model (described in Chapter 4 of Part 1
above) is used to handle motivation (emotions, drives, goals, etc.) in CogPrime, including the
guidance of action selection.
The crux of CogPrime's action selection mechanism is as follows
• the action selector chooses procedures that seem likely to help achieve important goals in
the current context
- Example: If the goal is to create a block structure that will surprise Bob, and there is
plenty of time, one procedure worth choosing might be a memory search procedure for
remembering situations involving Bob and physical structures. Alternately, if there isn't
much time, one procedure worth choosing might be a procedure for building the base
of a large structure - as this will give something to use as part of whatever structure
is eventually created. Another procedure worth choosing might be one that greedily
assembles structures from blocks without any particular design in mind.
• to support the action selector, the system builds implications of the form Context&Procedure
Goal, where Context is a predicate evaluated based on the agent's situation
— Example: If Bob has asked the agent to do something, and it knows that Bob is very
insistent on being obeyed, then implications such as
• "Bob instructed to do X" and "do X" -, "please Bob" < .9, .9 >
will be utilized
EFTA00624215
22.3 Psi in CogPrime 69
— Example: If the agent wants to make a tower taller, then implications such as
• ' is a blocks structure" and "place block atop —) "make T taller" < .9, .9 >
will be utilized
• the truth values of these implications are evaluated based on experience and inference
- Example: The above implication involving Bob could be evaluated based on experience,
by assessing it against remembered episodes involving Bob giving instructions
— Example: The same implication could be evaluated based on inference, using analogy
to experiences with instructions from other individuals similar to Bob; or using things
Bob has explicitly said, combined with knowledge that Bob's self-descriptions tend to
be reasonably accurate
• Importance values are propagated between goals using economic attention allocation (and,
inference is used to learn subgoals from existing goals)
- Example: If Bob has told the agent to do X, and the agent has then derived (from the
goal of pleasing Bob) the goal of doing X, then the "please Bob" goal will direct some
of its currency to the "do X" goal (which the latter goal can then pass to its subgoals,
or spend on executing procedures)
These various processes are carried out in a manner orchestrated by Domer's Psi model as
refined by Joscha Bach (as reviewed in Chapter 4 above), which supplies (among other features)
• a specific theory regarding what "demands" should be used to spawn the top-level goals
• a set of (four) interrelated system parameters governing overall system state in a useful
manner reminiscent of human and animal psychology
• a systematic theory of how various emotions (wholly or partially) emerge from more fun-
damental underlying phenomena
22.3 Psi in CogPrime
The basic concepts of the Psi approach to motivation, as reviewed in Chapter 4 of Part 1 above,
are incorporated in CogPrime as follows (note that the following list includes many concepts
that will be elaborated in more detail in later chapters):
• Demands are GroundedPredicateNodes (GPNs), i.e. Nodes that have their truth value com-
puted at each time by some internal C++ code or some Combo procedure in the Procedur-
eRepository
- Examples: Alertness, perceived novelty, internal novelty, reward from teachers, social
stimulus
- Humans and other animals have familiar demands such as hunger, thirst and excretion;
to create an AGI closely emulating a human or (say) a dog one may wish to simulate
these in one's AGI system as well
• Urges are also GPNs, with their truth values defined in terms of the truth values of the Nodes
for corresponding Demands. However in CogPrime we have chosen the term "Ubergoar
instead of Urge, as this is more evocative of the role that these entities play in the system's
dynamics (they are the top-level goals).
EFTA00624216
70 22 Emotion, Motivation, Attention and Control
• Each system comes with a fixed set of Ubergoals (and only very advanced CogPrime systems
will be able to modify their Ubergoals)
— Example: Stay alert and alive now and in the future; experience and learn new things
now and in the future; get reward from the teachers now and in the future; enjoy rich
social interactions with other minds now and in the future
- A more advanced CogPrime system could have abstract (but experientially grounded)
ethical principles among its Ubergoals, e.g. an Ubergoal to promote joy. an Ubergoal
to promote growth and an Ubergoal to promote choice, in accordance with the ethics
described in IGoci
• The ShortTermImportance of an Ubergoal indicates the urgency of the goal, so if the De-
mand corresponding to an Ubergoal is within its target range, then the Ubergoal will have
zero STI. But all Ubergoals can be given maximal LTI to guarantee they don't get deleted.
- Example: If the system is in an environment continually providing an adequate level of
novelty (according to its Ubergoal), then the Ubergoal corresponding to external novelty
will have low STI but high LTI. The system won't expend resources seeking novelty. But
then, if the environment becomes more monotonous, the urgency of the external novelty
goal will increase, and its STI will increase correspondingly, and resources will begin
getting allocated toward improving the novelty of the stimuli received by the agent.
• Pleasure is a GPN, and its internal truth value computing program compares the satisfaction
of Ubergoals to their expected satisfaction
- Of course, there are various mathematical functions (e.g. p'th power averages' for dif-
ferent p) that one can use to average the satisfaction of multiple Ubergoals; and choices
here, i.e. different specific ways of calculating Pleasure, could lead to systems with
different "personalities"
• Goals are Nodes or Links that are on the system's list of goals (the GoalPool). Ubergoals
are automatically Goals, but there will also be many other Goals also
- Example: The Ubergoal of getting reward from teachers might spawn subgoals like
"getting reward from Bob" (if Bob is a teacher), or "making teachers smile" or "create
surprising new structures" (if the latter often garners teacher reward). The subgoal of
"create surprising new structures" might, in the context of a new person entering the
agent's environment with a bag of toys, lead to the creation of a subgoal of asking for
a new toy of the sort that could be used to help create new structures. Etc.
• Psi's memory is CogPrime's AtomTable, with associated structures like the Procedur-
eRepository (explained in Chapter 19), the SpaceServer and TimeServer (explained in
Chapter 26), etc.
— Examples: The knowledge of what blocks look like and the knowledge that tall struc-
tures often fall down, go in the AtomTable; specific procedures for picking up blocks of
different shapes go in the ProcedureRepository; the layout of a room or a pile of blocks
at a specific point in time go in the SpaceServer; the series of events involved in the
building-up of a tower are temporally indexed in the TimeServer.
I the p'th power average is defined as vE XP
EFTA00624217
22.3 Psi in CogPrime 71
- In Psi and MicroPsi, these same phenomena are stored in memory in a rather different
way, yet the basic Psi motivational dynamics are independent of these representational
choices
• Psi's "motive selection" process is carried out in CogPrime by economic attention allocation,
which allocates ShortTermImportance to Goal nodes
- Example: The flow of importance from "Get reward from teachers" to "get reward from
Bob" to "make an interesting structure with blocks" is an instance of what Psi calls
"motive selection". No action is being taken yet, but choices are being made regarding
what specific goals are going to be used to guide action selection.
• Psi's action selection plays the same role as CogPrime's action selection, with the clari-
fication that in CogPrime this is a matter of selecting which procedures (i.e. schema) to
run, rather than which individual actions to execute. However, this notion exists in Psi as
well, which accounts for "automatized behaviors" that are similar to CogPrime schemata
the only (minor) difference here is that in CogPrime automatized behaviors are the default
case.
— Example: If the goal "make an interesting structure with blocks" has a high STI, then it
may be used to motivate choice of a procedure to execute, e.g. a procedure that finds an
interesting picture or object seen before and approximates it with blocks, or a procedure
that randomly constructs something and then filters it based on interestingness. Once a
blocks-structure-building procedure is chosen, this procedure may invoke the execution
of sub-procedures such as those involved with picking up and positioning particular
blocks.
• Psi's planning is carried out via various learning processes in CogPrime, including PLN plus
procedure learning methods like MOSES or hillclimbing
— Example: If the agent has decided to build a blocks structure emulating a pyramid
(which it saw in a picture), and it knows how to manipulate and position individual
blocks, then it must figure out a procedure for carrying out individual-block actions that
will result in production of the pyramid. In this case, a very inexperienced agent might
use MOSES or hillclimbing and "guidedly-randomly" fiddle with different construction
procedures until it hit on something workable. A slightly more experienced agent would
use reasoning based on prior structures it had built, to figure out a rational plan (like:
"start with the base, then iteratively pile on layers, each one slightly smaller than the
previous.")
• The modulators are system parameters which may be represented by PredicateNodes, and
which must be incorporated appropriately in the dynamics of various MindAgents, e.g.
- activation affects action selection. For instance this may be effected by a process that,
each cycle, causes a certain amount of STICurrency to pass to schema satisfying certain
properties (those involving physical action, or terminating rapidly). The amount of
currency passed in this way would be proportional to the activation
- resolution level affects perception schema and MindAgents, causing them to expend less
effort in processing perceptual data
- certainty affects inference and pattern mining and concept creation processes, causing
them to place less emphasis on certainty in guiding their activities, i.e. to be more
EFTA00624218
72 22 Emotion, Motivation, Attention and Control
accepting of uncertain conclusions. To give a single illustrative example: lV1Len backward
chaining inference is being used to find values for variables, a "fitness target" of the
form strength x confidence Ls sometimes used: this may be replaced with strengths' x
cortfidence2-P, where activation parameter affects the exponent p, so when p tends to 0
confidence is more important, when p tends to 2 strength is more important and when
p tends to 1 strength and confidence are equally important.
- selection threshold may be used to effect a process that, each cycle, causes a certain
amount of STICurrency (proportional to the selection threshold) to pass to the Goal
Atoms that were wealthiest at the previous cycle.
Based on this run-down, Psi and CogPrime may seem very similar, but that's because we have
focused here only on the motivation and emotion aspect. Psi uses a very different knowledge
representation than CogPrime; and in the Psi architecture diagram, nearly all of CogPrime is
pushed into the role of "background processes that operate in the memory box." According to the
theoretical framework underlying CogPrime, the multiple synergetic processes operating in the
memory box are actually the crux of general intelligence. But getting the motivation/emotion
framework right Ls also very important, and Psi seems to do an admirable job of that.
22.4 Implementing Emotion Rules atop Psi's Emotional Dynamics
Human motivations are largely determined by human emotions, which are the result of human-
ity's evolutionary heritage and embodiment, which are quite different than the heritage and
embodiment of current AI systems. So, if we want to create AGI systems that lack humanlike
bodies, and didn't evolve to adapt to the same environments as humans did, yet still have
vaguely human-like emotional and motivational structures, the latter will need to be explicitly
engineered or taught in some way.
For instance, if one wants to make a CogPrime agent display anger, something beyond Psi's
model of emotion needs to be coded into the agent to enable this. After all, the rule that when
angry the agent has some propensity to harm other beings, is not implicit in Psi and needs to be
programmed in. However, making use of Psi's emotion model, anger could be characterized as
an emotion consisting of high arousal, low resolution, strong motive dominance, few background
checks, strong goal-orientedness (as the Psi model suggests) and a propensity to cause harm to
agents or objects. This is much simpler than specifying a large set of detailed rules characterizing
angry behavior.
The "anger" example brings up the point that desirability of giving AGI systems closely
humanlike emotional and motivational systems is questionable. After all we humans cause
ourselves a lot of problems with these aspects of our mind/brains, and we sometimes put
our more ethical and intellectual sides at war with our emotional and motivational systems.
Looking into the future, an AGI with greater power than humans yet a humanlike motivational
and emotional system, could be a very dangerous thing.
On the other hand, if an AGI's motivational and emotional system is too different from
human nature, we might have trouble understanding it, and it understanding us. This problem
shouldn't be overblown - it seems possible that an AGI with a more orderly and rational
motivational system than the human one might be able to understand us intellectually very,
well, and that we might be able to understand it well using our analytical tools. However,
if we want to have mutual empathy with an AGI system, then its motivational and emotional
EFTA00624219
22.5 Coals and Contexts 73
framework had better have at least some reasonable overlap with our own. The value of empathy
for ethical behavior was stressed extensively in Chapter 12 of Part 1.
This is an area where experimentation is going to be key. Our initial plan is to supply
CogPrime with rough emulations of some but not all human emotions. We see no need to take
explicit pains to simulate emotions like anger, jealousy and hatred. On the other hand, joy,
curiosity, sadness, wonder, fear and a variety of other human emotions seem both natural in
the context of a robotically or virtually embodied CogPrime system, and valuable in terms of
allowing mutual human/CogPrime empathy.
22.4.1 Grounding the Logical Structure of Emotions in the Psi Model
To make this point in a systematic way, we point out that Ortony et al's FOCC90I "cognitive
theory of emotions" can be grounded in CogPrime's version of Psi in a natural way. This theory
captures a wide variety of human and animal emotions in a systematic logical framework, so that
grounding their framework in CogPrime Psi goes a long way toward explaining how CogPrime
Psi accounts for a broad spectrum of human emotions.
The essential idea of the cognitive theory of emotions can be seen in Figure 22.1. What we
see there is that common emotions can be defined in terms of a series of choices:
• Is it positive or negative?
• Is it a response to an agent, an event or an object?
• Is it focused on consequences for oneself, or for another?
- If on another, is it good or bad for the other?
- If on oneself, is it related to some event whose outcome is uncertain?
• if it's related to an uncertain outcome, did the expectation regarding the outcome
get fulfilled or not?
Figure 22.1 shows how each set of answers to these questions leads to a different emotion. For
instance: what is a negative emotion, responding to events, focusing on another, and undesirable
to the other? Pity.
In the list of questions, we see that two of them - positive vs. negative, and expectation ful-
fillment vs. otherwise - are foundational in the Psi model. The other questions are evaluations
that an intelligent agent would naturally make, but aren't bound up with Psi's emotion/moti-
vation infrastructure in such a deep way. Thus, the cognitive theory of emotion emerges as a
combination of some basic Psi factors with some more abstract cognitive properties (good vs.
bad for another; agents vs. events vs. objects).
22.5 Goals and Contexts
Now we dig deeper into the details of motivation in CogPrime. Just as we have both explicit
(local) and implicit (global) memory in CogPrime, we also have both explicit and implicit
goals. An explicit goal is formulated as a Goal Atom, and then MindAgents specifically orient
the system's activity toward achievement of that goal. An implicit goal is something that the
EFTA00624220
74 22 Emotion, Motivation, Attention and Control
witt FTC 0 REACTION TO
Cant aA
r ASPIC'S
a a a
Tarot% Wawa MUM
••••••••0 dukci
OK
SODAS) al row/moot
I I I
COGEOUEMat COMKOLIDICES
KO OMER f ULF Luta AOCMI
OEISTAI UPOISRAIll PROSPECTS PPOSPECTS
FOR OTHER la OMER Kama IRREIRVIPTI
I I
POSY Mel SRN iatelin
••••••••N *Ps — — NM
Y
POMMISCIP •41.1.41OPICI ATTIPINTRIN ATTRACT—
Ton
Imp
COMMED DICOPPORIG0
pans. paw.
Wean 8•10 SYR
fiesepdbal IffivelTemod
VARIAINO*11110,MOR
COSOMPS
Fig. 22.1: Ontology of Emotions from IOCC90]
system works toward, but in a more loosely organized way, and without necessarily explicitly
representing the knowledge that it is working toward that goal.
Here we will focus mainly on explicit motivation, beginning with a description of Goal Atoms,
and the Contexts in which Goals are worked toward via executing Procedures. Figure 22.2 gives
a rough depiction of the relationship between goals, procedures and context, in a simple example
relevant to an OpenCogPrime-controlled virtual agent in a game world.
22.5.1 Goal Atoms
A Goal Atom represents a target system state and is true to the extent that the system satisfied
the conditions it represents. A Context Atom represents an observed state of the world/mind,
and is true to the extent that the state it defines is observed. Taken together, these two Atom
types provide the infrastructure CogPrime needs to orient its actions in specific contexts toward
specific goals. Not all of CogPrime's activity is guided by these Atoms; much of it is non-goal-
directed and spontaneous, or ambient as we sometimes call it. But it is important that some
EFTA00624221
22.5 Coals and Contexts 75
CONTEXT
Bob is
nearby
building
a tower
'X
/ 71
" Yell
Bob
Bob's
fewer
AND
Ask Bob
Glee Bob I
'What are
a hug
you doing?'
Fig. 22.2: Context, Procedures and Goals. Examples of the basic "goal/context/procedure"
triad in a simple game-agent situation.
of the system's activity - and in some cases, a substantial portion - is controlled explicitly via
goals.
Specifically, a Goal Atom is simply an Atom (usually a PredicateNocle, sometimes a Link, and
potentially another type of Atom) that has been selected by the GoalRefinement MindAgent as
one that represents a state of the atom space which the system finds important to achieve. The
extent to which an Atom is considered a Goal Atom at a particular point in time is determined
by how much of a certain kind of financial instrument called an RFS (Request For Service) it
possess (as will be explained in Chapter 24).
A CogPrime instance must begin with some initial Ubergools (aka top level supergoals), but
may then refine these goals in various ways using inference. Immature, "childlike" CogPrime
systems cannot modify their Ubergoals nor add nor delete Ubergoals. Advanced CogPrime
systems may be allowed to modify, add or delete Ubergoals, but this is a critical and subtle
aspect of system dynamics that must be treated with great care. WIKISOURCE:ContextAtom
EFTA00624222
76 22 Emotion. Motivation, Attention and Control
22.6 Context Atoms
Next, a Context is simply an Atom that is used as the source of a ContextLink, for instance
Context
quantum_computing
Inheritance Ben amateur
or
Context
game_of_fetch
PredictiveAttraction
Evaluation give (ball, teacher)
Satisfaction
The former simply says that Ben is an amateur in the context of quantum computing. The latter
says that in the context of the game of fetch, giving the ball to the teacher implies satisfaction.
A more complex instance pertinent to our running example would be
Context
Evaluation
Recently
List
Minute
Evaluation
Ask
List
Bob
ThereExists SX
And
Evaluation
Build
List
self
SX
Evaluation
surprise
List
SX
Bob
AverageQuantifier SY
PredictiveAttraction
And
Evaluation
Build
List
self
SY
Evaluation
surprise
List
SY
Jim
Satisfaction
EFTA00624223
22.7 Ubergoal Dynamics 77
which says that, if the context is that Bob has recently asked for something surprising to be
built, then one strategy for getting satisfaction is to build something that seems likely to satisfy
Jim.
An implementation-level note: in the current OpenCogPrime implementation of CogPrime,
ContextLinks are implicit rather than explicit entities. An Atom can contain a ComplexTruth-
Value which in turn contains a number of VersionHandles. Each VersionHandle associates a
Context or a Hypothetical with a TruthValue. This accomplishes the same thing as a formal
ContextLink, but without the creation of a ContextLink object. However, we continue to use
ContextLinks in this book and other documents about CogPrime; and it's quite possible that
future CogPrime implementations might handle them differently.
22.7 Ubergoal Dynamics
In the early phases of a CogPrime system's cognitive development, the goal system dynamics
will be quite simple. The Ubergoals are supplied by human programmers, and the system's
adaptive cognition is used to derive subgoals. Attentional currency allocated to the Ubergoals
is then passed along to the subgoals. as judged appropriate.
As the system becomes more advanced, however, more interesting phenomena may arise
regarding Ubergoals: implicit and explicit Ubergoal creation.
22.7.1 Implicit Ubergoal Pool Modification
First of all, implicit Ubergoal creation or destruction may occur. Implicit Ubergoal destruction
may occur when there are multiple Ubergoals in the system, and some prove easier to achieve
than others. The system may then decide not to bother achieving the more difficult Ubergoals.
Appropriate parameter settings may mitigate against this phenomenon, of course.
Implicit Ubergoal creation may occur if some Goal Node G arises that inherits as a subgoal
from multiple Ubergoals. This Goal G may then come to act implicitly as an Ubergoal, in that
it may get more attentional currency than any of the Ubergoals.
Also, implicit Ubergoal creation may occur via forgetting. Suppose that G becomes a goal via
inferred inheritance from one or more Ubergoals. Then, suppose G forgets why this inheritance
exists, and that in fact the reason becomes obsolete, but the system doesn't realize that and
keeps the inheritance there. Then, G is an implicit Ubergoal in a strong sense: it gobbles up
a lot of attentional currency, potentially more than any of the actual Ubergoals, but actually
doesn't help achieve the Ubergoals, even though the system thinks it does. This kind of dynamic
is obviously very bad and should be avoided - and can be avoided with appropriate tuning of
system parameters (so that the system pays a lot of attention to making sure that its subgoaling-
related inferences are correct and are updated in a timely way).
EFTA00624224
78 22 Emotion, Motivation, Attention and Control
22.7.2 Explicit Ubergoal Pool Modification
An advanced CogPrime system may be given the ability to explicitly modify its Ubergoal pool.
This is a very interesting but very subtle type of dynamic, which is not currently well understood
and which potentially could lead to dramatically unpredictable behaviors.
However, modification, creation and deletion of goals is a key aspect of human psycholou,
and the granting of this capability to mature CogPrime systems must be seriously considered.
In the case that Ubergoal pool modification is allowed, one useful heuristic may be to make
implicit Ubergoals into explicit Ubergoals. For instance: if an Atom is found to consistently
receive a lot of RFSs, and has a long time-scale associated with it, then the system should
consider making it an Ubergoal. But this heuristic is certainly not sufficient, and any advanced
CogPrime system that is going to modify its own Ubergoals should definitely be tuned to put
a lot of thought into the process!
The science of Ubergoal pool dynamics basically does not exist at the moment, and one
would like to have some nice mathematical models of the process prior to experimenting with it
in any intelligent capable CogPrime system. Although Schmiddhuber's Godel machine ISch061
has the theoretical capability to modify its ubergoal (note that CogPrime is, in some way, a
Gadd machine), there is currently no mathematics allowing us to assess the time and space
complexity of such process in a realistic context, given a certain safety confidence target.
22.8 Goal Formation
Goal formation in CogPrime is done via PLN inference. In general, what PLN does for goal
formation is to look for predicates that can be proved to probabilistically imply the existing
goals. These new predicates will then tend to receive RFS currency, according to the logic
of RFS's to be outlined in Chapter 24, which (according to goal-driven attention allocation
dynamics) will make the system more likely to enact procedures that lead to their satisfaction.
As an example of the goal formation process, consider the case where ExternalNovelty Ls an
Ubergoal. The agent may then learn that whenever Bob gives it a picture to look at, its quest
for external novelty Ls satisfied to a singificant degree. That is, it learns
Attraction
Evaluation give (Bob, me, picture)
ExternalNovelty
where Attraction A B measures how much A versus implies B (as explained in Chapter 34).
This information allows the agent (the Goal Formation MindAgent) to nominate the atom:
EvaluationLink give (Bob, me, picture)
as a goal (a subgoal of the original Ubergoal). This is an example of goal refinement, which is
one among many ways that PLN can create new goals from existing ones.
EFTA00624225
22.10 Context Formation 79
22.9 Goal Fulfillment and Predicate Schematization
When there is a Goal Atom G important in the system (with a lot of RFS), the GoalFulfillment
MindAgent seeks Schemallodes $ that it has reason to believe, if enacted, will cause G to
become true (satisfied). It then adds these to the ActiveSchemaPool, an object to be discussed
below. The dynamics by which the GoalFulfillment process works will be discussed in Chapter
24 below.
For example, if a Context Node Chas a high truth value at that time (because it is currently
satisfied), and is involved in a relation:
Attraction
C
PredictiveAttraction S G
(for some Schemallode S and Goal Node G) then this Schemallode S is likely to be se-
lected by the GoalFulfillment process for execution. This is the fully formalized version of the
Context&Schema —> Goal notion discussed frequently above. The process may also allow the
importance of various schema S to bias its choices of which schemata to execute.
For instance, following up previous examples, we might have
Attraction
Evaluation
near
List
self
Bob
PredictiveAttraction
Evaluation
ask
List
Bob
"Show me a picture"
ExternalNovelty
Of course this is a very simplistic relationship but it's similar to a behavior a young child might
display. A more advanced agent would utilize a more abstract relationship that distinguishes
various situations in which Bob is nearby, and also involves expressing a concept rather than a
particular sentence.
The formation of these schema-context-goal triads may occur according to generic inference
mechanisms. However, a specially-focused PredicateSchematization MindAgent is very useful
here as a mechanism of inference control, increasing the number of such relations that will exist
in the system.
22.10 Context Formation
New contexts are formed by a combination of processes:
• The MapEncapstdation MindAgent, which creates Context Nodes embodying repeated pat-
terns in the perceived world. This process encompasses
- Maps creating Context Nodes involving Atoms that have high STI at the same time
EFTA00624226
80 22 Emotion. Motivation, Attention and Control
• Example: A large number of Atoms related to towers could be joined into a single
map, which would then be a ConceptNode pointing to "tower-related ideas, proce-
dures and experiences"
- Maps creating Context Nodes that are involved in a temporal activation pattern that
recurs at multiple points in the system's experience.
• Example: There may be a common set of processes involving creating a building
out of blocks: first build the base, then the walls, then the roof. This could be
encapsulated as a temporal map embodying the overall nature of the process. In
this case, the map contains information of the nature: first do things related to this,
then do things related to this, then do things related to this...
• A set of concept creation MindAgents (see Chapter 38, which fuse and split Context Nodes
to create new ones.
— The concept of a building and the concept of a person can be merged to create the
concept of a BuildingMan
- The concept of a truck built with Legos can be subdivided into trucks you can actually
carry Lego blocks with, versus trucks that are "just for show" and can't really be loaded
with objects and then carry them around
22.11 Execution Management
The GoalFulfillment MindAgent chooses schemata that are found likely to achieve current goals,
but it doesn't actually execute these schemata. What it does is to take these schemata and place
them in a container called the ActiveSchemaPool.
The ActiveSchemaPool contains a set of schemata that have been determined to be reason-
ably likely, if enacted, to significantly help with achieving the current goal-set. I.e., everything
in the active schema pool should be a schema S so that it has been concluded that
Attraction
C
PredictiveAttraction S G
- where C is a currently applicable context and C is one of the goals in the current goal pool
- has a high truth value compared to what could be obtained from other known schemata S or
other schemata S that could be reasonably expected to be found via reasoning.
The decision of which schemata in the ActiveSchemaPool to enact is made by an object
called the ExecutionManager, which is invoked each time the SchemaActivation MindAgent is
executed. The ExecutionManager is used to select which schemata to execute, based on doing
reasoning and consulting memory regarding which active schemata can usefully be executed
simultaneously without causing destructive interference (and hopefully causing constructive
interference). This process will also sometimes (indirectly) cause new schemata to be created
and/or other schemata from the AtomTable to be made active. This process is described more
fully in Chapter 24 on action selection. WIKISOURCE:GoalsAndTime
For instance, if the agent is involved in building a blocks structure intended to surprise or
please Bob, then it might simultaneously carry out sonic blocks-manipulation schema, and also
a schema involving looking at Bob to garner his approval. If it can do the blocks manipulation
without constantly looking at the blocks, this should be unproblematic for the agent.
EFTA00624227
22.12 Coals and Time 81
22.12 Goals and Time
The CogPrime system maintains an explicit list of "Ubergoals", which as will be explained
in Chapter 24, receive attentional currency which they may then allocate to their subgoals
according to a particular mechanism.
However, there is one subtle factor involved in the definition of the Ubergoals: time. The
truth value of a Ubergoal is typically defined as the average level of satisfaction of some De-
mand over some period of time - but the time scale of this averaging can be very important.
In many cases, it may be worthwhile to have separate Ubergoals corresponding to the same
Demand but doing their truth-value time-averaging over different time scales. For instance,
corresponding to Demands such as Novelty or Health, we may posit both long-term and short-
term versions, leading to Ubergoals such as CurrentNovelty, LongTermNovelty, CurrentHealth,
LongTermHealth, etc. Of course, one could also wrap multiple Ubergoals corresponding to a
single Demand into a single Ubergoal combining estimates over multiple time scales; this is not
a critical issue and the only point of splitting Demands into multiple UbergoaLs is that it can
make things slightly simpler for other cognitive processes.
For instance, if the agent has a goal of pleasing Bob, and it knows Bob likes to be presented
with surprising structures and ideas, then the agent has some tricky choices to make. Among
other choices it must balance between focusing on
• creating things and then showing them to Bob
• studying basic knowledge and improving its skills.
Perhaps studying basic knowledge and skills will give it a foundation to surprise Bob much
more dramatically in the mid-term future ... but in the short run will not allow it to surprise
Bob much at all, because Bob already knows all the basic material. This is essentially a variant
of the general "exploration versus exploitation" dichotomy, which lacks any easy solution. Young
children are typically poor at carrying out this kind of balancing act, and tend to focus overly
much on near-term satisfaction. There are also significant cultural differences in the heuristics
with which adult humans face these issues; e.g. in some contexts Oriental cultures tend to focus
more on mid to long term satisfaction whereas Western cultures are more short term oriented.
EFTA00624228
EFTA00624229
Chapter 23
Attention Allocation
Co-authored with Joel Pitt and Matt Ikle' and Rui Liu
23.1 Introduction
The critical factor shaping real-world general intelligence is resource constraint. Without this
issue, we could just have simplistic program-space-search algorithms like AIXI't instead of com-
plicated systems like the human brain or CogPrime. Resource constraint is managed implicitly
within various components of CogPrime, for instance in the population size used in evolu-
tionary learning algorithms, and the depth of forward or backward chaining inference trees
in PLN. But there is also a component of CogPrime that manages resources on a global and
cognitive-process-independent manner: the dictation allocation component.
The general principles the attention allocation process should follow are easy enough to see:
History should be used as a guide, and an intelligence should make probabilistic judgments
based on its experience, guessing which resource-allocation decisions are likely to maximize
its goal-achievement. The problem is that this is a difficult learning and inference problem,
and to carry it out with excellent accuracy would require a limited-resources intelligent system
to spend nearly all its resources deciding what to pay attention to and nearly none of them
actually paying attention to anything else. Clearly this would be a very poor allocation of an AI
system's attention! So simple heuristics are called for, to be supplemented by more advanced
and expensive procedures on those occasions where time is available and correct decisions are
particularly crucial.
Attention allocation plays, to a large extent, a "meta" role in enabling mind-world corre-
spondence. Without effective attention allocation, the other cognitive processes can't do their
jobs of helping an intelligent agent to achieve its goals in an environment, because they won't
be able to pay attention to the most important parts of the environment, and won't get com-
putational resources at the times when they need it. Of course this need could be addressed in
multiple different ways. For example, in a system with multiple complex cognitive processes,
one could have attention allocation handled separately within each cognitive process, and then
a simple "top layer" of attention allocation managing the resources allocated to each cognitive
process. On the other hand, one could also do attention allocation via a single dynamic, perva-
sive both within and between individual cognitive processes. The CogPrime design gravitates
more toward the latter approach. though also with some specific mechanisms within various
MindAgents; and efforts have been made to have these specific mechanisms modulated by the
generic attention allocation structures and dynamics wherever possible.
83
EFTA00624230
84 23 Attention Allocation
In this chapter we will dig into the specifics of how these attention allocation issues are
addressed in the CogPrime design. In short, they are addressed via a set of mechanisms and
equations for dynamically adjusting importance values attached to Atoms and MindAgents. Dif-
ferent importance values exist pertinent to different time scales, most critically the short-term
(STI) and long-term (LTI) importances. The use of two separate time-scales here reflects fun-
damental aspects of human-like general intelligence and real-world computational constraints.
The dynamics of STI is oriented partly toward the need for real-time responsiveness, and
the more thoroughgoing need for cognitive processes at speeds vaguely resembling the speed of
"real time" social interaction. The dynamics of LTI is based on the fact that some data tends to
be useful over long periods of time, years or decades in the case of human life, but the practical
capability to store large amounts of data in a rapidly accessible way is limited. One could
imagine environments in which very-long-term multiple-type memory was less critical than it is
in typical human-friendly environments; and one could envision AGI systems carrying out tasks
in which real-time responsiveness was unnecessary (though even then some attention focusing
would certainly be necessary). For AGI systems like these, an attention allocation system based
on STI and LTI with CogPrime-like equations would likely be inappropriate. But for an AGI
system intended to control a vaguely human-like agent in an environment vaguely resembling
everyday human environments, the focus on STI and LTI values, and the dynamics proposed
for these values in CogPrime, appear to make sense.
Two basic innovations are involved in the mechanisms attached to these STI and LTI impor-
tance values:
• treating attention allocation as a data mining problem: the system records information
about what it's done in the past and what goals it's achieved in the past, and then recog-
nizes patterns in this history and uses them to guide its future actions via probabilistically
adjusting the (often context-specific) importance values associated with internal terms, ac-
tors and relationships, and adjusting the "effort estimates" associated with Tasks
• using an artificial-economics approach to update the importance values (attached to Atoms,
MindAgents, and other actors in the CogPrime system) that regulate system attention.
(And, more speculatively, using an information geometry based approach to execute the
optimization involved in the artificial economics approach efficiently and accurately.)
The integration of these two aspects is crucial. The former aspect provides fundamental data
about what's of value to the system, and the latter aspect allows this fundamental data to
be leveraged to make sophisticated and integrative judgments rapidly. The need for the latter,
rapid-updating aspect exists partly because of the need for real-time responsiveness, imposed by
the need to control a body in a rapidly dynamic world, and the prominence in the architecture
of an animal-like cognitive cycle. The need for the former, data-mining aspect (or something
functionally equivalent) exists because, in the context of the tasks involved in human-level
general intelligence, the assignment of credit problem is hard - the relations between various
entities in the mind and the mind's goals are complex, and identifying and deploying these
relationships is a difficult learning problem requiring application of sophisticated intelligence.
Both of these aspects of attention allocation dynamics may be used in computationally
lightweight or computationally sophisticated manners:
• For routine use in real-time activity
EFTA00624231
23.2 Semantics of Short and Long Term Importance 85
- "data mining" consists of forming HebbianLinks (involved in the associative memory
and inference control, see Section 23.5), where the weight of the link from Atom A to
Atom B is based on the probability of shared utility of A and B
- economic attention allocation consists of spreading ShortTermlmportance and LongTer-
mImportance "artificial currency" values (both grounded in the universal underlying
"juju" currency value defined further below) between Atoms according to specific equa-
tions that somewhat resemble neural net activation equations but respect the conser-
vation of currency
• For use in cases where large amounts of computational resources are at stake based on lo-
calized decisions, hence allocation of substantial resources to specific instances of attention-
allocation is warranted
- "data mining" may be more sophisticated, including use of PLN, MOSES and pattern
mining to recognize patterns regarding what probably deserves more attention in what
contexts
- economic attention allocation may involve more sophisticated economic calculations
involving the expected future values of various "expenditures" of resources
The particular sort of "data mining" going on here is definitely not exactly what the human
brain does, but we believe this is a case where slavish adherence to neuroscience would be badly
suboptimal (even if the relevant neuroscience were well known, which is not the case). Doing
attention allocation entirely in a distributed, formal-neural-net-like way is, we believe, extremely
and unnecessarily inefficient, and given realistic resource constraints it leads to the rather poor
attention allocation that we experience every, day in our ordinary waking state of consciousness.
Several aspects of attention allocation can be fruitfully done in a distributed, neural-net-like
way, but not having a logically centralized repository of system-history information (regardless
of whether it's physically distributed or not) seems intrinsically problematic in terms of effective
attention allocation. And we argue that, even for those aspects of attention allocation that are
best addressed in terms of distributed, vaguely neural-net-like dynamics, an artificial-economics
approach has significant advantages over a more strictly neural-net-like approach, due to the
greater ease of integration with other cognitive mechanisms such as forgetting and data mining.
23.2 Semantics of Short and Long Term Importance
We now specify the two types of importance value (short and long term) that play a key role
in CogPrime dynamics. Conceptually, ShortTermImportance (STI) is defined as
STI(A) = P(A will be useful in the near future)
whereas LongTermlmportance (LTI) is defined as
LTI(A) = P(A will be useful eventually, in the foreseeable future)
Given a time-scale T, in general we can define an importance value relative to T as
Ir(A) = P(A will be useful during the next T seconds)
EFTA00624232
86 23 Attention Allocation
In the ECAN module in CogPrime, we deal only with STI and LTI rather than any other
importance values, and the dynamics of STI and LTI are dealt with by treating them as two
separate "artificial currency" values, which however are interconvertible via being mutually
grounded in a common currency called "juju."
For instance, if the agent is intensively concerned with trying to build interesting blocks
structures, then knowledge about interpreting biology research paper abstracts is likely to be
of very little current importance. So its biological knowledge will get low STI, but - assuming
the agent expects to use biology again - it should maintain reasonably high LTI so it can
remain in memory for future use. And if in its brainstorming about what blocks structures
to build, the system decides to use some biological diagrams as inspiration, STI can always
spread to some of the biology-related Atoms, increasing their relevance and getting them more
attention. While the attention allocation system contains mechanisms to convert STI to LTI,
it also has parameter settings biasing it to spend its juju on both kinds of importance - i.e.
it contains an innate bias to both focus its attention judiciously, and manage its long-term
memory conscientiously.
Because in CogPrime most computations involving STI and LTI are required to be very rapid
(as they're done for many Atoms in the memory very frequently), in most cases when dealing
with these quantities, it will be appropriate to sacrifice accuracy for efficiency. On the other
hand, it's useful to occasionally be able to carry out expensive, highly accurate computations
involving importance.
An example where doing expensive computations about attention allocation might pay off,
would be the decision whether to use biology-related or engineering-related metaphors in cre-
ating blocks structures to please a certain person. In this case it could be worth doing a few
steps of inference to figure out whether there's a greater intensional similarity between that
person's interests and biology or engineering; and then using the results to adjust the STI levels
of whichever of the two comes out most similar. This would not be a particularly expensive
inference to carry out, but it's still much more effort than what can be expended on Atoms in
the memory most of the time. Most attention allocation in CogPrime involves simple neural-net
type spreading dynamics rather than explicit reasoning.
Figure 23.1 illustrates the key role of LTI in the forgetting process. Figure 23.2 illustrates
the key role of STI in maintaining a "moving bubble of attention", which we call the system's
AttentionalFocus.
23.2.1 The Precise Semantics of STI and LTI
Now we precisiate the above definitions of STI and LTI.
First, we introduce the notion of reward. Reward is something that Goals give to Atoms. In
principle a Goal might give an Atom reward in various different forms, though in the design
given here, reward will be given in units of a currency called juju. The process by which Goals
assign reward to Atoms is part of the "assignment of credit" process (and we will later discuss
the various time-scales on which assignment of credit may occur and their relationship to the
time-scale parameter within LTI).
Next, we define
J(A,tht2,r) = expected amount of reward A will receive between t t and t2 time-steps in the
future, if its STI has percentile rank r among all Atoms in the AtomTable
EFTA00624233
23.2 Semantics of Short and Long Term Importance 87
STAY IN
ATOMSPACE
SAVE TO
0 BACKUP
ATOM STORE
REMOVE FROM
ATOMSPACE
PERMANENTLY
DELETE
Fig. 23.1: LongTermlmportance and Forgetting.
The percentile rank r of an Atom is the rank of that Atom in a list of Atoms ordered by
decreasing STI, divided by the total number of Atoms. The reason for using a percentile rank
instead of the STI itself is because at any given time only a limited number of atoms can be
given attention, so all atoms below a certain perceptible rank, depending on the amount of
available resource, will simply be ignored.
This is a fine-grained measure of how worthwhile it is expected to be to increase A's STI, in
terms of getting A rewarded by Goals.
For practical purposes it is useful to collapse J(A, t1, t2, r) to a single number:
J(A, tl, t2)
Er J(A,tht2,r)wr
Er tor
where tor weights the different percentile ranks (and should be chosen to be monotone increasing
in r). This is a single-number measure of the responsiveness of an Atom's utility to its STI level.
So for instance if A has a lot of STI and it turns out to be rewarded then J(A, tl, t2) will be
high. On the other hand if A has little STI then whether it gets rewarded or not will not
influence J(A, t1, t2) much.
To simplify notation, it's also useful to define a single-time-point version
J(A, = J(A, t,
23.2.1.1 Formalizing STI
Using these definitions, one simple way to make the STI definition precise is:
Mih nsh(A, t) = P(J(A, t, t + t shord 9 Sthreshoid)
where Sth„shou demarcates the "attentional focus boundary." Which is a way of saying that we
don't want to give STI to atoms that would not get rewarded if they were given attention.
EFTA00624234
ss 23 Attention Allocation
ATTENTIONAL FOCUS
AT TILNTIONAL I OCUS BOUNDARY
Fig. 23.2: Formation of the AttentionalFocus. The dynamics of STI is configured to en-
courage the emergence of richly crass-connected networks of Atoms with high STI (above a
threshold called the AttentionalFocusBoundary), passing STI among each other as long as this
is useful and forming new HebbianLinks among each other. The collection of these Atoms is
called the AttentionalFocus.
Or one could make the STI definition precise in a fuzzier way, and define
CO
smunv(A,0 t + s)c—
s=o
for some appropriate parameter c (or something similar with a decay function less severe than
exponential)
EFTA00624235
23.2 Semantics of Short and Long Term Importance 89
In either case, the goal of the ECAN subsystem, regarding STI, is to assign each Atom A
an STI value that corresponds as closely as possible to the theoretical STI values defined by
whichever one of the above equations is selected (or some other similar equation).
23.2.2 STI, STIFund, and Juju
But how can one estimate these probabilities in practice? In some cases they may be estimated
via explicit inference. But often they must be estimated by heuristics.
The estimative approach taken in current CogPrime design is an artificial economy, in which
each Atom maintains a certain fund of artificial currency. In the current proposal this currency
is called juju and is the same currency used to value LTI. Let us call the amount of juju owned
by Atom A the STIFund of A. Then, one way to formalize the goal of the artificial economy is
to state that: if one ranks all Atoms by the wealth of their STIFtmd, and separately ranks all
Atoms by their theoretical STI value, the rankings should be as close as passible to the same.
One may also formalize the goal in terms of value correlation instead of rank correlation, of
course.
Proving conditions under which the STIFtmd values will actually correlate well with the
theoretical STI values, is an open math problem. Heuristically, one may map STIFund values
into theoretical STI values by a mapping such as
A.STIT'und— STIFund. min
A.STI = a + i3
STIFund. 1111OC -STIF'und, min
where STIFund. min = min X.STIF'und, However, we don't currently have rigorous grounding
for any particular functional form for such a mapping; the above is just a heuristic approxima-
tion.
The artificial economy approach leads to a variety of supporting heuristics. For instance, one
such heuristic is: if A has been used at time t, then it will probably be useful at time t + s for
small .s. Based on this heuristic, whenever a MindAgent uses an Atom A, it may wish to increase
A's STIFund (so as to hopefully increase correlation of A's STIFund with its theoretical STI).
It does so by transferring some of its juju to A's STIFund.
23.2.3 Formalizing LTI
Similarly to STI, with LTI we will define theoretical LTI values, and posit an LTIFund associated
with each Atom, which seeks to create values correlated with the theoretical LTI
For LTI, the theoretical issues are subtler. There is a variety of different ways to precisiate
the above loose conceptual definition of LTI. For instance, one can (and we will below) create
formalizations of both:
1. L77.1(1) = (some time-weighting or normalization of) the expected value of A's total
usefulness over the long-term future
2. LTh,„„i(A) = the probability that A ever becomes highly useful at some point in the long-
term future
EFTA00624236
90 23 Attention Allocation
(here "cont" stands for "continuous"). Each of these may be formalized, in similar but noniden-
tical ways.
These two forms of LTI may be viewed as extremes along a continuum; one could posit a
host of intermediary LTI values between them. For instance, one could define
LT/p(A) = the p'th power average' of expectation of the utility of A over brief time intervals,
measured over the long-term future
Then we would have
LTIbargi =
LTIami = LTI1
and could vary p to vary the sharpness of the LTI computation. This might be useful in some
contexts, but our guess is that it's overkill in practice and that looking at LT/bunt and LTIona
is enough (or more than enough; the current OCP code uses only one LTI value and that has
not been problematic so far).
23.2.4 Applications of LTIfrurat versus LTIcont
It seems that the two forms of LTI discussed above might be of interest in different contexts,
depending on the different ways that Atoms may be used so as to achieve reward.
If an Atom is expected to get rewarded for the results of its being selected by MindAgents
that carry out diffuse, background thinking (and hence often select low-STI Atoms from the
AtomTable), then it may be best associated with LTIcont.
On the other hand, if an Atom is expected to get rewarded for the results of its being
selected by MindAgents that are focused on intensive foreground thinking (and hence generally
only select Atoms with very high STI), it may be best associated with LT/s„„i.
In principle, Atoms could he associated with particular LTIp based on the particulars of
the selection mechanisms of the MindAgents expected to lead to their reward. But the issue
with this is, it would result in Atoms carrying around an excessive abundance of different LTIp
values for various p, resulting in memory bloat; and it would also require complicated analyses
of MindAgent dynamics. If we do need more than one LTI value, one would hope that two will
be enough, for memory conservation reasons.
And of course, if an Atom has only one LTI value associated with it, this can reasonably
be taken to stand in for the other one: either of LT/bunt or LTIcont may, in the absence of
information to the contrary, be taken as an estimate of the other.
23.2.4.1 LTI with Various Time Lags
The issue of the p value in the average in the definition of LTI is somewhat similar to (though
orthogonal to) the point that there are many different interpretations of LTI, achieved via
I the p'th power average is defined as tJ XP
EFTA00624237
23.3 Defining Burst LTI in Terms of STI 91
considering various time-lags. Our guess is that a small set of time-lags will be sufficient. Perhaps
one wants an exponentially increasing series of time-lags: i.e. to calculate LTI over k cycles where
k is drawn from {r, 2r, 4r, 8r,...Tvr).
The time-lag in LTI seems related to the time-lag in the system's goals. If a Goal object is
disseminating juju, and the Goal has an intrinsic time scale of t, then it may be interested in
LTI on time-scale t. So when a MA (MindAgent) is acting in pursuit of that goal, it should
spend a bunch of its juju on LTI on time-scale t.
Complex goals may be interested in multiple time-scales (for instance, a goal might place
greater value on things that occur in the next hour, but still have nonzero interest in things that
occur in a week), and hence may have different levels of interest in LTI on multiple time -scales.
23.2.4.2 Formalizing Burst LTI
Regarding burst LTI, two approaches to formalization seem to be the threshold version
LT/6„nt.ihmh(A) = P(A will receive a total of at least Sthreshou amount of normalized
stimulus during some time interval of length t„hon in the next tk,„5, time steps)
and the fuzzy version,
CO
Lum,„,.,„zzvo, E + si t + S tshort)f(S, tlimg)
a=0
where f(t,tio,,g) : R+ x R+ R+ is a nonincreasing function that remains roughly constant in
t up till a point te„,„9 steps in the future, and then begins slowly decaying.
23.2.4.3 Formalizing Continuous LTI
The threshold version of continuous LTI is quite simply:
LT/c,,„144,„„h(A, t, ,,9) = 377thrah(A, te,mg)
That is, smooth threshold LTI is just like smooth threshold STI, but the time-scale involved is
longer.
On the other hand, the fuzzy version of smooth LTI is:
03
L J(A.,t s)f(s,tic,„9)
-F
s.o
using the same decay function f that was introduced above in the context of burst LTI.
23.3 Defining Burst LTI in Terms of STI
It is straightforward to define burst LTI in terms of STI, rather than directly in terms of juju.
We have
EFTA00624238
92 23 Attention Allocation
LTIoung,th,,,h(A, t) = P(i :si t; " STImnsh (A, t s))
Or, using the fuzzy definitions, we obtain instead the approximate equation
CO
LT16,,„Litm,y(A, Ea(s)STIpazy(A,t s)f(s,tiong)
8=o
where
1—c
cc(s)
1 — ce+1
or the more complex exact equation:
OD
LT/6„,„E.,,,..,(A, t) . E snii.y( A, t + s) f (s, turn") — E(c' f (s — r, tt87,9)))
( r= 1
23.4 Valuing LTI and STI in terms of a Single Currency
We now further discuss the approach of defining LTIFtmd and STIFund in terms of a single
currency: juju (which as noted, corresponds in the current ECAN design to normalized stimu-
lus).
In essence, we can think of STIFund and LTIFund as different forms of financial instrument,
which are both grounded in juju. Each Atom has two financial instruments attached to it:
"STIRind of Atom A" and "LTIFund of Atom A" (or more if multiple versions of LTI are used).
These financial instruments have the peculiarity that, although many agents can put juju into
any one of them, no record is kept of who put juju in which one. Rather, the MA's are acting
so as to satisfy the system's Goals, and are adjusting the STIFund and LTIFund values in a
heuristic manner that is expected to approximately maximize the total utility propagated from
Goals to Atoms.
Finally, each of these financial instruments has a value that gets updated by a specific update
equation.
To understand the logic of this situation better, consider the point of view of a Goal with a
certain amount of resources (juju, to be used as reward), and a certain time-scale on which its
satisfaction is to be measured. Suppose that the goal has a certain amount of juju to expend
on getting itself satisfied.
This Goal clearly should allocate some of its juju toward getting processor time allocated
toward the right Atoms to serve its ends in the near future; and some of its juju toward ensuring
that, in future, the memory, will contain the Atoms it will want to see processor time allocated
to. Thus, it should allocate some of its juju toward boosting the STIFund of Atoms that it
thinks will (if chosen by appropriate MindAgents) serve its needs in the near future, and some
of its juju toward boosting the LTIFund of Atoms that it thinks will serve its need in the future
(if they remain in RAM). Thus, when a Goal invokes a MindAgent (giving the MindAgent the
juju it needs to access Atoms and carry out its work), it should tell this MindAgent to put some
of its juju into LTIFunds and some into STIFunds.
EFTA00624239
23.4 Valuing LTI and STI in terms of a Single Currency 93
If a MindAgent receives a certain amount of juju each cycle, independently of what the
system Goals are explicitly telling it, then this should be viewed as reflecting an implicit goal
of "ambient cognition", and the balance of STI and LTI associated with this implicit goal must
be a system parameter.
In general, the trade-off between STI and LTI boils down to the weighting between near and
far future that is intrinsic to a particular Goal. Simplistically: if a Goal values getting processor
allocated to the right stuff immediately 25 times more than getting processor allocated to the
right stuff 20K cycles in the future, then it should he willing spend 25x more of its juju on
STI than on LT/20K cydee• (This simplistic picture is complicated a little by the relationship
between different time-scales. For instance, boosting LThok, q,de,(A) will have an indirect
effect of increasing the odds that A will still be in memory 20K cycles in the future.)
However, this isn't the whole story, because multiple Goals are setting the importance values
of the same set of Atoms. If M1 pumps all its juju into STI for certain Atoms, then M2 may
decide it's not worthwhile for it to bother competing with MI in the STI domain, and to spend
its juju on LTI instead.
Note that the current system doesn't allow a MA to change its mind about LTI allocations.
One can envision a system where a MindAgent could in January, pay juju to have Atom A
kept around for a year, but then change its mind in June 6 months later, and ask for some of
the money back. But this would require an expensive accounting procedure, keeping track of
how much of each Atom's LTI had been purchased by which MindAgent; so it seems a poor
approach.
A more interesting alternative would be to allow MA's to retain adjustable `reserve funds" of
juju. This would mean that a MindAgent would never see a purpose to setting LTI,,„. 5,02,4A)
instead of repeatedly setting LT10„c unless a substantial transaction cost were incurred
with each transaction of adjusting an Atom's LTI. Introducing a transaction cost plus an ad-
justable per-MindAgent juju reserve fund, and LTI's on multiple time scales, would give the
LTI framework considerable flexibility. (To prevent MA's from hoarding their juju, one could
place a tax rate on reserve juju.)
The conversion rate between STI and LTI becomes an interesting matter; though it seems
not a critical one, since in the practical dynamics of the system it's juju that is used to produce
STI and LTI. In the current design there is no apparent reason to spread STI of one Atom to
LTI of another Atom, or convert the STI of an Atom into LTI of that same Atom, etc. - but
such an application might come up. (Fbr the rest of this paragraph, let's just consider LTI with
one time scale, for simplicity.) Each Goal will have its own preferred conversion rate between
STI and LTI, based on its own balancing of different time scales. But, each Goal will also have
a limited amount of juju, hence one can only trade a certain amount of STI for LT!, if one
is trading with a specific goal G. One could envision a centralized STI-for-LTI market where
different MA's would trade with each other, but this seems overcomplicated, at least at the
present stage.
As a simpler software design point, this all suggests a value for associating each Goal with a
parameter telling how much of its juju it wants to spend on STI versus Lit Or, more subtly,
how much of its juju it wants to spend on LTI on various time-scales. On the other hand, in a
simple ECAN implementation this balance may be assumed constant across all Goals.
EFTA00624240
94 23 Attention Allocation
23.5 Economic Attention Networks
Economic Attention Networks (ECANs) are dynamical systems based on the propagation of
STI and LTI values. They are similar in many respects to Hopfield nets, but are based on a
different conceptual foundation involving the propagation of amounts of (conserved) currency
rather than neural-net activation. Further, ECANs are specifically designed for integration with
a diverse body of cognitive processes as embodied in integrative AI designs such as CogPrime.
A key aspect of the CogPrime design is the imposition of ECAN structure on the CogPrime
AtomSpace.
Specifically, ECANs have been designed to serve two main purposes within CogPrime: to
serve as an associative memory for the network, and to facilitate effective allocation of the
attention of other cognitive processes to appropriate knowledge items.
An ECAN is simply a graph, consisting of un-typed nodes and links, and also "Hebbian" links
that may have types such as HebbianLink. InverseHebbianLink, or SymmetricHebbianLink.
Each node and link in an ECAN is weighted with two currency values, called STI (short-
term importance) and LTI (long-term importance); and each Hebbian link is weighted with a
probabilistic truth value.
The equations of an ECAN explain how the STI, LTI and Hebbian link weights values
get updated over time. As alluded to above, the metaphor underlying these equations is the
interpretation of STI and LTI values as (separate) artificial currencies. The fact that STI and
LTI are currencies means that, except in unusual instances where the ECAN controller decides to
introduce inflation or deflation and explicitly manipulate the amount of currency in circulation,
the total amounts of STI and LTI in the system are conserved. This fact makes the dynamics
of an ECAN dramatically different than that of an attractor neural network.
In addition to STI and LTI as defined above, the ECAN equations also contain the notion
of an Attentional Focus (AF), consisting of those Atoms in the ECAN with the highest STI
values (and represented by the sgh„ shom value in the above equations). These Atoms play a
privileged role in the system and, as such, are treated using an alternate set of equations.
23.5.1 Semantics of Hebbian Links
Conceptually, the probability value of a HebbianLink from A to B is the odds that if A is in
the AF, so is B; and correspondingly, the InverseHebbianLink from A to B is weighted with the
odds that if A is in the AF, then B is not. An ECAN will often be coupled with a "Forgetting"
process that removes low-LTI Atoms from memory according to certain heuristics. A critical
aspect of the ECAN equations is that Atoms periodically spread their STI and LTI to other
Atoms that connect to them via Hebbian and InverseHebbianLinks; this is the ECAN analogue
of activation spreading in neural networks.
Multiple varieties of HebbianLink may be constructed, for instance
• Asymmetric HebbianLinks, whose semantics are as mentioned above: the truth value of
Hebbiatthink A B denotes the probability that if A is in the AF, so is B
• Symmetric HebbianLinks, whose semantics are that: the truth value of SymmetricHebbian-
Link A B denotes the probability that if one of A or B is in the AF, both are
EFTA00624241
23.6 Dynamics of STI and LTI Propagation 95
It is also worth noting that one can combine ContextLinks with HebbianLinks and express
contextual association such that in context C, there is a strong HebbianLink between A and B.
23.5.2 Explicit and Implicit Hebbian Relations
In addition to explicit HebbianLinks, it can be useful to treat other links implicitly as Heb-
bianLinks. For instance, if ConceptNodes A and B are found to connote similar concepts, and
a SimilarityLink is formed between them, then this gives reason to believe that maybe a Sym-
metricHebbianLink between A and B should exist as well. One could incorporate this insight
in CogPrime in at least three ways:
• creating HebbianLinks paralleling other links (such as SimilarityLinks)
• adding "Hebbian weights" to other links (such as SimilarityLinks)
• implicitly interpreting other links (such as SimilarityLinks) as HebbianLinks
Further, these strategies may potentially be used together.
There are some obvious semantic relationships to be used in interpreting other link types im-
plicitly as HebbianLinks: for instance, Similarity maps into SynunetricHebbian. and Inheritance
A B maps into Hebbian A B. One may express these as inference rules, e.g.
SimilarityLink A B <tv_1>
I-
SymmetricHebbianLink A B <tv_2>
where tv2.8 = tvi.s. Clearly, tv2.c c tvi.c; but the precise magnitude of tv2.c must be deter-
mined by a heuristic formula. One option is to set tv2.c = crtui.c where the constant cc is set
empirically via data mining the System Activity Tables to be described below.
23.6 Dynamics of STI and LTI Propagation
We now get more specific about how some of these ideas are implemented in the currently
implemented ECAN subsystem of CogPrime. We'll discuss mostly sTi here because in the
current design LTI works basically the same way.
MindAgents send out stimulus to Atoms whenever they use them (or else, sometimes, just for
the purpose of increasing the Atom's STI); and before these stimulus values are used to update
the STI levels of the receiving Atom, they are normalized by: the total amount of stimulus sent
out by the MindAgent in that cycle, multiplied by the total amount of STI currency that the
MindAgent decided to spend in that cycle. The normalized stimulus is what has above been
called juju. This normalization preserves fairness among MA's, and conservation of currency.
(The reason "stimuli" exist, separately from STI, is that stimulus-sending needs to be very
computationally cheap, as in general it's done frequently by each MA each cycle, and we don't
want each action a MA takes to invoke some costly importance-updating computation.)
Then, Atoms exchange STI according to certain equations (related to HebbianLinks and
other links), and have their STI values updated according to certain equations (which involve,
among other operations, transferring STI to the "central bank").
EFTA00624242
96 23 Attention Allocation
23.6.1 ECAN Update Equations
The CogServer is understood to maintain a kind of central bank of STI and LTI funds. When
a non-ECAN MindAgent finds an Atom valuable, it sends that Atom a certain amount of
Stimulus, which results in that Atom's STI and LTI values being increased (via equations to be
presented below, that transfer STI and LTI funds from the CogServer to the Atoms in question).
Then, the ECAN ImportanceUpdating MindAgent carries out multiple operations, including
some that transfer STI and LTI funds from some Atoms bath to the CogServer.
There are multiple ways to embody this process equationally; here we briefly describe two
variants.
23.6.1.1 Definition and Analysis of Variant 1
We now define a specific set of equations in accordance with the ECAN conceptual framework
described abate. We define HsTi• = [81, • • • i sn] to be the vector of STI values, and C =
1
1CIL, • • • >elm
•
: • :
• •
to be the connection matrix of Hebbian probability values, where it is assumed
cia, • • • 'cm,J
t at the existence of a HebbianLink or InverseHebbianLink between A and B are mutually
911," • 191
exclusive possibilities. We also define CLTI = [ : . • . : to be the matrix of LTI values
9nt, " >Sinn
for each of the corresponding
We assume an updating scheme in which, periodically, a number of Atoms are allocated
Stimulus amounts, which causes the corresponding STI values to change according to the equa-
tions
Vi : si = s e — rent + wages,
where rent and wages are given by
f 20.. 1
{ (Rent) • max ( 0, icl"n r "b"') , if s i > 0
rent =
0, if s i > e >
rent = 0, ifs; ≤ e
and
(Wage)(StiMUIUS) , if p i 1
wages
,Il pi = a
u '
= { (Veag talitiliiS) .r
Ti _Er..1p,
where P = [pi, • • • ,p.), with pa E {0,1) is the cue pattern for the pattern that is to be retrieved.
All quantities enclosed in angled brackets are system parameters, and LTI updating is accom-
plished using a completely analogous set of equations.
EFTA00624243
23.6 Dynamics of STI and LTI Propagation 97
The changing STI values then cause updating of the connection matrix, according to the
"conjunction" equations. First define
= resents nxSTri if Si
recentkiinSTi ' if <
Next define
conj = Conjunction (si,sj) = norms x norm.'
and
c:j = (ConjDecay) conj + (1 — conj)cii.
Finally update the matrix elements by setting
= tip 4.; > 0
cif = ow if <0
We are currently also experimenting with updating the connection matrix in accordance with
the equations suggested by Storkey (1997, 1998, 1999.)
A key property of these equations is that both wages paid to, and rent paid by, each node
are positively correlated to their STI values. That is, the more important nodes are paid more
for their services, but they also pay more in rent.
A fixed percentage of the links with the lowest LTI values is then forgotten (which corresponds
equationally to setting the LTI to 0).
Separately from the above, the process of Hebbian probability updating is carried out via a
diffusion process in which some nodes "trade" STI utilizing a diffusion matrix D, a version of
the connection matrix C normalized so that D is a left stochastic matrix. D acts on a similarly
scaled vector v, normalized so that v is equivalent to a probability vector of STI values.
The decision about which nodes diffuse in each diffusion cycle is carried out via a decision
function. We currently are working with two types of decision functions: a standard threshold
function, by which nodes diffuse if and only if the nodes are in the AF; and a stochastic decision
function in which nodes diffuse with probability th"h(shar'('' -PoEkm""17))+1, where shape and
FocusBoundary are parameters.
The details of the diffusion process are as follows. First, construct the diffusion matrix from
the entries in the connection matrix as follows:
If cif ≥ 0, then d i =
else, set do =
Next, we normalize the columns of D to make D a left stochastic matrix. In so doing, we ensure
that each node spreads no more than a (MaxSpread) proportion of its STI, by setting
n
if Edi > (MaxSpread) :
1=1
dy = { d„. x (It 87 1) , for i.0 j
' du = 1 — (MaxSpread)
EFTA00624244
98 23 Attention Allocation
else:
djj = 1— E
i=1
i
Now we obtain a scaled STI vector v by setting
minSTI = min si and maxSTI = max si
— min STI
v,
max STI — min STI
The diffusion matrix is then used to update the node STIs
v'= Dv
and the STI values are resealed to the interval [minSTI,mfocSTII.
In both the rent and wage stage and in the diffusion stage, the total STI and LTI funds
of the system each separately form a conserved quantity: in the case of diffusion, the vector
v is simply the total STI times a probability vector. To maintain overall system funds within
homeostatic bounds, a mid-cycle tax and rent-adjustment can be triggered if nectsbary; the
equations currently used for this are
recent stimulus awarded
. bettore
Arupdatex (Wage)
1. (Rent} recent
2. tax =f,, where x is the distance from the current AtomSpace bounds to the center of the
homeostatic range for AtomSpace funds;
3. Vi:si = — tax
23.6.1.2 Investigation of Convergence Properties of Variant 1
Now we investigate some of the properties that the above ECAN equations display when we
use an ECAN defined by them as an associative memory network in the manner of a Hopfield
network.
We consider a situation where the ECAN is supplied with memories via a "training" phase
in which one imprints it with a series of binary patterns of the form P = [pi , • • • ,p„], with
E {0, 1). Noisy versions of these patterns are then used as cue patterns during the retrieval
process.
We obviously desire that the ECAN retrieve the stored pattern corresponding to a given cue
pattern. In order to achieve this goal, the ECAN must converge to the correct fixed point.
Theorem 23.1. For a given value of e in the STI rent calculation, there is a subset of hyperbolic
decision functions for which the ECAN dynamics converge to an attracting faxed point.
Proof Rent is zero whenever e ≤ s, ≤ recent2o
m' s11 , so we consider this case first. The updating
process for the rent and wage stage can then be written as f (s) = s + constant. The next stage
is governed by the hyperbolic decision function
EFTA00624245
23.6 Dynamics of STI and LTI Propagation 99
tanh (shape (s — FocusBotmdary)) -I- 1
9 (s) 2
The entire updating sequence is obtained by the composition (go f) (s), whose derivative is
then
sech2 (f (s)) • shape
(s. o f) 2
(1),
which has magnitude less than 1 whenever -2 < shape < 2. We next consider the case s i
reeentigaxSTI > a The function f now takes the form
20
f (s) s log (20s/recentMaxSTI)
-F constant,
2
and we have
sech2 (f (s)) • shape ( 1 1\
(so 2 2s)
which has magnitude less than 1 whenever Ishapei < I 2; l• Choosing the shape parameter
to satisfy 0 < shape < min (2, then guarantees that < 1. Finally, g o f maps
I *2 I) I(91 0 f )'I
the closed interval [recentAlinStl,recentMazSTI) into itself, so applying the Contraction
Mapping Theorem completes the proof.
23.6.1.3 Definition and Analysis of Variant 2
The ECAN variant described above has performed completely acceptably in our experiments
so far; however we have also experimented with an alternate variant, with different convergence
properties. In Variant 2, the dynamics of the ECAN are specifically designed so that a certain
conceptually intuitive function serves as a Liaptmov function of the dynamics.
At a given time t, for a given Atom indexed i, we define two quantities: OUTi(t) = the total
amount that Atom i pays in rent and tax and diffusion during the time-t iteration of ECAN
; iNi(t) = the total amount that Atom i receives in diffusion, stimulus and welfare during the
time-t iteration of ECAN. Note that welfare is a new concept to be introduced below. We then
define DifFi(t) = MAO— OUTi(t) ; and define AVDIFF(t) as the average of DIFF;(t) over
all i in the ECAN.
The design goal of Variant 2 of the ECAN equations is to ensure that, if the parameters are
tweaked appropriately, AVDIFF can serve as a (deterministic or stochastic, depending on the
details) Lyapunov function for ECAN dynamics. This implies that with appropriate parameters
the ECAN dynamics will converge toward a state where AVDIFF=0, meaning that no Atom
is making any profit or incurring any loss. It must be noted that this kind of convergence is
not always desirable, and sometimes one might want the parameters set otherwise. But if one
wants the STI components of an ECAN to converge to some specific values, as for instance in
a classic associative memory application, Variant 2 can guarantee this easily.
In Variant 2, each ECAN cycle begins with rent collection and welfare distribution, which
occurs via collecting rent via the Variant 1 equation, and then performing the following two
steps:
EFTA00624246
100 23 Attention Allocation
• Step A: calculate X, defined as the positive part of the total amount by which AVDIFF
has been increased via the overall rent collection process.
• Step B: redistribute X to needy Atoms aS follows: For each Atom z, calculate the positive
part of OUT —IN, defined as deficit(z). Distribute X +e wealth among all Atoms z, giving
each Atom a percentage of X that is proportional to deficit(z), but never so much as to
cause OUT<IN for any Atom (the welfare being given counts toward IN). Here e > 0
ensures AVDIFF decrease; e = 0 may be appropriate if convergence is not required in a
certain situation.
Step B is the welfare step, which guarantees that rent collection will decrease AVDIFF. Step
A calculates the amount by which the rich have been made poorer, and uses this to make the
poor richer. In the case that the sum of deficit(z) over all nodes z is less than X, a mid-cycle
rent adjustment may be triggered, calculated so that step B will decrease AVDIFF. (Le. we cut
rent on the rich, if the poor don't need their money to stay out of deficit.)
Similarly. in each Variant 2 ECAN cycle, there is a wage-paying process, which involves the
wage-paying equation from Variant 1 followed by two steps. Step A: calculate Y, defined as the
positive part of the total amount by which AVDIFF has been increased via the overall wage
payment process. Step B: exert taxation based on the surplus Y as follows: For each Atom
z, calculate the positive part of IN— OUT, defined as surplus(z). Collect Y + el wealth from
all Atom z, collecting from each node a percentage of Y that is proportional to surplus(z), but
never so much as to cause IN < OUT for any node (the new STI being collected counts toward
OUT).
In case the total of surplus(z) over all nodes z is less than Y, one may trigger a mid-cycle
wage adjustment, calculated so that step B will decrease AVDIFF. I.e. we cut wages since there
is not enough surplus to support it.
Finally, in the Variant 2 ECAN cycle, diffusion is done a little differently, via iterating the
following process: If AVDIFF has increased during the diffusion round so far, then choose
a random node whose diffusion would decrease AVDIFF, and let it diffuse; if AVDIFF has
decreased during the diffusion round so far, then choose a random node whose diffusion would
increase AVDIFF, and let it diffuse. In carrying out these steps, we avoid letting the same node
diffuse twice in the same round. This algorithm does not let all Atoms diffuse in each cycle, but
it stochastically lets a lot of diffusion happen in a way that maintains AVDIFF constant. The
iteration may be modified to bias toward an average decrease in AVDIFF.
The random element in the diffusion step, together with the logic of the rent/welfare and
wage/tax steps, combines to yield the result that for Variant 2 of ECAN dynamics, AVDIFF is
a stochastic Lyapunov function. The details of the proof of this will be omitted but the outline
of the argument should be clear from the construction of Variant 2. And note that by setting the
e and el parameter to 0, the convergence requirement can be eliminated, allowing the network
to evolve more spontaneously as may be appropriate in some contexts; these parameters allow
one to explicitly adjust the convergence rate.
One may also derive results pertaining to the meaningfulness of the attractors, in various
special cases. For instance, if we have a memory consisting of a set M of m nodes, and we
imprint the memory on the ECAN by stimulating in nodes during an interval of time, then we
want to be able to show that the condition where precisely those m nodes are in the AF is a
fixed-point attractor. However, this is not difficult, because one must only show that if these m
nodes and none others are in the AF, this condition will persist.
EFTA00624247
23.7 Glocal Economic Attention Networks 101
23.6.2 ECAN as Associative Memory
We have carried out experiments gauging the performance of Variant 1 of ECAN as an as-
sociative memory, using the implementation of ECAN within CogPrime, and using both the
conventional and Storkey Hebbian updating fornmlas.
As with a Hopfield net memory, the memory capacity (defined as the number of memories
that can be retrieved from the network with high accuracy) depends on the sparsity of the
network, with denser networks leading to greater capacity. In the ECAN case the capacity also
depends on a variety of parameters of the ECAN equations, and the precise unraveling of these
dependencies is a subject of current research. However, one interesting dependency has already
been uncovered in our preliminary experimentation, which has to do with the size of the AF
versus the size of the memories being stored.
Define the size of a memory (a pattern being imprinted) as the number of nodes that are
stimulated during imprinting of that memory. In a classical Hopfield net experiment, the mean
size of a memory is usually around, say, .2-.5 of the number of neurons. In typical CogPrime
associative memory situations, we believe the mean size of a memory, will be one or two orders
of magnitude smaller than that, so that each memory occupies only a relatively small portion
of the overall network.
What we have found is that the memory capacity of an ECAN is generally comparable to
that of a Hopfield net with the same number of nodes and links, if and only if the ECAN
parameters are tuned so that the memories being imprinted can fit into the AF. That is, the
AF threshold or (in the hyperbolic case) shape parameter must be tuned so that the size of
the memories is not so large that the active nodes in a memory cannot stably fit into the AF.
This tuning may be done adaptively by testing the impact of different threshold/shape values
on various memories of the appropriate size; or potentially a theoretical relationship between
these quantities could be derived, but this has not been done yet. This is a reasonably satisfying
result given the cognitive foundation of ECAN: in loose terms what it means is that ECAN
works best for remembering things that fit into its focus of attention during the imprinting
process.
23.7 Glocal Economic Attention Networks
In order to transform ordinary ECANs into glocal ECANs, one may proceed in essentially the
same manner as with glocal Hopfield nets as discussed in Chapter 13 of Part 1. In the language
normally used to describe CogPrime, this would be termed a "map encapsulation" heuristic.
As with glocal Hopfield nets, one may proceed most simply via creating a fixed pool of nodes
intended to provide locally-representative keys for the maps formed as attractors of the network.
Links may then be formed to these key nodes, with weights and STI and LTI values adapted
by the usual ECAN algorithms.
EFTA00624248
102 23 Attention Allocation
23.7.1 Experimental Explorations
To compare the performance of glocal ECANs with glocal Hopfield networks in a simple context,
we ran experiments using ECAN in the manner of a Hopfield network. That is, a number of
nodes take on the equivalent role of the neurons that are presented patterns to be stored. These
patterns are imprinted by setting the corresponding nodes of active bits to have their STI within
the AF, whereas nodes corresponding to inactive bits of the pattern are below the AF threshold.
Link weight updating then occurs, using one of several update rules, but in this case the update
rule of ISV99I was used. Attention was spread using a diffusion approach by representing the
weights of Hebbian links between pattern nodes within a left stochastic Markov matrix, and
multiplying it by the vector of normalised STI values to give a vector representing the new
distribution of STI.
To explore the effects of key nodes on ECAN Hopfield networks, in roe08b1 we used the
palimpsest testing scenario of ISV99], where all the local neighbours of the imprinted pattern,
within a single bit change, are tested. Each neighbouring pattern is used as input to try and
retrieve the original pattern. If all the retrieved patterns are the same as the original (within
a given tolerance) then the pattern is deemed successfully retrieved and recall of the previous
pattern is attempted via its neighbours. The number of patterns this can repeat for successfully
is called the palimpsest storage of the network.
As an example, consider one simple experiment that was nm with recollection of 10 x 10
pixel pat tents (so, 100 nodes. each corresponding to a pixel in the grid), a Hebbian link density
of 30%, and with 1% of links being forgotten before each pattern is imprinted. The results
demonstrated that, when the mean palimpsest storage is calculated for each of 0, 1, 5 and 10
key nodes we find that the storage is 22.6, 22.4, 24.9, and 26.0 patterns respectively, indicating
that key nodes do improve memory recall on average.
23.8 Long-Term Importance and Forgetting
Now we turn to the forgetting process (carried out by the Forgetting MindAgent), which is
driven by LTI dynamics, but has its own properties as well.
Overall, the goal of the "forgetting" process is to maximize the total utility of the Atoms in
the AtomSpace throughout the future. The most basic heuristic toward this end is to remove
the Atoms with the lowest LTI, but this isn't the whole story. Clearly, the decision to remove
an Atom from RAM should depend on factors beyond just the LTI of the Atom. For example,
one should also take into account the expected difficulty in reconstituting the given Atom from
other Atoms. Suppose the system has the relations:
"dogs are animals"
"animals are cute"
"dogs are cute"
and the strength of the third relation is not dissimilar from what would be obtained by deduction
and revision from the first two relations and others in the system. Then, even if the system
judges it will be very useful to know dogs are cute in the future, it may reasonably choose to
remove dogs are cute from memory, anyway, because it knows it can be so easily reconstituted,
EFTA00624249
23.9 Attention Allocation via Data Mining on the System Activity Table 103
by a few inference steps for instance. Thus, as well as removing the lowest-LTI Atoms, the
Forgetting MindAgent should also remove Atoms meeting certain other criteria such as the
combination of:
• low STI
• easy reconstitutability in terms of other Atoms that have LTI not lass than its own
23.9 Attention Allocation via Data Mining on the System Activity
Table
In this section we'll discuss an object called the System Activity Table, which contains a number
of subtables recording various activities carried out by the various objects in the CogPrime
system. These tables may be used for sophisticated attention allocation processes, according
to an approach in which importance values and HebbianLink weight values are calculated via
direct data mining of a centralized knowledge store (the System Activity Table). This approach
provides highly accurate attention allocation but at the cost of significant computational effort.
The System Activity Table is actually a set of tables, with multiple components. The precise
definition of the tables will surely be adapted based on experience as the work with CogPrime
progresses; what is described here is a reasonable first approximation.
First, there is a MindAgent Activity Table, which includes, for each MindAgent in the system,
a table such as Table 23.1 (in which the time-points recorded are the last T system cycles, and
the Atom-lists recorded are lists of Handles for Atoms).
System Effort ?utopia° Atom Combo 1 Utilized Atom Combo 2 Utilized .. .
Cycle Spent Used
Now 3.3 4000 Atom21, Atom44 Atom 44, Atom 47, Atom 345 .. .
Now -1 0.4 6079 Atom123, Atom33 Atom 345 .. .
... ... ... ... ... .. .
Table 23.1: Example MindAgent Table
The MindAgent's activity table records, for that MindAgent and for each system cycle, which
Atom-sets were acted on by that MindAgent at that point in time.
Similarly, a table of this nature must be maintained for each Task-type, e.g. InferenceTask,
MOSESCategorizationTask, etc. The Task tables are used to estimate Effort values for various
Tasks, which are used in the procedure execution process. If it can be estimated how much
spatial and temporal resources a Task is likely to use, via comparison to a record of previous
similar tasks (in the Task table), then a MindAgent can decide whether it is appropriate to
carry out this Task (versus some other one, or versus some simpler process not requiring a Task)
at a given point in time, a process to be discussed in a later chapter.
In addition to the MindAgent and Task-type tables, it is convenient if tables are maintained
corresponding to various goals in the system (as shown in Table ??), including the Ubergoals
but also potentially derived goals of high importance.
For each goal, at minimum, the degree of achievement of the goal at a given time must be
recorded. Optionally, at each point in time, the degree of achievement of a goal relative to some
EFTA00624250
104 23 Attention Allocation
System Total Achievement Achievement for Atom44 Achievement for set ...
Cycle {Atom44, Atom 233}
Now .8 .4 .5 ...
Now-1 .9 .5 .55 ...
... ... •-• •-• ...
Table 23.2: Example Goal Table
particular Atoms may be recorded. Typically the list of Atom-specific goal-achievements will
be short and will be different for different goals and different time points. Some goals may be
applied to specific Atoms or Atom sets, others may only be applied more generally.
The basic idea is that attention allocation and credit assignment may be effectively carried
out via datamining on these tables.
23.10 Schema Credit Assignment
And, how do we apply a similar approach to clarifying the semantics of schema credit assign-
ment?
From the above-described System Activity Tables, one can derive information of the form
Achieve(G,E,T) "Goal G was achieved to extent E at time T"
which may be grounded as, for example:
Similarity
E
ExOut
GetTruthValue
Evaluation
atTime
HypLink G
and more refined versions such as
Achieve(G,E,T,A,P) "Goal G was achieved to extent E using
Atoms A (with parameters P) at time T"
Enact(S,I,ST_13,0,3T_2$1 "Schema S was enacted on inputs I
at time $T_13, producing outputs 0
at time $T 2$"
The problem of schema credit assignment is then, in essence: Given a goal G and a distribution
of times V. figure out what schema to enact in order to cause G's achievement at some time in
the future, where the desirability of times is weighted by V.
The basic method used is the learning of predicates of the form
ImplicationLink
F(C, Pn )
where
EFTA00624251
23.10 Schema Credit Assignment 105
• the P1 are Enact() statements in which the T1 and T2 are variable, and the S, I and 0 may
be concrete or variable
• C is a predicate representing a context
• g is an Achieve() statement, whose arguments may be concrete or abstract
• F is a Boolean function
Typically, the variable expressions in the T1 and T2 positions will be of the form T -F offset,
where offset is a constant value and T is a time value representing the time of inception of the
whole compound schema. T may then be defined as To — offsety where offset, is a constant
value and 7.0 is a variable denoting the time of achievement of the goal.
In CogPrime, these predicates may be learned by a combination of statistical pattern mining,
PLN inference and MOSES or hill-climbing procedure learning.
The choice of what action to take at a given point in time is then a probabilistic decision.
Based on the time-distribution 73 given, the system will know a certain number of expressions
C = F(C, Ph ..., P4 of the type described above. Each of these will be involved in an Implica-
tionLink with a certain estimated strength. It may select the "compound schema" C with the
highest strength.
One might think to introduce other criteria here, e.g. to choose the schema with the highest
strength but the lowest cost of execution. However, it seems better to include all pertinent
criteria in the goal, so that if one wants to consider cost of execution, one assumes the existence
of a goal that incorporates cost of execution (which may be measured in multiple ways, of
course) as part of its internal evaluation function.
Another issue that arises is whether to execute multiple C simultaneously. In many cases
this won't be possible because two different C's will contradict each other. It seems simplest
to assume that C's that can be fused together into a single plan of action, are presented to
the schema execution process as a single fused C. In other words, the fusion is done during the
schema learning process rather than the execution process.
A question emerges regarding how this process deals with false causality, e.g. with a schema
that, due to the existence of a common cause, often happens to occur immediately prior to the
occurrence of a given goal. For instance, roosters crowing often occurs prior to the sun rising.
This matter is discussed in more depth in the PLN book and The Hidden Pattern; but in brief,
the answer is: In the current approach, if roosters crowing often causes the sun to rise, then if
the system wants to cause the sun to rise, it may well cause a rooster to crow. Once this fails,
then the system will no longer hold the false belief, and afterwards will choose a different course
of action. Furthermore, if it holds background knowledge indicating that roosters crowing is
not likely to cause the sun to rise, then this background knowledge will be invoked by inference
to discount the strength of the ImplicationLink pointing from rooster-crowing to sun-rising, so
that the link will never be strong enough to guide schema execution in the first place.
The problem of credit assignment thus becomes a problem of creating appropriate heuristics
to guide inference of ImplicationLinks of the form described above. Assignment of credit is then
implicit in the calculation of truth values for these links. The difficulty is that the predicates F
involved may be large and complex.
EFTA00624252
106 23 Attention Allocation
23.11 Interaction between ECANs and other CogPrime Components
We have described above a number of interactions between attention allocation and other
aspects of CogPrime; in this section we gather a few comments on these interactions, and some
additional ones.
23.11.1 Use of PLN and Procedure Learning to Help ECAN
MOSES or hillclimbing may be used to help mine the SystemActivityTable for patterns of
usefulness, and create HebbianLinIcs reflecting these patterns.
PLN inference may be carried out on HebbianLinks by treating (HebbianLink A B) as a
virtual predicate evaluation relationship, i.e. as
EvaluationLink Rebbianpredicate (A, B)
PLN inference on HebbianLinks may then be used to update node importance values, because
node importance values are essentially node probabilities corresponding to HebbianLinks. And
similarly, MindAgent-relative node importance values are node probabilities corresponding to
MindAgent-relative HebbianLinks.
Note that conceptually. the nature of this application of PLN is different from most other
uses of PLN in CogPrime. Here, the purpose of PLN is not to draw conclusions about the
outside world, but rather about what the system should focus its resources on in what context.
PLN, used in this context, effectively constitutes a nonlinear-dynamical iteration governing the
flow of attention through the CogPrime system.
Finally, inference on HebbianLinks leads to the emergence of maps, via the recognition of
clusters in the graph of HebbianLinks.
23.11.2 Use of ECAN to Help Other Cognitive Processes
First of all, associative-memory functionality is directly important in CogPrime because it is
used to drive concept creation. The CogPrime heuristic called "map formation" creates new
Nodes corresponding to prominent attractors in the ECAN, a step that (according to our
preliminary results) not only increases the memory capacity of the network beyond what can
be achieved with a pure ECAN but also enables attractors to be explicitly manipulated by PLN
inference.
Equally important to associative memory is the capability of ECANs to facilitate effective
allocation of the attention of other cognitive processes to appropriate knowledge items (Atoms).
For example, one key role of ECANs in CogPrime is to guide the forward and backward chaining
processes of PLN (Probabilistic Logic Network) inference. At each step, the PLN inference
chainer is faced with a great number of inference steps (branches) from which to choose; and
a choice is made using a statistical "bandit problem" mechanism that selects each possible
inference step with a probability proportional to its expected "desirability." In this context,
there is considerable appeal in the heuristic of weighting inference steps using probabilities
EFTA00624253
23.12 MindAgent Importance and Scheduling 107
proportional to the STI values of the Atoms they contain. One thus arrives at a combined
PLN/ECAN dynamic as follows:
1. An inference step is carried out, involving a choice among multiple possible inference steps,
which is made using STI-based weightings (and made among Atoms that LTI weightings
have deemed valuable enough to remain in RAM)
2. The Atoms involved in the inference step are rewarded with STI and LTI proportionally
to the utility of the inference step (how much it increases the confidence of Atoms in the
system's memory)
3. The ECAN operates, and multiple Atom's importance values are updated
4. Return to Step 1 if the inference isn't finished
An analogous interplay may occur between ECANs and MOSES.
It seems intuitively clear that the same attractor-convergence properties highlighted in the
above analysis of associative-memory behavior, will also be highly valuable for the application
of ECANs to attention allocation. If a collection of Atoms is often collectively useful for some
cognitive process (such as PLN), then the ascnriative-memory-type behavior of ECANs means
that once a handful of the Atoms in the collection are found useful in a certain inference
process, the other Atoms in the collection will get their STI significantly boosted, and will be
likely to get chosen in subsequent portions of that same inference process. This is exactly the
sort of dynamics one would like to see occur. Systematic experimentation with these interactions
between ECAN and other CogPrime processes is one of our research priorities going forwards.
23.12 MindAgent Importance and Scheduling
So far we have discussed economic transactions between Atoms and Atoms, and between Atoms
and Units. MindAgents have played an indirect role, via spreading stimulation to Atoms which
causes them to get paid wages by the Unit. Now it is time to discuss the explicit role of
MindAgents in economic transactions. This has to do with the integration of economic attention
allocation with the Scheduler that schedules the core MindAgents involved in the basic cognitive
cycle.
This integration may be done in many ways, but one simple approach is:
1. When a MindAgent utilizes an Atom, this results in sending stimulus to that Atom. (Note
that we don't want to make MindAgents pay for using Atoms individually; that would
penalize MA's that use more Atoms, which doesn't really make much sense.)
2. MindAgents then get currency from the Lobe (as defined in Chapter 19) periodically, and
get extra currency based on usefulness for goal achievement as determined by the credit
assigmnent process. The Scheduler then gives more processor time to MindAgents with
more STI.
3. However, any MindAgent with LTI above a certain minimum threshold will get some min-
imum amount of processor time (i.e. get scheduled at least once each N cycles).
As a final note: In a multi-Lobe Unit, the Unit may use the different LTI values of MA's in
different Lobes to control the distribution of MA's among Lobes: e.g. a very important (LTI)
MA might get cloned across multiple Lobes.
EFTA00624254
108 23 Attention Allocation
23.13 Information Geometry for Attention Allocation
Appendix ?? outlines some very broad ideas regarding the potential utilization of information
geometry and related ideas for modeling cognition. In this section, we present sonic more con-
crete and detailed experiments inspired by the same line of thinking. We model CogPrime's
Economic Attention Networks (ECAN) component using information geometric language, and
then use this model to propose a novel information geometric method of updating ECAN net-
works (based on an extension of Amari's ANGL algorithm). Tests on small networks suggest
that information-geometric methods have the potential to vastly improve ECAN's capability to
shift attention from current preoccupations to desired preoccupations. However, there is a high
computational cost associated with the simplest implementations of these methods, which has
prevented us from carrying out large-scale experiments so far. We are exploring the possibility
of circumventing these issues via using sparse matrix algorithms on GPUs.
23.13.1 Brief Review of Information Geometry
"Information geometry" is a branch of applied mathematics concerned with the application of
differential geometry to spaces of probability distributions. In IG1111 we have suggested some
extensions to traditional information geometry aimed at allowing it to better model general
intelligence. However for the concrete technical work in this Chapter, the traditional formulation
of information geometry will suffice.
One of the core mathematical constructs underlying information geometry, is the Fisher
Information, a statistical quantity which has a a variety of applications ranging far beyond
statistical data analysis, including physics [Fri98J, psychology and Al LAW)]. Put simply, Fl
is a formal way of measuring the amount of information that an observable random variable
X carries about an unknown parameter 8 upon which the probability of X depends. Fl forms
the basis of the Fisher-Rao metric, which has been proved the only Riemannian metric on
the space of probability distributions satisfying certain natural properties regarding invariance
with respect to coordinate transformations. Typically 8 in the Fl is considered to be a real
multidimensional vector; however, IDab00I has presented a Fl variant that imposes basically
no restrictions on the form of O. Here the multidimensional Fl will suffice, but the more gen-
eral version is needed if one wishes to apply Fl to AGI more broadly, e.g. to declarative and
procedural as well as attentional knowledge.
In the set-up underlying the definition of the ordinary finite-dimensional Fisher information,
the probability function for X, which is also the likelihood function for 8 E Ft", is a function
f(X;6); it is the probability mass (or probability density) of the random variable X conditional
on the value of 0. The partial derivative with respect to 81 of the log of the likelihood function
is called the score with respect to (h. Under certain regularity conditions, it can be shown that
the first moment of the score is 0. The second moment is the Fisher information:
= rx(e)i = E [(( — In f(X;6)))
2 10]
88;
a
where, for any given value of 0i, the expression El..10) denotes the conditional expectation over
values for X with respect to the probability function f(X; 0) given 0. Note that 0 ≤ i(0); <
EFTA00624255
23.13 Information Geometry for Attention Allocation 109
co. Also note that, in the usual case where the expectation of the score is zero, the Fisher
information is also the variance of the score.
One can also look at the whole Fisher information matrix
8Inf(X,0)31n1(X,0)\
2(9)ic E[(
aej )
which may be interpreted as a metric gij, that provably Ls the only "intrinsic" metric on prob-
ability distribution space. In this notation we have z(6)i = 1(0);,i.
Dabak IDal)991 has shown that the geodesic between two parameter vectors 0 and Or is given
by the exponential weighted curve (y(t)) (x) - f ig:E.:4414 y, under the weak condition
that the log-likelihood ratios with respect to f(X, 9) and f(X, 9') are finite. Also, along this
sort of curve, the sum of the Kullback-Leibler distances between B and 0', known as the J-
divergence, equals the integral of the Fisher information along the geodesic connecting 0 and 6v.
This suggests that if one is attempting to learn a certain parameter vector based on data, and one
has a certain other parameter vector as an initial value, it may make sense to use algorithms that
try to follow the Fisher-Rao geodesic between the initial condition and the desired conclusion.
This is what Amari IA ma851 IA NO011 calls "natural gradient" based learning, a conceptually
powerful approach which subtly accounts for dependencies between the components of 0.
23.13.2 Information-Geometric Learning for Recurrent Networks:
Extending the ANGL Algorithm
Now we move on to discuss the practicalities of information-geometric learning within Cog-
Prime's ECAN component. As noted above, Amari lAma85, ANOI introduced the natural
gradient as a generalization of the direction of steepest descent on the space of loss functions of
the parameter space. Issues with the original implementation include the requirement of calcu-
lating both the Fisher information matrix and its inverse. To resolve these and other practical
considerations, Amari tAma981 proposed an adaptive version of the algorithm, the Adaptive
Natural Gradient Learning (ANGL) algorithm. Park, Amari, and Fukumizu IPA FOOI extended
ANGL to a variety of stochastic models including stochastic neural networks, multi-dimensional
regression, and classification problems.
In particular, they showed that, assuming a particular form of stochastic feedforward neu-
ral network and under a specific set of assumptions concerning the form of the probability
distributions involved, a version of the Fisher information matrix can be written as
G(0). E4 [(q1E
T [OH (VHi.
Although Park et al considered only feedforward neural networks, their result also holds
for more general neural networks, inc tiding the ECAN network. What is important is the
decomposition of the probability distribution as
P (Ylx; 0) = 11 n (yi - Ei (x,1))
1=1
EFTA00624256
110 23 Attention Allocation
where
Y = H(x; 0)+4, Y = (h, • • • ,YL)T, H = (HI, • • • ,HL)T, (41,- • • ,eL)T,
where is added noise. If we assume further that each r, has the same form as a Gaussian
distribution with zero mean and standard deviation a, then the Fisher information matrix
simplifies further to
GOO= (V1/)1 .
The adaptive estimate for dt-:, is given by
= (1+ ei)6T1 — ELOT I VHOT I VH)T.
and the loss function for our model takes the form
t(x, y; 0) = — E log r(yi — Iii(x, O)).
i.1
The learning algorithm for our connection matrix weights B is then given by
83+3 = B3 — MOT 1V1(00-
23.13.3 Information Geometry for Economic Attention Allocation: A
Detailed Example
Graph 1: Sum of Squares of Errors
versus Number of Nodes
SOW
•
7002
4010
sour
4000
3000 4- KAM
2000
•••••• &M.N.&
1000
0
0 100 NO 300 400 NO
Number of Nods
Fig. 23.3: Results from Experiment 1
We now present the results of a series of small-scale, exploratory experiments comparing
the original ECAN process running alone with the ECAN process coupled with ANGL. We
are interested in determining which of these two lines of processing result in focusing attention
more accurately.
EFTA00624257
23.13 Information Geometry for Attention Allocation 111
The experiment started with base patterns of various sizes to be determined by the two
algorithms. In the training stage, noise was added, generating a number of instances of noisy
base patterns. The learning goal is to identify the underlying base patterns from the noisy
patterns as this will identify how well the different algorithms can focus attention on relevant
versus irrelevant nodes.
Next, the ECAN process was run, resulting in the determination of the connection matrix
C. In order to apply the ANGL algorithm, we need the gradient, VH, of the ECAN training
process, with respect to the input x. While calculating the connection matrix C, we used Monte
Carlo simulation to simultaneously calculate an approximation to VH.
Graph 2: Sum of Squares of Errors
versus Training Noise (16 nodes)
120
100
00 •
60
ea
r _r ECM
e ECM...Mt
r -
0 05 I 15 2 2.5 3
Imastandad Oittlailan
Fig. 23.4: Results from Experiment 2
After ECAN training was completed, we bifurcated the experiment. In one branch, we ran
fuzzed cue patterns through the retrieval process. In the other, we first applied the ANGL
algorithm. optimizing the weights in the connection matrix, prior to running the retrieval process
on the same fuzzed cue patterns. At a constant value of a = 0.8 we ran several samples through
each branch with pattern sizes of 4 x 4, 7 x 7, 10 x 10, 15 x 15, and 20 x 20. The results are
shown in Figure 23.3. We also ran several experiments comparing the sum of squares of the
errors to the input training noise as measured by the value of a.; see Figures 23.4 and ??.
These results suggest two major advantages of the ECAN+ANGL combination compared
to ECAN alone. Not only was the performance of the combination better in every trial, save
for one involving a small number of nodes and little noise, but the combination clearly scales
significantly better both as the number of nodes increases, and as the training noise increases.
EFTA00624258
EFTA00624259
Chapter 24
Economic Goal and Action Selection
24.1 Introduction
A significant portion of CogPrime's dynamics is explicitly goal-driven - that is, based on trying
(inasmuch as passible within the available resources) to figure out which actions will best help
the system achieve its goals, given the current context. A key aspect of this explicit activity is
guided by the process of "goal and action selection" - prioritizing goals, and then prioritizing
actions based on these goals. We have already outlined the high-level process of action selection,
in Chapter 22 above. Now we dig into the specifics of the process, showing how action selection is
dynamically entwined with goal prioritization, and how both processes are guided by economic
attention allocation as described in Chapter 23.
While the basic structure of CogPrime's action selection aspect is fairly similar to MicroPsi
(due to the common foundation in Dorner's Psi model), the dynamics are less similar. MicroPsi's
dynamics are a little closer to being a formal neural net model, whereas ECAN's economic
foundation tends to push it in different directions. The CogPrime goal and action selection
design involves some simple simulated financial mechanisms, building on the economic metaphor
of ELAN, that are different from, and more complex than, anything in MicroPsi.
The main actors (apart from the usual ones like the AtomTable, economic attention alloca-
tion, etc.) in the tale to be told here are as follows:
• Structures:
- UbergoalPool
- ActiveSchemaPool
• MindAgents:
- GoalBasedSchemaSelection
- GoalBasedSchemaLearning
- GoalAttentionAllocation
- FeasibilityUpdating
- SchemaActivation
The Ubergoal Pool contains the Atoms that the system considers as top-level goals. These
goals must be treated specially by attention allocation: they must be given funding by the Unit
so that they can use it to pay for getting themselves achieved. The weighting among different
113
EFTA00624260
114 24 Economic Coal and Action Selection
top-level goals is achieved via giving them different amounts of currency. STICurrency is the key
kind here, but of course ubergoals must also get some LTICurrency so they won't be forgotten.
(Inadvertently deleting your top-level supergoals from memory is generally considered to be a
bad thing ... it's in a sense a sort of suicide...)
24.2 Transfer of STI "Requests for Services" Between Goals
Transfer of "attentional funds" from goals to subgoals, and schema modules to other schema
modules in the same schema, take place via a mechanism of promises of funding (or 'requests
for service,' to be called 'RFS's' from here on). This mechanism relies upon and interacts
with ordinary economic attention allocation but also has special properties. Note that we will
sometimes say that an Atom `tissues" an RFS or "transfers" currency while what we really mean
is that some MindAgent working on that Atom issues an RFS or transfers currency.
The logic of these RFS's is as follows. If agent A issues an RFS of value x to agent B, then
1. When B judges it appropriate, B may redeem the note and ask A to transfer currency of
value x to B.
2. A may withdraw the note from B at any time.
(There is also a little more complexity here, in that we will shortly introduce the notion of
RFS's whose value is defined by a set of constraints. But this complexity does not contradict
the two above points.) The total value of the of RFS's possessed by an Atom may be referred
to as its "promise."
A rough schematic depiction of this RFS process is given in Figure 24.1.
Now we explain how RFS's may be passed between goals. Given two predicates A and B,
if A is being considered as a goal, then B may be considered as a subgoal of A (and A the
supergoal of B) if there exists a Link of the form
Predictivelmplication B A
I.e., achieving B may help to achieve A. Of course, the strength of this link and the temporal
characteristics of this link are important in terms of quantifying how strongly and how usefully
B is a subgoal of A.
Supergoals (not only top-level ones, aka ubergoals) allocate RFS's to subgoals as follows.
Supergoal A may issue a RFS to subgoal B if it is judged that achievement (i.e., predicate
satisfaction) of B implies achievement of A. This may proceed recursively: subgoals may allocate
RFS's to subsubgoaLs according to the same justification.
Unlike actual currency, RFS's are not conserved. However, the actual payment of real cur-
rency upon redemption of RFS's obeys the conservation of real currency. This means that agents
need to be responsible in issuing and withdrawing RFS's. In practice this may be ensured by
having agents follow a couple simple rules in this regard.
1. If B and C are two alternatives for achieving A, and A has x units of currency, then A may
promise both B and C x units of currency. Whoever asks for a redemption of the promise
first, will get the money, and then the promise will be rescinded from the other one.
2. On the other hand, if the achievement of A requires both B and C to be achieved, then B
and C may be granted RFS's that are defined by constraints. If A has x units of currency,
then B and C receive an RFS tagged with the constraint (B+C<=x). This means that in
EFTA00624261
24.2 Ttansfer of STI "Requests for Services" Between Goals 115
CENTRAL BANK
(ATOMSPACE)
WO $100 no
PIM
APPROPRIATE MAINTAIN SEEK USER-
SOCIAL APPROPRIATE NOVELTY GOALS
INTERACTION ENERGY
MS $10
SUB-
FIND GOALS
GET ELECTRICAL
IFS
BATTERY OUTLET
WWWRMWW0
AND !URDU
ASK
SOMEONE
REMEMBER
WHERE I SAW
ONE
RFS $60 RISS.°
ND
SOMEONE
TO ASK
RFS $20
( WU FOR
ATTINTION
Fig. 24.1: The RFS Propagation Process. An illustration of the process via which RFS's
propagate from goals to abstract procedures, and finally mast get cashed out to pay for the
execution of actual concrete procedures that are estimated relatively likely to lead to goal
fulfillment.
order to redeem the note, either one of B or C mast confer with the other one, so that they
can simultaneously request constraint-consistent amounts of money from A.
As an example of the role of constraints, consider the goal of playing fetch successfully (a
subgoal of "get reward").... Then suppose it is learned that:
ImplicationLink
Sequent ialAND
get ball
deliver ball
play_fetch
where SequentielAND A B is the conjunction of A and B but with B occurring after A in time.
Then, if play_fetch has $10 in STICurrency, it may know it has $10 to spend on a combination
EFTA00624262
116 24 Economic Coal and Action Selection
of get_ball and deliver_ball. In this case both get_ball and deliver_ball would be given RFS's
labeled with the contraint:
RES.get_ball + RFS.deliver_ball
The issuance of RFS's embodying constraints is different from (and generally carried out
prior to) the evaluation of whether the constraints can be fulfilled.
An ubergoal may rescind offers of reward for service at any time. And, generally, if a subgoal
gets achieved and has not spent all the money it needed, the supergoal will not offer any more
funding to the subgoal (until/unless it needs that subgoal achieved again).
As there are no ultimate sources of RFS in OCP besides ubergoals, promise may be considered
as a measure of "goal-related importance."
Transfer of RFS's among Atoms is carried out by the GoalAttentionAllocation MindAgent.
24.3 Feasibility Structures
Next, there is a numerical data structure associated with goal Atoms, which is called the
feasibility structure. The feasibility structure of an Atom G indicates the feasibility of achieving
G as a goal using various amounts of effort. It contains triples of the form (t,p,E) indicating
the truth value t of achieving goal G to degree p using effort E. Feasibility structures mist be
updated periodically, via scanning the links coining into an Atom G; this may be done by a
FeasibilityUpdating MindAgent. Feasibility may be calculated for any Atom G for which there
are links of the form:
Implication
Execution S
G
for some S. Once a schema has actually been executed on various inputs, its cost of execution
on other inputs may be empirically estimated. But this is not the only case in which feasibility
may be estimated. For example, if goal G inherits from goal G1, and most children (e.g. subgoals)
of G1 are achievable with a certain feasibility, then probably G is achievable with a similar
feasibility as well. This allows feasibility estimation even in cases where no plan for achieving G
yet exists, e.g. if the plan can be produced via predicate schematization, but such schematization
has not yet been carried out.
Feasibility then connects with importance as follows. Important goals will get more STICur-
rency to spend, thus will be able to spawn more costly schemata. So, the GoalBasedSchemaSe-
lection MindAgent, when choosing which schemata to push into the ActiveSchemaPool, will be
able to choose more costly schemata corresponding to goals with more STICurrency to spend.
24.4 Goal Based Schema Selection
Next, the GoalBa.sedSchemaSelection (GBSS) selects schemata to be placed into the Ac-
tiveSchetnaPool. It does this by choosing goals G, and then choosing schemata that are alleged
to be useful for achieving these goals. It chooses goals via a fitness function that combines
promise and feasibility. This involves solving an optimization problem: figuring out how to
EFTA00624263
24.4 Coal Based Schema Selection 117
maximize the odds of getting a lot of goal-important stuff done within the available amount
of (memory and space) effort. Potentially this optimization problem can get quite subtle, but
initially some simple heuristics are satisfactory. (One subtlety involves handling dependencies
between goals, as represented by constraint-bearing RFS's.)
Given a goal, the GBSS MindAgent chooses a schema to achieve that goal via the heuristic
of selecting the one that maximizes a fitness function balancing the estimated effort required
to achieve the goal via executing the schema, with the estimated probability that executing the
schema will cause the goal to be achieved.
When searching for schemata to achieve G. and estimating their effort, one factor to be taken
into account is the set of schemata already in the ActiveSchemaPool. Some schemata S may
simultaneously achieve two goals; or two schemata achieving different goals may have significant
overlap of modules. In this case G may be able to get achieved using very little or no effort
(no additional effort. if there is already a schema S in the ActiveSchemaPool that is going to
cause G to be achieved). But if G "decide?' it can be achieved via a schema S already in the
ActiveSchemaPool, then it should still notify the ActiveSchemaPool of this, so that G can be
added to S's index (see below). If the other goal Cl that placed S in the ActiveSchemaPool
decides to withdraw S, then S may need to hit up Cl for money, in order to keep itself in the
ActiveSchemaPool with enough funds to actually execute.
24.4.1 A Game-Theoretic Approach to Action Selection
Min Jiang has observed that, mathematically, the problem of action selection (represented in
CogPrime as the problem of goal-based schema selection) can be modeled in terms of game
theory, as follows:
• the intelligent agent is one player, the world is another player
• the agent's model of the world lets it make probabilistic predictions of how the world may
respond to what the agent does (i.e. to estimate what mixed strategy the world is following,
considering the world as a game player)
• the agent itself chooses schema probabilistically, so it's also following a mixed strategy
• so, in principle the agent can choose schema that it thinks will lead to a mixed Nash
But the world's responses are very high-dimensional, which means that finding a mixed Nash
equilibrium even approximately is a very hard computational problem. Thus, in a sense. the crux
of the problem seems to come down to feature identification. If the world's response (real or
predicted) can be represented as a low-dimensional set of features, then these features can be
considered as the world's "move" in the game ... and the game theory problem becomes tractable
via approximation schemes. But without the reduction of the world to a low-dimensional set of
features, finding the mixed Nash equilbritun even approximately will not be computationally
tractable...
Some Al theorists would argue that this division into "feature identification" versus "action
selection" is unnecessarily artificial; for instance, Hawkins 11113011 or Arel [ARC09bl might sug-
gest to use a single hierarchical neural network to do both of them. But the brain after all
t in game theory, a Nash equilibrium is when no player can do better by unilaterally changing its strategy
EFTA00624264
118 24 Economic Coal and Action Selection
contains many different regions, with different architectures and dynamics.... In the visual cor-
tex, it seems that feature extraction and object classification are done separately. And it seems
that in the brain, action selection has a lot to do with the basal ganglia, whereas feature extrac-
tion is done in the cortex. So the neural analogy provides some inspiration for an architecture
in which feature identification and action selection are separated.
There is literature discussing numerical methods for calculating approximate Nash equilibria;
however, this is an extremely tricky topic in the CogPrime context because action selection must
generally be done in real-time. Like perception processing, this may be an area calling for the
use of parallel processing hardware. For instance, a neural network algorithm for finding mixed
Nash equilibria could be implemented on a GPU supercomputer, enabling rapid real-time action
selection based on a reduced-dimensionality model of the world produced by intelligent feature
identification.
Consideration of the application of game theory in this context brings out an important
point, which is that to do reasonably efficient and intelligent action selection, the agent needs
some rapidly-evaluable model of the world, i.e. some way to rapidly evaluate the predicted
response of the world to a hypothetical action by the agent. In the game theory
approach (or any other sufficiently intelligent approach), for the agent to evaluate fitness of a
schema-set S for achieving certain goals in a certain context, it has to (explicitly or implicitly)
estimate
• how the world will respond if the agent does S
• how the agent could usefully respond to the world's response (call this action-set S1)
• how the world will respond to the agent doing Si
• etc.
and so to rapidly evaluate the fitness of S, the agent needs to be able to quickly estimate how
the world will respond. This may be done via simulation, or it may be done via inference (which
however will rarely be fast enough, unless with a very accurate inference control mechanism),
or it may be done by learning some compacted model of the world as represented for instance
in a hierarchical neural network.
24.5 SchemaActivation
And what happens with schemata that are actually in the ActiveSchemapool? Let as assume
that each of these schema is a collection of modules (subprograms), connected via Activation-
Links, which have semantics: (ActivationLink A B) means that if the schema that placed module
A in the schema pool is to be completed, then after A is activated, B should he activated. (We
will have more to say about schemata, and their modtdarization, in Chapter 25.)
When a goal places a schema in the ActiveSchemaPool, it grants that schema an RFS equal
in value to the total or some fraction of the promissory+real currency it has in its possession.
The heuristics for determining how much currency to grant may become sophisticated; but
initially we may just have a goal give a schema all its promissory currency; or in the case of a
top-level supergoal, all its actual currency.
When a module within a schema actually executes, then it must redeem some of its promis-
sory currency to turn it into actual currency, because executing costs money (paid to the Lobe).
EFTA00624265
24.6 ConlBasedScheranLearning 119
Once a schema is done executing, if it hasn't redeemed all its promissory currency, it gives the
remainder back to the goal that placed it in the ActiveSchemaPool.
When a module finishes executing, it passes promissory currency to the other modules to
which it points with ActivationLinks.
The network of modules in the ActiveSchemaPool is a digraph (whose links are Activation-
Links), because sonic modules may be shared within different overall schemata. Each module
must be indexed via which schemata contain it, and each schema must be indexed via which
goal(s) want it in the ActiveSchemaPool.
24.6 GoalBasedSchemaLearning
Finally, we have the process of trying to figure out how to achieve goals, i.e. trying to learn
links between ExecutionLinks and goals G. This process should he focused on goals that have a
high importance but for which feasible achievement-methodologies are not yet known. Predicate
schematization is one way of achieving this; another is MOSES procedure evolution.
EFTA00624266
EFTA00624267
Chapter 25
Integrative Procedure Evaluation
25.1 Introduction
Procedural knowledge must be learned, an often subtle and difficult process - but it must also
be enacted. Procedure enaction is not as tricky a topic as procedure learning, but still is far
from trivial, and involves the real-time interaction of procedures, during the course of execution,
with other knowledge. In this brief chapter we explain how this process may be most naturally
and flexibly carried out, in the context of CogPrime's representation of procedures as programs
("Combo trees").
While this may seem somewhat of a "mechanical", implementation-level topic, it also involves
some basic conceptual points, on which CogPrime as an AGI design does procedure evaluation
fundamentally differently from narrow-AI systems or conventional programming language in-
terpreters. Basically, what makes CogPrime Combo tree evaluation somewhat subtle is due to
the interfacing between the Combo evaluator itself and the rest of the CogPrime system.
In the CogPrime design, Procedure objects (which contain Combo trees, and are associated
with ProcedureNodes) are evaluated by ProcedureEvaluator objects. Different ProcedureEvalu-
ator objects may evaluate the same Combo tree in different ways. Here we explain these various
sorts of evaluation - how they work and what they mean.
25.2 Procedure Evaluators
In this section we will mention three different ProcedureEvaluators:
• Simple procedure evaluation
• Effort-based procedure evaluation, which is more complex but is required for integration of
inference with procedure evaluation
• Adaptive evaluation order based procedure evaluation
In the following section we will delve more thoroughly into the interactions between inference
and procedure evaluation.
Another related issue is the modularization of procedures. This issue however is actually
orthogonal to the distinction between the three ProcedureEvaluators mentioned above. Modu-
larity simply requires that particular nodes within a Combo tree be marked as "module roots",
121
EFTA00624268
122 25 Integrative Procedure Evaluation
so that they may be extracted from the Combo tree as a whole and treated as separate modules
(called differently, sub-routines), if the ExecutionManager judges this appropriate.
25.2.1 Simple Procedure Evaluation
The SimpleComboTreeEvaluator simply does Combo tree evaluation as described earlier. When
an Atom is encountered, it looks into the AtomTable to evaluate the object.
In the case that a Schema refers to an ungrounded Schemallode (that is not defined by
a ComboTree as defined in Chapter 19), and an appropriate EvaluationLink value isn't in
the AtomTable, there's an evaluation failure, and the whole procedure evaluation returns the
truth value (.5,0): i.e., a zero-weight-of-evidence truth value, which is equivalent essentially to
returning no value.
In the case that a Predicate refers to an ungrounded PredicateNode, and an appropriate
EvaluationLink isn't in the AtomTable, then some very simple "default thinking" is done, and
it is assigned the truth value of the predicate on the given arguments to be the TruthValue of
the corresponding PredicateNode (which is defined as the mean truth value of the predicate
across all arguments known to CogPrime. )
25.2.2 Effort Based Procedure Evaluation
The next step is to introduce the notion of "effort" the amount of effort that the CogPrime
system must undertake in order to carry out a procedure evaluation. The notion of effort
is encapsulated in Effort objects, which may take various forms. The simplest Effort objects
measure only elapsed processing time; more advanced Effort objects take into consideration
other factors such as memory usage.
An effort-based Combo tree evaluator keeps a running total of the effort used in evaluating
the Combo tree. This is necessary if inference is to be used to evaluate Predicates, Schema,
Arguments, etc. Without some control of effort expenditure, the system could do an arbitrarily
large amount of inference to evaluate a single Atom.
The matter of evaluation effort is nontrivial because in many cases a given node of a Combo
tree may be evaluated in more than one way, with a significant effort differential between the
different methodologies. If a Combo tree Node refers to a predicate or schema that is very
costly to evaluate, then the ProcedureEvaluator managing the evaluation of the Combo tree
must decide whether to evaluate it directly (expensive) or estimate the result using inference
(cheaper but less accurate). This decision depends on how much effort the ProcedureEvaluator
has to play with, and what percentage of this effort it finds judicious to apply to the particular
Combo tree Node in question.
In the relevant prototypes we built within OpenCog, this kind of decision was made based on
some simple heuristics inside ProcedureEvaluator objects. However, it's clear that, in general,
more powerful intelligence must be applied here, so that one needs to have ProcedureEvaluators
that - in cases of sub-procedures that are both important and highly expensive - use PLN
inference to figure out how much effort to assign to a given subproblem.
EFTA00624269
25.3 The Procedure Evaluation Process 123
The simplest useful kind of effort-based Combo tree evaluator is the EffortIntervalCom-
boTreeEvaluator, which utilizes an Effort object that contains three numbers (yes, no, max).
The yes parameter tells it how much effort should be expended to evaluate an Atom if there is a
ready answer in the AtomTable. The no parameter tells it how much effort should be expended
in the case that there is not a ready answer in the AtomTable. The max parameter tells it how
much effort should be expended, at maximum, to evaluate all the Atoms in the Combo tree,
before giving up. Zero effort, in the simplest case, may be heuristically defined as simply looking
into the AtomTable - though in reality this does of course take effort, and a more sophisticated
treatment would incorporate this as a factor as well.
Quantification of amounts of effort is nontrivial, but a simple heuristic guideline is to assign
one unit of effort for each inference step. Thus, for instance,
• (yes, no, max) = (0,5,1000) means that if an Atom can be evaluated by AtomTable lookup,
this is done, but if AtomTable lookup fails, a minimum of 5 inference steps are done to try
to do the evaluation. It also says that no more than 1000 evaluations will be done in the
course of evaluating the Combo tree.
• (yes, no, max) = (3,5,1000) says the same thing, but with the change that even if evaluation
could be done by direct AtomTable lookup, 3 inference steps are tried anyway, to try to
improve the quality of the evaluation.
25.2.3 Procedure Evaluation with Adaptive Evaluation Order
While tracking effort enables the practical use of inference within Combo tree evaluation, if one
has truly complex Combo trees, then a higher degree of intelligence is necessary to guide the
evaluation process appropriately. The order of evaluation of a Combo tree may be determined
adaptively, based on up to three things:
• The history, of evaluation of the Combo tree
• Past history of evaluation of other Combo trees, as stored in a special AtomTable consisting
only of relationships about Combo tree-evaluation-order probabilities
• New information entering into CogPrime's primary AtomTable during the course of evalu-
ation
ProcedureEvaluator objects may be selected at runtime by cognitive schemata, and they may
also utilize schemata and MindAgents internally. The AdaptiveEvaluationOrderComboTreeE-
valuator is more complex than the other ProcedureEvaluators discussed, and will involve var-
ious calls to CogPrime MindAgents, particularly those concerned with PLN inference. WIK-
ISOURCE:ProcedureacecutionDetails
25.3 The Procedure Evaluation Process
Now we give a more thorough treatment of the procedure evaluation proems, as embodied in the
effort-based or adaptive-evaluation-order evaluators discussed above. The process of procedure
evaluation is somewhat complex, because it encompasses three interdependent processes:
EFTA00624270
12d 25 Integrative Procedure Evaluation
• The mechanics of procedure evaluation, which in the CogPrime design involves traversing
Combo trees in an appropriate order. When a Combo tree node referring to a predicate or
schema is encountered during Combo tree traversal, the process of predicate evaluation or
schema execution must be invoked.
• The evaluation of the truth values of predicates - which involves a combination of inference
and (in the case of grounded predicates) procedure evaluation.
• The computation of the truth values of schemata - which may involve inference as well as
procedure evaluation.
We now review each of these processes.
25.3.1 Truth Value Evaluation
What happens when the procedure evaluation process encounters a Combo tree Node that
represents a predicate or compound term? The same thing as when some other CogPrime
process decides it wants to evaluate the truth value of a PredicateNode or CompoundTermNode:
the generic process of predicate evaluation is initiated.
This process is carried out by a TruthValueEvaluator object. There are several varieties of
TruthValueEvaluator, which fall into the following hierarchy:
TruthValueEvaluator
DirectTruthValueEvaluator (abstract)
SimpleDirectTruthValueEvaluator
InferentialTruthValueEvaluator (abstract)
SimplelnferentialTruthValueEvaluator
MixedTruthValueEvaluator
A DirectTruthValueEvaluator evaluates a grounded predicate by directly executing it on
one or more inputs; an InferentialTruthValueEvaluator evaluates via inference based on the
previously recorded, or specifically elicited, behaviors of other related predicates or compound
terms. A MixedTruthValueEvaluator contains references to a DirectTruthValueEvaluator and an
InferentialTruthValueEvaluator, and contains a weight that tells it how to balance the outputs
from the two.
Direct truth value evaluation has two cases. In one case, there is a given argument for the
predicate; then, one simply plugs this argument in to the predicate's internal Combo tree,
and passes the problem off to an appropriately selected ProcedureEvaluator. In the other case,
there is no given argument, and one is looking for the truth value of the predicate in general.
In this latter case, some estimation is required. It is not plausible to evaluate the truth value
of a predicate on every possible argument, so one must sample a bunch of arguments and
then record the resulting probability distribution. A greater or fewer number of samples may
be taken, based on the amount of effort that's been allocated to the evaluation process. It's
also possible to evaluate the truth value of a predicate in a given context (information that's
recorded via embedding in a ContextLink); in this rase the random sampling is restricted to
inputs that lie within the specified context.
On the other hand, the job of an InferentialTruthValueEvaluator is to use inference rather
than direct evaluation to guess the truth value of a predicate (sometimes on a particular argu-
ment, sometimes in general). There are several different control strategies that may be applied
here, depending on the amount of effort allocated. The simplest strategy is to rely on analogy,
EFTA00624271
25.3 The Procedure Evaluation Process 125
simply searching for similar predicates and using their truth values as guidance. (In the case
where a specific argument is given, one searches for similar predicates that have been evaluated
on similar arguments.) If more effort is available, then a more sophisticated strategy may be
taken. Generally, an InferentialTruthValueEvaluator may invoke a Schemallode that embodies
an inference control strategy for guiding the truth value estimation process. These Schemallodes
may then be learned like any others.
Finally, a AlixedTruthValueEvaluator operates by consulting a DirectTruthValueEvaluator
and/or an InferentialTruthValueEvaluator as necesbary, and merging the results. Specifically,
in the case of an ungrounded PredicateNode, it simply returns the output of the Inferential-
TruthValueEvaluator it has chosen. But in the case of a GroundedPredicateNode, it returns a
weighted average of the directly evaluated and inferred values, where the weight is a parameter.
In general, this weighting may be done by a Schemallode that is selected by the MixedTruth-
ValueEvaluator; and these schemata may be adaptively learned.
25.3.2 Schema Execution
Finally, schema execution is handled similarly to truth value evaluation, but it's a bit simpler in
the details. Schemata have their outputs evaluated by SchemaExecutor objects, which in turn
invoke ProcedureEvaluator objects. We have the hierarchy:
SchemaExecutor
DirectSchemaExecutor (abstract)
SimpleDirectSchemaExecutor
InferentialSchemaExecutor (abstract)
SimplelnferentialSchemaExecutor
MixedSchemaExecutor
A DirectSchemaExecutor evaluates the output of a schema by directly executing it on some
inputs; an InferentialSchemaExecutor evaluates via inference based on the previously recorded,
or specifically elicited, behaviors of other related schemata. A MixedSchemaExecutor contains
references to a DirectSchemaExecutor and an InferentialSchemaExecutor, and contains a weight
that tells it how to balance the outputs from the two (not always obvious, depending on the
output type in question).
Contexts may be used in schema execution, but they're used only indirectly, via being
passed to TruthValueEvaluators used for evaluating truth values of PredicateNodes or Com-
poundTermNodes that occur internally in schemata being executed.
EFTA00624272
EFTA00624273
Section III
Perception and Action
EFTA00624274
EFTA00624275
Chapter 26
Perceptual and Motor Hierarchies
26.1 Introduction
Having discussed declarative, attentional, intentional and procedural knowledge, we are left only
with sensorimotor and episodic knowledge to complete our treatment of the basic CogPrime
"cognitive cycle" via which a CogPrime system can interact with an environment and seek to
achieve its goals therein.
The cognitive cycle in its most basic form leaves out the most subtle and unique aspects of
CogPrime, which all relate to learning in various forms. But nevertheless it is the foundation
on which CogPrime is built, and within which the various learning processes dealing with the
various forms of memory all interact. The CogPrime cognitive cycle is more complex in many
respects than it would need to be if not for the need to support diverse forms of learning.
And this learning-driven complexity is present to sonic extent in the contents of the present
chapter as well. If learning weren't an issue, perception and actuation could more likely be
treated as wholly (or near-wholly) distinct modules, operating according to algorithms and
structures independent of cognition. But our suspicion is that this sort of approach is unlikely
to be adequate for achieving high levels of perception and action capability under real-world
conditions. Instead, we suspect, it's necessary to create perception and action processes that
operate fairly effectively on their own, but are capable of cooperating with cognition to achieve
yet higher levels of functionality.
And the benefit in such an approach goes both ways. Cognition helps perception and actu-
ation deal with difficult cases, where the broad generalization that is cognition's specialty is
useful for appropriately biasing perception and actuation based on subtle enviromnental regu-
larities. And, the patterns involved in perception and actuation help cognition, via supplying
a rich reservoir of structures and processes to use as analogies for reasoning and learning at
various levels of abstraction. The prominence of visual and other sensory metaphors in abstract
cognition is well known lArn69, Gar00]; and according to Lakoff and Nunez 1LN001 even pure
mathematics is grounded in physical perception and action in very concrete ways.
We begin by discussing the perception and action mechanisms required to interface CogPrime
with an agent operating in a virtual world. We then turn to the more complex mechanisms
needed to effectively interface CogPrime with a robot possessing vaguely humanoid sensors and
actuators, focusing largely on vision processing. This discussion leads up to deeper discussions in
Chapters 27, 28 and 29 where we describe in detail the strategy that would be used to integrate
129
EFTA00624276
130 26 Perceptual and Motor Hierarchies
CogPrime with the DeSTIN framework for AGI perception/action (which was described in some
detail in Chapter 4 of Part 1).
In terms of the integrative cognitive architecture presented in Chapter 5, the material pre-
sented in the chapters in this section has mostly to do with the perceptual and motor hierarchies,
also touching on the pattern recognition and imprinting processes that play a role in the inter-
action between these hierarchies and the conceptual memory. The commitment to a hierarchical
architecture for perception and action Ls not critical for the CogPrime design as a whole - one
could build a CogPrime with non-hierarchical perception and action modules, and the rest of
the system would be about the same. The role of hierarchy here is a reflection of the obvious
hierarchical structure of the everyday human environment, and of the human body. In a world
marked by hierarchical structure, a hierarchically structured perceptual system is advantageous.
To control a body marked by hierarchical structure, an hierarchically structured action system
is advantageous. It would be possible to create a CogPrime system without this sort of in-built
hierarchical structure, and have it gradually self-adapt in such a way as to grow its own internal
hierarchical structure, based its experience in the world. However, this might be a case of push-
ing the "experiential learning" perspective too far. The human brain definitely has hierarchical
structure built into it; it doesn't need to learn to experience the world in hierarchical terms; and
there seems to be no good reason to complicate an AGI's early development phase by forcing
it to learn the basic facts of the world's and its body's hierarchality.
26.2 The Generic Perception Process
We have already discussed the generic action process of CogPrime, in Chapter 25 on procedure
evaluation. Action sequences are generated by Combo programs, which execute primitive ac-
tions, including those corresponding to actuator control signals as well as those corresponding
to, say, mathematical or cognitive operations. In some cases the actuator control signals may
directly dictate movements; in other cases they may supply inputs and/or parameters to other
software (such as DeSTIN, in the integrated CogBot architecture to be described below).
What about the generic perception process? We distinguish sensation from perception, in a
CogPrime context, by defining
• perception as what occurs when some signal from the outside world registers itself in either:
a CogPrime Atom, or some other sort of node (e.g. a DeSTIN node) that is capable of
serving as the target of a CogPrime Link.
• sensation as any "preprocessing" that occurs between the impact of some signal on some
sensor, and the creation of a corresponding perception
Once perceptual Atoms have been created, various perceptual MindAgents comes into play,
taking perceptual schemata (schemata whose arguments are perceptual nodes or relations there-
between) and applying them to Atoms recently created (creating appropriate ExecutionLinks
to store the results). The need to have special, often modality-specific perception MindAgents
to do this, instead of just leaving it to the generic SchemaExecution MindAgent, has to do
with computational efficiency, scheduling and parameter settings. Perception MindAgents are
doing schema execution urgently, and doing it with parameter settings tuned for perceptual
processing. This means that, except in unusual circumstances, newly received stimuli will be
processed immediately by the appropriate perceptual schemata.
EFTA00624277
26.3 Interfacing CogPrime with a Virtual Agent 131
Sonic newly formed perceptual Atoms will have links to existing atoms, ready-made at their
moment of creation. CharacterlnstanceNodes and NumberinstanceNodes are examples; they are
born linked to the appropriate CharacterNodes and NumberNodes. Of course, atoms represent-
ing perceived relationships, perceived groupings, etc., will not have ready-made links and will
have to grow such links via various cognitive processes. Also, the ContextFormation MindAgent
looks at perceptual atom creation events and creates Context Nodes accordingly; and this must
be timed so that the Context Nodes are entered into the system rapidly, so that they can be
used by the processes doing initial-stage link creation for new perceptual Atoms.
In a full CogPrime configuration, newly created perceptual nodes and perceptual schemata
may reside in a special perception-oriented Units, so as to ensure that perceptual processes
occur rapidly, not delayed by slower cognitive processes.
26.2.1 The ExperienceDB
Separate from the ordinary perception process, it may also valuable for there to be a direct
route from the system's sensory sources to a special "ExperienceDB" database that records all
of the system's experience. This does not involve perceptual schemata at all, nor is it left up
to the sensory source; rather, it is carried out by the CogPrime server at the point where it
receives input from the sensory source. This experience database is a record of what the system
has seen in the past, and may be mined by the system in the future for various purposes. The
creation of new perceptual atoms may also be stored in the experience database, but this must
be handled with care as it can pose a large computational expense; it will often be best to store
only a subset of these.
Obviously, such an ExperienceDB is something that has no correlate in the human mind/brain.
This is a case where CogPrime takes advantage of the non-brainlike properties of its digital com-
puter substrate. The CogPrime perception process is intended to work perfectly well without
access to the comprehensive database of experiences potentially stored in the ExperienceDB.
However, a complete record of a mind's experience is a valuable thing, and there seems no
reason for the system not to exploit it fully. Advantages like this allow the CogPrime system to
partially compensate for its lath of some of the strengths of the human brain as an Al platform,
such as massive parallelism.
26.3 Interfacing CogPrime with a Virtual Agent
We now discuss some of the particularities of connecting CogPrime to a virtual world (such as
Second Life, Multiverse, or Unity3D, to name sonic of the virtual world / gaming platforms to
which OpenCog has already been connected in practice).
EFTA00624278
132 26 Perceptual and Motor Hierarchies
26.3.1 Perceiving the Virtual World
The most complex, high-bandwidth sensory data coming in from a typical virtual world is visual
data, so that will be our focus here. We consider three modes in which a virtual world may
present visual data to CogPrime (or any other system):
• Object vision: CogPrime receives information about polygonal objects and their colors,
textures and coordinates (each object is a set of contiguous polygons, and sometimes objects
have "type" information, e.g. cube or sphere)
• Polygon vision: CogPrime receives information about polygons and their colors, textures
and coordinates
• Pixel vision: CogPrime receives information about pixels and their colors and coordinates
In each case, coordinates may be given either in "world coordinates" or in "relative coordinates"
(relative to the gaze). This distinction is not a huge deal since within an architecture like
CogPrime, supplying schemata for coordinate transformation is trivial; and, even if treated as a
machine learning task, this sort of coordinate transformation is not very difficult to learn. Our
current approach is to prefer relative coordinates, as this approach is more natural in terms of
modern Western human psychology; but we note that in some other cultures world coordinates
are preferred and considered more psychologically natural.
Currently we have not yet done any work with pixel vision in virtual worlds. We have
been using object vision for most of our experiments, and consider a combination of polygon
vision and object vision as the "right" approach for early AGI experiments in a virtual worlds
context. The problem with pure object vision is that it removes the possibility for CogPrime
to understand object segmentation. If, for instance, CogPrime perceives a person as a single
object, then how can it recognize a head as a distinct sub-object? Feeding the system a pre-
figured hierarchy of objects, sub-objects and so forth seems inappropriate in the context of
an experiential learning system. On the other hand, the use of polygon vision instead of pixel
vision seems to meet no such objections. This may take different forms in different platforms.
For instance, in our work with a Minecraft-like world in the Unity3D environment, we have
relied heavily on virtual objects made of blocks, in which case the polygons of most interest are
the faces of the blocks.
Momentarily sticking with the object vision case for simplicity, examples of the perceptions
emanating from the virtual world perceptual preprocessor into CogPrime are things like:
1. I am at world-coordinates $W
2. Object with metadata SM is at world-coordinates SW
3. Part of object with metadata $M is at world-coordinates $W
4. Avatar with metadata SM is at world-coordinates SW
5. Avatar with metadata SM is carrying out animation SA
6. Statements in natural language, from the pet owner
The perceptual preprocessor takes these signals and translates them into Atoms, making use
of the special Atomspace mechanisms for efficiently indexing spatial and temporal information
(the and ) as appropriate.
EFTA00624279
26.3 Interfacing CogPrime with a Virtual Agent 133
26.3.1.1 Transforming Real-World Vision into Virtual Vision
One approach to enabling CogPrime to handle visual data coining from the real world is to
transform this data into data of the type CogPrime sees in the virtual world. While this is not
the approach we are taking in our current work, we do consider it a viable strategy, and we
briefly describe it here.
One approach along these lines would involve multiple phases:
• Use a camera eye and a LiDAR (Light Detection And Ranging, used for high-resolution
topographic mapping) sensor in tandem, so as to avoid having to deal with stereo vision
• Using the above two inputs, create a continuous 3D contour map of the perceived visual
world
• Use standard mathematical transforms to polygon-ize the 3D contour map into a large set
of small polygons
• Use heuristics to merge together the small polygons, obtaining a smaller set of larger poly-
gons (but retaining the large set of small polygons for the system to reference in cases where
a high level of detail is necessary)
• Feed the polygons into the perceptual pattern mining subsystem, analogously to the poly-
gons that come in from virtual-world
In this approach, preprocessing is used to make the system see the physical world in a manner
analogous to how it sees the virtual-world world. This is quite different from the DeSTIN-based
approach to CogPrime vision that we will discuss in Chapter 28, but may well also be feasible.
26.3.2 Acting in the Virtual World
Complementing the perceptual preprocessor is the action postprocessor: code that translates
the actions and action-sequences generated by CogPrime into instructions the virtual world can
understand (such as "launch thus-and-thus animation"). Due to the particularities of current
virtual world architectures, the current OpenCogPrime system carries out actions via executing
pre-programmed high-level procedures, such as "move forward one step", "bend over forward"
and so forth. Example action commands are:
1. Move ($D, $S) : $D is a distance, $S is a speed
2. Turn ($A, $S) : $A is an angle, $S is a speed
3. Pitch ($A, $S) : turn vertically up/down... [for birds only]
4. Jump ($D, $H, $S) : $H is a maximum height, at the center of the jump
5. Say ($T), ST is text : for agents with linguistic capability, which is not enabled in the
current version
6. pick up($0) : $0 is an object
7. put down($0)
This is admittedly a crude approach, and if a robot simulator rather than a typical virtual
world were used, it would be possible for CogPrime to emanate detailed servomotor control
commands rather than high-level instructions such as these. However, as noted in Chapter 16
of Part 1, at the moment there is no such thing as a "massive multiplayer robot simulator," and so
the choice is between a multi-participant virtual environment (like the Multiverse environment
EFTA00624280
134 26 Perceptual and Motor Hierarchies
currently used with the PetBrain) or a small-scale robot simulator. Our experiments with virtual
worlds so far have used the high-level approach described here; but we are also experimenting
with using physical robots and corresponding simulators, as will be described below.
26.4 Perceptual Pattern Mining
Next we describe how perceptual pattern mining may be carried out, to recognize meaningful
structures in the stream of data produced via perceiving a virtual or physical world.
In this subsection we discuss the representation of knowledge, and then in the following
subsection we discuss the actual mining. We discuss the process in the context of virtual-world
perception as outlined above, but the same processes apply to robotic perception, whether
one takes the "physical world as virtual world" approach described above or a different sort of
approach such as the DeSTIN hybridization approach described below.
26.4.1 Input Data
First, we may assume that each perception is recorded as set of "transactions", each of which is
of the form
Time, 3D coordinates, object type
or
Time, 3D coordinates, action type
Each transaction may also come with an additional list of (attribute, value) pairs, where the
list of attributes is dependent upon the object or action type. Transactions are represented as
Atoms, and don't need to be a specific Atom type - but are referred to here by the special name
transactions simply to make the discussion clear.
Next, define a transaction template as a transaction with location and time information set
to wild cards - and potentially, some other attributes set to wild cards. (These are implemented
in terms of Atoms involving VariableNodes.)
For instance, some transaction templates in the current virtual-world might be informally
represented as:
• Reward
• Red cube
• kick
• move_forward
• Cube
• Cube, size 5
• inc
• Teacher
EFTA00624281
26.4 Perceptual Pattern Mining 135
26.4.2 Transaction Graphs
Next we may conceive a transaction graph, whose nodes are transactions and whose links are
labeled with labels like after, SimAND, SSegAND (short for SimultaneousSequentialAND),
near. in_front_of, and so forth (and whose links are weighted as well).
We may also conceive a transaction template graph, whose nodes are transaction templates,
and whose links are the same as in the transaction graph. Examples of transaction template
graphs are
near(Cube, Teacher)
SSegAND(move_forward, Reward)
Where Cube, Teacher, etc are transaction templates since Time and 3D coordinates are left
unspecified.
And finally, we may conceive a transaction template relationship graph (1TRG), whose nodes
may be any of: transactions; transaction templates; basic spatiotemporal predicates evaluated
at tuples of transactions or transaction templates. For instance
SimAND(near(Cube, Teacher), above(Cube, Chair))
26.4.3 Spatiotempcnut Conjunctions
Define a temporal conjunction as a conjunction involving SimultaneousAND and Sequentia-
IAND operators (including SSNAND as a special case of SeciAND: the special case that interests
us in the short term). The conjunction is therefore ordered, e.g.
A SSecIAND S SimAND C SSemAND D
We may assume that the order of operations favors SimAND, so that no parenthesizing is
necessary.
Next, define a basic spatiotemporal conjunction as a temporal conjunction that conjoins
terms that are either
• transactions, or
• transaction templates. or
• basic spat iotemporal predicates applied to tuples of transactions or transaction templates
I.e. a basic spatiotemporal conjunction is a temporal conjunction of nodes from the transaction
template relationship graph.
An example would be:
(hold ball) SimAND ( near(ne, teacher) ) SSeciAND Reward
This assumes that the hold action has an attribute that is the type of object held, so that
hold ball
in the above temporal conjunction is a shorthand for the transaction template specified by
action type: hold
object_held_type: ball
This example says that if the agent is holding the ball and is near the teacher then shortly
after that, the agent will get a reward.
EFTA00624282
136 26 Perceptual and Motor Hierarchies
26.4.4 The Mining Task
The perceptual mining task, then, is to find basic spatiotemporal conjunctions that are inter-
esting. What constitutes interestingness is multifactorial, and includes.
• involves important Atoms (e.g. Reward)
• has a high temporal cohesion (i.e. the strength of the time relationships embodied in the
SimAND and SeqAND links is high)
• has a high spatial cohesion (i.e. the near() relationships have high strength)
• has a high frequency
• has a high surprise value (its frequency is far from what would be predicted by its component
sub-conjunctions)
Note that a conjunction can be interesting without satisfying all these criteria; e.g. if it involves
something important and has a high temporal cohesion, we want to find it regardless of its
spatial cohesion.
In preliminary experiments we have worked with a provisional definition orinterestingness"
as the combination of frequency and temporal cohesion, but this must be extended; and one
may even wish to have the combination function optimized over time (slowly) where the fitness
function is defined in terms of the STI and LTI of the concepts generated.
26.4.4.1 A Mining Approach
One tractable approach to perceptual pattern mining is greedy and iterative, involving the
following steps:
1. Build an initial transaction template graph G
2. Greedily mine some interesting basic spatiotemporal conjunctions from it, adding each
interesting conjunction found as a new node in G (so that G becomes a transaction template
relationship graph), repeating step 2 until boredom results or time runs out
The same TTRG may be maintained over time, but of course will require a robust forgetting
mechanism once the history gets long or the environment gets nontrivially complex.
The greedy mining step may involve simply grabbing SeqAND or SimAND links with prob-
ability determined by the (importance and/or interestingness) of their targets, and the prob-
abilistic strength and temporal strength of the temporal AND relationship, and then creating
conjunctions based on these links (which then become new nodes in the TTRG, so they can be
built up into larger conjunctions).
26.5 The Perceptual-Motor Hierarchy
The perceptual pattern mining approach described above is "flat," in the sense that it simply
proposes to recognize patterns in a stream of perceptions, without imposing any kind of ex-
plicitly hierarchical structure on the pattern recognition process or the memory of perceptual
patterns. This is different from how the human visual system works, with its clear hierarchical
EFTA00624283
26.6 Object Recognition from Polygonal Meshes 137
structure, and also different from many contemporary vision architectures, such as DeSTIN or
Hawkins' Numenta system which also utilizes hierarchical neural networks.
However, the approach described above may be easily made hierarchical within the CogPrime
architecture, and this is likely the most effective way to deal with complex visual scenes. Most
simply, in this approach, a hierarchy may be constructed corresponding to different spatial
regions, within the visual field. The RegionNodes at the lowest level of the hierarchy correspond
to small spatial regions, the ones at the next level up correspond to slightly larger spatial regions,
and so forth. Each RegionNode also correspond to a certain interval of time, and there may
be different RegionNodes corresponding to the same spatial region but with different time-
durations attached to them. RegionNodes may correspond to overlapping rather than disjoint
regions.
Within each region mapped by a RegionNode, then, perceptual pattern mining as defined
in the previous section may occur. The patterns recognized in a region are linked to the corre-
sponding RegionNode - and are then fed as inputs to the RegionNodes corresponding to larger,
encompassing regions; and as suggestions-to-guide-pattern-recognition to nearby RegionNodes
on the same level. This architecture involves the fundamental hierarchical structure/dynamic
observed in the human visual cortex. Thus, the hierarchy incurs a dynamic of patterns-within-
patterns-within-patterns, and the heterarchy incurs a dynamic of patterns-spawning-similar-
patterns.
Also, patterns found in a RegionNode should be used to bias the pattern-search in the
RegionNodes corresponding to smaller. contained regions: for instance, if many of the sub-
regions corresponding to a certain region have revealed parts of a face, then the pattern-mining
processes in the remaining sub-regions may be instructed to look for other face-parts.
This architecture permits the hierarchical dynamics utilized in standard hierarchical vision
models, such as Jeff Hawkins' and other neural net models, but within the context of CogPrime's
pattern-mining approach to perception. It is a good example of the flexibility intrinsic to the
CogPrime architecture.
Finally, why have we called it a perceptual-motor hierarchy above? This is because, due to
the embedding of the perceptual hierarchy in CogPrime's general Atom-network, the percepts in
a certain region will automatically be linked to actions occurring in that region. So, there may
be some perception-cognition-action interplay specific to a region, occurring in parallel with
the dynamics in the hierarchy of multiple regions. Clearly this mirrors some of the complex
dynamics occurring in the human brain, and is also reflected in the structure of sophisticated
perceptual-motor approaches like DeSTIN, to be discussed below.
26.6 Object Recognition from Polygonal Meshes
Next we describe a more specific perceptual pattern recognition algorithm - a strategy for iden-
tifying objects in a visual scene that is perceived as a set of polygons. It is not a thoroughly
detailed algorithmic approach, but rather a high-level description of how this may be done effec-
tively within the CogPrime design. It is offered here largely as an illustration of how specialized
perceptual data processing algorithms may he designed and implemented within the CogPrime
framework.
EFTA00624284
138 26 Perceptual and Motor Hierarchies
We deal here with an agent whose perception of the world, at any point in time, is understood
to consist of a set of polygons, each one described in terms of a list of corners. The corners may
be assumed to be described in coordinates relative to the viewing eye of the agent.
What we mean by "identifying objects" here is something very simple. We don't mean iden-
tifying that a particular object is a chair, or is Ben's brown chair, or anything like that - we
simply mean identifying that a given collection of polygons is meaningfully grouped into an ob-
ject. That is the task considered here. The object could be a single block, it could be a person,
or it could be a tower of blocks (which appears as a single object until it is taken apart).
Of course, not all approaches to polygon-based vision processing would require this sort of
phase: it would be possible, as an alternative, to simply compare the set of polygons in the
visual field to a database of prior experience and then do object identification (in the present
sense) based on this database-comparison. But in the approach described in this section, one
begins instead with an automated segmentation of the set of perceived polygons into a set of
objects.
26.6.1 Algorithm Overview
The algorithm described here falls into three stages:
1. Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes.
2. Creating Adjacency Graphs from PPNodes.
3. Clustering in the Adjacency Graph.
Each of these stages involves a bunch of details, not all of which have been fully resolved: this
section just gives a conceptual overview.
We will speak in terms of objects such as PolygonNode, PPNode and so forth, because inside
the CogPrime AI engine, observed and conceived entities are represented as nodes in an graph.
However, this terminology is not very important here, and what we call a PolygonNode here
could just as well be represented in a host of other ways, within the overall CogPrime framework.
26.6.2 Recognizing PersistentPolygonNodes (PPNodes) from
PolygonNodes
A PolygonNode represents a polygon observed at a point in time. A PPNode represents a series
of PolygonNodes that are heuristically guessed to represent the same PolygonNode at different
moments in time.
Before "object permanence" is learned, the heuristics for recognizing PPNodes will only
work in the case of a persistent polygon that, over an interval of time, is experiencing relative
motion within the visual field, but is never leaving the visual field. For example some reasonable
heuristics are: If P1 occurs at time t, P2 occurs at time s where s is very close to t, and PI are
similar in shape, size and color and position, then P1 and P2 should be grouped together into
the same PPNode.
EFTA00624285
26.6 Object Recognition from Polygonal Meshes 139
More advanced heuristics would deal carefully with the case where some of these similarities
did not hold, which would allow us to deal e.g. with the case where an object was rapidly
changing color.
In the case where the polygons are coming from a simulation world like OpenSim, then from
our positions as programmers and world-masters, we can see that what a PPNode is supposed to
correspond to is a certain side of a certain OpenSim object; but it doesn't appear immediately
that way to CogPrime when controlling an agent in OpenSim since CogPrime isn't perceiving
OpenSim objects, it's perceiving polygons. On the other hand, in the case where polygons are
coming from software that postprocesses the output of a LiDAR based vision system, then the
piecing together of PPNodes from PolygonNodes is really necessary.
26.6.3 Creating Adjacency Graphs from. PPNodes
Having identified PPNodes, we may then draw a graph between PPNodes, a PPGraph (also
called an "Adjacency Graph"), wherein the links are AdjacencyLinks (with weights indicating
the degree to which the two PPNodes tend to be adjacent, over time). A more refined graph
might also involve SpatialCoordinationLinks (with weights indicating the degree to which the
vector between the centroids of the two PPNodes tends to be consistent over time).
We may then use this graph to do object identification:
• First-level objects may be defined as clusters in the graph of PPNodes.
• One may also make a graph between first-level objects, an ObjectGraph with the same
kinds of links as in the PPGraph. Second-level objects may be defined as clusters in the
ObjectGraph.
The "strength" of an identified object may be assigned as the "quality" of the cluster (measured
in terms of how tight the cluster is, and how well separated from other clusters.)
As an example, consider a robot with two parts: a body and a head. The whole body may
have a moderate strength as a first-level object, but the head and body individually will have
significantly greater strengths as first-level objects. On the other hand, the whole body should
have a pretty strong strength as a second-level object.
It seems convenient (though not necessary) to have a PhysicalObjectNode type to represent
the objects recognized via clustering; but the first versus second level object distinction should
not need to be made on the Atom type level.
Building the adjacency graph requires a mathematical formula defining what it means for
two PPNodes to be adjacent. Creating this formula may require a little tuning. For instance,
the adjacency between two PPNodes PP1 and PP2 may be defined as the average over time
of the adjacency of the PolygonNodes PP1(t) and PP2(t) observed at each time t. (A p'th
power average' may be used here, and different values of p may be tried.) Then, the adjacency
between two (simultaneous) PolygonNodes P1 and P2 may be defined as the average over all
x in PI of the minimum over all y in P2 of sim(x,y), where sim(,) is an appropriately scaled
similarity function. This latter average could arguably be made a maximum; or perhaps even
better a p'th power average with large p, which approximates a maximum.
I the p'th power average is defined as J XP
EFTA00624286
140 26 Perceptual and Motor Hierarchies
26.6.4 Clustering in the Adjacency Graph.
As noted above, the idea is that objects correspond to clusters in the adjacency graph. This
means we need to implement some hierarchical clustering algorithm that is tailored to find
clusters in symmetric weighted graphs. Probably some decent algorithms of this character exist,
if not it would be fairly easy to define one, e.g. by mapping some standard hierarchical clustering
algorithm to deal with graphs rather than vectors.
Clusters will then be mapped into PhysicalObjectNodes, interlinked appropriately via Phys-
icalPartLinks and AdjacencyLinks. (E.g. there would be a PhysicalPartLink between the Phys-
icalObjectNode representing a head and the PhysicalObjectNode representing a body [where
the body is considered as including the head).
26.6.5 Discussion
It seems probable that, for simple scenes consisting of a small number of simple objects, clus-
tering for object recognition will be fairly unproblematic. However, there are two cases that are
potentially tricky:
• Sub-objects: e.g. the head and torso of a body, which may move separately; or the nose of
the head, which may wiggle; or the legs of a walking dog; etc.
• Coordinated objects: e.g. if a character's hat is on a table, and then later on his head, then
when it's on his head we basically want to consider him and his hat as the same object, for
some purposes.
These examples show that partitioning a scene into objects is a borderline-cognitive rather than
purely lower-level-perceptual task, which cannot be hard-wired in any very simple way.
We also note that, for complex scen , clustering may not work perfectly for object recogni-
tion and some reasoning may be needed to aid with the process. Intuitively, these may corre-
spond to scenes that, in human perceptual psychology, require conscious attention and focus in
order to be accurately and usefully perceived.
26.7 Interfacing the Atomspace with a Deep Learning Based
Perception-Action Hierarchy
We have discussed how one may do perception processing such as object recognition within the
Atomspace, and this is indeed a viable strategy. But an alternate approach is also interesting,
and likely more valuable in the case of robotic perception/action: build a separate perceptual-
motor hierarchy, and link it in with the Atomspace. This approach is appealing in large part
because a lot of valuable and successful work has already been done using neural networks
and related architectures for perception and actuation. And it is not necessarily contradictory
to doing perception processing in the Atomspace - obviously, one may have complementary,
synergetic perception processing occurring in two different parts of the architecture.
EFTA00624287
26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 141
This section reviews some general ideas regarding the interfacing of CogPrime with deep
learning hierarchies for perception and action; the following chapter then discusses one example
of this in detail, involving the DeSTIN deep learning architecture.
26.7.1 Hierarchical Perception Action Networks
CogPrime could be integrated with a variety of different hierarchical perception/action archi-
tectures. For the purpose of this section, however, we will consider a class of architectures that
is neither completely general nor extremely specific. Many of the ideas to be presented here are
in fact more broadly applicable beyond the architecture described here.
The following assumptions will be made about the HPANs (Hierarchical Perception/Action
Network) to be hybridized with CogPrime. It may be best to use multiple HPANs, at least one
for declarative/sensory/episodic knowledge (we'll call this the "primary HPAN") and one for
procedural knowledge. A HPAN for intentional knowledge (a goal hierarchy; in DeSTIN called
the "critic hierarchy") may be valuable as well. We assume that each HPAN has the properties:
1. It consists of a network of nodes, endowed with a learning algorithm, whose connectivity
pattern is largely but not entirely hierarchical (and whose hierarchy contains both feedback,
feedforward and lateral connections)
2. It contains a set of input nodes, receiving perceptual inputs, at the bottom of the hierarchy
3. It has a set of output nodes, which may span multiple levels of the hierarchy. The "output
nodes" indicate informational signals to cognitive processes lying outside the HPAN, or else
control signals to actuators, which may be internal or external.
4. Other nodes besides I/O nodes may potentially be observed or influenced by external pro-
cesses; for instance they may receive stimulation
5. Link weights in the HPAN get updated via some learning algorithm that is roughly speaking
"statistically Hebbian." in the sense that on the whole when a set of nodes get activated
together for a period of time, they will tend to become attractors. By an attractor we mean:
a set S of nodes such that the activation of a subset of S during a brief interval tends to
lead to the activation of the whole set S during a reasonably brief interval to follow
6. As an approximate but not necessarily strict rule, nodes higher in the hierarchy tend to
be involved in attractors corresponding to events or objects localized in larger spacetime
regions
Examples of specific hierarchical architectures broadly satisfying these requirements are the
visual pattern recognition networks constructed by Hawkins 11113061 and IPCP011, and Arel's
DeSTIN system discussed earlier (and in more depth in following chapters). The latter appears
to fit the requirements particularly snugly due to having dynamics very well suited to the
formation of a complex array of attractors, and a richer methodology for producing outputs.
These are all not only HPANs but have a more particular structure that in Chapter 27 is called
a Compositional Spatiotemporal Deep Learning Network or CSDLN.
The particulars of the use of HPANs with OpenCog are perhaps best explained via enumer-
ation of memory types and control operations.
EFTA00624288
142 26 Perceptual and Motor Hierarchies
26.7.2 Declarative Memory
The key idea here is linkage of primary HPAN attractors to CogPrime
ConceptNodes via MemberLinks. This is in accordance with the notion of glocal memory, in
the language of which the HPAN attractors are the maps and the corresponding ConceptNodes
are the keys. Put simply, when a HPAN attractor is recognized, MemberLinks are created
between the HPAN nodes comprising the main body of the attractor, and a ConceptNode in
the AtomTable representing the attractor. MemberLink weights may be used to denote fuzzy
attractor membership. Activation may spread from HPAN nodes to ConceptNodes, and STI
may spread from ConceptNodes to HPAN nodes; a conversion rate between HPAN activation
and STI currency must be maintained by the CogPrime central bank (see Chapter 23), for
ECAN purposes.
Both abstract and concrete knowledge may be represented in this way. For instance, the
Eiffel Tower would correspond to one attractor, the general shape of the Eiffel Tower would
correspond to another, and the general notion of a "tower" would correspond to yet another. As
these three examples are increasingly abstract, the corresponding attractors would be weighted
increasingly heavily on the upper levels of the hierarchy.
26.7.3 Sensory Memory
CogPrime may also use its primary HPAN to store memories of sense-perceptions and low-level
abstractions therefrom. MemberLinks may join concepts in the AtomTable to percept-attractors
in the HPAN. If the HPAN is engineered to associate specific neural modules to specific spatial
regions or specific temporal intervals, then this may be accounted for by automatically index-
ing ConceptNodes corresponding to attractors. centered in those modules in the AtomTable's
TimeServer and SpaceServer objects, which index Atoms according to time and space.
An attractor representing something specific like the Eiffel Tower, or Bob's face, would be
weighted largely in the lower levels of the hierarchy, and would correspond mainly to sensory
rather than conceptual memory.
26.7.4 Procedural Memory
The procedural HPAN may be used to learn procedures such as low-level motion primitives
that are more easily learned using HPAN training than using more abstract procedure learning
methods. For example, a Combo tree learned by MOSES in CogPrime might contain a primitive
corresponding to the predicate-argument relationship pick_up(ball); but the actual procedure
for controlling a robot hand to pick up a ball, might be expressed as an activity pattern within
the low-level procedural HPAN. A procedure P stored in the low-level procedural HPAN would
be represented in the AtomTable as a ConceptNode C linked to key nodes in the HPAN attractor
corresponding to P. The invocation of P would be accomplished by transferring STI currency
to C and then allowing ECAN to do its work.
On the other hand, CogPrime's interfacing of the high-level procedural HPAN with the Cog-
Prime ProcedureRepository is intimately dependent on the particulars of the MOSES proce-
EFTA00624289
26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 143
dure learning algorithm. As will be outlined in more depth in Chapter 33, MOSES is a complex,
multi-stage process that tries to find a program maximizing some specified fitness function, and
that involves doing the following within each "deme" (a deme being an island of roughly-similar
programs)
1. casting program trees into a hierarchical normal form
2. evaluating the program trees on a fitness function
3. building a model distinguishing fit versus unfit program trees, which involves: 3a. figuring
out what program tree features the model should include; 3b. building the model using a
learning algorithm
4. generating new program trees that are inferred likely to give high fitness, based on the
model
5. return to step 1 with these new program trees
There is also a system for managing the creation and deletion of denies.
The weakest point in CogPrime's current MOSES-based approach to procedure learning
appears to be step 3. And the main weakness is conceptual rather than algorithmic; what is
needed is to replace the current step 3 with something that uses long-term memory to do
model-building and feature-selection, rather than (like the current code) doing these things
in a manner that's restricted to the population of program trees being evolved to optimize a
particular fitness function.
One promising approach to resolving this issue is via replacing step 3b (and, to a limited
extent, 3a) with an interconnection between MOSES and a procedural HPAN. A HPAN can
do supervised categorization, and can be designed to handle feature selection in a manner
integrated with categorization, and also to integrate long-term memory into its categorization
decisions.
26.7.5 Episodic Memory
In a hybrid CogPrime /HPAN architecture, episodic knowledge may be handled via a combi-
nation of:
1. using a traditional approach to store a large ExperienceDB of actual experienced episodes
(including sensory inputs and actions; and also the states of the most important items in
memory during the experience)
2. using the Atomspace (with its TimeServer and Spar,Server components) to store declarative
knowledge about experiences
3. using dimensional embedding to index the AtomSpace's episodic knowledge in a spatiotem-
porally savvy way, as described in Chapter 40
4. training a large HPAN to summarize the scope of experienced episodes (this could be the
primary HPAN used for declarative and sensory memory, or could potentially be a separate
episodic HPAN)
Such a network should be capable of generating imagined episodes based on cues, as well
recalling real episodes. The HPAN would serve as a sort of index into the memory, of episodes,
There would be HebbianLinks from the AtomTable into the episodic HPAN.
EFTA00624290
144 26 Perceptual and Motor Hierarchies
For instance, suppose that once the agent built an extremely tall tower of blocks, taller than
any others in its memory. Perhaps it wants to build another very tall tower again, so it wants to
summon up the memory, of that previous occasion, to see if there is possibly guidance therein.
It then proceeds by thinking about tallness and towerness at the same time, which stimulates
the relevant episode, because at the time of building the extremely tall tower, the agent was
thinking a lot about tallness (so thoughts of tallness are part of the episodic memory).
26.7.6 Action Selection and Attention Allocation
CogPrime's action selection mechanism chooses procedures based on which ones are estimated
most likely to achieve current goals given current context, and places these in an "active proce-
dure pool" where an ExecutionManager object mediates their execution.
Attention allocation spans all components of CogPrime, including an HPAN if one is in-
tegrated. Attention flows between the two components due to the conversion of STI to and
from HPAN activation. And in this manner assignment of credit flows from GoalNodes into the
HPAN, because this kind of simultaneous activation may be viewed as "rewarding" a HPAN
link. So, the HPAN may reward signals from GoalNodes via ECAN, because when a ConceptN-
ode gets rewarded, if the ConceptNode points to a set of nodes, these nodes get some of the
reward.
26.8 Multiple Interaction Channels
Now we discuss a broader issue regarding the interfacing between CogPrime and the external
world. The only currently existing embodied OpenCog applications, PetBrain and C,ogBot, are
based on a loosely human model of perception and action, in which a single CogPrime instance
controls a single mobile body, but this of course is not the only way to do things. More generally,
what we can say is that a variety of external-world events come into a CogPrime system from
physical or virtual world sensors, plus from other sources such as database interfaces, Web
spiders, and/or other sources. The external systems providing CogPrime with data may be
generically referred to as sensory sources (and in the terminology we adopt here, once Atoms
have been created to represent external data, then one is dealing with perceptions rather than
sensations). The question arises how to architect a CogPrime system, in general, for dealing
with a variety of sensory sources.
We introduce the notion of an "interaction channel": a collection of sensory, sources that
is intended to be considered as a whole as a synchronous stream, and that is also able to
receive CogPrime actions - in the sense that when CogPrime carries out actions relative to
the interaction channel, this directly affects the perceptions that CogPrime receives from the
interaction channel. A CogPrime meant to have conversations with 10 separate users at once
might have 10 interaction channels. A human mind has only one interaction channel in this
sense (although humans may become moderately adept at processing information from multiple
external-world sources, coming in through the same interaction channel).
Multiple-interaction-channel digital psychology may become extremely complex - and hard
for us, with our single interaction channels, to comprehend. This is one among many cases
EFTA00624291
26.8 Multiple Interaction Channels 145
where a digital mind, with its more flexible architecture, will have a clear advantage over our
human minds with their fixed and limited neural architectures. For simplicity, however, in the
following chapters we will often focus on the single-interaction-channel case.
Events coming in through an interaction channel are presented to the system as new per-
ceptual Atoms, and relationships amongst these. In the multiple interaction channel case, the
AttentionValues of these newly created Atoms require special treatment. Not only do they re-
quire special rules, they require additional fields to be added to the AttentionValue object,
beyond what has been discussed so far.
We require newly created perceptual Atoms to be given a high initial STI. And we also
require them to be given a high amount of a quantity called "interaction-channel STI." To
support this, the AttentionValue objects of Atoms must be expanded to contain interaction-
channel STI values; and the ImportanceUpdating MindAgent must compute interaction-channel
importance separately from ordinary importance.
And, just as we have channel-specific AttentionValues, we may also have channel-specific
TruthValues. This allows the system to separately account for the frequency of a given percep-
tual item in a given interaction channel. However, no specific mechanism is needed for these,
they are merely contextual truth values, to be interpreted within a Context Node associated
with the interaction channel.
EFTA00624292
EFTA00624293
Chapter 27
Integrating CogPrime with a Compositional
Spatiotemporal Deep Learning Network
27.1 Introduction
Many different approaches to "low-level" perception and action processing are possible within
the overall CogPrime framework. We discussed several in the previous chapter, all elaborations
of the general hierarchical pattern recognition approach. Here we describe one sophisticated ap-
proach to hierarchical pattern recognition based perception in more detail: the tight integration
of CogPrime with a sophisticated hierarchical perception/action oriented learning system such
as the DeSTIN architecture reviewed in Chapter 4 of Part 1.
We introduce here the term "Compositional Spatiotemporal Deep Learning Network" (CS-
DLN), to refer to deep learning networks whose hierarchical structure directly mirrors the
hierarchical structure of spacetime. In the language of Chapter 26, a CSDLN is a special kind
of HPAN (hierarchical perception action network), which has the special property that each
of its nodes refers to a certain spatiotemporal region and is concerned with predicting what
happens inside that region. Current exemplifications of the CSDLN paradigm include the DeS-
TIN architecture that we will focus on here. along with Jeff Hawkins' Numenta "HTM" system
111B061 I , Itatnar Arel's DeSTIN 1ARC09a1, Itamar Arel's HDRN 2 system (the proprietary,
closed-source sibling of DeSTIN), Dileep George's spin-off from Numenta a, and work by Mo-
hamad Tarifi 1TSH Ill, Bundzel and Hashimoto 11311101, and others. CSDLNs are reasonably
well proven as an approach to intelligent sensory data processing, and have also been hypothe-
sized as a broader foundation for artificial general intelligence at the human level and beyond
I1113061 IABC094
While CSDLNs have been discussed largely in the context of perception, the specific form
of CSDLN we will pursue here goes beyond perception processing, and involves the coupling
of three separate hierarchies, for perception, action and goals/reinforcement 1G141G+101. The
"action" CSDLNs discussed here correspond to the procedural HPAN discussed in Chapter
26. Abstract learning and self-understanding are then hypothesized as related to systems of
attractors emerging from the close dynamic coupling of the upper levels of the three hierarchies.
I While the Numenta system is the best-known CSDLN architecture, other CSDLNs appear more impressively
functional in various respects; and many CSDLN-related ideas existed in the literature well before Numenta's
advent.
2 http: \binatix.con
3 http : \ vicarioussyst ems c om
147
EFTA00624294
27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network
DeSTIN is our paradigm case of this sort of CSDLN, but most of the considerations given here
would apply to any CSDLN of this general character.
CSDLNs embody a certain conceptual model of the nature of intelligence, and to integrate
them appropriately with a broader architecture, one must perform the integration not only
on the level of software code but also on the level of conceptual models. Here we focus here
on the problem of integrating an extended version of the DeSTIN CSDLN system with the
CogPrime integrative AGI (artificial general intelligence) system. The crux of the issue here is
how to map DeSTIN's attractors into CogPrime's more abstract, probabilistic "weighted, la-
beled hypergraph" representation (called the Atomspace). The main conclusion reached is that
in order to perform this mapping in a conceptually satisfactory way, one requires a system of
hierarchies involving the structure of DeSTIN's network but the semantic structures
of the Atomspace. The DeSTIN perceptual hierarchy is augmented by motor and goal hier-
archies, leading to a tripartite "extended DeSTIN". In this spirit, three "semantic-perceptual"
hierarchies are proposed, corresponding to the three extended-DeSTIN CSDLN hierarchies and
explicitly constituting an intermediate level of representation between attractors in DeSTIN
and the habitual cognitive usage of CogPrime Atoms and Atom-networks. For simple reference
we refer to this as the "Semantic CSDLN" approach.
A "tripartite semantic CSDLN" consisting of interlinked semantic perceptual, motoric and
goal hierarchies could be coupled with DeSTIN or another CSDLN architecture to form a novel
AGI approach; or (our main focus here) it may be used as a glue between an CSDLN and and
a more abstract semantic network such as the cognitive Atoms in CogPrime's Atomspace.
One of the core intuitions underlying this integration is that, in order to achieve the desired
level of functionality for tasks like picture interpretation and assembly of complex block struc-
tures, a convenient route is to perform a fairly tight integration of a highly capable CSDLN like
DeSTIN with other CogPrime components. For instance, we believe it's necessary to go deeper
than just using DeSTIN as an input/output layer for CogPrime, by building associative links
between the nodes inside DeSTIN and those inside the Atomspace.
This "tightly linked integration" approach is obviously an instantiation of the general cogni-
tive synergy principle, which hypothesizes particular properties that the interactions between
components in an integrated AGI system should display, in order for the overall system to dis-
play significant general intelligence using limited computational resources. Simply piping output
from an CSDLN to other components, and issuing control signals from these components to the
CSDLN, is likely an inadequate mode of integration, incapable of leveraging the full potential
of CSDLNs; what we are suggesting here is a much tighter and more synergetic integration.
In terms of the general principle of mind-world correspondence, the conceptual justification
for CSDLN/CogPrime integration would be that the everyday human world contains many com-
positional spatiotemporal structures relevant to human goals, but also contains many relevant
patterns that are not most conveniently cast into a compositional spatiotemporal hierarchy.
Thus, in order to most effectively perceive, remember, represent, manipulate and enact the full
variety of relevant patterns in the world, it is sensible to have a cognitive structure containing
a CSDLN as a significant component, but not the only component.
EFTA00624295
27.3 Semantic CSDLN for Perception Processing 149
27.2 Integrating CSDLNs with Other AI Frameworks
CSDLNs represent knowledge as attractor patterns spanning multiple levels of hierarchical net-
works, supported by nonlinear dynamics and (at least in the case of the overall DeSTIN design)
involving cooperative activity of perceptual, motor and control networks. These attractors are
learned and adapted via a combination of methods including localized pattern recognition al-
gorithms and probabilistic inference. Other AGI paradigms represent and learn knowledge in a
host of other ways. How then can CSDLNs be integrated with these other paradigms?
A very simple form of integration, obviously, would be to use a CSDLN as a sensorimotor
cortex for another AI system that's focused on more abstract cognition. In this approach, the
CSDLN would stream state-vectors to the abstract cognitive system, and the abstract cognitive
system would stream abstract cognitive inputs to the CSDLN (which would then consider them
together with its other inputs). One thing missing in this approach is the possibility of the
abstract cognitive system's insights biasing the judgments inside the CSDLN. Also, abstract
cognition systems aren't usually well prepared to handle a stream of quantitative state vectors
(even ones representing intelligent compressions of raw data).
An alternate approach is to build a richer intermediate layer, which in effect translates
between the internal language of the CSDLN and the internal language of the other AI system
involved. The particulars, and the viability, of this will depend on the particulars of the other
AI system. What we'll consider here is the case where the other AI system contains explicit
symbolic representations of patterns (including patterns abstracted from observations that may
have no relation to its prior knowledge or any linguistic terms). In this case, we suggest, a
viable approach may be to construct a "semantic CSDLN" to serve as an intermediary. The
semantic CSDLN has the same hierarchical structure as an CSDLN, but inside each node it
contains abstract patterns rather than numerical vectors. This approach has several potential
major advantages: the other Al system is not presented with a large volume of numerical vectors
(which it may be unprepared to deal with effectively); the CSDLN can be guided by the other AI
system, without needing to understand symbolic control signals; and the intermediary semantic
CSDLN can serve as a sort of "blackboard" which the CSDLN and the other AI system can
update in parallel, and be guided by in parallel, thus providing a platform encouraging "cognitive
synergy".
The following sections go into more detail on the concept of semantic CSDLNs. The discussion
mainly concerns the specific context of DeSTIN/CogPrime integration, but the core ideas would
apply to the integration of any CSDLN architecture with any other AI architecture involving
uncertain symbolic representations susceptible to online learning.
27.3 Semantic CSDLN for Perception Processing
In the standard perceptual CSDLN hierarchy, a node N on level k (considering level 1 as the
bottom) corresponds to a spatiotemporal region S with size sk (sk increasing monotonically and
usually exponentially with k); and, has children on level k -1 corresponding to spatiotemporal
regions that collectively partition S. For example, a node on level 3 might correspond to a 16x16
pixel region $ of 2D space over a time period of 10 seconds, and might have 4 level 2 children
corresponding to disjoint 4x4 regions of 2D space over 10 seconds, collectively composing S.
EFTA00624296
150 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network
This kind of hierarchy is very, effective for recognizing certain types of visual patterns. How-
ever it is cumbersome for recognizing some other types of patterns, e.g. the pattern that a face
typically contains two eyes beside each other, but at variable distance from each other.
One way to remedy this deficiency is to extend the definition of the hierarchy, so that nodes do
not refer to fixed spatial or temporal positions, but only to relative positions. In this approach,
the internals of a node are basically the same as in an CSDLN, and the correspondence of the
nodes on level k with regions of size s k is retained, but the relationships between the nodes are
quite different. For instance, a variable-position node of this sort could contain several possible
2D pictures of an eye, but be nonspecific about where the eye is located in the 2D input image.
Figure 27.1 depicts this "semantic-perceptual CSDLN" idea heuristically, showing part of a
semantic-perceptual CSDLN indicating the parts of a face, and also the connections between
the semantic-perceptual CSDLN, a standard perceptual CSDLN, and a higher-level cognitive
semantic network like CogPrime's Atomspace.
More formally, in the suggested "semantic-perceptual CSDLN" approach, a node N on level
k, instead of pointing to a set of level k. — 1 children, points to a small (but not necessarily
connected) semantic network , such that the nodes of the semantic network are (variable-
position) level k — 1 nodes; and the edges of the semantic network possess labels repre-
senting spatial or temporal relationships, for example horizontally_ aligned, vertically_ aligned,
right_side, left_side, above, behind, immediately_ right, immediately_left, immediately_ above,
immediately_ below, after, immediately_ after. The edges may also be weighted either with num-
bers or probability distributions, indicating the quantitative weight of the relationship indicated
by the label.
So for example, a level 3 node could have a child network of the form horizontally_aligned(Ni, N2)
where N1 and N2 are variable-position level 2 nodes. This would mean that N1 and N2 are along
the same horizontal axis in the 2D input but don't need to be immediately next to each other.
Or one could say, e.g. on_axis_perpendicular_to(N1, N2, N3, N1), meaning that N1 and N2
are on an axis perpendicular to the axis between N3 and N4. It may be that the latter sort of
relationship is fundamentally better in some cases, because horizontally_aligned is still tied to
a specific orientation in an absolute space, whereas on _axis _perpendicular _to is fully relative.
But it may be that both sorts of relationship are useful.
Next, development of learning algorithms for semantic CSDLNs seems a tractable research
area. First of all, it would seem that, for instance, the DeSTIN learning algorithms could
straightforwardly be utilized in the semantic CSDLN case, once the local semantic networks
involved in the network are known. So at least for sonic CSDLN designs, the problem of learning
the semantic networks may be decoupled somewhat from the learning occurring inside the nodes.
DeSTIN nodes deal with clustering of their inputs, and calculation of probabilities based on
these clusters (and based on the parent node states). The difference between the semantic
CSDLN and the traditional DeSTIN CSDLN has to do with what the inputs are.
Regarding learning the local semantic networks, one relatively straightforward approach
would be to data mine them from a standard CSDLN. That is, if one runs a standard CS-
The perceptual CSDLN shown is unrealistically small for complex vision processing (only 4 layers), and only
a fragment of the semantic-perceptual CSDLN is shown (a node corresponding to the category face, and then
a child network containing nodes corresponding to several components of a typical face). In a real semantic-
perceptual CSDLN, there would be many other nodes on the same level as the face node. many other parts
to the face subnetwork besides the eyes, nose and mouth depicted here; the eye. nose and mouth nodes would
also have child subnetworks; there would be link from each semantic node to centrokla within a large number of
perceptual nodes; and there would also be many nodes not corresponding clearly to any single English language
concept like eye, nose, face, etc.
EFTA00624297
27.3 Semantic CSDLN for Perception Processing 151
COGNITNE
SEMANTIC
NETWORK
.-/
4..I...mi.
Weil Wines. —
11..10000 MIN
PERCEPTUAL HTM
vomilecHYMinvade
peed•••••••••vd. SEMANTIC-PERCEPTUAL NTH
inPROMPI
Fig. 27.1: Simplified depiction of the relationship between a semantic-perceptual CSDLN, a tra-
ditional perceptual CSDLN (like DeSTIN), and a cognitive semantic network (like CogPrime's
AtomSpace).
DLN on a stream of inputs, one can then run a frequent pattern mining algorithm to find
semantic networks (using a given vocabulary of semantic relationships) that occur frequently
in the CSDLN as it processes input. A subnetwork that is identified via this sort of mining,
can then be grouped together in the semantic CSDLN, and a parent node can be created and
pointed to it.
Also, the standard CSDLN can be searched for frequent patterns involving the clusters
(referring to DeSTIN here, where the nodes contain clusters of input sequences) inside the
nodes in the semantic CSDLN. Thus, in the "semantic DeSTIN" case, we have a feedback
interaction wherein: 1) the standard CSDLN is formed via processing input;
2) frequent pattern mining on the standard CSDLN is used to create subnetworks and cor-
responding parent nodes in the semantic CSDLN;
3) the newly created nodes in the semantic CSDLN get their internal clusters updated via
standard DeSTIN dynamics;
4) the clusters in the semantic nodes are used as seeds for frequent pattern mining on the
standard CSDLN, returning us to Step 2 above.
After the semantic CSDLN is formed via mining the perceptual CSDLN, it may be used to
bias the further processing of the perceptual CSDLN. For instance, in DeSTIN each node carries
out probabilistic calculations involving knowledge of the prior probability of the "observation"
coming into that node over a given interval of time. In the current DeSTIN version, this prior
probability is drawn from a uniform distribution, but it would be more effective to draw the
prior probability from the semantic network - observations matching things represented in the
semantic network would get a higher prior probability. One could also use subtler strategies
EFTA00624298
152 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network
such as using imprecise probabilities in DeSTIN IG°el 114, and assigning a greater confidence
to probabilities involving observations contained in the semantic network.
Finally, we note that the nodes and networks in the semantic CSDLN may either
• be linked into the nodes and links in a semantic network such as CogPrime's AtomSpace
• actually be implemented in terms of an abstract semantic network language like CogPrime's
AtomSpace (the strategy to be suggested in Chapter 29).
This allows us to think of the semantic CSDLN as a kind of bridge between the standard CSDLN
and the cognitive layer of an AI system. In an advanced implementation, the cognitive network
may be used to suggest new relationships between nodes in the semantic CSDLN, based on
knowledge gained via inference or language.
27.4 Semantic CSDLN for Motor and Sensorimotor Processing
Next we consider a semantic CSDLN that focuses on movement rather than sensation. In this
case, rather than a 2D or 3D visual space, one is dealing with an n-dimensional configura-
tion space (C-space). This space has one dimension for each degree of freedom of the agent
in question. The more joints with more freedom of movement an agent has, the higher the
dimensionality of its configuration space.
Using the notion of configuration space, one can construct a semantic-motoric CSDLN hi-
erarchy analogous to the semantic-perceptual CSDLN hierarchy. However, the curse of dimen-
sionality demands a thoughtful approach here. A square of side 2 can be tiled with 4 squares of
side 1, but a 50-dimensional cube of side 2 can be tiled with 25° 50-dimensional cubes of side
1. If one is to build a CSDLN hierarchy in configuration space analogous to that in perceptual
space, some sort of sparse hierarchy is necessary.
There are many ways to build a sparse hierarchy of this nature, but one simple approach is
to build a hierarchy where the nodes on level k represent motions that combine the motions
represented by nodes on level k — 1. In this case the most natural semantic label predicates
would seem to be things like simultaneously, after, immediately_after, etc. So a level k node
represents a sort of "motion plan" corresponded by chaining together (serially and/or in parallel)
the motions encoded in level k-1 nodes. Overlapping regions of C-space correspond to different
complex movements that share some of the same component movements, e.g. if one is trying
to slap one pennon while elbowing another, or run while kicking a soccer ball forwards. Also
note, the semantic CSDLN approach reveals perception and motor control to have essentially
similar hierarchical structures, more so than with the traditional CSDLN approach and its
fixed-position perceptual nodes.
Just as the semantic-perceptual CSDLN is naturally aligned with a traditional perceptual
CSDLN, similarly a semantic-motoric CSDLN may be naturally aligned with a "motor CS-
DLN". A typical motoric hierarchy in robotics might contain a node corresponding to a robot
arm, with children corresponding to the hand, upper arm and lower arm; the hand node might
then contain child nodes corresponding to each finger, etc. This sort of hierarchy is intrinsically
spatiotemporal because each individual action of each joint of an actuator like an arm is in-
trinsically bounded in space and time. Perhaps the most ambitious attempt along these lines
is IA\101j, which shows how perceptual and motoric hierarchies are constructed and aligned in
an architecture for intelligent automated vehicle control.
EFTA00624299
27.4 Semantic CSDLN for Motor and Sensorimotor Processing 153
Figure 27.2 gives a simplified illustration of the potential alignment between a semantic-
motoric CSDLN and a purely motoric hierarchy (like the one posited above in the context of
extended DeSTIN). 5 In the figure, the motoric hierarchy is assumed to operate somewhat like
DeSTIN, with nodes corresponding to (at the lowest level) individual servomotors, and (on
higher levels) natural groupings of servomotors. The node corresponding to a set of servos is
assumed to contain centroids of clusters of trajectories through configuration space. The task
of choosing an appropriate action is then executed by finding the appropriate centroids for
the nodes. Note an asymmetry between perception and action here. In perception the basic
flow is bottom-up, with top-down flow used for modulation and for "imaginative" generation of
percepts. In action, the basic flow is top-down, with bottom-up flow used for modulation and for
imaginative, "fiddling around" style generation of actions. The semantic-motoric hierarchy then
contains abstractions of the C-space centroids from the motoric hierarchy - i.e., actions that
bind together different C-space trajectories that correspond to the same fundamental action
carried out in different contexts or under different constraints. Similarly to in the perceptual
case, the semantic hierarchy here serves as a glue between lower-level function and higher-level
cognitive semantics.
get object
iVsaffi l
COGNITIVE
SEMANTIC
NETWORK
cir IC v.
oan.tos.
met roam
nolemotimw
dl . & 4. own, gar'
or lower arm toward object
, t-TLI L
• • • % I
MOTORIC HTM
The mown[ hatarth/ neck teerespendolg 0), a
parocubet to 0 wnretan. marM h.e Iona man SEMANTIC- MOTORIC HTM
chums Sash. through configuration spate that the
ganconettes ham Natant* Mama&
Fig. 27.2: Simplified depiction of the relationship between a semantic-motoric CSDLN, a motor
control hierarchy (illustrated by the hierarchy of servos associated with a robot arm), and a
cognitive semantic network (like CogPrime's AtomSpace).
5 In the figure, only a fragment of the semantic-motoric CSDLN is shown (a node corresponding to the "get
object" action category, and then a child network containing nodes corresponding to several components of the
action). In a real semantic-motoric CSDLN, there would be many other nodes on the same level as the get-object
node, many other parts to the get-object subnetwork besides the ones depicted here; the subnetwork nodes would
also have child subnetworks; there would be link from each semantic node to centroids within a large number
of motoric nodes: and there might also be many nodes not corresponding clearly to any single English language
concept like 'grasp object" etc.
EFTA00624300
154 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network
27.5 Connecting the Perceptual and Motoric Hierarchies with a Goal
Hierarchy
One way to connect perceptual and motoric CSDLN hierarchies is using a "semantic-goal CS-
DLN" bridging the semantic-perceptual and semantic-motoric CSDLNs. The semantic-goal CS-
DLN would be a "semantic CSDLN" loosely analogous to the perceptual and motor semantic
CSDLNs - and could optionally be linked into the reinforcement hierarchy of a tripartite CS-
DLN like extended DeSTIN. Each node in the semantic-goal CSDLN would contain implications
of the form "Context & Procedure —> Goal", where Goal is one of the Al system's overall goals
or a subgoal thereof, and Context and Procedure refer to nodes in the perceptual and motoric
semantic CSDLNs respectively.
For instance, a semantic-goal CSDLN node might contain an implication of the form "I
perceive my hand is near object X & I grasp object X r I possess object X." This would be
useful if "I possess object X" were a subgoal of some higher-level system goal, e.g. if X were a
food object and the system had the higher-level goal of obtaining food.
To the extent that the system's goals can be decomposed into hierarchies of progressively
more and more spatiotemporally localized subgoals, this sort of hierarchy will make sense,
leading to a tripartite hierarchy as loosely depicted in Figure 27.3. 6 One could attempt to
construct an overall AGI approach based on a tripartite hierarchy of this nature, counting on
the upper levels of the three hierarchies to conic together dynamically to form an integrated
cognitive network, yielding abstract phenomena like language, self, reasoning and mathematics.
On the other hand, one may view this sort of hierarchy as a portion of a larger integrative AGI
architecture, containing a separate cognitive network, with a less rigidly hierarchical structure
and less of a tie to the spatiotemporal structure of physical reality. The latter view is the one
we are primarily taking within the CogPrime AGI approach, viewing perceptual, motoric and
goal hierarchies as "lower level" subsystems connected to a "higher level" system based on the
CogPrime AtomSpace and centered on its abstract cognitive processes.
Learning of the subgoals and implications in the goal hierarchy is of course a complex matter,
which may be addressed via a variety of algorithms, including online clustering (for subgoals or
implications) or supervised learning (for implications, the "supervision" being purely internal
and provided by goal or subgoal achievement).
6 The diagram is simplified in many ways, e.g. only a handful of nodes in each hierarchy is shown (rather than
the whole hierarchy), and lines without arrows are used to indicate bidirectional arrows, and nearly all links are
omitted. The purpose is just to show the general character of interaction between the components in a simplified
context.
EFTA00624301
27.5 Connecting the Perceptual and Motoric Hierarchies with a Coal Hierarchy 155
SEMANTIC-PERCEPTUAL HTM SB1ANTIC GOAL HTM SBMNfIC• MOTORIC HTM
Fig. 27.3: Simplified illustration of the proposed interoperation of perceptual, motoric and goal
semantic CSDLNs.
EFTA00624302
EFTA00624303
Chapter 28
Making DeSTIN Representationally Transparent
Co-authored with Itamar Arel
28.1 Introduction
In this chapter and the next we describe one particular incarnation of the above ideas on
semantic CSDLNs in more depth: the integration of CogPrime with the DeSTIN architecture
reviewed in Chapter 4 of Part 1.
One of the core intuitions underlying this integration is that, in order to achieve the de-
sired level of functionality for tasks like picture interpretation and assembly of complex block
structures, it will be necessary to integrate DeSTIN (or some similar system) and CogPrime
components fairly tightly - going deeper than just using DeSTIN as an input/output layer for
CogPrime, by building a number of explicit linkages between the nodes inside DeSTIN and
CogPrime respectively.
The general DeSTIN design has been described in talks as comprising three crosslinked hier-
archies. handling perception, action and reinforcement; but so far only the perceptual hierarchy
(also called the "spatiotemporal inference network") has been implemented or described in de-
tail in publications. In this chapter we will focus on DeSTIN's perception hierarchy. We will
explain DeSTIN's perceptual dynamics and representations as we understand them, more thor-
oughly than was done in the brief review above; and we will describe a series of changes to the
DeSTIN design, made in the spirit of easing DeSTIN/OpenCog integration. In the following
chapter we will draw action and reinforcement into the picture, deviating somewhat in the de-
tails from the manner in which these things would be incorporated into a standalone DeSTIN,
but pursuing the same concepts in an OpenCog integration context.
What we describe here is a way to make a "Uniform DeSTIN", in which the internal repre-
sentation of perceived visual forms is independent of affine transformations (translation, scaling,
rotation and shear). This "representational transparency" means that, when Uniform DeSTIN
perceives a pattern: no matter how that pattern is shifted or linearly transformed, the way Uni-
form DeSTIN represents that pattern internally is going to be basically the same. This makes
it easy to look at a collection of DeSTIN states, obtained by exposing a DeSTIN perception
network to the world at different points in time, and see the commonalities in what they are
perceiving and how they are interpreting it. By contrast, in the original version of DeSTIN
(here called "classic DeSTIN"), it may take significant effort to connect the internal repre-
sentation of a visual pattern and the representation of its translated or linearly transformed
versions. The uniformity of Uniform DeSTIN makes it easier for humans to inspect DeSTIN's
state and understand what's going on, and also (more to the point) makes it easier for other
157
EFTA00624304
158 28 Making DeSTIN Representationally Transparent
AI components to recognize patterns in sets of DeSTIN states. The latter fact is critical for the
DeSTIN/OpenCog integration
28.2 Review of DeSTIN Architecture and Dynamics
The hierarchical architecture of DeSTIN's spatiotemporal inference network comprises an ar-
rangement into multiple layers of "nodes" comprising multiple instantiations of an identical
processing unit. Each node corresponds to a particular spatiotemporal region, and uses a sta-
tistical learning algorithm to characterize the sequences of patterns that are presented to it by
nodes in the layer beneath it.
More specifically, at the very lowest layer of the hierarchy nodes receive as input raw data
(e.g. pixels of an image) and continuously construct a belief state that attempts to characterize
the sequences of patterns viewed. The second layer, and all these above it, receive as input
the belief states of nodes at their corresponding lower layers, and attempt to construct belief
states that capture regularities in their inputs. Each node also receives as input the belief state
of the node above it in the hierarchy (which constitutes "contextual" information, utilized in
the node's prediction process).
Inside each node, an online clustering algorithm is used to identify regularities in the se-
quences received by that node. The centroids of the clusters learned are stored in the node and
comprise the basic visual patterns recognized by that node. The node's "belief" regarding what
it is seeing, is then understood as a probability density function defined over the centroids at
that node. The equations underlying this centroid formation and belief updating process are
identical for every node in the architecture, and were given in their original form in IARC09aJ,
though the current open-source DeSTIN codebase reflects some significant improvements not
yet reflected in the publication record.
In short, the way DeSTIN represents an item of knowledge is as a probability distribution
over "network activity patterns" in its hierarchical network. An activity pattern, at each point
in time, comprises an indication of which centroids in each node are most active, meaning they
have been identified as most closely resembling what that node has perceived, as judged in the
context of the perceptions of the other nodes in the system. Based on this methodology, the
DeSTIN perceptual network serves the critical role of building and maintaining a model of the
state of the world as visually perceived.
This methodology allows for powerful unsupervised classification. If shown a variety of real-
world scenes, DeSTIN will automatically form internal structures corresponding to the various
natural categories of objects shown in the scenes, such as trees, chairs, people, etc.; and also
to the various natural categories of events it sees, such as reaching, pointing, falling. In order
to demonstrate the informativeness of these internal structures, experiments have been done
using DeSTIN's states as input feature vectors for supervised learning algorithms, enabling
high-accuracy supervised learning of classification models from labeled image data [KARR)]. A
closely related algorithm developed by the same principal researcher (Itamar Arel) has proven
extremely successful at audition tasks such as phoneme recognition IA13S ± Ill.
EFTA00624305
28.3 Uniform DeSTIN 159
28.2.1 Beyond Gray-Scale Vision
The DeSTIN approach may easily be extended to other senses beyond gray-scale vision. For
color vision, it suffices to replace the one-dimensional signals coming into DeSTIN's lower layer
with 3D signals representing points in the color spectrum; the rest of the DeSTIN process may
be carried over essentially without modification. Extension to further senses is also relatively
straightforward on the mathematical and software structure level, though they may of course
require significant additional tuning and refinement of details.
For instance, olfaction does not lend itself well to hierarchical modeling, but audition and
haptics (touch) do:
• for auditory perception, one could use a DeSTIN architecture in which each layer is one-
dimensional rather than two-dimensional, representing a certain pitch. Or one could use two
dimensions for pitch and volume. This results in a system quite similar to the DeSTIN-like
system shown to perform outstanding phoneme recognition in IA13S+ lib and is conceptually
similar to Hierarchical Hidden Markov Models (HHMMs), which have proven quite success-
ful in speech recognition and which Ray Kurzweil has argued are the central mechanism
of human intelligence tKurl2l. Note also recent results published by Microsoft Research,
showing dramatic improvements over prior speech recognition results based on use of a
broadly HHMM-like deep learning system 2I.
• for haptic perception, one could use a DeSTIN architecture in which the lower layer of the
network possesses a 2D topology reflecting the topology of the surface of the body. Similar
to the somatccensory cortex in the human brain, the map could be distorted so that more
"pixels" are used for regions of the body from which more data is available (e.g. currently
this might be the fingertips, if these were implemented using Syntouch technology [Fill,
which has proved excellent at touch-based object identification). Input could potentially
be multidimensional if multiple kinds of haptic sensors were available, e.g temperature,
pressure and movement as in the Syntouch case.
Augmentation of DeSTIN to handle action as well as perception is also possible, and will be
discussed in Chapter 29
28.3 Uniform DeSTIN
It would be possible to integrate DeSTIN in its original form with OpenCog or other Al sys-
tems with symbolic aspects, via using an unsupervised machine learning algorithm to recognize
patterns in sets of states of the DeSTIN network as originally defined. However, this pattern
recognition task becomes much easier if one suitably modifies DeSTIN, so as to make the com-
monalities between semantically similar states more obviously perceptible. This can be done by
making the library of patterns recognized within each DeSTIN node invariant with respect to
translation, scale, rotation and shear - a modification we call "Uniform DeSTIN." This "uni-
formization" decreases DeSTIN's degree of biological mimicry, but eases integration of DeSTIN
with symbolic AI methods.
EFTA00624306
160 28 Making DeSTIN Representationally Transparent
28.3.1 Translation-Invariant DeSTIN
The first revision to the "classic DeSTIN" to be suggested here is: All the nodes on the same
level of the DeSTIN hierarchy should share the same library, of patterns. In the context of classic
DeSTIN (i.e. in the absence of further changes to DeSTIN to be suggested below, which extend
the type of patterns usable by DeSTIN), this means: the nodes on the same level should share
the same list of centroids. This makes DeSTIN's pattern recognition capability translation-
invariant. This translation invariance can be achieved without any change to the algorithms for
updating centroids and matching inputs to centroids.
In this approach, it's computationally feasible to have a much larger library of patterns
utilized by each node, as compared to in classic DeSTIN. Suppose we have anxn pixel grid,
where the lowest level has nodes corresponding to 4 x 4 squares. Then, there are a? nodes
on the lowest level, and on the k'th level there are (fE )2 nodes. This means that, without
increasing computational complexity (actually decreasing it, under reasonable assumptions), in
translation-invariant Uniform DeSTIN we can have a factor of (402 more centroids on level k.
One can achieve a much greater decrease in computational complexity (with the same amount
of centroid increase) via use of a clever data structure like a cover tree IBK1,1181 to store the
centroids at each level. Then the nearest-neighbor matching of input patterns to the library
(centroid) patterns would be very rapid, much faster than linearly comparing the input to each
pattern in the list.
28.3.1.1 Conceptual Justification for Uniform DeSTIN
Generally speaking, one may say that: if the class of images that the system will see is invariant
with respect to linear translations, then without loss of generality, we can assume that the
library of patterns at each node on the same level is the same.
In reality this assumption isn't quite going to hold. For instance, for an eye attached to a
person or humanoid robot, the top of the pixel grid will probably look at a person's hair more
often than the bottom ... because the person stands right-side-up more often than they stand
upside-down, and because they will often fixate the center of their view on a person's face, etc.
For this reason, we can recognize our friend's face better if we're looking at them directly, with
their face centered in our vision.
However, we suggest that this kind of peculiarity is not really essential to vision processing
for general intelligence. There's no reason you can't have an intelligent vision system that
recognizes a face just as well whether it's centered in the visual field or not. (In fact you could
straightforwardly explicitly introduce this kind of bias within a translation-invariant DeSTIN,
but it's not clear this is a useful direction.)
By and large, in almost all cases, it seems to us that in a DeSTIN system exposed to a wide
variety of real-world inputs in complex situations, the library of patterns in the different nodes
at the same level would turn out to be substantially the same. Even if they weren't exactly
the same, they would be close to the same, embodying essentially the same regularities. But
of course, this sameness would be obscured, because centroid 7 in a certain node X on level 4
might actually be the same as centroid 18 in some other node Y on level 4 ... and there would
be no way to tell that centroid 7 in node X and centroid 18 and node Y were actually referring
to the same pattern, without doing a lot of work.
EFTA00624307
28.3 Uniform DeSTIN 161
28.3.1.2 Comments on Biological Realism
Translation-invariant DeSTIN deviates further from human brain structure than classic DeS-
TIN, but this is for good reason.
The brain has a lot of neurons, since adding new neurons was fairly easy and cheap for
evolution; and tends to do things in a massively parallel manner, with great redundancy. For
the brain, it's not so problematically expensive to have the functional equivalent of a lot of
DeSTIN nodes on the same level, all simultaneously using and learning libraries of patterns
that are essentially identical to each other. Using current computer technology, on the other
hand, this sort of strategy is rather inefficient.
In the brain, messaging between separated regions is expensive, whereas replicating function
redundantly is cheap. In most current computers (with some partial exceptions such as CPUs),
messaging between separated regions is fairly cheap (so long as those regions are stored on
the same machine), whereas replicating function redundantly is expensive. Thus, even in cases
where the same concept and abstract mathematical algorithm can be effectively applied in
both the brain and a computer, the specifics needed for efficient implementation may be quite
different.
28.3.2 Mapping States of Translation-Invariant DeSTIN into the
Atomspace
Mapping classic DeSTIN's states into a symbolic pattern-manipulation engine like OpenCog is
possible, but relatively cumbersome. Doing the same thing with Uniform DeSTIN is much more
straightforward.
In Uniform DeSTIN, for example, Cluster 7 means the same thing in ANY node on level 4.
So after a Uniform DeSTIN system has seen a fair number of images, you can be pretty sure its
library of patterns is going to be relatively stable. Some clusters may come and go as learning
progresses, but there's going to be a large and solid library, of clusters at each level that persists,
because all of its member clusters occur reasonably often across a variety of inputs.
Define a DeSTIN state-tree as a (quaternary) tree with one node for each DeSTIN node;
and living at each node, a small list of (integer pattern_code, float weight) pairs. That is, at
each node, the state-tree has a short-list of the patterns that closely match a given state at that
node. The weights may be assumed between 0 and 1. The integer pattern codes have the same
meaning for every, node on the same level.
As you feed DeSTIN inputs, at each point in time it will have a certain state, representable as
a state-tree. So, suppose you have a large database of DeSTIN state-trees, obtained by showing
various inputs to DeSTIN over a long period of time. Then, you can do various kinds of pattern
recognition on this database of state-trees.
More formally, define a state-subtree as a (quaternary) tree with a single integer at each node.
Two state-subtrees may have various relationships with each other within a single state-tree -
for instance they may be adjacent to each other, or one may appear atop or below the other, etc.
In these terms, one interesting kind of pattern recognition to do is: Recognize frequent state-
subtrees in the stored library of state-trees; and then recognize frequent relationships between
these frequent state-subtrees. The latter relationships will form a kind of "image grammar,"
conceptually similar and formally related to those described in IZNIN. Further, temporal pat-
EFTA00624308
162 28 Making DeSTIN Representationally Transparent
terns may be recognized in the same way as spatial ones, as part of the state-subtree grammar
(e.g. state•subtree A often occurs right before state-subtree B; state-subtree C often occurs
right before and right below state-subtree D; etc.).
The flow of activation from OpenCog back down to DeSTIN is also fairly straightforward in
the context of translation-invariant DeSTIN. If relationships have been stored between concepts
in OpenCogPrime's memory and grammatical patterns between state-subtrees, then whenever
concept C becomes important in OpenCogPrime's memory, this can cause a top-down increase
in the probability of matching inputs to DeSTIN node centroids, that would cause the DeSTIN
state-tree to contain the grammatical patterns corresponding to concept C.
28.3.3 Scale-Invariant DeSTIN
The next step, moving beyond translation invariance, is to make DeSTIN's pattern recognition
mostly (not wholly) scale invariant. We will describe a straightforward way to map centroids
on one level of DeSTIN, into centroids on the other levels of DeSTIN. This means that when a
centroid has been learned on one level, it can be experimentally ported to all the other levels,
to see if it may be useful there too.
To make the explanation of this mapping clear, we reiterate some DeSTIN basics in slightly
different language:
• A centroid on Level N is: a spatial arrangement (e.g. k x k square lattice) of beliefs of Level
N —1. (More generally it is a spatiotemporal arrangement of such beliefs, but we will ignore
this for the moment.)
• A belief on Level N is: a probability distribution over centroids on Level N. For heuristic
purposes one can think about this as a mixture of Gaussian, though this won't always be
the best model.
• Thus, a belief on Level N is: a probability distribution over spatial (or more generally,
spatiotemporal) arrangements of beliefs on Level N — 1
On Level 1, the role of centroids is played by simple k x k squares of pixels. Level 1 beliefs
are probability distributions over these small pixel squares. Level 2 centroids are hence spa-
tial arrangements of probability distributions over small pixel-squares; and Level 2 beliefs are
probability distributions over spatial arrangements of probability distributions over small pixel-
squares.
A small pixel-square S may be mapped into a single pixel P via a heuristic algorithm such
as:
• if S has more black than white pixels, then P is black
• is S has more white than black pixels, then P is white
• if S has an equal number of white and black pixels, then use some heuristic. For instance if
S is 4 x 4 you could look at the central 2 x 2 square and assign P to the color that occurs
most often there. If that is also a tie, then you can just arbitrarily assign P to the color
that occurs in the upper left corner of S.
A probability distribution over small pixel-squares may then be mapped into a probability
distribution over pixel values (B or Mr). A probability distribution over the two values B and
Wmay be approximatively mapped into a single pixel value - the one that occurs most often
EFTA00624309
28.3 Uniform DeSTIN 163
in the distribution, with a random choice made to break a tie. This tells us how to map Level 2
beliefs into spatial arrangements of pixels; and thus, it tells us how to map Level 2 beliefs into
Level 1 beliefs.
But this tells us how to map Level N beliefs into Level N-1 beliefs, inductively. Remember,
a Level N belief is a probability distribution (pdf for short) over spatial arrangements of beliefs
on Level N — I. For example: A Level 3 belief if a pdf over arrangements of Level 2 beliefs. But
since we can map Level 2 beliefs into Level 1 beliefs, this means we can map a Level 3 belief
into a pdf over arrangements of Level 1 beliefs - which means we can map a Level 3 belief into
a Level 2 belief. Etc.
Of course, this also tells as how to map Level N centroids into Level N — 1 centroids. A
Level N centroid is a pdf over arrangements of Level N —1 beliefs; a Level N —1 centroid is a
pdf over arrangements of Level N — 2 beliefs. But Level N —1 beliefs can be mapped into Level
N — 2 beliefs. so Level N centroids can be represented as pdfs over arrangements of Level N
beliefs, and hence mapped into Level N — 1 centroids.
In practice, one can implement this idea by moving from the bottom up. Given the mapping
from Level 1 "centroids" to pixels, one can iterate through the Level 1 beliefs and identify
which pixels they correspond to. Then one can iterate through the Level 2 beliefs and identify
which Level 1 beliefs they correspond to. Etc. Each Level N belief can be explicitly linked
to a corresponding level N — 1 belief. Synchronously, as one moves up the hierarchy, Level N
centroids can be explicitly linked to corresponding Level N — 1 centroids.
Since there are in principle more possible Level Nbeliefs than Level N-1 beliefs, the mapping
from level Nbeliefs to level N-1 beliefs is many-to-one. This is a reason not to simply maintain
a single centroid pool across levels. However, when a new centroid C is added to the Level N
pool, it can be mapped into a Level N — 1 centroid to be added to the Level N — 1 pool (if
not there already). And, it can also be used to spawn a Level N + 1 centroid, drawn randomly
from the set of possible Level N +1 centroids that map into C.
Also, note that it is possible to maintain a single centroid numbering system across levels, so
that a reference like "centroid # 175" has only one meaning in an entire DeSTIN network, even
though some of these centroid may only be meaningful above a certain level in the network.
28.3.4 Rotation Invariant DeSTIN
With a little more work, one can make DeSTIN rotation and shear invariant as well 1. Consid-
ering rotation first:
• When comparing an input A to a Level N node with a Level N centroid B , consider var-
ious rotations of A, and see which rotation gives the closest match.
• When you match a centroid to an input observation-or-belief, record the rotation angle
corresponding to the match.
The second of these points implies the tweaked definitions
• A centroid on Level N is: a spatial arrangement (e.g. k x k square lattice) of beliefs of Level
N —1
• A belief on Level N is: a probability distribution over (angle, centroid) pairs on Level N.
I The basic idea in this section, in the context of rotation, is due to Jade O'Neill (private communication)
EFTA00624310
164 28 Making DeSTIN Representationally Transparent
From these it follows that a belief on Level N is: a probability distribution over (angle, spatial
arrangement of beliefs) pairs on Level N — 1
An additional complexity here is that two different (angle, centroid) pairs (on the same level)
could be (exactly or approximately) equal to each other. This necessitates an additional step of
"centroid simplification", in which ongoing checks are made to see if there are any two centroids
C1, C2 on the same level so that: There exist angles A1, A2 so that (Ai, CI ) is very close to
(A2, C2). In this case the two centroids may be merged into one.
To apply these same ideas to shear, one may simply replace "rotation angle" in the above
by "(rotation angle, shear factor) pair."
28.3.5 Temporal Perception
Translation and scale invariant DeSTIN can be applied perfectly well if the inputs to DeSTIN,
at level 1, are movies rather than static images. Then, in the simplest version, Level 1 consists
of pixel cubes instead of pixel squares, etc. (the third dimension in the cube representing time).
The scale invariance achieved by the methods described above would then be scale invariance
in time as well as in space.
In this context, one may enable rectangular shapes as well as cubes. That is, one can look
at a Level N centroid consisting of m time-slices of a k x k arrangement of Level N —1 beliefs
- without requiring that in = k .... This would make the centroid learning algorithm a little
more complex, because at each level one would want to consider centroids with various values
of in, from in = 1, ..., k (and potentially m > k also).
28.4 Interpretation of DeSTIN's Activity
Uniform DeSTIN constitutes a substantial change in how DeSTIN does its business of recog-
nizing patterns in the world - conceptually as well as technically. To explicate the meaning of
these changes, we briefly present our favored interpretation of DeSTIN's dynamics.
The centroids in the DeSTIN library represent points in "spatial pattern space", i.e. they
represent exemplary spatial patterns. DeSTIN's beliefs, as probability distributions over cen-
troids, represent guesses as to which of the exemplary spatial patterns are the best models of
what's currently being seen in a certain space-time region.
This matching between observations and centroids might seem to be a simple matter of
"nearest neighbor matching"; but the subtle point is, it's not immediately obvious how to
best measure the distance between observations and centroids. The optimal way of measuring
distance is going to depend on context; that is to say, on the actual distribution of observations
in the system's real environment over time.
DeSTIN's algorithm for calculating the belief at a node, based on the observation and cen-
troids at that node plus the beliefs at other nearby nodes, is essentially a way of tweaking the
distance measurement between observations and centroids, so that this measurement accounts
for the context (the historical distribution of observations). There are many possible ways of
doing this tweaking. Ideally one could use probability theory explicitly, but that's not always
EFTA00624311
28.4 Interpretation of DeSTIN's Activity 165
going to be computationally feasible, so heuristics may be valuable, and various versions of
DeSTIN have contained various heuristics in this regard.
The various ways of "uniformizing" DeSTIN described above (i.e. making its pattern recogni-
tion activity approximately invariant with respect to affine transformations), don't really affect
this story - they just improve the algorithm's ability to learn based on small amounts of data
(and its rapidity at learning from data in general), by removing the need for the system to
repeatedly re-learn transformed versions of the same patterns. So the uniformization just lets
DeSTIN carry out its basic activity faster and using less data.
28.4.1 DeSTIN's Assumption of Hierarchical Decomposability
Roughly speaking, DeSTIN will work well to the extent that: The average distance between each
part of an actually observed spatial pattern, and the closest centroid pattern, is not too large
(note: the choice of distance measure in this statement is potentially subtle). That is: DeSTIN's
set of centroids is supposed to provide a compact model of the probability distribution of spatial
patterns appearing in the experience of the cognitive system of which DeSTIN is a part.
DeSTIN's effective functionality relies on the assumption that this probability distribution
is hierarchically decomposable - i.e. that the distribution of spatial patterns appearing over a
k x k region can be compactly expressed, to a reasonable degree of approximation, as a spatial
combination of the distributions of spatial patterns appearing over (k/4) x (k/4) regions. This
assumption of hierarchical decomposability greatly simplifies the search problem that DeSTIN
faces, but also restricts DeSTIN's capability to deal with more general spatial patterns that are
not easily hierarchically decomposable. However, the benefits of this approach seem to outweigh
the costs, given that visual patterns in the environments humans naturally encounter do seem
(intuitively at least) to have this hierarchical property.
28.4.2 Distance and Utility
Above we noted that choice of distance measure involved in the assessment of DeSTIN's effec-
tive functionality is subtle. Further above, we observed that the function of DeSTIN's belief
assessment is basically to figure out the contextually best way to measure the distance between
the observation and the centroids at a node. These comments were both getting at the same
point.
But what is the right measure of distance between two spatial patterns? Ultimately, the right
measure is: the probability that the two patterns A and B can be used in the same way. That
is: the system wants to identify observation A with centroid B if it has useful action-patterns
involving B, and it can substitute A for B in these patterns without lass.
This is difficult to calculate in general, though - a rough proxy, which it seems will often be
acceptable, is to measure the distance between A and B in terms of both
• the basic (extensional) distance between the physical patterns they embody (e.g. pixel by
pixel distance)
• the contextual (intensional) distance, i.e. the difference between the contexts in which they
occur
EFTA00624312
166 28 Making DeSTIN Representationally Transparent
Via enabling the belief in a node's parent to play a role in modulating a certain node's be-
lief, DeSTIN's core algorithm enables contextual/intensional factors to play a role in distance
assessment.
28.5 Benefits and Costs of Uniform DeSTIN
We now summarize the main benefits and casts of Uniform DeSTIN a little more systematically.
The key point we have made here regarding Uniform DeSTIN and representational transparency
may be summarized as follows:
• Define an "affine perceptual equivalence class" as a set of percepts that are equivalent to
each other, or nearly so, under affine transformation. An example would be views of the
same object from different perspectives or distances.
• Suppose one has an embodied agent using DeSTIN for visual perception, whose perceptual
stream tends to include a lot of reasonably large affine perceptual equivalence classes.
• Then, supposing the "mechanics" of DeSTIN can be transferred to the Uniform DeSTIN
case without dramatic loss of performance, Uniform DeSTIN should be able to recognize
patterns based on many fewer examples than classic DeSTIN.
As soon as Uniform DeSTIN has learned to recognize one element of a given affine perceptual
equivalence class, it can recognize all of them. Whereas, classic DeSTIN must learn each element
of the equivalence class separately. So, roughly speaking, the number of cases required for
unsupervised training of Uniform DeSTIN will be less than that for classic DeSTIN, by a ratio
equal to the average size of the affine perceptual equivalence classes in the agent's perceptual
stream.
Counterbalancing this, we have the performance cost of comparing the input to each node
against a much larger set of centroids (in Uniform DeSTIN as opposed to classic DeSTIN).
However, if a cover tree or other efficient data structure is used, this cost is not so onerous.
The cost of nearest neighbor queries in a cover tree storing n items (in this case, n centroids) is
0(cn logn), where the constant c represents the "intrinsic dimensionality" of the data; and in
practice the cover tree search algorithm seems to perform quite well. So, the added time cost
for online clustering in Uniform DeSTIN as opposed to DeSTIN, is a factor on the order of the
log of the number of nodes in the DeSTIN tree. We believe this moderate added time cost is
well worth paying, to gain a significant decrease in the number of training examples required
for unsupervised learning.
Beyond increases in computational cost, there is also the risk that the online clustering
may just not work as well when one has so many clusters in each node. This is the sort of
problem that can really only be identified, and dealt with. during extensive practice - since the
performance of any clustering algorithm is largely determined by the specific distribution of
the data it's dealing with. It may be necessary to improve DeSTIN's online clustering in some
way to make Uniform DeSTIN work optimally, e.g. improving its ability to form clusters with
markedly non-spherical shapes. This ties in to a point raised in chapter 29 - the possibility of
supplementing traditional clusters with predicates learned by CogPrime, which may live inside
DeSTIN nodes alongside centroids. Each such predicate in effect defines a (generally nonconvex)
"cluster".
EFTA00624313
28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 167
28.6 Imprecise Probability as a Tool for Linking CogPrime and
DeSTIN
One key aspect of vision processing is the ability to preferentially focus attention on certain
positions within a perceived visual scene. In this section we describe a novel strategy for enabling
this in a hybrid CogPrime /DeSTIN system, via use of imprecise probabilities. In fact the basic
idea suggested here applies to any probabilistic sensory system, whether deep-learning-based
or not, and whether oriented toward vision or some other sensory modality. However, for sake
of concreteness, we will focus here on the case of DeSTIN/CogPrime integration.
28.6.1 Visual Attention Focusing
Since visual input streams contain vast amounts of data, it's beneficial for a vision system to
be able to focus its attention specifically on the most important parts of its input. Sometimes
knowledge of what's important will come from cognition and long-term memory, but sometimes
it may come from mathematical heuristics applied to the visual data itself.
In the human visual system the latter kind of "low level attention focusing" is achieved
largely in the context of the eye changing its focus frequently, looking preferentially at certain
positions in the scene rint091. This works because the center of the eye corresponds to a greater
density of neurons than the periphery.
So for example, consider a computer vision algorithm like SIFT (Scale-Invariant Feature
Extraction) ILow991, which (as shown in Figure 28.1) mathematically isolates certain points in
a visual scene as "keypoints" which are particularly important for identifying what the scene
depicts (e.g. these may be corners, or easily identifiable curves in edges). The human eye, when
looking at a scene, would probably spend a greater percentage of its time focusing on the SIFT
keypoints than on random points in the image.
The human visual system's strategy for low-level attention focusing is obviously workable
(at least in contexts similar to those in which the human eye evolved), but it's also somewhat
complex, requiring the use of subtle temporal processing to interpret even static scenes. We
suggest here that there may be a simpler way to achieve the same thing, in the context of
vision systems that are substantially probabilistic in nature, via using imprecise probabilities.
The crux of the idea is to represent the most important data, e.g. keypoints, using imprecise
probability values with greater confidence.
Similarly, cognition-guided visual attention-focusing occurs when a mind's broader knowledge
of the world tells it that certain parts of the visual input may be more interesting to study than
others. For example, in a picture of a person walking down a dark street, the contours of the
person may not be tremendously striking visually (according to SIFT or similar approaches);
but even so, if the system as a whole knows that it's looking at a person, it may decide to focus
extra visual attention on anything person-like. This sort of cognition guided visual attention
focusing, we suggest, may be achieved similarly to visual attention focusing guided on lower-
level cues - by increasing the confidence of the imprecise probabilities associated with those
aspects of the input that are judged more cognitively significant.
EFTA00624314
168 28 Making DeSTIN Representationally Transparent
28.6.2 Using Imprecise Probabilities to Guide Visual Attention
Focusing
Suppose one has a vision system that internally constructs probabilistic values corresponding to
small local regions in visual input (these could be pixels or voxels, or something a little larger),
and then (perhaps via a complex process) assigns probabilities to different interpretations of the
input based on combinations of these input-level probabilities. For this sort of vision system,
one may be able to achieve focusing of attention via appropriately replacing the probabilities
with imprecise probabilities. Such an approach may be especially interesting in hierarchical
vision systems, that also involve the calculation of probabilities corresponding to larger regions
of the visual input. Examples of the latter include deep learning based vision systems like Hill
or DeSTIN, which construct nested hierarchies corresponding to larger and larger regions of the
input space, and calculate probabilities associated with each of the regions on each level, based
in part on the probabilities associated with other related regions.
In this context, we now state the basic suggestion of the section:
1. Assign higher confidence to the low-level probabilities that the vision system creates cor-
responding to the local visual regions that one wants to focus attention on (based on cues
from visual preprocessing or cognitive guidance)
2. Carry out the vision system's processing using imprecise probabilities rather than single-
number probabilities
3. Wherever the vision system makes a decision based on "the most probable choice" from a
number of possibilities, change the system to make a decision based on "the choice maxi-
mizing the product (expectation * confidence)".
28.6.3 Sketch of Application to DeSTIN
Internally to DeSTIN, probabilities are assigned to clusters associated with local regions of the
visual input. If a system such as SIFT is run as a preprocessor to DeSTIN, then those small
regions corresponding to SIFT keypoints may be assumed semantically meaningful. and internal
DeSTIN probabilities associated with them can be given a high confidence. A similar strategy
may be taken if a cognitive system such as OpenCog is run together with DeSTIN, feeding
DeSTIN information on which portions of a partially-processed image appear most cognitively
relevant. The probabilistic calculations inside DeSTIN can be replaced with corresponding cal-
culations involving imprecise probabilities. And critically, there is a step in DeSTIN where,
among a set of beliefs about the state in each region of an image (on each of a set of hierarchi-
cal levels), the one with the highest probability is selected. In accordance with the above recipe,
this step should be modified to select the belief with the highest probability * confidence.
28.6.3.1 Conceptual Justification
What is the conceptual justification for this approach?
One justification is obtained by assuming that each percept has a certain probability of being
erroneous, and those percepts that appear to more closely embody the semantic meaning of the
EFTA00624315
28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 169
visual scene are less likely to be erroneous. This follows conceptually from the assumption that
the perceived world tends to be patterned and structured, so that being part of a statistically
significant pattern is (perhaps weak) evidence of being real rather than artifactual. Under this
assumption, the proposed approach will maximize the accuracy of the system's judgments.
A related justification is obtained by observing that this algorithmic approach follows from
the consideration of the perceived world as mutable. Consider a vision system that has the
capability to modify even the low-level percepts that it intakes - i.e. to use what it thinks
and knows, to modify what it sees. The human brain certainly has this potential ICha091. In
this case, it will make sense for the system to place sonic constraints regarding which of its
percepts it is more likely to modify. Confidence values semantically embody this - a higher
confidence being sensibly assigned to percepts that the system considers should be less likely to
be modified based on feedback from its higher (more cognitive) processing levels. In that case,
a higher confidence should be given to those percepts that seem to more closely embody the
semantic meaning of the visual scene - which is exactly what we're suggesting here.
28.6.3.2 Enabling Visual Attention Focusing in DeSTIN via Imprecise
Probabilities
We now refer back to the mathematical formulation of DeSTIN summarized in Section 4.3.1
of Chapter 4 above, in the context of which the application of imprecise probability based
attention focusing to DeSTIN is almost immediate.
The probabilities P(ols) may be assigned greater or lesser confidence depending on the
assessed semantic criticality of the observation o in question. So for instance, if one is using
SIFT as a preprocessor to DeSTIN, then one may assign probabilities P(ols) higher confidence
if they correspond to observations o of SIFT keypoints, than if they do not.
These confidence levels may then be propagated throughout DeSTIN's probabilistic mathe-
matics. For instance, if one were using \Valley's interval probabilities, then one could carry out
the probabilistic equations using interval arithmetic.
Finally, one wishes to replace Equation 4.3.1.2 in Chapter 4 with
c = arg max ((bp(s)).strength (bp(s)).confidence) , (28.1)
or some similar variant. The effect of this is that hypotheses based on high-confidence observa-
tions are more likely to be chosen, which of course has a large impact on the dynamics of the
DeSTIN network.
EFTA00624316
170 28 Making DeSTIN Representationally Transparent
Fig. 28.1: The SIFT algorithm finds keypoints in an image, i.e. localized features that are
particularly useful for identifying the objects in an image. The top row shows images that
are matched against the image in the middle row. The bottom-row image shows some of the
keypoints used to perform the matching (i.e. these keypoints demonstrate the same features
in the top-row images and their transformed middle-row counterparts). SIFT keypoints are
identified via a staged filtering approach. The first stage identifies key locations in scale space
by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each
point is used to generate a feature vector that describes the local image region sampled relative
to its scale-space coordinate frame. The features achieve partial invariance to local variations,
such as affine or 3D projections, by blurring image gradient locations.
EFTA00624317
Chapter 29
Bridging the Symbolic/Subsymbolic Gap
29.1 Introduction
While it's widely accepted that human beings carry out both symbolic and subsymbolic process-
ing, as integral parts of their general intelligence, the precise definition of "symbolic" versus
"subsymbolic" is a subtle issue, which different Al researchers will approach in different ways
depending on their differing overall perspectives on AI. Nevertheless, the intuitive meaning of
the concepts is commonly understood:
• "subsymbolic" refers to things like pattern recognition in high-dimensional quantitative
sensory data, and real-time coordination of multiple actuators taking multidimensional
control signals
• "symbolic" refers to things like natural language grammar and (certain or uncertain)
logical reasoning, that are naturally modeled in terms of manipulation of symbolic tokens
in terms of particular (perhaps experientially learned) rules
Views on the relationship between these two aspects of intelligence in human and artificial
cognition are quite diverse, including perspectives such as
1. Symbolic representation and reasoning are the core of human-level intelligence; subsymbolic
aspects of intelligence are of secondary importance and can be thought of as pre- or post-
processors to symbolic representation and reasoning
2. Subsymbolic representation and learning are the core of human intelligence; symbolic as-
pects of intelligence
a. emerge from the subsymbolic aspects as needed; or,
b. arise via a relatively simple, thin layer on top of subsymbolic intelligence, that merely
applies subsymbolic intelligence in a slightly different way
3. Symbolic and subsymbolic aspects of intelligence are best considered as different subsystems,
which
a. have a significant degree of independent operation, but also need to coordinate closely
together; or,
b. operate largely separately and can be mostly considered as discrete modules
171
EFTA00624318
172 29 Bridging the Symbolic/Subsymbolic Gap
In evolutionary, terms, it is clear that subsymbolic intelligence came first, and that most of
the human brain is concerned with the subsymbolic intelligence that humans share with other
animals. However, this observation doesn't have clear implications regarding the relationship
between symbolic and subsymbolic intelligence in the context of everyday cognition.
In the history of the Al field, the symbolic/subsymbolic distinction was sometimes aligned
with the dichotomy between logic-based and rule-based AI systems (on the symbolic side) and
neural networks (on the subsymbolic side) IP.188bl. However, this dichotomy has become much
blurrier in the last couple decades, with developments such as neural network models of lan-
guage parsing IC Hilf and logical reasoning 11.Bi Oh and symbolic approaches to perception
and action ISR0-1]. Integrative approaches have also become more common, with one of the ma-
jor traditional symbolic AI systems, ACT-R, spawning a neural network version II,A931 with
parallel structures and dynamics to the traditional explicitly symbolic version and a hybridiza-
tion with a computational neuroscience model 1.11.08; and another one, SOAR, incorporating
perception processing components as separate modules ILait2l. The field of "neural-symbolic
computing" has emerged, covering the emergence of symbolic rules from neural networks, and
the hybridization of neural networks with explicitly symbolic systems [HH07].
Our goal here is not to explore the numerous deep issues involved with the symbolic/subsym-
bolic dichotomy, but rather to describe the details of a particular approach to symbolic/sub-
symbolic integration, inspired by Perspective 3a in the above list: the consideration of symbolic
and subsymbolic aspects of intelligence as different subcystems, which have a significant degree
of independent operation, but also need to coordinate closely together. We believe this kind
of integration can serve a key role in the quest to create human-level general intelligence. The
approach presented here is at the beginning rather than end of its practical implementation;
what we are describing here is the initial design intention of a project in progress, which is sure
to be revised in some respects as implementation and testing proceed. We will focus mainly
on the tight integration of a subsymbolic system enabling gray-scale vision processing into a
cognitive architecture with significant symbolic aspects, and will then briefly explain how the
same ideas can be used for color vision, and multi-sensory and perception-action integration.
The approach presented here begins with two separate Al systems, OpenCog (introduced
in Chapter 6.3) and DeSTIN (introduced in Chapter 4.3.1) - both currently implemented in
open-source software. Here are the relevant features of each as they pertain to our current effort
of bridging the symbolic/subsymbolic gap::
• OpenCog, is centered on a "weighted, labeled hypergraph" knowledge representation called
the Atomspace. and features a number of different, sophisticated cognitive algorithms acting
on the Atomspace. Some of these cognitive algorithms are heavily symbolic in focus (e.g.
a probabilistic logic engine); others are more subsymbolic in nature (e.g. a neural net like
system for allocating attention and assigning credit). However, OpenCog in its current form
cannot deal with high-dimensional perceptual input, nor with detailed real-time control
of complex actuators. OpenCog is now being used to control intelligent characters in an
experimental virtual world, where the perceptual inputs are the 3D coordinate locations of
objects or small blocks; and the actions are movement commands like "step forward", "turn
head to the right."
• DeSTIN is a deep learning system consisting of a hierarchy of processing nodes, in which
the nodes on higher levels correspond to larger regions of space-time, and each node carries
out prediction regarding events in the space-time region to which it corresponds. Feedback
and feedforward dynamics between nodes combine with the predictive activity within nodes,
to create a complex nonlinear dynamical system whose state self-organizes to reflect the
EFTA00624319
29.2 Simplified OpenCog Workflow 173
state of the world being perceived. However, the specifics of DeSTIN's dynamics have been
designed in what we consider a particularly powerful way, and the system has shown good
results on small-scale test problems IKAII10]. So far DeSTIN has been utilized only for
vision processing, but a similar proprietary system has been used for auditory data as well;
and DeSTIN was designed to work together with an accompanying action hierarchy.
These two systems were not originally designed to work together, but we will describe a method
for achieving their tight integration via
1. Modifying DeSTIN in several ways, so that
a. the patterns in its states over time will have more easily recognizable regularities
b. its nodes are able to scan their inputs not only for simple statistical patterns (DeSTIN
"centroids"), but also for patterns recognized by routines supplied to it by an external
source (e.g. another AI system such as OpenCog)
2. Utilizing one of OpenCogPrime's cognitive processes (the "Fishgram" frequent subhyper-
graph mining algorithm) to recognize patterns in sets of DeSTIN states, and then recording
these patterns in OpenCogPrime's Atomspace knowledge store
3. Utilizing OpenCogPrime's other cognitive processes to abstract concepts and draw conclu-
sions from the patterns recognized in DeSTIN states by Fishgram
4. Exporting the concepts and conclusions thus formed to DeSTIN, so that its nodes can ex-
plicitly scan for their presence in their inputs, thus allowing the results of symbolic cognition
to explicitly guide subsymbolic perception
5. Creating an action hierarchy corresponding closely to DeSTIN's perceptual hierarchy, and
also corresponding to the actuators of a particular robot. This allows action learning to be
done via an optimization approach (ILKP- 05j, IYKL+0-11), where the optimization algo-
rithm uses DeSTIN states corresponding to perceived actuator states as part of its inputs.
The ideas presented here are compatible with those described in [Coe] lag, but different in
emphasis. That paper described a strategy for integrating OpenCog and DeSTIN via creating an
intermediate "semantic CSDLN" hierarchy to translate between OpenCog and DeSTIN, in both
directions. In the approach suggested here, this semantic CSDLN hierarchy exists conceptually
but not as a separate software object: it exists as the combination of
• OpenCog predicates exported to DeSTIN and used alongside DeSTIN centroids, inside
DeSTIN nodes
• OpenCog predicates living in the OpenCog knowledge repository (AtomSpace), and inter-
connected in a hierarchical way using OpenCog nodes and links (thus reflecting DeSTIN's
hierarchical structure within the AtomSpace).
This hierarchical network of predicates, spanning the two software systems, plays the role of a
semantic CSDLN as described in ICoel la].
29.2 Simplified OpenCog Workflow
The dynamics inside an OpenCog system may be highly complex, defying simple flowchart-
ing, but from the point of view of OpenCog-DeSTIN integration, one important pattern of
information flow through the system is as follows:
EFTA00624320
174 29 Bridging the Symbolic/Subsymbolic Gap
1. Perceptions come into the Atomspace. In the current OpenCog system, these are provided
via a proxy to the game engine where the OpenCog controlled character interacts. In an
OpenCog-DeSTIN hybrid, these will be provided via DeSTIN.
2. Hebbian learning builds HebbianLinks between perceptual Atoms representing percepts
that have frequently co-occurred
3. PLN inference, concept blending and other methods act on these perceptual Atoms and
their HebbianLinks, forming links between them and linking them to other Atoms stored
in the Atomspace reflecting prior experience and generalizations therefrom
4. Attention allocation gives higher short and long term importance values to those Atoms
that appear likely to be useful based on the links they have obtained
5. Based on the system's current goals and subgoals (the latter learned from the top-level goals
using PLN), and the goal-related links in the Atomspace, the OpenPsi mechanism triggers
the PLN-based planner, which chooses a series of high-level actions that are judged likely
to help the system achieve its goals in the current context
6. The chosen high-level actions are transformed into series of lower-level, directly executable
actions. In the current OpenCog system, this is done by a set of hand-coded rules based
on the specific mechanics of the game engine where the OpenCog controlled character
interacts. In an OpenCog-DeSTIN hybrid, the lower-level action sequence will be chosen by
an optimization method acting based on the motor control and perceptual hierarchies.
This pattern of information flow omits numerous aspects of OpenCog cognitive dynamics,
but gives the key parts of the picture in terms of the interaction of OpenCog cognition with
perception and action. Most of the other aspects of the dynamics have to do with the interac-
tion of multiple cognitive processes acting on the Atomspace, and the interaction between the
Atomspace and several associated specialized memory stores, dealing with procedural. episodic,
temporal and spatial aspects of knowledge. From the present point of view, these additional
aspects may be viewed as part of Step 3 above, wrapped up in the phrase "and other methods
act on these perceptual Atoms." However, it's worth noting that in order to act appropriately
on perceptual Atoms, a lot of background cognition regarding more abstract conceptual Atoms
(often generalized from previous perceptual Atoms) may be drawn on. This background infer-
ence incorporates both symbolic and subsymbolic aspects, but goes beyond the scope of the
present discussion, as its particulars do not impinge on the particulars of DeSTIN-OpenCog
integration.
OpenCog also possesses a specialized facility for natural language comprehension and genera-
tion'LCElOJ V;oelObj, which may be viewed as a parallel perception/action pathway, bypassing
traditional human-like sense perception and dealing with text directly. Integrating OpenCog-
Prime's current linguistics processes with DeSTIN-based auditory and visual processing is a
deep and important topic, but one we will bypass here, for sake of brevity and because it's not
our current research priority.
29.3 Integrating DeSTIN and OpenCog
The integration of DeSTIN and OpenCog involves two key aspects:
• recognition of patterns in sets of DeSTIN states, and exportation of these patterns into the
OpenCog Atomspace
EFTA00624321
29.3 Integrating DeSTIN and OpenCog 175
• use of OpenCog-created concepts within DeSTIN nodes, alongside statistically-derived "cen-
troids"
From here on, unless specified otherwise, when we mention "DeSTIN" we will refer to "Uniform
DeSTIN" as presented in Chapter 28 and an extension of "classic DeSTIN" as defined in
IA K
29.3.1 Mining Patterns from DeSTIN States
The first step toward using OpenCog tools to mine patterns from sets of DeSTIN states, is to
represent these states in Atom form in an appropriate way. A simple but workable approach,
restricting attention for the moment to purely spatial patterns, is to use the six predicates:
• hasCentroid(node N,int k)
• hosParentCentroid(node N,int k)
• hasNorthNeighborCentrold(node N,int k)
• hasSouthNeighborCentrold(node N,int k)
• hasEastNeighborCentroid(node N, int k)
• hasWestNeighborCentroid(node N, ird
For instance
hasNorthNeighborCentroid(N, 3)
means that N's north neighbor has centroid #3
One may consider also the predicates
• hosParent(node N, Node
• hasNorthNeighbor(node N, Node Ail)
• hasSouthNeighbor(node N, Node M)
• hasEastNeighbor(node N, Node M)
• hasWestNeighbor(node N, Node Al)
Now suppose we have a stored set of DeSTIN states, saved from the application of DeSTIN
to multiple different inputs. What we want to find are predicates P that are conjunctions of
instances of the above 10 predicates, which occur frequently in the stored set of DeSTIN states.
A simple example of such a predicate would be the conjunction of
• hosNorthNeighbor(SN,SAI)
• hosParentCentroid($N, 5)
• hosParentCentroid($M, 5)
• hasNorthNeighborCentrold(SN, 6)
• hasWestNeighboreentroid(SAI, 4)
This predicate could be evaluated at any pair of nodes ($N, $M) on the same DeSTIN level.
If it is true for atypically many of these pairs, then it's a "frequent pattern", and should be
detected and stored.
EFTA00624322
176 29 Bridging the Symbolic/Subsymbolic Gap
OpenCogPrime's pattern mining component, Fishgram, exists precisely for the purpose of
mining this sort of conjunction from sets of relationships that are stored in the Atonispace. It
may be applied to this problem as follows:
• Translate each DeSTIN state into a set of relationships drawn from: hasNorthNeighbor,
ha.sSouthNeighbor, hasEastNeighbor, hasWestNeighbor, hasCentroid, hasParent
• Import these relationships, describing each DeSTIN state, into the OpenCog Atomspace
• Run pattern mining on this AtomSpace.
29.3.2 Probabilistic Inference on Mined Hypergraphs
Patterns mined from DeSTIN states can then be reasoned on by OpenCogPrime's PLN inference
engine, allowing analogy and generalization.
Suppose centroids 5 and 617 are estimated to be similar - either via DeSTIN's built-in simi-
larity metric, or, more interestingly via OpenCog inference on the Atom representations of these
centroids. As an example of the latter, consider: 5 could represent a person's nose and 617 could
represent a rabbit's nose. In this case, DeSTIN might not judge the two centroids particularly
similar on a purely visual level, but, OpenCog may know that the images corresponding to
both of these centroids are are called "noses" (e.g. perhaps via noticing people indicate these
images in association with the word "nose"), and may thus infer (using a simple chain of PLN
inferences) that these centroids seem probabilistically similar.
If 5 and 617 are estimated to be similar, then a predicate like
ANDLink
EvaluationLink
hasNorthNeighbor
ListLink $N $M
EvaluationLink
hasParentCentroid
ListLink $N 5
EvaluationLink
hasParentCentroid
ListLink $M 5
EvaluationLink
hasNorthNeighborCentroid
ListLink $N 6
EvaluationLink
hasWestNeighborCentroid
ListLink $M 4
mined from DeSTIN states, could be extended via PLN analogical reasoning to
ANDLink
EvaluationLink
hasNorthNeighbor
ListLink $N $M
EvaluationLink
EFTA00624323
29.3 Integrating DeSTIN and OpenCog ITT
hasParentCentroid
ListLink $N 617
EvaluationLink
hasParentCentroid
ListLink $M 617
EvaluationLink
hasNorthNeighborCentroid
ListLink $N 6
EvaluationLink
hasWestNeighborCentroid
ListLink $M 4
29.3.3 Insertion of OpenCog-Learned Predicates into DeSTIN's
Pattern Library
Suppose one has used Fishgram, as described in the earlier part of this chapter, to recog-
nize predicates embodying frequent or surprising patterns in a set of DeSTIN states or state-
sequences. The next natural step is to add these frequent or surprising patterns to DeSTIN's
pattern library, so that the pattern library contains not only classic DeSTIN centroids, but also
these corresponding "image grammar" style patterns. Then, when a new input comes into a
DeSTIN node, in addition to being compared to the centroids at the node, it can be fed as
input to the predicates associated with the node.
What is the advantage of this approach, compared to DeSTIN without these predicates? The
capability for more compact representation of a variety of spatial patterns. In many cases, a
spatial pattern that would require a large number of DeSTIN centroids to represent, can be
represented by a single, fairly compact predicate. It is an open question whether these sorts
of predicates are really critical for human-like vision processing. However, our intuition is that
they do have a role in human as well as machine vision. In essence, DeSTIN is based on a fancy
version of nearest-neighbor search, applied in a clever way on multiple levels of a hierarchy,
using context-savvy probabilities to bias the matching. But we suspect there are many visual
patterns that are more compactly and intuitively represented using a more flexible language,
such as OpenCog predicates formed by combining elementary predicates involving appropriate
spatial and temporal relations.
For example, consider the archetypal spatial pattern of a face as: either two eyes that are
next to each other, or sunglasses, above a nose, which is in turn above a mouth. (This is an
oversimplified toy example, but we're positing it for illustration only. The same point applies
to more complex and realistic patterns.) One could represent this in OpenCogPrime's Atom
language as something like:
AND
InheritanceLink N B_nose
InheritanceLink M B_mouth
EvaluationLink
above
ListLink E N
EvaluationLink
EFTA00624324
178 29 Bridging the Symbolic/Subsymbolic Gap
above
ListLink N M
OR
AND
MemberLink El E
MemberLink E2 E
EvaluationLink
next_to
ListLink El E2
InheritanceLink El B_eye
AND
InheritanceLink E B_sunglasses
where e.g. B_eye is a DeSTIN belief that corresponds roughly to recognition of the spatial
pattern of a human eye. To represent this using ordinary DeSTIN centroids, one couldn't rep-
resent the OR explicitly; instead one would need to split it into two different sets of centroids,
corresponding to the eye case and the sunglasses case - unless the DeSTIN pattern library
contained a belief corresponding to "eyes or sunglasses." But the question then becomes: how
would classic DeSTIN actually learn a belief like this? In the suggested architecture, pattern
mining on the database of DeSTIN states is proposed as an algorithm for learning such beliefs.
This sort of predicate-enhanced DeSTIN will have advantages over the traditional version,
only if the actual distribution of images observed by the system contains many (reasonably high
probability) images modeled accurately by predicates involving disjunction and/or negations as
well as conjunctions. If the system's perceived world is simpler than this, then good old DeSTIN
will work just as well, and the OpenCog-learned predicates are a needless complication.
Without these sorts of predicates, how might DeSTIN be extended to include beliefs like "eyes
or sunglasses"? One way would be to couple DeSTIN with a reinforcement learning subsystem,
that reinforced the creation of beliefs that were useful for the system as a whole. If reasoning
in terms of faces (independent of whether they have eyes or sunglasses) got the system reward,
presumably it could learn to form the concept "eyes or sunglasses." We believe this would also
be a workable approach, but that given the strengths and weaknesses of contemporary computer
hardware, the proposed DeSTIN-OpenCog approach will prove considerably simpler and more
effective.
29.4 Multisensory Integration, and Perception-Action Integration
In Chapter 28.2.1 we have briefly indicated how DeSTIN could be extended beyond vision
to handle other senses such as audition and touch. If one had multiple perception hierarchies
corresponding to multiple senses, the easiest way to integrate them within an OpenCog context
would be to use OpenCog as the communication nexus - representing DeSTIN centroids in
the various modality-specific hierarchies as OpenCog Atoms (PerceptualCentroidNodes), and
building HebbianLinks in OpenCogPrime's Atomspace between these PerceptualCentroidNodes
as appropriate based on their association. So for instance the sound of a person's footsteps
would correspond to a certain belief (probability distribution over centroids) in the auditory
DeSTIN network, and the sight of a person's feet stepping would correspond to a certain
belief (probability distribution over centroids) in the visual DeSTIN network; and the OpenCog
EFTA00624325
29.4 Multisensory Integration, and Perception-Action Integration 179
Atomspace would contain links between the sets of centroids assigned high weights between these
two belief distributions. Importance spreading between these various PerceptualCentroidNodes
would cause a dynamic wherein seeing feet stepping would bias the system to think it was
hearing footsteps, and hearing footsteps would bias it to think it was seeing feet stepping.
And, suppose there are similarities between the belief distributions for the visual appearance
of dogs, and the visual appearance of cats. Via the intermediary of the Atomspace, this would
bias the auditory and haptic DeSTIN hierarchies to assume a similarity between the auditory
and haptic characteristics of dogs, and the analogous characteristics of cats. Because: PLN
analogical reasoning would extrapolate from, e.g.
• HebbianLinks joining cat-related visual PerceptualCentroidNodes and dog-related visual
PerceptualCentroidNodm
• HebbianLinks joining cat-related visual PerceptualCentroidNodes to cat-related haptic Per-
ceptualCentroidNodes; and others joining dog-related visual PerceptualCentroidNodes to
dog-related haptic PerceptualCentroidNodes
to yield HebbianLinks joining cat-related haptic PerceptualCentroidNodes and dog-related hap-
tic PerceptualCentroidNodes. This sort of reasoning would then cause the system DeSTIN to,
for example, upon touching a cat, vaguely expect to maybe hear dog-like things. This sort
of simple analogical reasoning will be right sometimes and wrong sometimes - a cat walking
sounds a fair bit like a dog walking, and cat and dog growls sound fairly similar, but a cat
meowing doesn't sound that much like a dog barking. More refined inferences of the same basic
sort may be used to get the details right as the system explores and understands the world
more accurately.
29.4.1 Perception-Action Integration
While experimentation with DeSTIN has so far been restricted to perception processing, the sys-
tem was designed from the beginning with robotics applications in mind, involving integration
of perception with action and reinforcement learning. As OpenCog already handles reinforce-
ment learning on a high level (via OpenPsi), our approach to robot control using DeSTIN and
OpenCog involves creating a control hierarchy parallel to DeSTIN's perceptual hierarchy, and
doing motor learning using optimization algorithms guided by reinforcement signals delivered
from OpenPsi and incorporating DeSTIN perceptual states as part of their input information.
Our initial research goal, where action is concerned, is not to equal the best purely control-
theoretic algorithms at fine-grained control of robots carrying out specialized tasks, but rather
to achieve basic perception / control / cognition integration in the rough manner of a young
human child. A two year old child is not particularly well coordinated, but is capable of coor-
dinating actions involving multiple body parts using an integration of perception and action
with unconscious and deliberative reasoning. Current robots, in some cases, can carry out spe-
cialized actions with great accuracy, but they lack this sort of integration, and thus generally
have difficulty effectively carrying out actions in unforeseen environments and circumstances.
We will create an action hierarchy with nodes corresponding to different parts of the robot
body, where e.g. the node corresponding to an arm would have child nodes corresponding to
a shoulder, elbow, wrist and hand; and the node corresponding to a hand would have child
nodes corresponding to the fingers of the hand; etc. Physical self-perception is then achieved
EFTA00624326
ISO 29 Bridging the Symbolic/Subsymbolic Cap
by creating a DeSTIN "action-perception" hierarchy with nodes corresponding to the states
of body components. In the simplest case this means the lowest-level nodes will correspond to
individual servomotors, and their inputs will be numerical vectors characterizing servomotor
states. If one is dealing with a robot endowed with haptic technology. e.g. Syntouch
fingertips, then numerical vectors characterizing haptic inputs may be used alongside these.
The configuration space of an action-perception node, corresponding to the degrees of free-
dom of the servomotors of the body part the node represents, may be approximated by a set of
"centroid" vectors. When an action is learned by the optimization method used for this purpose,
this involves movements of the servomotors corresponding to many different nodes, and thus
creates a series of "configuration vectors" in each node. These configuration vector series may
be subjected to online clustering, similar to percepts in a DeSTIN perceptual hierarchy. The
result is a library of "codewords", corresponding to discrete trajectories of movement, associ-
ated with each node. The libraries may be shared by identical body parts (e.g. shared among
legs, shared among fingers), but will be distinct otherwise. Each coordinated whole-body action
thus results in a series of (node, centroid) pairs, which may be mined for patterns, similarly to
the perception case.
The set of predicates needed to characterize states in this action-perception hierarchy is
simpler than the one described for visual perception above; here one requires only
• haseentroid(node N,int k)
• hosParentCentroid(node N, Mt k)
• hasParent(node N, Node M)
• hasSibling(node N, Node M)
and most of the patterns will involve specific nodes rather than node variables. The different
nodes in a DeSTIN vision hierarchy are more interchangeable (in terms of their involvement in
various patterns) than, say, a leg and a finger.
In a pure DeSTIN implementation, the visual and action-perception hierarchies would be
directly linked. In the context of OpenCog integration, it is simplest to link the two via OpenCog,
in a sense using cognition as a bridge between action and perception. It is unclear whether this
strategy will be sufficient in the long run, but we believe it will be more than adequate for
experimentation with robotic perceptual-motor coordination in a variety of everyday tasks.
OpenCogPrime's Hebbian learning process can be used to find common associations between
action-perception states and visual-perception states, via mining a data store containing time-
stamped state records from both hierarchies.
Importance spreading along the HebbianLinIcs learned in this way can then be used to bias
the weights in the belief states of the nodes in both hierarchies. So, for example, the action-
perception patterns related to clenching the fist, would be Hebbianly correlated with the visual-
perception patterns related to seeing a clenched fist. When a clenched fist was perceived via
servomotor data. importance spreading would increase the weighting of visual patterns corre-
sponding to clenched fists, within the visual hierarchy. When a clenched fist was perceived via
visual data, importance spreading would increase the weighting of servomotor data patterns
corresponding to clenched fists, within the action-perception hierarchy.
EFTA00624327
29.4 Multisensory Integration, and Perception-Action Integration 181
29.4.2 Thought-Experiment: Eye-Hand Coordination
For example, how would DeSTIN-OpenCog integration as described here carry out a simple
task of eye-hand coordination? Of course the details of such a feat, as actually achieved, would
be too intricate to describe in a brief space, but it still is meaningful to describe the basic ideas.
Consider the case of a robot picking up a block, in plain sight immediately in front of the robot,
via pinching it between two fingers and then lifting it. In this case,
• The visual scene, including the block, is perceived by DeSTIN; and appropriate patterns in
various DeSTIN nodes are formed
• Predicates corresponding to the distribution of patterns among DeSTIN nodes are activated
and exported to the OpenCog Atomspace
• Recognition that a block is present is carried out, either by
- PLN inference within OpenCog, drawing the conclusion that a block is present from
the exported predicates, using ImplicationLinks comprising a working definition of a
"block"
- A predicate comprising the definition of "block", previously imported into DeSTIN from
OpenCog and utilized within DeSTIN nodes as a basic pattern to be scanned for. This
option would obtain only if the system had perceived many blocks in the past, justifying
the automation of block recognition within the perceptual hierarchy.
• OpenCog, motivated by one of its higher-level goals, chooses "picking up the block" as
subgoal. So it allocates effort to finding a procedure whose execution, in the current context,
has a reasonable likelihood of achieving the goal of picking up the block. For instance, the
goal could be curiosity (which might make the robot want to see what lies under the block),
or the desire to please the agent's human teacher (in case the human teacher likes presents,
and will reward the robot for giving it a block as a present), etc.
• OpenCog, based on its experience, uses PLN to reason that "grabbing the block" is a
subgoal of "picking up the block"
• OpenCog utilizes a set of predicates corresponding to the desired state of "grabbing the
the block" as a target for an optimization algorithm, designed to figure out a series of
servomotor actions that will move the robot's body from the current state to the target
state. This is a relatively straightforward control theory problem.
• Once the chosen series of servomotor actions has been executed, the robot has its fingers
poised around the block, ready to pick it up. At this point, the action-perception hierarchy
perceives what is happening in the fingers. If the block is really being grabbed properly,
then the fingers are reporting some force, due to the feeling of grabbing the block (haptic
input is another possibility and would be treated similarly, but we will leave that aside for
now). Importance spreads from these action-perception patterns into the Atomspace, and
back down into the visual perception hierarchy, stimulating concepts and percepts related
to "something is being grabbed by the fingers."
• If the fingers aren't receiving enough force, because the agent is actually only poking the
block with one finger and grabbing the air with another finder, then the "something is being
grabbed by the fingers" stimulation doesn't happen, and the agent is less sure it's actually
grabbing anything. In that case it may withdraw its hand a bit, so that it can more easily
assess its hand's state visually, and try the optimization-based movement planning again.
EFTA00624328
182 29 Bridging the Symbolic/Subsymbolic Gap
• Once the robot estimates the goal of gabbing the block has been successfully achieved, it
proceeds to the next sub-subgoal, and asks the action-sequence optimizer to find a sequence
of movements that will likely cause the predicates corresponding to "hold the block up" to
obtain. It then executes this movement series and picks the block up in the air.
This simple example is a far cry from the perceptual-motor coordination involved in doing
embroidery, juggling or serving a tennis ball. But we believe it illustrates, in a simple way, the
same basic cognitive structures and dynamics used in these more complex instances.
29.5 A Practical Example: Using Subtree Mining to Bridge the Gap
Between DeSTIN and PLN
In this section we describe seine relatively simple practical experiments we have run, exploring
the general ideas described above. The core idea of these experiments is to apply Yun Chi's
Frequent Subtree Mining software ICXYNION n to mine frequent patterns from a data-store
of trees representing DeSTIN states. In this application, each frequent subtree represents a
"common visual pattern." These patterns may then be reasoned about using PLN. This approach
may also be extended to include additional quality metrics besides frequency, e.g. interaction
information Ina/ which lets one measure how "surprising" a subtree is.
Figure ?? illustrates the overall architecture into which the use of frequent subtree mining to
bridge DeSTIN and PLN Ls intended to fit. This architecture is not yet fully implemented, but
is a straightforward extension of the current OpenCog architecture for processing data from
game worlds IGEA08], and is scheduled for implementation later in 2013 in the course of a
funded project involving the use of DeSTIN and OpenCog for humanoid robot control.
The components intervening between DeSTIN and OpenCog, in this architecture, are:
• DeSTIN State DB: Stores all DeSTIN states the system has experienced, indexed by
time of occurrence
• Frequent Subtree Miner: Recognizes frequent subtrees in the database of DeSTIN states,
and can also filter the frequent subtrees by other criteria such as information-theoretic
surprisingnes.s. These subtrees may sometimes span multiple time points.
• Fltquent Subtree Recognizer: Scans DeSTIN output, and recognizes frequent subtrees
therein. These subtrees are the high level "visual patterns" that make their way from DeSTIN
to OpenCog.
• Perception Collector: Linearly normalizes the spatial coordinates associated with its
input subtrees, to compensate for movement of the camera. Filters out perceptions that
didn't change recently (e.g. a static white wall), so that only new visual information is
passed along to OpenCog. Translates the subtrees into Scheme files representing OpenCog
logical Atoms.
• Experience DB: Stores all the normalized subtrees that have actually made their way
into OpenCog
available for download at http://www. nec- labs . comRychi/publicat ion/so ftware html
EFTA00624329
29.5 A Practical Example: Using Subtree Mining to Bridge the Cap Between DeSTIN and PLN
semantic feedback. providing additional input to DeSTIN node>
Frequent Cognitive OpenCog
Subtree Processes AtomSpace
Miner
Experience
4
DB
DeSTIN
State DB
important new subtrees.
tagged with normalized
locations
Frequent Perception
DeSTIN —> Subtree important subtrees Collector
forms hierarchical sate Recognizer identified in DeSTIN state.
embodying pattern> tagged with their normalizes input
video from robot in visual input location in visual field to correct for robot
cameras head and body movement
fiten out perceptions
that did not change
much recently:
converts suberees to OpenCog
"Atomese" logical form
Fig. 29.1: Graphical depiction of the architecture for DeSTIN/OpenCog integration using fre-
quent subtree mining as a bridge. Semantic feedback is not yet implemented; and sophisticated
processing of DeSTIN or other visual input is not yet handled by OpenCog's Perception Collec-
tor. The experiments presented here utilized a simplified, preliminary version of the architecture.
• Semantic Feedback: Allows the semantic associations OpenCog makes to a subtree, to
be fed back into DeSTIN as additional inputs to the nodes involved in the subtree. This
allows perception to make use of cognitive information.
EFTA00624330
184 29 Bridging the Symbolic/Subsymbolic Gap
29.5.1 The Importance of Semantic Feedback
One aspect of the above architecture not yet implemented, but worthy of note, is semantic
feedback. Without the semantic feedback, we expect to be able to emulate human object and
event recognition insofar as they are done by the human brain in a span of less than 500ms
or so. In this time frame, the brain cannot do much sophisticated cognitive feedback, and
processes perceptual data in an essentially feedforward manner. On the other hand, properly
tuned semantic feedback along with appropriate symbolic reasoning in OpenCog, may allow us
to emulate human object and event recognition as the human brain does it when it has more
time available, and can use its cognitive understanding to guide vision processing.
A simple example of this sort of symbolic reasoning is analogical inference. Given a visual
scene, OpenCog can reason about what the robot has seen in similar situations before, where
its notion of "similarity" draws not only on visual cues but on other contextual information:
what time it is, what else is in the room (even if not currently visible), who has been seen in
the room recently, etc.
For instance, recognizing familiar objects that are largely occluded and in dim light, may be
something requiring semantic feedback, and not achievable via the feedforward dynamics alone.
This can be tested in a robot vision context via showing the robot various objects and events
in various conditions of lighting and occlusion, and observing its capability at recognizing the
objects and events with and without semantic feedback, in each of the conditions.
If the robot sees an occluded object in a half-dark area on a desk, and it knows that a woman
was recently sitting at that desk and then got up and left the room, its symbolic analogical
inference may make it more likely to conclude that the object is a purse. Without this symbolic
inference, it might not be able to recognize the object as a purse based on bottom-up visual
clues alone.
29.6 Some Simple Experiments with Letters
To illustrate the above ideas in an elementary context, we now present results of an experiment
using DeSTIN, subt.ree mining and PLN together to recognize patterns among a handful of black
and white images comprising simple letter-forms. This is a "toy" example, but exemplifies the
key processes reviewed above. During the next year we will be working on deploying these same
processes in the context of robot vision.
29.6.1 Mining Subtrees from DeSTIN States Induced via Observing
Letterforms
Figure 29.2 shows the 7 input images utilized; Figure 29.3 shows the centroids found on each of
the layers of DeSTIN (note that translation invariance was enabled for these experiments); and
Figure 29.4 shows the most frequent subtrees recognized among the DeSTIN states induced by
observing the 7 input images.
EFTA00624331
29.6 Some Simple Experiments with Letters 185
The centroid images shown in Figure 29.3 were generated as follows. For the bottom layer,
centroids were directly represented as 4x4 grayscale images (ignoring the previous and parent
belief sections of the centroid). For higher-level centroids, we proceeded as follows:
• Divide the centroid into 4 sub-arrays. An image is generated for each sub-array by treating
the elements of the sub-array as weights in a weighted sum of the child centroid images.
This weighted sum is used superpose / blend the child images into t image.
• Then these 4 sub-array images are combined in a square to create the whole centroid image.
• Repeat the process recursively, till one reaches the top level.
The weighted averaging used a p-power approach, i.e. replacing each weight uPi with wri(wc -F
+ .41 for a given exponent p > 0. The parameter p toggles how much attention is paid to
nearby versus distant centroids. In generating Figure 29.3 we used p = 4.
The relation between subtrees and input images, in this example, was directly given via the
subtree miner as:
tree #0 matches input image: 4 6
tree #1 matches input image: 1 2
tree #2 matches input image: 3 5
tree #3 matches input image: 0 1 2 4
tree #4 matches input image: 3 5
tree #5 matches input image: 4 5
tree #6 matches input image: 1 2
tree #7 matches input image: 0 3 4
29.6.2 Mining Subtrees from DeSTIN States Induced via Observing
Letterforms
The subtree-image relationships listed above may be most directly expressed in PLN syntax/se-
mantics via
Evaluation contained_in (Tree0 Image4)
Evaluation contained_in (Tree0 Image6)
Evaluation contained_in (Treel Imagel)
Evaluation contained_in (Treel Image2)
Evaluation contained_in (Tree2 Image3)
Evaluation contained_in (Tree2 Image5)
Evaluation contained_in (Tree3 Image0)
Evaluation contained_in (Tree3 Imagel)
Evaluation contained_in (Tree3 Image2)
Evaluation contained_in (Tree3 Image4)
Evaluation contained_in (Tree4 Image3)
Evaluation contained_in (Tree4 Image5)
Evaluation contained_in (Tree5 Image4)
Evaluation contained_in (Tree5 Image5)
Evaluation contained_in (Tree6 Imagel)
Evaluation contained_in (Tree6 Image2)
Evaluation contained_in (Tree? Image0)
Evaluation contained_in (Tree? Image3)
Evaluation contained_in (Tree? Image4)
EFTA00624332
186 29 Bridging the Symbolic Subsymbolic Gap
(a Image 0
C
(b) Image
(c) Image 2 (d) Image 3 (e) Image
T
(f) Image 5 (g) Image 6
Fig. 29.2: Simple input images fed to DeSTIN for the experiment reported here.
But while this is a perfectly natural way to import such relationships into OpenCog, it is
not necessarily the most convenient form for PLN to use to manipulate them. For sonic useful
inference chains, it is most convenient for PLN to translate these into the more concise form
Inheritance Image4 hasTree0
Inheritance Image6 hasTree0
Inheritance Image3 hasTree7
Inheritance Image4 hasTree7
PLN performs the translation from Evaluation into Inheritance form via the inference steps
Evaluation contains (Tree° Image4)
--> \\ definition of SatisfyingSet
Member Image4 (SatisfyingSet (Evaluation contains(Tree0 *)) )
-- \\ definition of hasTreeO
Member Image4 hasTree0
--> \\ M2I, Member to Inheritance inference
Inheritance Image4 hasTree0
EFTA00624333
29.6 Some Simple Experiments with Letters 187
(a) Layer 0 (b) Layer 1
(c) Layer 2 (d) Layer 3 (e) Layer 4
(f) Layer 5 (g) Laye 6 (h) Layer 7
Fig. 29.3: Example visualization of the centroids on the 7 layers of the DeSTIN network. Each
picture shows multiple centroids at the corresponding level. Higher level centroids are visualized
as p'th power averages of lower level centroids, with p=4.
Finally, given the Inheritance relations listed above, PLN can draw some simple conclusions
fairly directly, such as:
Similarity Imagel Image2 <1, .375>
Similarity Image3 Image5 c.5, .444>
The PLN truth values above are given in the form " <strength, confidence> ", where strength
is in this case effectively a probability, and confidence represents a scaling into the interval [0,1)
of the amount of evidence on which that strength value is based. The confidence is calculated
using a "personality parameter" of k = 5 (k may vary between 1 and oo, with higher numbers
indicating less value attached to each individual piece of evidence. For example the truth value
strength of 1 attached to "Similarity Imagel Image2" indicates that according to the evidence
provided by these subtrees (and ignoring all other evidence), Imagel and Image2 are the same.
Of course they are not the same — one is a C and another is an 0 — and once more evidence is
given to PLN, it will decrease the strength value of this SimilarityLink. The confidence value
of .375 indicates that PLN is not very certain of the sameness of these two letters.
What conclusion can we draw from this toy example, practically speaking? The conclusions
drawn by the PLN system are not useful in this case - PLN thinks C and 0 are the same,
as a provisional hypothesis based on this data. But this is not because DeSTIN and PLN are
stupid. Rather, it's because they have not been fed enough data. The hypothesis that a C is an
EFTA00624334
188 29 Bridging the Symbolic/Subsymbolic Gap
occluded 0 is actually reasonably intelligent. If we fed these same systems many more pictures,
then the subtree miner would recognize many more frequent subtrees in the larger corpus of
DeSTIN states, and PLN would have a lot more information to go on, and would draw more
conmmonsensically clever conclusions. We will explore this in our future work.
We present this toy example not as a useful practical achievement, but rather as a very simple
illustration of the process via which subsymbolic knowledge (as in the states of the DeSTIN
deep learning architecture) can be mapped into abstract logical knowledge, which can then be
reasoned on via a probabilistic logical reasoning engine (such as PLN). We believe that the
same process illustrated so simplistically in this example, will also generalize to more realistic
and interesting examples, involving more complex images and inferences. The integration of
DeSTIN and OpenCog described here is being pursued in the context of a project aimed at the
creation of a humanoid robot capable of perceiving interpreting and acting in its environment
with a high level of general intelligence.
29.7 Conclusion
We have described, at a high level, a novel approach to bridging the symbolic / subsymbolic
gap, via very tightly integrating DeSTIN with OpenCog. We don't claim that this is the only
way to bridge the gap, but we do believe it is a viable way. Given the existing DeSTIN and
OpenCog designs and codebases, the execution of the ideas outlined here seems to be relatively
straightforward. falling closer to the category of "advanced development" than that of blue-sky
research. However, fine-tuning all the details of the approach will surely require substantial
effort.
While we have focused on robotics applications here, the basic ideas described could be im-
plemented and evaluated in a variety of other contexts as well, for example the identification of
objects and events in videos, or intelligent video summarization. Our interests are broad, how-
ever. we feel that robotics is the best place to start - partly due to a general intuition regarding
the deep coupling between human-like intelligence and human-like embodiment; and partly due
to a more specific intuition regarding the value of action for perception, as reflected in Heinz
von Foerster's dictum "if you want to see, learn how to act". We suspect there are important
cognitive reasons why perception in the human brain centrally involves premotor regions. The
coupling of a perceptual deep learning hierarchy and a symbolic AI system doesn't intrinsically
solve the combinatorial explosion problem intrinsic in looking for potential conceptual patterns
in masses of perceptual data. However, a system with particular goals and the desire to act
in such a way as to achieve them, possesses a very natural heuristic for pruning the space of
possible perceptual/conceptual patterns. It allows the mind to focus in on those percepts and
concepts that are useful for action. Of course, there are other ways besides integrating action to
enforce effective pruning, but the integration of perception and action has a variety of desirable
properties that might be difficult to emulate via other methods, such as the natural alignment
of the hierarchical structures of action and reward with that of perception.
The outcome of any complex research project is difficult to foresee in detail. However, our
intuition - based on our experience with OpenCog and DeSTIN, and our work with the math-
ematical and conceptual theories underlying these two systems - is that the hybridization of
OpenCog and DeSTIN as described here will constitute a major step along the path to human-
level AGI. It will enable the creation of an OpenCog instance endowed with the capability of
EFTA00624335
29.7 Conclusion 189
flexibly interacting with a rich stream of data from the everyday human world. This data will
not only help OpenCog to guide a robot in carrying out everyday tasks, but will also provide
raw material for OpenCogPrime's cognitive processes to generalize from in various ways - e.g.
to use as the basis for the formation of new concepts or analogical inferences.
EFTA00624336
190 29 Bridging the Symbolic/Subsymbolic Gap
(a) Subtree 0: (1.,6,C2,P0) (b) Subtree 1: (L6,C12,P0)
liwit I' V I" es Ili
(c) Subtree 2: (L6,C19.20) (d) Subtree 3: (L6,C12,P1) (e) Subtree 4: (L6,C13,P1)
(f) Subtree 5: (L6,C1,P2) (g) Subtree 6: (L6,C5,P2) (1) Subtree 7: (L6,C20,P3)
Fig. 29.4: Example subtrees extracted from the set of DeSTIN states corresponding to the
input images given above. Each subtree is associated with a triple (level, centroid, position).
The position is one of the four level n squares making up a level n — 1 centroid. In this simple
exmaple, all these frequent subtrees happen to be from Level 6, but this is not generally the
case for more complex images. some of the centroids look like whitespace, but this is because a
common region of whitespace was recognized among multiple input images.
EFTA00624337
Section IV
Procedure Learning
EFTA00624338
EFTA00624339
Chapter 30
Procedure Learning as Program Learning
30.1 Introduction
Broadly speaking, the learning of predicates and schemata (executable procedures) is done in
CogPrime via a number of different methods, including for example PLN inference and concept
predicatization (to be discussed in later chapters). Most of these methods, however, merely
extrapolate procedures directly from other procedures or concepts in the AtomSpace, in a local
way - a new procedure is derived from a small number of other procedures or concepts. General
intelligence also requires a method for deriving new procedures that are more "fundamentally
new." This is where CogPrime makes recourse to explicit procedure learning algorithms such
as hillclimbing and MOSES, discussed in Chapters 32 and 33 below.
In this brief chapter we fornmlate the procedure learning problem as a program learning
problem in a general way, and make some high-level observations about it. Conceptually, this
chapter is a follow-up to Chapter 21, which discussed the choice to represent procedures as
programs; here we make some simple observations regarding the implications of this choice for
procedure learning, and the formal representation of procedure learning with CogPrime.
30.1.1 Program Learning
An optimization problem may be defined as follows: a solution space S is specified, together
with some fitness function on solutions, where "solving the problem" corresponds to discovering
a solution in S with a sufficiently high fitness.
In this context, we may define program learning as follows: given a program space P,
a behavior space B, an execution function exec : PI—O3, and a fitness function on behav-
iors, "solving the problem" corresponds to discovering a program p in P whose corresponding
behavior, exec(p), has a sufficiently high fitness.
In evolutionary, learning terms, the program space is the space of genotypes, and the be-
havior space is the space of phenotypes.
This formalism of procedure learning serves well for explicit procedure learning CogPrime,
not counting cases like procedure learning within other systems (like DeSTIN) that may be
hybridized with CogPrime.
193
EFTA00624340
194 30 Procedure Learning as Program Learning
Of course, this extended formalism can of be entirely vacuous - the behavior space could be
identical to the program space. and the execution function simply identity, allowing any opti-
mization problem to be cast as a problem of program learning. The utility of this specification
arises when we make interesting assumptions regarding the program and behavior spaces, and
the execution and fitness functions (thus incorporating additional inductive bias):
1. Open-endedness - P has a natural "program size" measure - programs may be enumer-
ated from smallest to largest, and there is no obvious problem-independent upper bound
on program size.
2. Over-representation - exec often maps many programs to the same behavior.
3. Compositional hierarchy - programs themselves have an intrinsic hierarchical organiza-
tion, and may contain subprograms which are themselves members of P or some related
program space. This provides a natural family of distance measures on programs, in tenns
of the the number and type of compositions / decompositions needed to transform one
program into another (i.e., edit distance).
4. Chaotic Execution - very similar programs (as conceptualized in the previous item) may
have very different behaviors.
Precise mathematical definitions could be given for all of these properties but would provide
little insight - it is more instructive to simply note their ubiquity in symbolic representations;
human programming languages (LISP, C, etc.), Boolean and real-valued formulae, pattern-
matching systems, automata, and many more. The crux of this line of thought is that the
combination of these four factors conspires to scramble fitness functions - even if the mapping
from behaviors to fitness is separable or nearly decomposable, the complex' program space and
chaotic execution function will often quickly lead to intractability as problem size grows. These
properties are not superficial inconveniences that can be circumvented by some particularly
clever encoding. On the contrary, they are the essential characteristics that give programs the
power to compress knowledge and generalize correctly, in contrast to fiat, inert representations
such as lookup tables (see Baum [13aMMI for a full treatment of this line of argument).
The consequences of this particular kind of complexity, together with the fact that most
program spaces of interest are combinatorially very large, might lead one to believe that com-
petent program learning is impossible. Not so: real-world program learning tasks of interest
have a compact structuret - they are not "needle in haystack" problems or uncorrelated fitness
landscapes, although they can certainly be encoded as such. The most one can definitively
state is that algorithm foo, methodology bar, or representation baz is unsuitable for express-
ing and exploiting the regularities that occur across interesting program spaces. Some of these
regularities are as follows:
1. Simplicity prior - our prior assigns greater probability mass to smaller programs.
2. Simplicity preference - given two programs mapping to the same behavior, we prefer
the smaller program (this can be seen as a secondary fitness function).
3. Behavioral decomposability - the mapping between behaviors and fitness is separable
or nearly decomposable. Belatedly, fitness are more than scalars - there is a partial ordering
corresponding to behavioral dominance, where one behavior dominates another if it exhibits
• Here "complex" means open-ended, over-representing, and hierarchical.
t Otherwise, humans could not write programs significantly more compact than lookup tables.
EFTA00624341
30.2 Representation-Building 195
a strict superset of the latter's desideratum, according to the fitness ftuiction.t This partial
order will never contradict the total ordering of scalar fitnesss.
4. White box execution - the mechanism of program execution is known a priori, and
remains constant across many problems.
How these regularities may be exploited will be discussed in later sections and chapters.
Another fundamental regularity of great interest for artificial general intelligence is patterns
across related problems that may be solvable with similar programs (e.g., involving common
modules).
30.2 Representation-Building
One important issue in achieving competent program learning is representation building. In an
ideally encoded optimization problem. all prespecified variables would exhibit complete separa-
bility, and could be optimized independently. Problems with hierarchical dependency structure
cannot be encoded this way, but are still tractable by dynamically learning the problem decom-
position (as is done by the BOA and hBOA, described in Chapter 33). For complex problems
with interacting subcomponents, finding an accurate problem decomposition is often tanta-
mount to finding a solution. In an idealized run of a competent optimization algorithm, the
problem decomposition evolves along with the set of solutions being considered, with parallel
convergence to the correct decomposition and the global solution optima. However, this is cer-
tainly contingent on the existence of some compacts and reasonably correct decomposition in
the space (of decompositions, not solutions) being searched.
Difficulty arises when no such decomposition exists, or when a more effective decomposition
exists that cannot be formulated as a probabilistic model over representational parameters.
Accordingly, one may extend current approaches via either: a more general modeling language
for expressing problem decompositions; or additional mechanisms that modify the representa-
tions on which modeling operates (introducing additional inductive bias). In CogPrime we have
focused on the latter - the former would appear to require qualitatively more computational
capacity than will be available in the near future. If one ignores this constraint, such a "univer-
sal" approach to general problem-solving is indeed possible, e.g. AIXP as discussed in Chapter
7.3.
We refer to these additional mechanisms as "representation-building" because they serve
the same purpose as the pre-representational mechanisms employed (typically by humans) in
setting up an optimization problem - to present an optimization algorithm with the salient pa-
rameters needed to build effective problem decompositions and vary solutions along meaningful
dimensions. We return to this issue in detail in Chapter 33 in the context of MOSES, the most
powerful procedure-learning algorithm provided in CogPrime.
* For example, in supervised classification one rule dominates another if it correctly classifies all of the items
that second rule classifies correctly, as well as some which the second rule gets wrong.
§ The decomposition must be compact because in practice only a fairly small sampling of solutions may be
evaluated (relative to the size of the total space) at a time, and the search mechanism for exploring decomposition-
space is greedy and local. This is in also accordance with the general notion of learning corresponding to
compression.
EFTA00624342
196 30 Procedure Learning as Program Learning
30.3 Specification Based Procedure Learning
Now we explain how procedure learning fits in with the declarative and intentional knowledge
representation in the AtomSpace.
The basic method that CogPrime uses to learn procedures that appear fundamentally new
from the point of view of the AtomSpace at a given point in time is "specification-based pro-
cedure learning". This involves taking a PredicateNode with a ProcedureNode input type as a
specification, and searching for ProcedureNodes that fulfill this specification (in the sense of
making the specification PredicateNode as true as possible). In evolutionary computing lingo,
the specification predicate is a fitness function.
Searching for PredicateNodes that embody patterns in the AtomSpace as a whole is a spe-
cial case of this kind of learning, where the specification PredicateNode embodies a notion of
what constitutes an "interesting pattern". The quantification of interestingness is of course an
interesting and nontrivial topic in itself.
Finding schemata that are likely to achieve goals important to the system is also a special
case of this kind of learning. In this case, the specification predicate is of the form:
F(S) PredictivelmplicationLink (ExOutLink S) G
This measures the extent to which executing schema S is positively correlated with goal-
predicate G being achieved shortly later.
Given a PredicateNode interpretable as a specification, how do we find a ProcedureNode sat-
isfying the specification? Lacking prior knowledge sufficient to enable an incremental approach
like inference, we must search the space of possible ProcedureNodes, using an appropriate search
heuristic, hopefully one that makes use of the system's existing knowledge as hilly as passible.
EFTA00624343
Chapter 31
Learning Procedures via Imitation, Reinforcement
and Correction
Co-authored with Moshe Looks, Samir Araujo and Welter Silva
31.1 Introduction
In procedure learning as elsewhere in cognition, it's not enough to use the right algorithm, one
has to use it in the right way based on the data and context and affordances available. While
Chapters ?? and 33 focus on procedure learning algorithms, this one focuses on procedure
learning methodology. We will delve into the important special case of procedure learning in
which the fitness function involves reinforcement and imitation supplied by a teacher and/or an
environment, and look at examples of this in the context of teaching behaviors to virtual pets
controlled by OpenCogPrime. While this may seem a very narrow context, many of the lessons
learned are applicable more broadly; and the discussion has the advantage of being grounded in
actual experiments done with OpenCogPrime's predecessor system, the Novamente Cognition
Engine, and with an early OpenCog version as well, during the period 2007-2008.
We will focus mainly on learning from a teacher, and then common on the very similar case
where the environment, rather than some specific agent, is the teacher.
31.2 IRC Learning
Suppose one intelligent agent (the "teacher") has knowledge of how to carry out a certain
behavior, and wants to transfer this knowledge to another intelligent agent (the "student").
But, suppose the student agent lacks the power of language (which might be, for example,
because language is the thing being taught!). How may the knowledge be transferred? At least
three methodologies are possible:
1. Imitative learning: The teacher acts out the behavior, showing the student by example
2. Reinforcement learning: The student tries to do the behavior himself, and the teacher
gives him feedback on how well he did
3. Corrective learning: As the student attempts the behavior, the teacher actively corrects
(i.e. changes) the student's actions, guiding him toward correct performance
Obviously, these three forms of instmction are not exclusive. What we describe here, and call
IRC learning, is a pragmatic methodology for instructing AGI systems that combines these
197
EFTA00624344
198 31 Learning Procedures via Imitation, Reinforcement and Correction
three forms of instruction. We believe this combination is a potent one, and is certainly implicit
in the way human beings typically teach young children and animals.
For sake of concreteness, we present IRC learning here primarily in the context of virtually
embodied AGI systems — i.e., AGI systems that control virtual agents living in virtual worlds.
There is an obvious extension to physical robots living in the real world and capable of flexible
interaction with humans. In principle, IRC learning is applicable more broadly as well, and could
be explored in various non-embodied context such as (for instance) automated theorem-proving.
In general, the term "IRC learning" may be used to describe any teacher/student interaction
that involves a combination of reinforcement, imitation and correction. While we have focused
in our practical work so far on the use of IRC to teach simple "animal-like" behaviors, the
application that interests us more in the medium term is language instruction, to which we will
return in later chapters.
Harking back to Chapter 9, it is clear that an orientation toward effective IRC learning
will be valuable for any system attempting to achieve complex goals in an environment heavily
populated by other intelligences possessing significant goal-relevant knowledge. Everyday human
environments possess this characteristic, and we suggest the best way to create human-level
AGIs will be to allow them to develop in environments possessing this characteristic as well.
31.2.1 A Simple Example of Imitation/Reinforcement Learning
Perhaps the best way to introduce the essential nature of the IRC teaching protocol is to
give a brief snippet from a script that was created to guide the actual training of the virtual
animals controlled by the PetBrain. This snippet involves only I and It; the C will be discussed
afterwards.
This snippet demonstrates a teaching methodology that involves two avatars: Bob who is
being the teacher, and Jill who is being an "imitation animal," showing the animal what to do
by example.
1. Bob wants to teach the dog Fido a trick. He calls his friend Jill over. "Jill, can you help me
teach Fido a trick?"
2. Jill comes over. glow much will you pay me for it?
3. Bob gives her a Liss.
4. "All right," says Jill, "what do you want to teach him?"
5. "Let's start with fetching stuff," replies Bob.
6. So Bob and Jill start teaching Fido to fetch using the Pet language....
7. Bob says: "Fido, I'm going to teach you to play fetch with Jilt"
8. Fido sits attentively, looking at Bob.
9. Bob says: "OK, I'm playing fetch now."
10. Bob picks up a stick from the ground and throws it. Jill runs to get the stick and brings it
back to Bob.
11. Bob says: "I'm done fetching.
12. Bob says, "You try it."
13. Bob throws a stick Fido runs to the stick, gets it, and brings it back.
14. Bob says "Good dog!"
15. Fido looks happy.
16. Bob says: "Ok, we're done with that game of fetch.
EFTA00624345
31.2 IRC Learning 199
17. Bob says, "Now, let's try playing fetch again.°
18. This time, Bob throws a stick in a different direction, where there's already a stick lying
on the ground (call the other stick Stick 2).
19. Fido runs and retrieves Stick 2. As soon as he picks it up, Bob says "No." But Fido keeps
on running and brings the stick back to Bob.
20. Bob says "No, that was wrong. That was the wrong stick. Stop trying!"
21. Jill says, "Furry little moron!"
22. Bob says to Jill, "Have some patience, will you? Let's try again."
23. Fido is slowly wandering around, sniffing the ground.
24. Bob says "Fido, stay." Fido returns near Bob and sits.
25. Bob throws Stick 2. Fido starts to get up and Bob repeats "Fido, stay."
26. Bob goes and picks up Stick 1, and walks back to his original position.
27. Bob says "Fido, I'm playing fetch with Jill again."
28. Bob throws the first stick in the direction of stick 2.
29. Jill goes and gets stick 1 and brings it back to Bob.
30. Bob says "I'm done playing fetch with Jill."
31. Bob says "Thy playing fetch with me now." He throws stick I in another direction, where
stick 3 and stick 4 are lying on the ground, along with some other junk.
32. Fido runs and gets stick 1 and brings it back.
33. Bob and Jill both jump up and down smiling and say "Good dog! Good dog, Fido!! Good
doe!"
34. Fido smiles and jumps up and licks Jill on the face.
35. Bob says, "Fido, we're done practicing fetch."
In the above transcript, Line 7 initiates a formal training session, and Line 33 terminates this
session. The training session is broken into "exemplar" intervals during which exemplars are
being given, and "trial" intervals during which the animal is trying to imitate the exemplars,
following which is receives reinforcement on its success or otherwise. For instance line 9 initiates
the presentation of an exemplar interval, and line 11 indicates the termination of this interval.
Line 12 indicates the beginning of a trial interval, and line 16 indicates the termination of this
interval.
The above example of combined imitative/reinforcement learning involves two teachers, but,
this is of course not the only way things can be done. Jill could be eliminated from the above
teaching example. The result of this would be that, in figuring out how to imitate the exemplars,
Fido would have to figure out which of Bob's actions were "teacher" actions and which were
"simulated student" actions. This is not a particularly hard problem, but it's harder than the
case where Jill carries out all the simulated-student actions. So in the case of teaching fetch
with only one teacher avatar, on average, more reinforcement trials will be required.
31.2.2 A Simple Example of Corrective Learning
Another interesting twist on the imitative/reinforcement teaching methodology described above
is the use of explicit correctional instructions from the teacher to the animal. This is not shown
in the above example but represents an important addition to the methodology show there. One
good example of the use of corrections would be the problem of teaching would be teaching an
animal to sit and wait until the teacher says "Get Up," using only a single teacher. Obviously,
EFTA00624346
200 31 Learning Procedures via Imitation, Reinforcement and Correction
using two teachers, this is a much easier problem. Using only one teacher, it's still easy, but
involves a little more subtlety, and becomes much more tractable when corrections are allowed.
One way that human dog owners teach their dogs this sort of behavior is as follows:
1. Tell the dog "sit"
2. tell the dog "stay"
3. Whenever the dog tries to get up. tell him "no" or "sit", and then he sits down again
4. eventually, tell the dog to "get up"
The real dog understands, in its own way, that the "no" and "sit" commands said after the "stay"
command are meta-commands rather than part of the "stay" behavior.
In our virtual-pet case, this would be more like
1. tell the dog "I'm teaching you to stay"
2. Tell the dog "sit"
3. Whenever the dog tries to get up, tell him "no" or "sit", and then he sits down again
4. eventually, tell the dog to "get up"
5. tell the dog "I'm done teaching you to stay"
One easy way to do this, which deviates from the pattern of humanlike interaction, would be
to give the agent knowledge about how to interpret an explicit META flag in communications
directed toward it. In this case, the teaching would look like
1. tell the dog "I'm teaching you to stay"
2. Tell the dog "META: sit"
3. Whenever the dog tries to get up, tell him "META: no" or "META:sit", and then he sits
down again
4. eventually, tell the dog to "get up"
5. tell the dog "I'm done teaching you to stay"
Even without the META tag, this behavior (and other comparable ones) is learnable via
CogPrime's learning algorithms within a modest number of reinforcement trials. So we have not
actually implemented the META approach. But it well illustrates the give-and-take relationship
between the sophistication of the teaching methodology and the number of reinforcement trials
required. In many cases, the best way to reduce the number of reinforcement trials required to
learn a behavior is not to increase the sophistication of the learning algorithm, but rather to
increase the information provided during the instruction process. No matter how advanced the
learning algorithm, if the teaching methodology only gives a small amount of information, it's
going to take a bunch of reinforcement trials to go through the search space and find one of the
right procedures satisfying the teacher's desires. One of the differences between the real-world
learning that an animal or human child (or adult) experiences, and the learning "experienced"
by standard machine-learning algorithms, is the richness and diversity of information that the
real world teaching environment provides, beyond simple reinforcement signals. Virtual worlds
provide a natural venue in which to experiment with providing this sort of richer feedback to
AI learning systems, which is one among the many reasons why we feel that virtual worlds are
an excellent venue for experimentation with and education of early-stage AGI systems.
EFTA00624347
31.3 IRC Learning in the PetBrain 201
31.3 IRC Learning in the PetBrain
Continuing the theme of the previous section, we now discuss "trick learning" in the PetBrain,
as tested using OpenCog and the Mtdtiverse virtual world during 2007-2008. The PetBrain
constitutes a specific cognitive infrastructure implementing the IRC learning methodology in
the virtual-animal context, with some extensibility beyond this context as well.
In the PetBrain, learning itself is carried out by a variety of hillclimbing as described in
Chapter 32, which is a fast learning algorithm but may fail on harder behaviors (in the sense of
requiring an unacceptably large number of reinforcement trials). For more complex behaviors,
MOSES (Chapter 33) would need to be integrated as an alternative. Compared to hillclimbing,
MOSES is much smarter but slower, and may take a few minutes to solve a problem. The two
algorithms (as implemented for the PetBrain) share the same Combo knowledge representation
and some other software components (e.g. normalization rules for placing procedures in an
appropriate hierarchical normal form, as described in Chapter 21).
The big challenge involved in designing the PetBrain system, Al-wise, was that these learning
algorithms, used in a straightforward way with feedback from a human-controlled avatar as
the fitness function, would have needed an excessive number of reinforcement trials to learn
relatively simple behaviors. This would bore the human beings involved with teaching the
animals. This is not a flaw of the particular learning algorithms being proposed, but Ls a
generic problem that would exist with any Al algorithms. To choose an appropriate behavior
out of the space of all possible behaviors satisfying reasonable constraints, requires more bits
of information that is contained in a handful of reinforcement trials.
Most "animal training" games (e.g. Nintendogs may be considered as a reference case) work
around this "hard problem" by not allowing teaching of novel behaviors. Instead, a behavior
list is made up front by the game designers. The animals have preprogrammed procedures for
carrying out the behaviors on the list. As training proceeds they make fewer errors, till after
enough training they converge "miraculously" on the pre-programmed plan
This approach only works, however, if all the behaviors the animals will ever learn have been
planned and scripted in advance.
The first key to making learning of non-pre-programmed behaviors work, without an excessive
number of reinforcement trials, is in "fitness estimation" - code that guesses the fitness of a
candidate procedure at fulfilling the teacher's definition of a certain behavior, without actually
having to try out the procedure and see how it works. This is where the I part of IRC learning
comes in.
At an early stage in designing the PetBrain application, we realized it would be best if the
animals were instructed via a methodology where the same behaviors are defined by the teacher
both by demonstration and by reinforcement signals. Learning based on reinforcement signals
only can also be handled, but learning will be much slower.
In evolutionary programming lingo, we have
1. Procedures — genotypes
2. Demonstrated exemplars, and behaviors generated via procedures = phenotypes
3. Reinforcement signals from pet owner = fitness
EFTA00624348
202 31 Learning Procedures via Imitation, Reinforcement and Correction
One method of imitation-based fitness estimation used in the PetBrain involves an internal
simulation world which we'll call CogSim, as discussed in Chapter 40 ". CogSim can be visualized
using a simple testing UI, but in the normal course of operations it doesn't require a user
interface; it is an internal simulation world, which allows the PetBrain to experiment and see
what a certain procedure would be likely to do if enacted in the SL virtual world. Of course, the
accuracy of this kind of simulation depends on the nature of the procedure. For procedures that
solely involve moving around and interacting with inanimate objects, it can be very effective. For
procedures involving interaction with human-controlled avatars, other animals, or other complex
objects, it may be unreliable - and making it even moderately reliable would require significant
work that has not yet been done, in terms of endowing CogSim with realistic simulations of
other agents and their internal motivational structures and so forth. But short of this, CogSim
has nonetheless proved useful for estimating the fitness of simple behavioral procedures.
When a procedure is enacted in CogSim, this produces an object called a "behavior descrip-
tion" (BD), which is represented in the AtomSpace knowledge representation format. The BD
generated by the procedure is then compared with the BD's corresponding to the "exemplar"
behaviors that the teacher has generated, and that the student is trying to emulate. Similari-
ties are calculated, which is a fairly subtle matter that involves some heuristic inferences. An
estimate of the likelihood that the procedure, if executed in the world, will generate a behavior
adequately similar to the exemplar behaviors.
Furthermore, this process of estimation may be extended to make use of the animal's long-
term episodic memory. Suppose a procedure P is being evaluated in the context of exemplar-set
E. Then
1. The episodic memory is mined for pairs (P', E') that are similar to (P,E)
2. The fitness of these pairs (P', E') is gathered from the experience base
3. An estimate of the fitness of (P,E) is then formed
Of course, if a behavior description corresponding to P has been generated via CogSim, this
may also be used in the similarity matching against long-term memory. The tricky part here,
of course, is the similarity measurement itself, which can be handled via simple heuristics, but
if taken sufficiently seriously becomes a complex problem of uncertain inference.
One thing to note here is that in the PetBrain context, although learning is done by each
animal individually, this learning is subtly guided by collective knowledge within the fitness
estimation process. Internally, we have a "borg mind" with multiple animal bodies, and an
architecture designed to ensure the maintenance of unique personalities on the part of the
individual animals in spite of the collective knowledge and learning underneath.
At time of writing, we have just begun to experiment with the learning system as described
above, and are using it to learn simple behaviors such as playing fetch, basic soccer skills,
doing specific dances as demonstrated by the teacher, and so forth. We have not yet done
enough experimentation to get a solid feel for the limitations of the methodology as currently
implemented.
Note also the possibility of using CogPrime's PLN inference component to allow generaliza-
tion of learned behaviors. For instance, with inference deployed appropriately, a pet that had
learned how to play tag would afterwards have a relatively easy time learning to play "freeze
tag." A pet that had learned how to hunt for Easter eggs would have a relatively easy time
• The few readers familiar with obscure OpenCog documentation may remember that CogSim was previously
called 'Third Life, in reference to the Second Life virtual world that was being used to embody the OpenCog
virtual pets at the time
EFTA00624349
31.4 Applying A Similar IRC Methodology to Spontaneous Learning 203
learning to play hide-and-seek. Episodic memory can be very useful for fitness estimation here,
but explicit use of inference may allow much more rapid and far-reaching inference capabilities.
31.3.1 Introducing Corrective Learning
Next, how may corrections be utilized in the learning process we have described? Obviously, the
corrected behavior description gets added into the knowledge base as an additional exemplar.
And, the fact of the correction acts as a partial reinforcement (up until the time of the correction,
what the animal was doing was correct). But beyond this, what's necessary is to propagate the
correction backward from the BD level to the procedure level. For instance, if the animal is
supposed to be staying in one place, and it starts to get up but is corrected by the teacher (who
says "sit" or physically pushes the animal back down), then the part of the behavior-generating
procedure that directly generated the "sit" command needs to be "punished." How difficult this
is to do, depends on how complex the procedure is. It may be as simple as providing a negative
reinforcement to a specific "program tree node" within the procedure, thus disincentivizing
future procedures generated by the procedure learning algorithm from containing this node.
Or it may be more complex, requiring the solution of an inference problem of the form "Find
a procedure P" that is as similar as possible to procedure P, but that does not generate the
corrected behavior, but rather generates the behavior that the teacher wanted in.stead." This
sort of "working backwards from the behavior description to the procedure" is never going to be
perfect except in extremely simple cases, but it is an important part of learning. We have not
yet experimented with this extensively in our virtual animals, but plan to do so as the project
proceeds.
There is also an interesting variant of correction in which the agent's own memory serves
implicitly as the teacher. That is, if a procedure generates a behavior that seems wrong based
on the history of successful behavior descriptions for similar exemplars, then the system may
suppress that particular behavior or replace it with another one that seems more appropriate
- inference based on history thus serving the role of a correcting teacher.
31.4 Applying A Similar IRC Methodology to Spontaneous Learning
We have described the IRC teaching/learning methodology in the context of learning from a
teacher - but in fact a similar approach can be utilized for purely unsupervised learning. In
that case, the animal's intrinsic goal system acts implicitly as a teacher.
For instance, suppose the animal wants to learn how to better get itself fed. In this case,
1. Exemplars are provided by instances in the animal's history when it has successfully gotten
itself fed
2. Reinforcement is provided by, when it is executing a certain procedure, whether or not it
actually gets itself fed or not
3. Correction as such doesn't apply, but implicit correction may be used via deploying history-
based inference. If a procedure generates a behavior that seems wrong based on the history
of successful behavior descriptions for the goal of getting fed, then the system may suppress
that particular behavior.
EFTA00624350
204 31 Learning Procedures via Imitation, Reinforcement and Correction
The only real added complexity here lies in identifying the exemplars. In surveying its own
history, the animal must look at each previous instance in which it got fed (or same sample
thereof), and for each one recollect the series of N actions that it carried out prior to getting fed.
It then must figure out how to set N - i.e. which of the actions prior to getting fed were part of
the behavior that led up to getting fed, and which were just other things the animal happened
to be doing a while before getting fed. To the extent that this exemplar mining problem can
be solved adequately, innate-goal-directed spontaneous learning becomes closely analogous to
teacher-driven learning as we've described it. Or in other words: Experience, as is well known,
can serve as a very effective teacher.
EFTA00624351
Chapter 32
Procedure Learning via Adaptively Biased
Hillclimbing
Primary author: Nil Geisweiller
32.1 Introduction
Having chosen to represent procedures as programs, explicit procedure learning then becomes
a matter of automated program learning. In its mast general incarnation, automated program
learning is obviously an intractable problem; so the procedure learning design problem then boils
down to finding procedure learning algorithms that are effective on the class of problems relevant
to CogPrime systems in practice. This is a subtle matter because there is no straightforward
way to map from the vaguely-defined category of real-world "everyday human world like" goals
and environments to any formally-defined class of relevant objective functions for a program
learning algorithm.
However, this difficulty is not a particular artifact of the choice of programs to represent
procedures; similar issues would arise with any known representational mechanism of suitable
power. For instance, if procedures were represented as recurrent neural nets, there would arise
similar questions of how many layers to give the networks, how to determine the connection
statistics, what sorts of neurons to use, which learning algorithm, etc. One can always posh
such problems to the meta level and use automated learning to determine which variety of
learning algorithm to use - but then one has to make some decisions on the metalearning level,
based on one's understanding of the specific structure of the space of relevant program learning
algorithms. In the fictitious work of unbounded computational resources no such judicious
choices are necessary, but that's not the world we live in, and it's not relevant to the design of
human-like AGI systems.
At the moment, in CogPrime, we utilize two different procedure learning systems, which
operate on the same knowledge representation and rely on much of the same internal code.
One, which we roughly label "hill climbing", is used for problems that are sufficiently "easy"
in the sense that it's possible for the system to solve them using feasible resources without
(implicitly or explicitly) building any kind of sophisticated model of the space of solutions to
the problem. The other, MOSES, is used for problems that are sufficiently difficult that the
right way to solve them is to progressively build a model of the program space as one tries
out various solutions, and then use this model to guide ongoing search for better and better
solutions. Hillclimbing is treated in this chapter; MOSES in the next.
205
EFTA00624352
206 32 Procedure Learning via Adaptively Biased Hillclimbing
32.2 Hillclimbing
"Hillclimbing," broadly speaking, is not a specific algorithm but a category of algorithms. It
applies in general to any search problem where there is a large space of possible solutions, which
can be compared as to their solution quality. Here we are interested in applying it specifically
to problems of search through spaces of programs.
In hillclimbing, one starts with a candidate solution to a problem (often a random one, which
may be very low-quality), and iteratively makes small changes to the candidate to generate new
possibilities, hoping one of them will be a better solution. If a new possibility is better than the
current candidate, then the algorithm adopts the new possibility as its new candidate solution.
When the current candidate solution can no longer be improved via small changes. t he algorithm
terminates. Ideally, at that point the current candidate solution is close to optimal - but this
is not guaranteed!
Various tweaks to hillclimbing exist, including "restart" which means that one starts hill-
climbing over and over again, taking the best solution from multiple trials; and "backtracking",
which means that if the algorithm terminates at a solution that seems inadequate, then the
search can "backtrack" to a previously considered candidate solution, and try to make different
small changes to that candidate solution, trying previously unexplored possibilities in search of
a new candidate. The value of these and other tweaks depends on the specific problem under
consideration.
In the specific approach to hillclimbing described here, we use a hillclimber with backtrack-
ing, applied to programs that are represented in the same hierarchical normal form used with
MOSES (based on the program normalization ideas presented in Chapter 21). The basic pseu-
docode for the hillamber may be given as:
Let L be the list (initially empty) of programs explored so far in decreasing order with respect
to their fitness. Let Arp be the neighbors of program p.
1. Take the best program p E L
2. Evaluate all programs of Np
3. Merge Np in L
4. Move p from L to the set of best programs found so far and repeat from step 1 until time
runs out
In the following sections of this chapter, we show how to speed up the hillclimbing search
for learning procedures via four optimizations, which have been tested fairly extensively. For
concreteness we will refer often to the specific case of using the hill climbing algorithm to control
a virtual agent in a virtual world - and especially the case of teaching a virtual pet tricks via
imitation learning (as in Chapter 31) but the ideas have more general importance. The four
optimizations are:
• reduce candidates to normal form to minimize over-representation and increase the syntactic
semantic correlation (Chapter 21),
• filter perceptions using an entropy criterion to avoid building candidates that involve nodes
unlikely to be contained in the solution (Section 32.3),
• use sequences of agent actions, observed during the execution of the program, as building
blocks (Section 32.4),
• choose and calibrate a simplicity measure to focus on simpler solutions (the "Ocean bias")
first (Section 32.5).
EFTA00624353
32.3 Entity and Perception Filters 207
32.3 Entity and Perception Filters
The number of program candidates of a given size increases exponentially with the alphabet of
the language; therefore it is important to narrow that alphabet as much as possible. This is the
role of the two filters explained below, the Entity and the Entropy filter.
32.3.1 Entity filter
This filter is in charge of selecting the entities in the scene the pet should take into account
during an imitation learning session. These entities can be any objects, avatars or other pets.
In general this is a very hard problem, for instance if a bird is flying near the owner while
teaching a trick, should the pet ignore it? Perhaps the owner wants to teach the pet to bark at
them; if so they should not be ignored.
In our current and prior work with OpenCog controlling virtual world agents, we have used
some fairly crude heuristics for entity filtering, which must be hand-tuned depending on the
properties of the virtual world. However, our intention is to replace these heuristics with entity
filtering based on Economic Attention Networks (ECAN) as described in Chapter 23.
32.3.2 Entropy perception filter
The perception filter is in charge of selecting all perceptions in the scene that are reasonably
likely to be part of the solution to the program learning problem posed. A "perception" in the
virtual world context means the evaluation of one of a set of pre-specified perception predicates,
with an argument consisting of one of the entities in the observed environment.
Given N entities (provided by the Entity filter), there are usually O(N2) potential perceptions
in the Atomspace due to binary perceptions like
near(owner bird)
inside(toy box>
The perception filter proceeds by computing the entropy of any potential perceptions hap-
pening during a learning session. Indeed if the entropy of a given perception P is high that
means that a conditional if (P 81 E2) has a rather balanced probability of taking Branch El
or 82. On the other hand if the entropy is low then the probability of taking these branches
is unbalanced, for instance the probability of taking El may be significantly higher than the
probability of taking 82, and therefore if (P 81 E2) could reasonably be substituted by 81.
For example, assume that during the teaching sessions, the predicate near (owner bird) is
false 99% percent of the time; then near (owner bird) will have a low entropy and will possibly
be discarded by the filter (depending on the threshold). If the bird is always far from the owner
then it will have entropy 0 and will surely be discarded, but if the bird comes and goes it will
have a high entropy and will pass the filter. Let P be such a perception and Pi returns 1 when
the perception is true at time t or 0 otherwise, where t ranges over the set of instants, of size N,
recorded between the beginning and the end of the demonstrated trick. The calculation goes as
follows
EFTA00624354
208 32 Procedure Learning via Adaptively Biased Hillclimbing
Entropy(P) = H I E 8
where H(p) = —p log(p) — (1 — Alog(1 — p). There are additional subtleties when the perception
involves random operators, like near (owner random object) that is the entropy is calculated by
taking into account a certain distribution over entities grouped under the term random_object.
The calculation is optimized to ignore instants when the perception relates to object that have
not moved which makes the calculation efficient enough, but there is room to improve it in
various ways; for instance it could be made to choose perceptions based not only on entropy
but also inferred relevancy with respect to the context using PLN.
32.4 Using Action Sequences as Building Blocks
A heuristic that has been shown to work, in the "virtual pet trick" context, is to consider
sequences of actions that are compatible with the behavior demonstrated by the avatar showing
the trick as building blocks when defining the neighborhood of a candidate. For instance if the
trick is to fetch a ball, compatible sequences would be
got o(ball ), grab(ball), got o(owner), drop
goto(random_object), grab(nearest_ objact), got o(ovner), drop
Sub-sequences can be considered as well - though too many building blocks also increase the
neighborhood exponentially, so one has to be careful when doing that. In practice using the
set of whole compatible sequences worked well. This for instance can speed up many fold the
learning of the trick triple_kick as shown in Section 32.6.
32.5 Automatically Parametrizing the Program Size Penalty
A common heuristic for program learning is an "Occam penalty" that penalizes large programs,
hence biasing search toward compact programs. The function we use to penalize program size
is inspired by Ray Solomonoff's theory of optimal inductive inference ISo164a, So16411; simply
said, a program is penalized exponentially with respect to its size. Also, one may say that since
the number of program candidates grows exponentially with their size, exploring solutions with
higher size must be exponentially worth the cost.
In the next subsections we describe the particular penalty function we have used and how
to tune its parameters.
32.5.1 Definition of the complexity penalty
Let p be a program candidate and penalty(p) a function with domain Al] measuring the
complexity of p. If we consider the complexity penalty function penalty(p) as if it denotes
the prior probability of p, and score(p) (the quality of p as utilized within the hill climbing
algorithm) as denoting the conditional probability of the desired behavior knowing p, then
EFTA00624355
32.5 Automatically Parametrizing the Program Size Penalty 209
Bayer rules tells us that
fitness(p) = score(p) x penalty(p)
denotes the conditional probability of p knowing the right behavior to imitate, the fitness
function that we want to maximize.
It happens that in the pet trick learning context which is our main example in this chapter,
score(p) does not denote such a probability; instead it measures how similar the behavior
generated by p and the behavior to imitate are. However, we utilize the above formula anyway,
with a heuristic interpretation. One may construct assumptions under which score(p) does
represent a probability but this would take us too far afield.
The penalty function we use is then given by:
penalty(p) = exp(—a x log(b x x
where I I is the program size, lAl its alphabet size and e = exp(1). The reason I AI enters
into the equation is because the alphabet size varies from one problem to another due to the
perception and action filters. Without that constraint the term log(b x IAI + e) could simply
be included in a. The higher a the more intense the penalty is. The parameter b controls how
that intensity varies with the alphabet size.
It is important to remark the difference between such a penalty function and lexicographic
parsimony pressure (literally said everything being equal, choose the shortest program). Because
of the use of sequences as building blocks, without such a penalty function the algorithm may
rapidly reach an optimal program (a mere long sequence of actions) and remain stuck in an
apparent optimum while missing the very, logic of the action sequence that the human wants
to convey.
32.5.2 Parameterizing the complexity penalty
Due to the nature of the search algorithm (hill climbing with restart), the choice of the candidate
used to restart the search is crucial. In our case we restart with the candidate with the best
fitness so far which has not been yet used to restart. The danger of such an approach is that
when the algorithm enters a region with local optima (like a plateau), it may basically stay
there as long as there exist better candidates in that region than outside of it non used yet for
restart. Longer programs tend to generate larger regions of local optima (because they have
exponentially more syntactic variations that lead to close behaviors), so if the search enters
such region via an overly complex program it is likely to take a very long time to get out
of it. Introducing probability in the choice of the restart may help avoiding that sort of trap
but having experimented with that it turned out not to be significantly better on average for
learning relatively simple things (indeed although the restart choice Ls more diverse it still tends
to occur in large region of local optima).
However, a significant improvement we have found is to carefully choose the size penalty
function so that the search will tend to restart on simpler programs even if they do not exhibit
• Bayes rule as used here is P(MID) = P(M)P(DIA"P(D) where Al denotes the Model (the program) and D
denotes the data (the behavior to imitate), here P(D) is ignored, that is the data is assumed to be distributed
uniformly
EFTA00624356
210 32 Procedure Learning via Adaptively Biased HilIclimbing
the best behaviors, but will still be able to reach the optimal solution even if it is a complex
one.
The solution we suggest is to choose a and b such that penalty(p) is:
1. as penalizing as possible, to focus on simpler programs first (although that constraint may
possibly be lightened as the experimentation shows),
2. but still correct in the sense that the optimal solution p maximizes fitness(p).
And we want that to work for all problems we are interested in. That restriction is an
important point because it is likely that in general the second constraint will be too strict to
produce a good penalty function.
We will now formalize the above problem. Let i be an index that ranges over the set of
problems of interest (in our case pet tricks to learn), score; and fitness; denotes the score and
fitness functions of the dub problem. Let 6); (s) denote the set of programs of score s
&As) = {plscore(p) = s}
Define a family of partial functions
: [0,1) N
so that
Rs) = argminIPI
pe9.(p)
What this says is that for any given score s we want the size of the shortest program p with
that score. And fi is partial because there may not be any program returning a given score.
Let be the family of partial functions
g:: [0,11 H [0,1)
parametrized by a and b such that
gds) = s x exp(—a x (log(b x IAI -F e) x f i(s))
That is: given a score s, ye(s) returns the fitness fitness(p) of the shortest program p that
marks that score.
32.5.3 Definition of the Optimization Problem
Let si be the highest score obtained for fitness function i (that is the score of the program
chosen as the current best solution of i). Now the optimization problem consists of finding some
a and b such that
Vi argmax gi(s) =
that is the highest score has also the highest fitness. We started by choosing a and b as high
as possible, it is a good heuristic but not the best, the best one would be to choose a and b so
EFTA00624357
32.6 Some Simple Experimental Results 211
that they minimize the number of iterations (number of restarts) to reach a global optimum,
which is a harder problem.
Also, regarding the resolution of the above equation, it is worth noting we do not need the
analytical expression of scom(p). Using past learning experiences we can get a partial description
of the fitness landscape of each problem just by looking at the traces of the search.
Overall we have found this optimization works rather well; that is, tricks that would otherwise
take several hours or days of computation can be learned in seconds or minutes. And the method
also enables fast learning for new tricks, in fact all tricks we have experimented with so far could
be learned reasonably fast (seconds or minutes) without the need to retune the penalty function.
In the current CogPrime codebase, the algorithm in charge of calibrating the parameters
of the penalty function has been written in Python. It takes in input the log of the imitation
learning engine that contains the score, the size, the penalty and the fitness of all candidates
explored for all tricks taken in consideration for the parameterizing. The algorithm proceeds in
2 steps:
1. Reconstitute the partial functions f1 for all fitness functions i already attempted, based on
the traces of these previously optimized fitness functions.
2. Try to find the highest a and b so that
Vi argmax gi(s) =
For step 2, since there are only two parameters to tune, we have used a 2D grid, enumerating
all points (a, b) and zooming when necessary. So the speed of the process depends largely on
the resolution of the grid but (on an ordinary 2009 PC processor) usually it does not require
more than 20 minutes to both extract f; and find a and b with a satisfactory resolution.
32.6 Some Simple Experimental Results
To test the above ideas in a simple context, we initially used them to enable an OpenCog
powered virtual world agent to learn a variety of simple "dog tricks" based on imitation and
reinforcement learning in the Multiverse virtual world. We have since deployed them on a variety
of other applications in various domains.
We began these experiments by running learning on two tricks, fetch_ball and triple_kick
to be described below. in order to calibrate the size penalty function:
1. fetch_ball, which corresponds to the Combo program
and_seg (goto (ball)
grab (ball)
goto (owner)
drop)
2. triple_kick, if the stick is near the ball then kick 3 times with the left leg and otherwise 3
times with the right leg. So for that trick the owner had to provide 2 exemplars, one for
kickL (with the stick near the ball) and one for kickR, and move away the ball from the
stick before showing the second exemplar. Below is the Combo program of triple_kick
if (near (stick ball)
and_seg (kickL kickL kickL)
and_seg(kickR kickR kickR) )
EFTA00624358
212 32 Procedure Learning via Adaptively Biased HilIclimbing
Before choosing an exponential size penalty function and calibrating it fetch_ball would be
learned rather rapidly in a few seconds, but triple_kick would take more than an hour. After
calibration both fetch_ball and triple_kick would be learned rapidly, the later in less than a
minute.
Then we experimented with a new few tricks, some simpler, like sit_under_tree
and_seg(goto(tree) sit)
and others more complex like double_dance, where the trick consists of dancing until the
owner emits the message "stop dancing", and changing the dance upon owner's actions
while(not(says(owner "stop dancing"))
if(last_action(owner "kickL")
tap_dance
lean_rock_dance))
That is the pet performs a tap_dance when the last action of the owner is kickL, and
otherwise performs a lean_rock_dance.
We tested learning for 3 tricks, fetch_ball, triple_kick and double_dance. Each trick was
tested in 7 settings denoted confi to confio summarized in Table 32.1.
• confi is the default configuration of the system, the parameters of the size penalty function
are a = 0.03 and b = 0.34, which is actually not what is returned by the calibration
technique but close to. That is because in practice we have found that on average learning
is working slightly faster with these values.
• conf2 is the configuration with the exact values returned by the calibration, that is a = 0.05,
= 0.94 .
• conk has the reduction engine disabled.
• corala has the entropy filter disabled (threshold is null so all perceptions pass the filter).
• conk has the intensity of the penalty function set to 0.
• conk has the penalty function set with low intensity.
• conf7 and conk have the penalty function set with high intensity.
• conk has the action sequence building block enabled
• conho has the action sequence building block enabled but with a slightly lower intensity of
the size penalty function than normal.
lieduct ActSeq Entropy a b Setting
On Off 0.1 0.03 0.34 confi
On Off 0.1 0.05 0.94 conf2
Off Off 0.1 0.03 0.34 conh
On Off 0 0.03 0.34 conk
On Off 0.1 0 0.34 conh
On Off 0.1 0.0003 0.34 confo
On Off 0.1 0.3 0.34 conk
On Off 0.1 3 0.34 conk
On On 0.1 0.03 0.34 conh
On On 0.1 0.025 0.34 confto
Table 32.1: Settings for each learning experiment
EFTA00624359
32.6 Some Simple Experimental Results 213
Setting Percep Restart Eval Time
confr 3 .4 653 5218
conk 3 3 245 2s
conk 3 3 1073 8s42
conk 136 3 28287 4mn7s
coots 3 >700 >500000 >lh
conk 3 3 653 5218
confr 3 8 3121 23s42
conk 3 147 65948 8mnlOs
confg 3 0 89 410ms
confio 3 0 33 l6lms
Table 32 2: Learning time for fetch_ball
Setting Percep Restart Eval Time
ccmh 1 18 2783 21s47
conk 1 110 11426 Inin532
conk 1 49 15069 2nin152
canf4 124 oO co 00
conk 1 >800 >200K >lh
conk 1 7 1191 9s67
confr 1 >2500 >200K >lh
conk 1 >2500 >200K >lh
conk 1 0 107 146ms
confi a 1 0 101 164ms
Table 32.3: Learning time for triple_kick
Setting Percep Restart Eval Time
1C73
conk 113 ds
conk 113 4s
conk 150 6s20ms
conk >60K >1h
conk 113 ds
FM FM FM FM FM FM 10 CM CM CM
conk 113 ds
conk 113 ds
ccmfa >300K >1h A
ccmfo 138 dsl9lms
conf10 219K 56mn3s
Table 32.4: Learning time for double_dance
Tables 32.2, 32.3 and 32.4 contain the results of the learning experiment for the three tricks,
fetch_ball, triple_kick and double_dance. In each table the column Percept gives the number
perceptions which is taken into account for the learning. Restart gives the number of restarts
hill climbing had to do before reaching the solution. Eval gives the number of evaluations and
Time the search time.
In Table 32.2 and 32.4 we can see that fetch_ball or double_dance are learned in a few
seconds both in conh and conf2. In 32.3 however learning is about five times faster with conh
than with c1242, which was the motivation to go with conf2 as default configuration, but oonf2
still performs well.
EFTA00624360
214 32 Procedure Learning via Adaptively Biased HilIclimbing
As Tables 32.2, 32.3 and 32.4 demonstrate for setting confs, the reduction engine speeds the
search up by less than twice for fetch_ball and double_dance, and many times for triple_kick.
The results for conf4 shows the importance of the filtering function, learning is dramatically
slowed down without it. A simple trick like fetch_ball took few minutes instead of seconds,
double_dance could not be learned after an hour, and triple_kick might never be learned
because it did not focus on the right perception from the start.
The results for conf5 shows that without any kind of complexity penalty learning can be
dramatically slowed down, for the reasons explained in Section 32.5 that is the search loses
itself in large regions of sub-optima. Only double_dance was not affected by that, which is
probably explained by the fact that only one restart occurred in double_dance and it happened
to be the right one.
The results for cogs show that when action sequence building-block is disabled the intensity
of the penalty function could be set even lower. For instance triple_ kick is learned faster (9s67
instead of 21s47 for co:Q. Conversely the results for lamb show that when action sequence
building-block is enabled, if the Occam's razor is too weak it can dramatically slow down the
search. That is because in this circumstance the search is misled by longer candidates that fit
and takes a very cut before it can reach the optimal more compact solution.
32.7 Conclusion
In our experimentation with hillclimbing for learning pet tricks in a virtual world, we have
shown that the combination of
1. candidate reduction into normal form,
2. filtering operators to narrow the alphabet,
3. using action sequences that are compatible with the shown behavior as building blocks,
4. adequately choosing and calibrating the complexity penalty function,
can speed up imitation learning so that moderately complex tricks can be learned within seconds
to minutes instead of hours, using a simple "hill climbing with restarts" learning algorithm.
While we have discussed these ideas in the context of pet tricks, they have of course been
developed with more general applications in mind, and have been applied in many additional
contexts. Combo can be used to represent any sort of procedure, and both the hillclimbing
algorithm and the optimization heuristics described here appear broad in their relevance.
Natural extensions of the approach described here include the following directions:
1. improving the Entity and Entropy filter using ECAN and PLN so that filtering is not only
based on entropy but also relevancy with respect to the context and background knowledge,
2. using transfer learning (see Section 33.5 of Chapter 33) to tune the parameters of algorithm
using contextual and background knowledge.
Indeed these improvements are under active investigation at time of writing, and some may
well have been implemented and tested by the time you read this.
EFTA00624361
Chapter 33
Probabilistic Evolutionary Procedure Learning
Co-authored with Moshe Looks"
33.1 Introduction
The CogPrime architecture fundamentally requires, as one of its components, some powerful
algorithm for automated program learning. This algorithm must be able to solve procedure
learning problems relevant to achieving human-like goals in everyday human environments, re-
lying on the support of other cognitive processes, and providing them with support in turn. The
requirement is not that complex human behaviors need to be learnable via program induction
alone, but rather that when the best way for the system to achieve a certain goal seems to be
the acquisition of a chunk of procedural knowledge, the program learning component should be
able to carry out the requisite procedural knowledge.
As CogPrime is a fairly broadly-defined architecture overall, there are no extremely precise
requirements for its procedure learning component. There could be variants of CogPrime in
which procedure learning carried more or less weight, relative to other components.
Some guidance here may be provided by looking at which tasks are generally handled by
humans primarily using procedural learning, a topic on which cognitive psychology has a fair
amount to say, and which is also relatively amenable to commonsense understanding based on
our introspective and social experience of being human. When we know how to do something,
but can't explain very clearly to ourselves or others how we do it, the chances are high that we
have acquired this knowledge using some form of "procedure learning" as opposed to declarative
learning. This is especially the case if we can do this same sort of thing in many different
contexts, each time displaying a conceptually similar series of actions, but adapted to the specific
situation. We would like CogPrime to be able to carry out procedural learning in roughly the
same situations ordinary humans can (and potentially other situations as well: maybe even at
the start, and definitely as development proceeds), largely via action of its program learning
component.
In practical terms, our intuition (based on considerable experience with automated program
learning, in OpenCog and other contexts) is that one requires a program learning component
capable of learning programs with between dozens and hundreds of program tree nodes, in
Combo or some similar representation - not able to learn arbitrary programs of this size, but
rather able to solve problems arising in everyday human situations in which the simplest ac-
ceptable solutions involve programs of this size. We also suggest that the majority of procedure
" First author
215
EFTA00624362
216 33 Probabilistic Evolutionary Procedure Learning
learning problems arising in everyday human situation can be solved via program with hierar-
chical structure, so that it likely suffices to be able to learn programs with between dozens and
hundreds of program tree nodes, where the programs have a modular structure, consisting of
modules each possessing no more than dozens of program tree nodes. Roughly speaking, with
only a few dozen Combo tree nodes, complex behaviors seem only achievable via using very
subtle algorithmic tricks that aren't the sort of thing a human-like mind in the early stages of
development could be expected to figure out; whereas, getting beyond a few hundred Combo
tree nodes, one seems to get into the domain where an automated program learning approach is
likely infeasible without rather strong restrictions on the program structure, so that a more ap-
propriate approach within CogPrime would be to use PLN, concept creation or other methods
to fuse together the results of multiple smaller procedure learning runs.
While simple program learning techniques like hillclimbing (as discussed in Chapter 32 above)
can be surprisingly powerful, they do have fundamental limitations, and our experience and
intuition both indicate that they are not adequate for serving as CogPrime's primary program
learning component. This chapter describes an algorithm that we do believe is thus capable -
CogPrime's most powerful and general procedure learning algorithm, MOSES, an integrative
probabilistic evolutionary program learning algorithm that was briefly overviewed in Chapter
6 of Part I.
While MOSES as currently designed and implemented embodies a number of specific algo-
rithmic and structural choices, at bottom it embodies two fundamental insights that are critical
to generally intelligent procedure learning:
• Evolution is the right approach to the learning of difficult procedures
• Enhancing evolution with probabilistic methods is necessary. Pure evolution, in the vein of
the evolution of organisms and species, is too slow for broad use within cognition; so what
is required is a hybridization of evolutionary and probabilistic methods, where probabilistic
methods provide a more directed approach to generating candidate solutions than is possible
with typical evolutionary heuristics like crossover and mutation
We summarize these insights in the phrase Probabilistic Evolutionary Program Learning (PEPL);
MOSES is then one particular PEPL algorithm, and in our view a very good one. We have also
considered other related algorithms such as the PLEASURE algorithm roe0Sal (which may
also be hybridized with MOSES), but for the time being it appears to us that MOSES satisfies
CogPrime's needs.
Our views on the fundamental role of evolutionary dynamics in intelligence were briefly pre-
sented in Chapter 3 of Part 1. Terrence Deacon said it even more emphatically: "At every step
the design logic of brains is a Darwinian logic: overproduction, variation, competition, selec-
tion ... it should not come as a surprise that this same logic is also the basis for the normal
millisecond-by-millisecond information processing that continues to adapt neural software to
the world." pea981 He has articulated ways in which, during neurodevelopment, different com-
putations compete with each other (e.g., to determine which brain regions are responsible for
motor control). More generally, he posits a kind of continuous flux as control shifts between
competing brain regions, again, based on high-level "cognitive demand."
Deacon's intuition is similar to the one that led Edelman to propose Neural Darwinism
tEcle931, and Calvin and Bickerton IC13001 to pose the notion of mind as a "Darwin Machine".
The latter have given plausible neural mechanisms ("Darwin Machines") for synthesizing short
"programs". These programs are for tasks such as rock throwing and sentence generation, which
are represented as coherent firing patterns in the cerebral cortex. A population of such patterns,
EFTA00624363
33.1 Introduction 217
competing for neurocomputational territory, replicates with variations, under selection pressure
to conform to background knowledge and constraints.
To incorporate these insights, a system is needed that can recombine existing solutions in
a non-local synthetic fashion, learning nested and sequential structures, and incorporate back-
ground knowledge (e.g. previously learned routines). MOSES is a particular kind of program
evolution intended to satisfy these goals, using a combination of probability theory with ideas
drawn from genetic programming, and also incorporating some ideas we have seen in previous
chapters such as program normalization.
The main conceptual assumption about CogPrime's world, implicit in the suggestion of
MOSES as the primary, program learning component, is that the goal-relevant knowledge that
cannot effectively be acquired by the other methods at CogPrime's disposal (PLN, ECAN, etc.),
forms a body of knowledge that can effectively be induced across via probabilistic modeling on
the space of programs for controlling a CogPrime agent. If this is not true, then MOSES will
provide no advantage over simple met hods like well-timed hillclimbing as described in Chapter
32. If it is true, then the effort of deploying a complicated algorithm like MOSES is worthwhile.
In essence, the assumption is that there are relatively simple regularities among the programs
implementing those procedures that are most critical for a human-like intelligence to acquire
via procedure learning rather than other methods.
33.1.1 Explicit versus Implicit Evolution in CogPrime
Of course, the general importance of evolutionary dynamics for intelligence does not imply the
need to use explicit evolutionary algorithms in one's AGI system. Evolution can occur in an
intelligent system whether or not the low-level implementation layer of the system involves any
explicitly evolutionary processes. For instance it's clear that the human mind/brain involves
evolution in this sense on the emergent level - we create new ideas and procedures by varying
and combining ones that we've found useful in the past, and this occurs on a variety of levels
of abstraction in the mind. In CogPrime, however, we have chosen to implement evolutionary
dynamics explicitly, as well as encouraging them to occur implicitly.
CogPrime is intended to display evolutionary dynamics on the derived-hypergraph level,
and this is intended to be a consequence of both explicitly-evolutionary and not-explicitly-
evolutionary dynamics. Cognitive processes such as PLN inference may lead to emergent evo-
lutionary dynamics (as useful logical relationships are reasoned on and combined, leading to
new logical relationships in an evolutionary manner); even though PLN in itself is not explicitly
evolutionary in character, it becomes emergently evolutionary via its coupling with CogPrime's
attention allocation subsystem, which gives more cognitive attention to Atoms with more impor-
tance, and hence creates an evolutionary, dynamic with importance as the fitness criterion and
the whole constellation of MindAgents as the novelty-generation mechanism. However, MOSES
explicitly embodies evolutionary dynamics for the learning of new patterns and procedures that
are too complex for hillclimbing or other simple heuristics to handle. And this evolutionary,
learning subsystem naturally also contributes to the creation of evolutionary patterns on the
emergent, derived-hypergraph level.
EFTA00624364
218 33 Probabilistic Evolutionary Procedure Learning
33.2 Estimation of Distribution Algorithms
There is a long history in AI of applying evolution-derived methods to practical problem-solving;
John Holland's genetic algorithm No174 initially a theoretical model, has been adapted suc-
cessfully to a wide variety of applications (see e.g. the proceedings of the GECCO conferences).
Briefly, the methodology applied is as follows:
1. generate a random population of solutions to a problem
2. evaluate the solutions in the population using a predefined fitness function
3. select solutions from the population proportionate to their fitnesss
4. recombine/mutate them to generate a new population
5. go to step 2
Holland's paradigm has been adapted from the case of fixed-length strings to the evolution of
variable-sized and shaped trees (typically Lisp S-expressions), which in principle can represent
arbitrary computer programs Il<oz92, Koz9-11.
Recently, replacements-for/extensions-of the genetic algorithm have been developed (for
fixed-length strings) which may be described as estimation-of-distribution algorithms (see
IPe101 for an overview). These methods, which outperform genetic algorithms and related
techniques across a range of problems, maintain centralized probabilistic models of the pop-
ulation learned with sophisticated datamining techniques. One of the most powerful of these
methods is the Bayesian optimization algorithm (BOA) IPe1051.
The basic steps of the BOA are:
1. generate a random population of solutions to a problem
2. evaluate the solutions in the population using a predefined fitness function
3. from the promising solutions in the population, learn a generative model
4. create new solutions using the model, and merge them into the existing population
5. go to step 2.
The neurological implausibility of this sort of algorithm is readily apparent - yet recall that in
CogPrime we are attempting to roughly emulate human cognition on the level of behavior not
structure or dynamics.
Fundamentally, the BOA and its ilk (the competent adaptive optimization algorithms) dif-
fer from classic selecto-recombinative search by attempting to dynamically learn a problem
decomposition, in terms of the variables that have been pre-specified. The BOA represents
this decomposition as a Bayesian network (directed acyclic graph with the variables as nodes,
and an edge from x to y indicating that y is probabilistically dependent on x). An extension,
the hierarchical Bayesian optimization algorithm (hBOA) uses a Bayesian network with lo-
cal structure to more accurately represent hierarchical dependency relationships. The BOA and
hBOA are scalable and robust to noise across the range of nearly decomposable functions. They
are also effective, empirically, on real-world problems with unknown decompositions, which may
or may not be effectively representable by the algorithms; robust, high-quality results have been
obtained for Ising spin glasses and NfaxSAT, as well as a variety of real-world problems.
EFTA00624365
33.3 Competent Program Evolution via MOSES 219
33.3 Competent Program Evolution via MOSES
In this section we summarize meta-optimizing semantic evolutionary search (MOSES), a system
for competent program evolution, described more thoroughly in ILoo061. Based on the viewpoint
developed in the previous section, MOSES is designed around the central and unprecedented
capability of competent optimization algorithms such as the hBOA, to generate new solutions
that simultaneously combine sets of promising assignments from previous solutions according
to a dynamically learned problem decomposition. The novel aspects of MOSES described herein
are built around this core to exploit the unique properties of program learning problems. This
facilitates effective problem decomposition (and thus competent optimization).
33.3.1 Statics
The basic goal of MOSES is to exploit the regularities in program spaces outlined in the previ-
ous section, most critically behavioral decomposability and white box execution, to dynamically
construct representations that limit and transform the program space being searched into a
relevant subspace with a compact problem decomposition. These representations will evolve as
the search progresses.
33.3.1.1 An Example
Let's start with an easy example. What knobs (meaningful parameters to vary) exist for the
family of programs depicted in Figure ?? on the left? We can assume, in accordance with the
principle of white box execution, that all symbols have their standard mathematical interpre-
tations, and that x, y, and z are real-valued variables.
In this case, all three programs correspond to variations on the behavior represented graph-
ically on the right in the figure. Based on the principle of behavioral decomposability, good
knobs should express plausible evolutionary variation and recombination of features in behav-
ior space, regardless of the nature of the corresponding changes in program space. It's worth
repeating once more that this goal cannot be meaningfully addressed on a syntactic level - it
requires us to leverage background knowledge of what the symbols in our vocabulary (cos, +,
0.35, etc.) actually mean.
A good set of knobs will also be orthogonal Since we are searching through the space of
combinations of knob settings (not a single change at a time, but a set of changes), any knob
whose effects are equivalent to another knob or combination of knobs is tuidesirable.t Corre-
spondingly, our set of knobs should span all of the given programs (i.e., be able to represent
them as various knob settings).
A small basis for these programs could be the 3-dimensional parameter space, x1 E {x, z,0}
(left argument of the root node), x2 E {y, x} (argument of cos), and x3 E [0.3, 0.4] (multiplier
for the cos-expression). However, this is a very, limiting view, and overly tied to the particulars
t First because this will increase the number of samples needed to effectively model the structure of knob-space,
and second because this modeling will typically be quadratic with the number of knobs, at least for the BOA
or hBOA.
EFTA00624366
220 33 Probabilistic Evolutionary Procedure Learning
of how these three programs happen to be encoded. Considering the space behaviorally (right of
Figure ??), a number of additional knobs can be imagined which might be turned in meaningful
ways, such as:
1. numerical constants modifying the phase and frequency of the cosine expression,
2. considering some weighted average of x and y instead of one or the other,
3. multiplying the entire expression by a constant,
4. adjusting the relative weightings of the two arguments to +.
33.3.1.2 Syntax and Semantics
This kind of representation-building calls for a correspondence between syntactic and semantic
variation. The properties of program spaces that make this difficult are over-representation
and chaotic execution, which lead to non-orthogonality, oversampling of distant behaviors, and
undersampling of nearby behaviors, all of which can directly impede effective program evolution.
Non-orthogonality is caused by over-representation. For example. based on the properties
of commutativity and associativity, an -F a2 -F + may be expressed in exponentially many
different ways, if + is treated as a non-commutative and non-associative binary operator.
Similarly, operations such as addition of zero and multiplication by one have no effect, the
successive addition of two constants is equivalent to the addition of their sum, etc. These effects
are not quirks of real-valued expressions; similar redundancies appear in Boolean formulae (x
AND x e x), list manipulation (cdr(cons(x, L)) a L), and conditionals (if x then y else
z e if NOT x then z else y).
Without the ability to exploit these identities, we are forced to work in a greatly expanded
space which represents equivalent expression in many different ways, and will therefore be very
far from orthogonality. Completely eliminating redundancy is infeasible, and typically NP-hard
(in the domain of Boolean formulae it is reducible to the satisfiability problem, for instance),
but one can go quite far with a heuristic approach.
Oversampling of distant behaviors is caused directly by chaotic execution, as well as a
somewhat subtle effect of over-representation, which can lead to simpler programs being heavily
oversampled. Simplicity is defined relative to a given program space in terms of minimal length,
the number of symbols in the shortest program that produces the same behavior.
Undersampling of nearby behaviors is the flip side of the oversampling of distant
behaviors. As we have seen, syntactically diverse programs can have the same behavior; this
can be attributed to redundancy, as well as non-redundant programs that simply compute the
same result by different means. For example, 3*x can also be computed as r+r+x; the first
version uses less symbols, but neither contains any obvious "bloat" such as addition of zero or
multiplication by one. Note however that the nearby behavior of 3.1 *x, is syntactically close
to the former, and relatively far from the latter. The converse is the case for the behavior
of ex+y. In a sense, these two expressions can be said to exemplify differing organizational
principles, or points of view, on the underlying function.
Differing organizational principles lead to different biases in sampling nearby behaviors. A
superior organizational principle (one leading to higher-fitness syntactically nearby programs
for a particular problem) might be considered a metaptation (adaptation at the second tier).
Since equivalent programs organized according to different principles will have identical fitnesss,
some methodology beyond selection for high fitnesss must be employed to search for good
organizational principles. Thus, the resolution of undersampling of nearby behaviors revolves
EFTA00624367
33.3 Competent Program Evolution via MOSES 221
around the management of neutrality in search, a complex topic beyond the scope of this
chapter.
These three properties of program spaces greatly affect the performance of evolutionary,
methods based solely on syntactic variation and recombination operators, such as local search
or genetic programming. In fact, when quantified in terms of various fitness-distance correlation
measures, they can be effective predictors of algorithm performance, although they are of course
not the whole story. A semantic search procedure will address these concerns in terms of the
underlying behavioral effects of and interactions between a language's basic operators; the
general scheme for doing so in MOSES is the topic of the next subsection.
33.3.1.3 Neighborhoods and Normal Forms
The procedure MOSES uses to construct a set of knobs for a given program (or family of
structurally related programs) is based on three conceptual steps: ?eduction to normal form,
neighborhood enumeration, and neighborhood reduction.
Reduction to normal form
- Redundancy is heuristically eliminated by reducing programs to a normal form. Typically, this
will be via the iterative application of a series of local rewrite rules (e.g., Vx, x t 0 x ), until
the target program no longer changes. Note that the well-known conjunctive and disjunctive
normal torus for Boolean formulae are generally unsuitable for this purpose; they destroy the
hierarchical structure of formulae, and dramatically limit the range of behaviors (in this case
Boolean functions) that can be expressed compactly. Rather, hierarchical normal forms for
programs are required.
Neighborhood enumeration
- A set of possible atomic perturbations is generated for all programs under consideration (the
overall perturbation set will be the union of these). The goal is to heuristically generate new
programs that correspond to behaviorally nearby variations on the source program, in such a
way that arbitrary sets of perturbations may be composed combinatorially to generate novel
valid programs.
Neighborhood reduction
- Redundant perturbations are heuristically culled to reach a more orthogonal set. A straightfor-
ward way to do this is to exploit the reduction to normal form outlined above; if multiple knobs
lead to the same normal forms program, only one of them is actually needed. Additionally,
note that the number of symbols in the normal form of a program can be used as a heuristic
approximation for its minimal length - if the reduction to normal form of the program resulting
from twiddling some knob significantly decreases its size, it can be assumed to be a source of
oversampling, and hence eliminated from consideration. A slightly smaller program is typically
EFTA00624368
222 33 Probabilistic Evolutionary Procedure Learning
a meaningful change to make, but a large reduction in complexity will rarely be useful (and
if so, can be accomplished through a combination of knobs that individually produce small
changes).
At the end of this process, we will be left with a set of knobs defining a subspace of programs
centered around a particular region in program space and heuristically centered around the cor-
responding region in behavior space as well. This is part of the meta aspect of MOSES, which
seeks not to evaluate variations on existing programs itself, but to construct parameterized
program subspaces (representations) containing meaningful variations, guided by background
knowledge. These representations are used as search spaces within which an optimization algo-
rithm can be applied.
33.3.2 Dynamics
As described above, the representation-building component of MOSES constructs a parameter-
ized representation of a particular region of program space, centered around a single program
family of closely related programs. This is consistent with the line of thought developed above,
that a representation constructed across an arbitrary region of program space (e.g., all programs
containing less than n symbols), or spanning an arbitrary collection of unrelated programs, is
unlikely to produce a meaningful parameterization (i.e., one leading to a compact problem
decomposition).
A sample of programs within a region derived from representation-building together with
the corresponding set of knobs will be referred to herein as a deme;t a set of denies (together
spanning an arbitrary, area within program space in a patchwork fashion) will be referred to as
a metapopulationfi MOSES operates on a metapopulation. adaptively creating, removing, and
allocating optimization effort to various denies. Demo management is the second fundamental
nets aspect of MOSES, after (and above) representation-building; it essentially corresponds to
the problem of effectively allocating computational resources to competing regions, and hence
to competing programmatic organizational- representational schemes.
33.3.2.1 Algorithmic Sketch
The salient aspects of programs and program learning lead to requirements for competent
program evolution that can be addressed via a representation-building process such as the one
shown above, combined with effective deme management. The following sketch of MOSES,
elaborating Figure 33.1 repeated here from Chapter 8 of Part 1, presents a simple control flow
that dynamically integrates these processes into an overall program evolution procedure:
1. Construct an initial set of knobs based on some prior (e.g., based on an empty program)
and use it to generate an initial random sampling of programs. Add this deme to the
metapopulation.
2. Select a deme from the metapopulation and update its sample, as follows:
A term borrowed from biology, referring to a somewhat isolated local population of a species.
Another term borrowed from biology, referring to a coup of somewhat separate populations (the demes) that
nonetheless interact.
EFTA00624369
33.3 Competent Program Evolution via MOSES 223
Repreatuationaudding
Realm Sampling
Samoa
Opomareioa
Fig. 33.1: The top-level architectural components of MOSES, with directed edges indicating the
flow of information and program control.
a. Select some promising programs from the deme's existing sample to use for modeling,
according to the fitness function.
b. Considering the promising programs as collections of knob settings, generate new col-
lections of knob settings by applying some (competent) optimization algorithm.
c. Convert the new collections of knob settings into their corresponding programs, reduce
the programs to normal form, evaluate their fltnesss, and integrate them into the deme's
sample, replacing less promising programs.
3. For each new program that meets the criteria for creating a new deme, if any:
a. Construct a new set of knobs (via representation-building) to define a region centered
around the program (the deme's exemplar). and use it to generate a new random sam-
pling of programs, producing a new dente.
b. Integrate the new deme into the metapopulation, possibly displacing less promising
domes.
4. Repeat from step 2.
The criterion for creating a new deme is behavioral non-dominance (programs which are not
dominated by the exemplars of any existing denies are used as exemplars to create new denies),
which can be defined in a domain-specific fashion. As a default, the fitness function may be
used to induce dominance, in which case the set of exemplar programs for demos corresponds
to the set of top-fitness programs.
33.3.3 Architecture
The preceding algorithmic sketch of MOSES leads to the top-level architecture depicted in
Figure ??. Of the four top-level components, only the fitness function is problem-specific. The
EFTA00624370
224 33 Probabilistic Evolutionary Procedure Learning
representation-building process is domain-specific, while the random sampling methodology
and optimization algorithm are domain-general. There is of course the possibility of improving
performance by incorporating domain and/or problem-specific bias into random sampling and
optimization as well.
33.3.4 Example: Artificial Ant Problem
Let's go through all of the steps that are needed to apply MOSES to a small problem, the
artificial ant on the Santa Fe trail lkoz92], and describe the search process. The artificial ant
domain is a two-dimensional grid landscape where each cell may or may not contain a piece of
food. The artificial ant has a location (a cell) and orientation (facing up, down, left, or right),
and navigates the landscape via a primitive sensor, which detects whether or not there is food
in the cell that the ant is facing, and primitive actuators move (take a single step forward),
right (rotate 90 degrees clockwise), and left (rotate 90 degrees counter-clockwise). The Santa
Fe trail problem is a particular 32 x 32 toroidal grid with food scattered on it (Figure 4), and a
fitness function counting the number of unique pieces of food the ant eats (by entering the cell
containing the food) within 600 steps (movement and 90 degree rotations are considered single
steps).
Programs are composed of the primitive actions taking no arguments, a conditional (if-food-
ahead),' which takes two arguments and evaluates one or the other based on whether or not
there is food ahead, and progn, which takes a variable number of arguments and sequentially
evaluates all of them from left to right. To fitness a program, it is evaluated continuously until
600 time steps have passed, or all of the food is eaten (whichever comes first). Thus for example,
the program if-food-ahead(m, r) moves forward as long as there is food ahead of it, at which
point it rotates clockwise until food is again spotted. It can successfully navigate the first two
turns of the placeSanta Fe trail, but cannot cross "gaps" in the trail, giving it a final fitness of
11.
The first step in applying MOSES is to decide what our reduction rules should look like.
This program space has several clear sources of redundancy leading to over-representation that
we can eliminate, leading to the following reduction rules:
1. Any sequence of rotations may be reduced to either a left rotation, a right rotation, or a
reversal, for example:
progn(left, left, left)
reduces to
right
1. Any if-food-ahead statement which is the child of an if-food-ahead statement may be elim-
inated, as one of its branches is clearly irrelevant, for example:
if-food-ahead(m, if-food-ahead(l, r))
reduces to
if-food-ahead(m, r)
I This formulation is equivalent to using a general three-argument if-then-else statement with a predicate as
the first argument, as there is only a single predicate (food-ahead) for the ant problem.
EFTA00624371
33.3 Competent Program Evolution via MOSES 225
1. Any progn statement which is the child of a progn statement may be eliminated and replaced
by its children, for example:
pnogn(pnogn(left, move), move)
reduces to
progn(left, move, move)
The representation language for the ant problem is simple enough that these are the only
three rules needed - in principle there could be many more. The first rule may be seen as a
consequence of general domain-knowledge pertaining to rotation. The second and third rules
are fully general simplification rules based on the semantics of if-then-else statements and
associative functions (such as progn). respectively.
These rules allow us to naturally parameterize a knob space corresponding to a given program
(note that the arguments to the progn and if-food-ahead functions will be recursively reduced
and parameterized according to the same procedure). Rotations will correspond to knobs with
four possibilities (left, right, reversal, no rotation). Movement commands will correspond to
knobs with two possibilities (move, no movement). There is also the possibility of introducing a
new command in between, before, or after, existing commands. Some convention (a "canonical
form") for our space is needed to determine how the knobs for new commands will be introduced.
A representation consists of a rotation knob, followed by a conditional knob, followed by a
movement knob, followed by a rotation knob, etc."
The structure of the space (how large and what shape) and default knob values will be
determined by the "exemplar" program used to construct it. The default values are used to
bias the initial sampling to focus around the prototype associated to the exemplar: all of the n
direct neighbors of the prototype are first added to the sample, followed by a random selection
of n programs at a distance of two from the prototype, n programs at a distance of three,
etc., until the entire sample is filled. Note that the hBOA can of course effectively recombine
this sample to generate novel programs at any distance from the initial prototype. The empty
program progn (which can be used as the initial exemplar for MOSES), for example, leads to
the following prototype:
progn(
rotate? [default no rotation],
if-food-ahead(
progn(
rotate? (default no rotation],
move? (default no movement/),
progn(
rotate? (default no rotation/,
move? (default no movement/)),
move? (default no movement()
There are six parameters here, three which are quaternary (rotate), and three which are
binary (move). So the program
I That there is some fixed ordering on the knobs is important, so that two rotation knobs are not placed next to
each other (as this would introduce redundancy). In this case, the precise ordering chosen (rotation, conditional,
movement) does not appear to be critical.
EFTA00624372
226 33 Probabilistic Evolutionary Procedure Learning
progn(left, if-food-ohead(move, left))
would be encoded in the space as
[left, no rotation, move, left, no movement, no movement]
with knobs ordered according to a pre-order left-to-right traversal of the program's parse tree
(this is merely for exposition; the ordering of the parameters has no effect on MOSES). For
a prototype program already containing an if-food-ahead statement, nested conditionals would
be considered.
60
SO
40
0
30
O
tiro
O 2°
vl
10
goo &A00 12.1103 16= 20.C00
# program evaluations
Technique Computational Effort
Genetic Programming 450,000 evaluations
Evolutionary Programming 136,000 evaluations
MOSES 23,000 evaluations
Fig. 33.2: On the top, histogram of the number of global optima found after a given number of
program evaluations for 100 rims of MOSES on the artificial ant problem (each run is counted
once for the first global optimum reached). On the bottom, computational effort required to
find an optimal solution for various techniques with probability p=.99 (for MOSES p=1, since
an optimal solution was found in all runs).
A space with six parameters in it is small enough that MOSES can reliably find the optimum
(the program progn(right, if-food-ahead(pmgnaleft), move)), with a very small population. Af-
ter no further improvements have been made in the search for a specified number of generations
(calculated based on the size of the space based on a model derived from 1231 that is general
to the hBOA, and not at all tuned for the artificial ant problem), a new representation is
constructed centered around this program." Additional knobs are introduced 9n between" all
" MOSES reduces the exemplar program to normal form before constructing the representation; in this
particular case however, no transformations are needed. Similarly, in general neighborhood reduction would be
EFTA00624373
33.3 Competent Program Evolution via MOSES 227
existing ones (e.g., an optional move in between the first rotation and the first conditional),
and possible nested conditionals are considered (a nested conditional occurring in a sequence
after some other action has been taken is not redundant). The resulting space has 39 knobs,
still quite tractable for hBOA, which typically finds a global optimum within a few genera-
tions. If the optimum were not to be found, MOSES would construct a new (possibly larger
or smaller) representation, centered around the best program that was found, and the process
would repeat.
The artificial ant problem is well-studied, with published benchmark results available for
genetic programming as well as evolutionary programming based solely on mutation (i.e., a
form of population-based stochastic hill climbing). Furthermore, an extensive analysis of the
search space has been carried out by Langdon and Poli 111)04 with the authors concluding:
1. The problem is "deceptive at all levels", meaning that the partial solutions that must be
recombined to solve the problem to optimality have lower average fitness than the partial
solutions that lead to inferior local optima.
2. The search space contains many symmetries (e.g., between left and right rotations),
3. There is an unusually high density of global optima in the space (relative to other common
test problems);
4. even though current evolutionary methods can solve the problem, they are not significantly
more effective (in terms of the number of program evaluations require) than random sample.
5. "If real program spaces have the above characteristics (we expect them to do so but be still
worse) then it is important to be able to demonstrate scalable techniques on such problem
spaces".
33.3.4.1 Test Results
Koza [Koz92J reports on a set of 148 runs of genetic programming with a population size of
500 which had a 16% success rate after 51 generations when the runs were terminated (a total
of 25,500 program evaluations per run). The minimal "computational effort" needed to achieve
success with 99% probability was attained by processing through generation 14 was 450,000
(based on parallel independent runs). Chellapilla the971 reports 47 out of 50 successful rums
with a minimal computational effort (again, for success with 99% probability) of 136,000 for
his stochastic hill climbing method.
In our experiment with the artificial ant problem, one hundred runs of MOSES were executed.
Beyond the domain knowledge embodied in the reduction and knob construction procedure, the
only parameter that needed to be set was the population scaling factor, which was set to 30
(MOSES automatically adjusts to generate a larger population as the size of the representa-
tion grows, with the base case determined by this factor). Based on these "factory" settings,
MOSES found optimal solutions on every run out of 100 trials, within a maximum of 23,000
program evaluations (the computational effort figure corresponding to 100% success). The av-
erage number of program evaluations required was 6952, with 95% confidence intervals of ±856
evaluations.
Why does MOSES outperform other techniques? One factor to consider first is that the
language programs are evolved in is slightly more expressive than that used for the other
used to eliminate any extraneous knobs (based on domain-specific heuristics). For the ant domain however no
such reductions are necessary.
EFTA00624374
228 33 Probabilistic Evolutionary Procedure Learning
techniques; specifically, a progn is allowed to have no children (if all of its possible children are
"turned off"), leading to the possibility of if-food-ahead statements which do nothing if food
is present (or not present). Indeed, many of the smallest solutions found by MOSES exploit
this feature. This can be tested by inserting a "do nothing" operation into the terminal set for
genetic programming (for example). Indeed, this reduces the computational effort to 272,000;
an interesting effect, but still over an order of magnitude short of the results obtained with
MOSES (the success rate after 50 generations is still only 20%).
Another possibility is that the reductions in the search space via simplification of programs
alone are responsible. However, the results past attempts at introducing program simplification
into genetic programming systems [27, 28J have been mixed; although the system may be sped
up (because programs are smaller), there have been no dramatic improvement in results noted.
To be fair, these results have been primarily focused on the symbolic regression domain; I am
not aware of any results for the artificial ant problem.
The final contributor to consider is the sampling mechanism (knowledge-driven knob-creation
followed by probabilistic model-building). We can test to what extent model-building con-
tributes to the bottom line by simply disabling it and assuming probabilistic independence
between all knobs. The result here is of interest because model-building can be quite expensive
(O(n2N) per generation, where n is the problem size and N is the population sizett). In 50
independent runs of MOSES without model-building, a global optimum was still discovered in
all runs. However, the variance in the number of evaluations required was much higher (in two
cases over 100,000 evaluations were needed). The new average was 26,355 evaluations to reach
an optimum (about 3.5 times more than required with model-building). The contribution of
model-building to the performance of MOSES is expected to be even greater for more difficult
problems.
Applying MOSES without model-building (i.e., a model assuming no interactions between
variables) is a way to test the combination of representation-building with an approach re-
sembling the probabilistic incremental program learning (PIPE) algorithm ISS03I, which learns
programs based on a probabilistic model without any interactions. PIPE has now been shown
to provide results competitive with genetic programming on a number of problems (regression,
agent control, etc.).
It Ls additionally possible to look inside the models that the hBOA constructs (based on the
empirical statistics of successful programs) to see what sorts of linkages between knobs are being
learned.tt For the 6-knob model given above for instance an analysis of the linkages learned
shows that the three most common pairwise dependencies uncovered, occurring in over 90% of
the models across 100 runs, are between the rotation knobs. No other individual dependencies
occurred in more than 32% of the models. This preliminary finding is quite significant given
Landgon and Poli's findings on symmetry. and their observation that "It[hese symmetries lead
to essentially the same solutions appearing to be the opposite of each other. E.g. either a pair
of Right or pair of Left terminals at a particular location may be important."
In this relatively simple case, all of the components of MOSES appear to mesh together to
provide superior performance - which is promising, though it of course does not prove that
these same advantages will apply across the range of problems relevant to human-level AGI.
ti The fact that reduction to normal tends to reduce the problem size is another synergy between it and the
application of probabilistic model-building.
*1 There is in fact even more information available in the hBOA models concerning hierarchy and direction of
dependence, but this is difficult to analyze.
EFTA00624375
33.3 Competent Program Evolution via MOSES 229
33.3.5 Discussion
The overall MOSES design is unique. However, it Ls instructive at this point to compare its
two primary unique facets (representation-building and deme management) to related work in
evolutionary computation.
Rosca's adaptive representation architecture Illos991 is an approach to program evolution
which also alternates between separate representation-building and optimization stages. It is
based on Koza's genetic programming, and modifies the representation based on a syntactic
analysis driven by the fitness function, as well as a modularity bias. The representation-building
that takes place consists of introducing new compound operators, and hence modifying the
implicit distance function in tree-space. This modification is uniform, in the sense that the new
operators can be placed in any context, without regard for semantics.
In contrast to Rosca's work and other approaches to representation-building such as Koza's
automatically defined functions IKA95], MOSES explicitly addresses the underlying (semantic)
structure of program space independently of the search for any kind of modularity or problem
decomposition. This preliminary stage critically changes neighborhood structures (syntactic
similarity) and other aggregate properties of programs.
Regarding deme management, the embedding of an evolutionary algorithm within a super-
ordinate procedure maintaining a metapopulation is most commonly associated with "island
model" architectures ISWM901. One of the motivations articulated for using island models has
been to allow distinct islands to (asually implicitly) explore different regions of the search space,
as MOSES does explicitly. MOSES can thus be seen as a very particular kind of island model
architecture, where programs never migrate between islands (demos), and islands are created
and destroyed dynamically as the search progresses.
In MOSES, optimization does not operate directly on program space, but rather on a sub-
space defined by the representation-building process. This subspace may be considered as being
defined by a sort of template assigning values to some of the underlying dimensions (e.g., it
restricts the size and shape of any resulting trees). The messy genetic algorithm [(NUM%
an early competent optimization algorithm, uses a similar mechanism - a common "competi-
tive template" is used to evaluate candidate solutions to the optimization problem which are
themselves underspecified. Search consequently centers on the template(s), much as search in
MOSES centers on the programs used to create new domes (and thereby new representations).
The issue of deme management can thus be seen as analogous to the issue of template selection
in the messy genetic algorithm.
33.3.6 Conclusion
Competent evolutionary optimization algorithms are a pivotal development, allowing encoded
problems with compact decompositions to be tractably solved according to normative princi-
ples. We are still faced with the problem of representation-building - casting a problem in terms
of knobs that can be twiddled to solve it. Hopefully, the chosen encoding will allow for a com-
pact problem decomposition. Program learning problems in particular rarely possess compact
decompositions, due to particular features generally present in program spaces (and in the map-
ping between programs and behaviors). This often leads to intractable problem formulations,
even if the mapping between behaviors and fitness has an intrinsic separable or nearly decom-
EFTA00624376
230 33 Probabilistic Evolutionary Procedure Learning
posable structure. As a consequence, practitioners must often resort to manually carrying out
the analogue of representation-building, on a problem-specific basis. Working under the thesis
that the properties of programs and program spaces can be leveraged as inductive bias to remove
the burden of manual representation-building, leading to competent program evolution, we have
developed the MOSES system. and explored its properties.
While the discussion above has highlighted many of the features that make MOSES uniquely
powerful, in a sense it has told only half the story. Part of what makes MOSES valuable for
CogPrime is that it's good on its own; and the other part is that it cooperates well with the
other cognitive processes within CogPrime. We have discussed aspects of this already in Chapter
8 of Part 1, especially in regard to the MOSES/PLN relationship. In the following section we
proceed further to explore the interaction of MOSES with other aspects of the CogPrime system
— a topic that will arise repeatedly in later chapters as well.
33.4 Integrating Feature Selection Into the Learning Process
In the typical workflow of applied machine learning, one begins with a large number of features,
each applicable to some or all of the entities one wishes to learn about; then one applies some
feature selection heuristics to whittle down the large set of features into a smaller one; then
one applies a learning algorithm to the reduced set of features. The reason for this approach is
that the more powerful among the existing machine learning algorithms tend to get confused
when supplied with too many features. The problem with this approach is that sometimes one
winds up throwing out potentially very useful information during the feature selection phase.
This same sort of problem exists with MOSES in its simplest form, as described above.
The human mind, as best we understand it, does things a bit differently than this standard
"feature selection followed by learning" process. It does seem to perform operations analogous
to feature selection, and operations analogous to the application of a machine learning algorithm
to a reduced feature set - but then it also involves feedback from these "machine learning like"
operations to the "feature selection like" operations, so that the intermediate results of learning
can cause the introduction into the learning process of features additional to those initially
selected, thus allowing the development of better learning results.
Compositional spatiotemporal deep learning (CSDLN) architectures like HTM 1/1131161 or
DeSTIN IA RCO9al, as discussed in 27 incorporate this same sort of feedback. The lower levels
of such an architecture, in effect, carry out "feature selection" for the upper levels - but then
feedback from the upper to the lower levels also occurs, thus in effect modulating the "feature
selection like" activity at the lower levels based on the more abstract learning activity on the
upper levels. However, such CSDLN architectures are specifically biased toward recognition of
certain sorts of patterns - an aspect that may be considered a bug or a feature of this class of
learning architecture, depending on the context. For visual pattern recognition, it appears to be
a feature, since the hierarchical structure of such algorithms roughly mimics the architecture of
visual cortex. For automated learning of computer programs carrying out symbolic tasks, on the
other hand, CSDLN architectures are awkward at best and probably generally inappropriate.
For cases like language learning or abstract conceptual inference, the jury, is out.
In this section we explore the question of: how to introduce an appropriate feedback between
feature selection and learning in the case of machine learning algorithms with general scope
and without explicit hierarchical structure - such as MOSES. We introduce a specific technique
EFTA00624377
33.4 Integrating Feature Selection Into the Learning Process 231
enabling this, which we call LIFES, short for Learning-Incorporated Feature Selection. We argue
that LIFES is particularly applicable to learning problems that possess the conjunction of two
properties that we call data focusability and feature focusability. We illustrate LIFES in a
MOSES context, via describing a specific incarnation of the LIFES technique that does feature
selection repeatedly during the MOSES learning process, rather than just doing it initially prior
to MOSES learning.
33.4.1 Machine Learning, Feature Selection and AGI
The relation between feature selection and machine learning appears an excellent example of
the way that, even when the same basic technique is useful in both narrow AI and AGI, the
method of utilization is often quite different. In most applied machine learning tasks, the need
to customize feature selection heuristics for each application domain (and in some cases, each
particular problem) is not a major difficulty. This need does limit the practical utilization of
machine learning algorithms, because it means that many ML applications require an expert
user who understands something about machine learning, both to deal with feature selection
issues and to interpret the results. But it doesn't stand in the way of ML's fundamental usability.
On the other hand, in an AGI context, the situation is different, and the need for human-crafted,
context-appropriate feature selection does stand in the way of the straightforward insertion of
most ML algorithms into an integrative AGI systems.
For instance, in the OpenCog integrative AGI architecture that we have co-architected
rea131, the MOSES automated program learning algorithm plays a key role. It is OpenCog's
main algorithm for acquiring procedural knowledge, and is used for generating some sorts of
declarative knowledge as well. However, when MOSES tasks are launched automatically via
the OpenCog scheduler based on an OpenCog agent's goals, there is no opportunity for the
clever choice of feature selection heuristics based on the particular data involved. And crude
feature selection heuristics based on elementary statistics, are often insufficiently effective, as
they rule out too many valuable features (and sometimes rule out the most critical features).
In this context, having a variant of MOSES that can sift through the scope of possible features
in the course of its learning is very important.
An example from the virtual dog domain pursued in [GEMS' would be as follows. Each
procedure learned by the virtual dog combines a number of different actions, such as "step
forward", "bark", "turn around", "look right","lift left front leg," etc. In the virtual dog a-
periments done previously, the number of different actions permitted to the dog was less than
100, so that feature selection was not a major issue. However, this was an artifact of the rela-
tively simplistic nature of the experiments conducted. For a real organism, or for a robot that
learns its own behavioral procedures (say, via a deep learning algorithm) rather than using a
pre-configured set of "animated" behaviors, the number of passible behavioral procedures to
potentially be combined using a MOSES-learned program may be very large. In this case, one
must either use some crude feature selection heuristic, have a human select the features, or use
something like the LIFES approach described here. LIFES addresses a key problem in moving
from the relatively simple virtual dog work done before, to related work with virtual agents
displaying greater general intelligence.
As an other example, suppose an OpenCog-controlled agent is using MOSES to learn pro-
cedures for navigating in a dynamic environment. The features that candidate navigation pro-
EFTA00624378
232 33 Probabilistic Evolutionary Procedure Learning
cedures will want to pay attention to, may be different in a well-lit environment than in a
dark environment. However, if the MOSES learning process is being launched internally via
OpenCog's goal system, there is no opportunity for a human to adjust the feature selection
heuristics based on the amount of light in the environment. Instead, MOSES has got to figure
out what features to pay attention to all by itself. LIFES is designed to allow MOSES (or other
comparable learning algorithms) to do this.
So far we have tested LIFES in genomics and other narrow-AI application areas, as a way of
initially exploring and validating the technique. As our OpenCog work proceeds, we will explore
more AGI-oriented applications of MOSES-LIFES. This will be relatively straightforward on a
software level as MOSES is fully integrated with OpenCog.
33.4.2 Data- and Feature- Focusable Learning Problems
Learning-integrated feature selection as described here is applicable across multiple domain
areas and types of learning problem - but it is not completely broadly applicable. Rather it
is most appropriate for learning problems possessing two properties we call data focusability
and feature focusability. While these properties can be defined with mathematical rigor, here
we will not be proving any theorems about them, so we will content ourselves with semi-formal
definitions, sufficient to guide practical work.
We consider a fitness function 0, defined on a space of programs f whose inputs are
features defined on elements of a reference dataset S, and whose outputs lie in the inter-
val [0,1]. The features are construed as functions mapping elements of S into [0,1]. Where
F(x) = F"(x)) is the set of features evaluated on x E .9, we use f (x) as a short-
hand for f(F(x)).
We are specifically interested in 0 which are "data focusable", in the sense that, for a large
number of highly fit programs f, there is some subset Si on which f is highly concentrated
(note that St will be different for different f). By "concentrated" it is meant that
ExEs, f (x)
& Es f (x)
is large. A simple case is where f is Boolean and f(x) = I <=> x E
One important case is where is "property-based", in the sense that each element x E $ has
some Boolean or numeric property p(x), and the fitness function 0(f) rewards f for predicting
p(x) given x for x E Si, where $ f is some non-trivial subset of S. For example, each element
of $ might belong to some category, and the fitness function might represent the problem of
placing elements of S into the proper category - but with the twist that f gets rewarded if
it accurately places some subset St of elements in S into the proper category, even if it has
nothing to say about all the elements in S but not St.
For instance, consider the case where $ is a set of images. Suppose the function p(x) indicates
whether the image x contains a picture of a cat or not. Then, a suitable fitness function 0 would
be one measuring whether there is some non-trivially large set of images St so that if x E Si,
then f can accurately predict whether x contains a picture of a cat or not. A key point is that
the fitness function doesn't care whether f can accurately predict whether x contains a picture
of a cat or not, for x outside St.
EFTA00624379
33.4 Integrating Feature Selection Into the Learning Process 233
Or, consider the case where S is a discrete series of time points, and p(x) indicates the
value of some quantity (say, a person's EEG) at a certain point in time. Then a suitable fitness
function 0 might measure whether there is sonic non-trivially large set of time-points St so
that if x E Sp, then f can accurately predict whether x will be above a certain level L or not.
Finally, in addition to the property of data-focusability introduced above, we will concern
ourselves with the complementary property of "feature-focusability." This means that, while
the elements of S are each characterized by a potentially large set of features, there are many
highly fit programs f that utilize only a small subset of this large set of features. The case of
most interest here is where there are various highly fit programs f, each utilizing a different
small subset of the overall large set of features. In this case one has (loosely speaking) a pattern
recognition problem, with approximate solutions comprising various patterns that combine
various different features in various different ways. For example, this would be the case if
there were many different programs for recognizing pictures containing cats, each one utilizing
different features of cats and hence applying to different subsets of the overall database of
images.
There may, of course, be many important learning problems that are neither data nor feature
focusable. However, the LIFE$ technique presented here for integrating feature selection into
learning is specifically applicable to objective functions that are both data and feature focusable.
In this sense, the conjunction of data and feature focusability appears to be a kind of "tractabil-
ity" that allows one to bypass the troublesome separation of feature selection and learning, and
straightforwardly combine the two into a single integrated process. Being property-based in the
sense described above does not seem to be necessary for the application of LIFES, though most
practical problems do seem to be property-based.
33.4.3 Integrating Feature Selection Into Learning
The essential idea proposed here is a simple one. Suppose one has a learning problem involving
a fitness function that is both data and feature focusable. And suppose that, in the course
of learning according to some learning algorithm, one has a candidate program f, which is
reasonably fit but merits improvement. Suppose that f uses a subset F1 of the total set F of
possible input features. Then, one may do a special feature selection step, customized just for
f. Namely, one may look at the total set F of possible features, and ask which features or
small feature-sets display desirable properties on the set Sp. This will lead to a new set
of features potentially worthy of exploration; let's call it F1. We can then attempt to improve
f by creating variants of f introducing some of the features in Pf - either replacing features
in F1 or augmenting them. The process of creating and refining these variants will then lead
to new candidate programs g, potentially concentrated on sets $ g different from St, in which
case the process may be repeated. This is what we call LIFES - Learning-Integrated Feature
Selection.
As described above the LIFES process is quite general, and applies to a variety of learning
algorithms - basically any learning algorithm that includes the capability to refine a candidate
solution via the introduction of novel features. The nature of the "desirable properties" used
to evaluate candidate features or feature-sets on St needs to be specified, but a variety of
standard techniques may be used here (along with more advanced ideas) - for instance, in the
case where the fitness function is defined in terms of some property mapping p as describe,
EFTA00624380
234 33 Probabilistic Evolutionary Procedure Learning
above, then given a feature F. one can calculate the mutual information of P with p over Sp
Other measures than mutual information may be used here as well.
The LIFES process doesn't necessarily obviate the need for np-front feature selection. What
it does, is prevent up-front feature selection from limiting the ultimate feature usage of the
learning algorithm. It allows the initially selected features to be used as a rough initial guide
to learning - and for the candidates learned using these initial features, to then be refined and
improved using additional features chosen opportunistically along the learning path. In some
cases, the best programs ultimately learned via this approach might not end up involving any
of the initially selected features.
33.4.4 Integrating Feature Selection into MOSES Learning
The application of the general LIFES process in the MOSES context is relatively straightfor-
ward. Quite simply, given a reasonably fit program f produced within a deme, one then isolates
the set Si on which f is concentrated, and identifies a set F't of features within F that displays
desirable properties relative to $f. One then creates a new deme 7, with exemplar f, and with
a set of potential input features consisting of Ff U F;.
What does it mean to create a deme f'with a certain set of "potential input features"
Ff U Fr Abstractly, it means that Ff. = Ff U F. Concretely, it means that the knobs in the
deme's exemplar 7 must be supplied with settings corresponding to the elements of Ff U F.
The right way to do this will depend on the semantics of the features.
For instance, it may be that the overall feature space F is naturally divided into groups of
features. In that case. each new feature F1 in would be added, as a potential knob setting,
to any knob in f corresponding to a feature in the same group as P.
On the other hand, if there is no knob in f corresponding to features in FIN knob group,
then one has a different situation, and it is necessary to "mutate" f by adding a new node
with a new kind of knob corresponding to or replacing an existing node with a new one
corresponding to P.
33.4.5 Application to Genomic Data Classification
To illustrate the effectiveness of LIFES in a MOSES context, we now briefly describe an exam-
ple application, in the genomics domain. The application of MOSES to gene expression data is
described in more detail in ilmo0711, and is only very briefly summarized here. To obtain the
results summarized here, we have used MOSES, with and without LIFES, to analyze two differ-
ent genomics datasets: an Alzheimers SNP (single nucleotide polymorphism) dataset [Meal
previously analyzed using ensemble genetic programming [CC P1-109] . The dataset is of the form
"Case vs. Control", where the Case category, consists of data from individuals with Alzheimers
and Control consists of matched controls. MOSES was used to learn Boolean program trees
embodying predictive models that take in a subset of the genes in an individual, and output a
Boolean combination of their discretized expression values, that is interpreted as a prediction of
whether the individual is in the Case or Control category. Prior to feeding them into MOSES,
expression values were first Q-normalized, and then discretized via comparison to the median
EFTA00624381
33.4 Integrating Feature Selection Into the Learning Process 235
expression measured across all genes on a per-individual basis (1 for greater than the median,
0 for less than). Fitness was taken as precision, with a penalty factor restriction attention to
program trees with recall above a specified minimum level.
These study was carried out, not merely for testing MOSES and LIFES, but as part of a
practical investigation into which genes and gene combinations may be the best drug targets
for Alzheimers Disease. The overall methodology for the biological investigation, as described
in IGCPNI061, is to find a (hopefully diverse) ensemble of accurate classification models, and
then statistically observe which genes tend to occur most often in this ensemble, and which
combinations of genes tend to co-occur most often in the models in the ensemble. These most
frequent genes and combinations are taken as potential therapeutic targets for the Case cate-
gory of the underlying classification problem (which in this case denotes inflammation). This
methodology has been biologically validated by follow-up lab work in a number of cases; see
e.g. IGea05] where this approach resulted in the first evidence of a genetic basis for Chronic
Fatigue Syndrome. A significant body of unpublished commercial work along these lines has
been done by Biomind LLC thttp: / /biomind. comj for its various customers.
Comparing MOSES-LIFES to MOSES with conventional feature selection, we find that the
former finds model ensembles combining greater diversity with greater precision, and equivalent
recall. This is because conventional feature selection eliminates numerous genes that actually
have predictive value for the phenotype of inflammation, so that MOSES never gets to see them.
LIFES exposes MOSES to a much greater number of genes, some of which MOSES finds useful.
And LIFES enables MOSES to explore this larger space of genes without getting bollixed by
the potential combinatorial explosion of possibilities.
Algorithm Train. Precision Train. Recall Test Precision Test Recall
MOSES .81 .51 .65 .42 best training precision
MOSES .80 .52 .69 .43 best test precision
MOSES-LIFES .84 .51 .68 .38 best training precision
MOSES-LIFES .82 .51 .72 .48 best test precision
Table 33.1: Impact of LIFES on MOSES classification of Alzheimers Disease SNP data. Fitness
function sought to maximize precision consistent with a constraint of precision being at least
0.5. Precision and recall figures are average figures over 10 folds, using 10-fold cress-validation.
The results shown here are drawn from a larger set of runs, and are selected according to
two criteria: best training precision (the fair way to do it) and best test precision (just for
comparison). We see that use of LIFES increases precision by around 3% in these tests, which
is highly statistically significant according to permutation analysis.
The genomics example shows that LIFES makes sense and works in the context of MOSES,
broadly speaking. It seems very plausible that LIFES will also work effectively with MOSES
in an integrative AGI context, for instance in OpenCog deployments where MOSES is used
to drive procedure learning, with fitness functions supplied by other OpenCog components.
However, the empirical validation of this plausible conjecture remains for future work.
EFTA00624382
236 33 Probabilistic Evolutionary Procedure Learning
33.5 Supplying Evolutionary Learning with Long-Term Memory
This section introduces an important enhancement to evolutionary learning, which extends the
basic PEPL framework, by forming an adaptive hybridization of PEPL optimization with PLN
inference (rather than merely using PLN inference within evolutionary learning to aid with
modeling).
The first idea here is the use of PLN to supply evolutionary learning with a long-term memory.
Evolutionary, learning approaches each problem as an isolated entity, but in reality, a CogPrime
system will be confronting a long series of optimization problems, with subtle interrelationships.
When trying to optimize the function 1, CogPrime may make use of its experience in optimizing
other functions g.
Inference allows optimizers of g to be analogically transformed into optimizers of f, for
instance it allows one to conclude:
Inheritance f g
EvaluationLink f x
EvaluationLink g x
However, less obviously, inference also allows patterns in populations of optimizers of g to be
analogically transformed into patterns in populations of optimizers off For example, if pat is
a pattern in good optimizers of f, then we have:
InheritanceLink f g
ImplicationLink
EvaluationLink f x
EvaluationLink pat x
ImplicationLink
EvaluationLink g x
EvaluationLink pat x
(with appropriate probabilistic truth values), an inference which says that patterns in the
population of f-optimizers should also be patterns in the population of g-optimizers).
Note that we can write the previous example more briefly as:
InheritanceLink f g
ImplicationLink (EvaluationLink f) (EvaluationLink pat)
ImplicationLink (EvaluationLink g) (EvaluationLink pat)
A similar formula holds for $imilarityLinIcs.
We may also infer:
ImplicationLink (EvaluationLink g) (EvaluationLink pat_g)
ImplicationLink (EvaluationLink f) (EvaluationLink pat_f)
ImplicationLink
(EvaluationLink (g AND f))
(EvaluationLink (pat_g AND pat_f))
and:
ImplicationLink (EvaluationLink f) (EvaluationLink pat)
EFTA00624383
33.6 Hierarchical Program Learning 237
ImplicationLink (EvaluationLink -f) (EvaluationLink -pat)
Through these sorts of inferences, PLN inference can be used to give evolutionary learning
a long-term memory. allowing knowledge about population models to be transferred from one
optimization problem to another. This complements the more obvious use of inference to transfer
knowledge about specific solutions from one optimization problem to another.
For instance in the problem of finding a compact program generating some given sequences of
bits the system might have noticed that when the number of 0 roughly balances the number of
1 (let us call this property STR_BALANCE) successful optimizers tend to give greater biases
toward conditionals involving comparisons of the number of 0 and 1 inside the condition, let
us call this property over optimizers COMP_CARD_DIGIT_BIAS. This can be expressed in
PLN as follows
AverageQuantifierLink (tv)
ListLink
$X
$Y
ImplicationLink
ANDLink
InheritanceLink
STR_BALANCE
$X
EvaluationLink
SUCCESSFUL_OPTIMIZER_OF
ListLink
$Y
$X
InheritanceLink
COMP_CARD_DIGIT_BIAS
$Y
which translates by, if the problem SX inherits from STR_BALANCE and $V is a successful
optimizer of $X then, with probability p calculated according to tv, SY tends to be biased
according to the property described by COMP_CARD_DIGIT_BIAS.
33.6 Hierarchical Program Learning
Next we discuss hierarchical program structure, and its reflection in probabilistic modeling, in
more depth. This is a surprisingly subtle and critical topic, which may be approached from
several different complementary, angles. To an extent, hierarchical structure is automatically
accounted for in MOSES, but it may also be valuable to pay more explicit mind to it.
In human-created software projects, one common approach for dealing with the existence of
complex interdependencies between parts of a program is to give the program a hierarchical
structure. The program is then a hierarchical arrangement of programs within programs within
programs, each one of which has relatively simple dependencies between its parts (however its
EFTA00624384
238 33 Probabilistic Evolutionary Procedure Learning
parts may themselves be hierarchical composites). This notion of hierarchy is essential to such
programming methodologies as modular programming and object-oriented design.
Pelikan and Goldberg discuss the hierarchal nature of human problem-solving, in the context
of the hBOA (hierarchical BOA) version of BOA. However, the hBOA algorithm does not
incorporate hierarchical program structure nearly as deeply and thoroughly as the hierarchical
procedure learning approach proposed here. In hBOA the hierarchy is implicit in the models of
the evolving population, but the population instances themselves are not necessarily explicitly
hierarchical in structure. In hierarchical PEPL as we describe it here, the population consists of
hierarchically structured Combo trees, and the hierarchy of the probabilistic models corresponds
directly to this hierarchical program structure.
The ideas presented here have some commonalities to John Koza's ADFs and related tricks for
putting reusable subroutines in GP trees, but there are also some very substantial differences,
which we believe will make the current approach far more effective (though also involving
considerably more computational overhead).
We believe that this sort of hierarchically-savvy modeling is what will be needed to get
probabilistic evolutionary learning to scale to large and complex programs, just as hierarchy-
based methodologies like modular and object-oriented programming are needed to get human
software engineering to scale to large and complex programs.
33.6.1 Hierarchical Modeling of Composite Procedures in the
AtomSpace
The passibility of hierarchically structured programs is (intentionally) present in the CogPrime
design, even without any special effort to build hierarchy into the PEPL framework. Combo
trees may contain Nodes that point to PredicateNodes, which may in turn contain Combo trees,
etc. However, our current framework for learning Combo trees does not take advantage of this
hierarchy. What is needed, in order to do so, is for the models used for instance generation to
include events of the form:
Combo tree Node at position x has type PredicateNode; and the PredicateNode at position x
contains a Combo tree that possesses property P.
where x is a position in a Combo tree and P is a property that may or may not be true of
any given Combo tree. Using events like this, a relatively small program explicitly incorporat-
ing only short-range dependencies; may implicitly encapsulate long-range dependencies via the
properties R
But where do these properties P come from? These properties should be patterns learned as
part of the probabilistic modeling of the Combo tree inside the PredicateNode at position x.
For example, if one is using a decision tree modeling framework, then the properties might be
of the form decision tree D evaluates to Trim. Note that not all of these properties have to be
statistically correlated with the fitness of the PredicateNode at position x (although some of
them surely will be).
Thus we have a multi-level probabilistic modeling strategy. The top-level Combo tree has
a probabilistic model whose events may refer to patterns that are parts of the probabilistic
models of Combo trees that occur within it and so on down.
In instance generation, when a newly generated Combo tree is given a PredicateNode at
position x, two possibilities exist:
EFTA00624385
33.6 Hierarchical Program Learning 239
• There is already a model for PredicateNodes at position x in Combo trees in the given
population, in which case a population of PredicateNodes potentially living at that position
is drawn from the known model, and evaluated.
• There is no such model (because it has never been tried to create a PredicateNode at
position x in this population before), in which case a new population of Combo trees is
created corresponding to the position, and evaluated.
Note that the fitness of a Combo tree that is not at the top level of the overall process, is assessed
indirectly in terms of the fitness of the higher-level Combo tree in which it is embedded, due to
the requirement of having certain properties, etc.
Suppose each Combo tree in the hierarchy has on average R adaptable sub-programs (rep-
resented as Nodes pointing to PredicateNodes containing Combo trees to be learned). Suppose
the hierarchy is K levels deep. Then we will have about R x K program tree populations in the
tree. This suggests that hierarchies shouldn't get too big, and indeed, they shouldn't need to,
for the same essential reason that human-created software programs, if well-designed, tend not
to require extremely deep and complex hierarchical structures.
One may also introduce a notion of reusable components across various program learning
runs, or across several portions of the same hierarchical program. Here is one learning patterns
of the form:
If property P1(C,x) applies to a Combo tree C and a node x within it, then it is often good
for node x to refer to a PredicateNode containing a Combo tree with property P2.
These patterns may be assigned probabilities and may be used in instance generation. They
are general or specialized programming guidelines, which may be learned over time.
33.6.2 Identifying Hierarchical Structure In Combo trees via
Metallodes and Dimensional Embedding
One may also apply the concepts of the previous section to model a population of CTs that
doesn't explicitly have a hierarchical structure, via introducing the hierarchical structure during
the evolutionary process, through the introduction of special extra Combo tree nodes called
Metallodes. For instance Metallodes may represent subtrees of Combo trees, which have proved
useful enough that it seems justifiable to extract them as "macros." This concept may be
implemented in a couple different ways, here we will introduce a simple way of doing this based
on dimensional embedding, and then in the next section we will allude to a more sophisticated
approach that uses inference instead.
The basic idea is to couple decision tree modeling with dimensional embedding of subtrees,
a trick that enables small decision tree models to cover large regions of a CT in an approximate
way, and which leads naturally to a form of probabilistically-guided crossover.
The approach as described here works most simply for C11 that have many subtrees that can
be viewed as mapping numerical inputs into numerical outputs. There are clear generalizations
to other sorts of CTs, but it seems advisable to test the approach on this relatively simple case
first.
The first part of the idea is to represent subtrees of a CT as numerical vectors in a relatively
low-dimensional space (say N=50 dimensions). This can be done using our existing dimensional
embedding algorithm, which maps any metric space of entities into a dimensional space. All
EFTA00624386
240 33 Probabilistic Evolutionary Procedure Learning
that's required is that we define a way of measuring distance between subtrees. If we look at
subtrees with numerical inputs and outputs, this is easy. Such a subtree can be viewed as a
function mapping Rn into Km; and there are many standard ways to calculate the distance
between two functions of this sort (for instance one can make a Monte Carlo estimate of the
if metric which is defined as:
[Sum(x) (f(x) - g(x))";() (1/p)
Of course, the same idea works for subtrees with non-numerical inputs and outputs, the
tuning and implementation are just a little trickier.
Next, one can augment a CT with meta-nodes that correspond to subtrees. Each meta-node is
of a special CT node type Metallode, and comes tagged with an N-dimensional vector. Exactly
which subtrees to replace with Metallodes is an interesting question that must be solved via
some heuristics.
Then, in the course of executing the PEPL algorithm, one does decision tree modeling as
usual, but making use of Metallodes as well as ordinary CT nodes. The modeling of Metallodes
is quite similar to the modeling of Nodes representing ConceptNodes and PredicateNodes using
embedding vectors. In this way, one can use standard, small decision tree models to model fairly
large portions of CTs (because portions of CTs are approximately represented by Metallodes).
But how does one do instance generation, in this scheme? What happens when one tries to do
instance generation using a model that predicts a Metallode existing in a certain location in a
CT? Then, the instance generation process has got to find sonic CT subtree to put in the place
where the Metallode is predicted. It needs to find a subtree whose corresponding embedding
vector is close to the embedding vector stored in the Metallode. But how can it find such a
subtree?
There seem to be two ways:
1. A reasonable solution is to look at the database of subtrees that have been seen before
in the evolving population, and choose one from this database, with the probability of
choosing subtree X being proportional to the distance between X's embedding vector and
the embedding vector stored in the Metallode.
2. One can simply choose good subtrees, where the goodness of a subtree is judged by the
average fitness of the instances containing the target subtree.
One can use a combination of both of these processes (luring instance generation.
But of course, what this means is that we're in a sense doing a form of crossover, because we're
generating new instances that combine subtrees from previous instances. But we're combining
subtrces in a judicious way guided by probabilistic modeling, rather than in a random way as
in GP-style crossover.
33.6.2.1 Inferential Metallodes
Metallodes are an interesting and potentially powerful technique, but we don't believe that
they, or any other algorithmic trick, is going to be the solution to the problem of learning
hierarchical procedures. We believe that this is a cognitive science problem that probably isn't
amenable to a purely computer science oriented solution. In other words, we suspect that the
correct way to break a Combo tree down into hierarchical components depends on context,
algorithms are of course required, but they're algorithms for relating a CT to its context rather
EFTA00624387
33.6 Hierarchical Program Learning 241
than pure CT-manipulation algorithms. Dimensional embedding is arguably a tool for capturing
contextual relationships, but it's a very crude one.
Generally speaking, what we need to be learning are patterns of the form "A subtree meeting
requirements X Ls often fit when linked to a subtree meeting requirements Y, when solving a
problem of type Z". Here the context requirements Y will not pertain to absolute tree position
but rather to abstract properties of a subtree.
The Metallode approach as outlined above is a kind of halfway measure toward this goal,
good because of its relative computational efficiency, but ultimately too limited in its power
to deal with really hard hierarchical learning problems. The reason the Metallode approach is
crude Ls simply because it involves describing subtrees via points in an embedding space. We
believe that the correct (but computationally expensive) approach is indeed to use Metallodes
- but with each Metallode tagged, not with coordinates in an embedding space, but with a
set of logical relationships describing the subtree that the Metallode stands for. A candidate
subtree's similarity to the Metallode may then be determined by inference rather than by the
simple computation of a distance between points in the embedding space. (And, note that we
may have a hierarchy of Metallodes, with small subtrees corresponding to Metallodes, larger
subtrees comprising networks of small subtrees also corresponding to Metallodes, etc.)
The question then becomes which logical relationships one tries to look for, when character-
izing a Metallode. This may be partially domain-specific, in the sense that different properties
will be more interesting when studying motor-control procedures than when studying cognitive
procedures.
To intuitively understand the nature of this idea, let's consider some abstract but common-
sense examples. Firstly, suppose one is learning procedures for serving a ball in tennis. Suppose
all the successful procedures work by first throwing the ball up really high, then doing other
stuff. The internal details of the different procedures for throwing the ball up really high may
be wildly different. What we need is to learn the pattern
Implication
Inheritance X "throwing the ball up really high"
"X then Y" is fit
Here X and Y are Metallodes. But the question is how do we learn to break trees down into
Metallodes according to the formula "tree='X then Y' where X inherits from 'throwing the ball
up really high."'?
Similarly, suppose one is learning procedures to do first-order inference. What we need is to
learn a pattern such as:
Implication
AND
F involves grabbing pairs from the AtomTable
G involves applying an inference rule to those each pair
H involves putting the results back in the AtomTable
"F I G (H)))" is fit
Here we need Metallodes for F, G and H, but we need to characterize e.g. the Metallode F
by a relationship such as "involves grabbing pairs from the AtomTable."
Until we can characterize !Actallodes using abstract descriptors like this, one might argue
we're just doing "statistical learning" rather than "general intelligence style" procedure learning.
But to do this kind of abstraction intelligently seems to require some background knowledge
about the domain.
EFTA00624388
242 33 Probabilistic Evolutionary Procedure Learning
In the "throwing the ball up really high" case the assignment of a descriptive relationship
to a subtree involves looking, not at the internals of the subtree itself, but at the state of the
world after the subtree has been executed.
In the "grabbing pairs from the AtomTable" case it's a bit simpler but still requires some
kind of abstract model of what the subtree is doing, i.e. a model involving a logic expression
such as 'The output of F is a set S so that if P belongs to S then P is a set of two Atoms Al
and A2. and both Al and A2 were produced via the getAtom operator."
How can this kind of abstraction be learned? It seems unlikely that abstractions like this will
be found via evolutionary search over the space of all possible predicates describing program
subtrees. Rather, they need to be found via probabilistic reasoning based on the terms combined
in subtrees, put together with background knowledge about the domain in which the fitness
function exists. In short, integrative cognition is required to learn hierarchically structured pro-
grams in a truly effective way, because the appropriate hierarchical breakdowns are contextual
in nature, and to search for appropriate hierarchical breakdowns without using inference to take
context into account, involves intractably large search spaces.
33.7 Fitness Function Estimation via Integrative Intelligence
If instance generation is very cheap and fitness evaluation is very expensive (as is the case
in many applications of evolutionary learning hi CogPrime ), one can accelerate evolutionary
learning via a "fitness function estimation" approach. Given a fitness function embodied in a
predicate P, the goal is to learn a predicate Q so that:
1. Q is much cheaper than P to evaluate, and
2. There is a high-strength relationship:
Similarity Q P
or else
ContextLink C (Similarity Q P)
where C is a relevant context.
Given such a predicate Q, one could proceed to optimize P by ignoring evolutionary learning
altogether and just repeatedly following the algorithm:
• Randomly generate N candidate solutions.
• Evaluate each of the N candidate solutions according to Q.
• Take the kcN solutions that satisfy Q best, and evaluate them according to P.
improved based on the new evaluations of P that are done. Of course, this would not be as good
as incorporating fitness function estimation into an overall evolutionary, learning framework.
Heavy utilization of fitness function estimation may be appropriate, for example, if the
entities being evolved are schemata intended to control an agent's actions in a real or simulated
environment. In this case the specification predicate F., in order to evaluate P(S), has to actually
use the schema S to control the agent in the environment. So one may search for Q that do
not involve any simulated environment, but are constrained to be relatively small predicates
involving only cheap-to-evaluate terms (e.g. one may allow standard combinatory, numbers,
EFTA00624389
33.7 Fitness Function Estimation via Integrative Intelligence 243
strings, ConceptNodes, and predicates built up recursively from these). Then Q will be an
abstract predictor of concrete environment success.
We have left open the all-important question of how to find the "specification approximating
predicate" Q.
One approach is to use evolutionary learning. In this case, one has a population of predi-
cates, which are candidates for Q. The fitness of each candidate Q is judged by how well it
approximates P over the set of candidate solutions for P that have already been evaluated. If
one uses evolutionary learning to evolve Qs, then one is learning a probabilistic model of the
set of Qs, which tries to predict what sort of Q,s will better solve the optimization problem
of approximating P's behavior. Of course, using evolutionary, learning for this purpose poten-
tially initiates an infinite regress, but the regress can be stopped by, at some level, finding Qs
using a non-evolutionary learning based technique such as genetic programming, or a simple
evolutionary learning based technique like standard BOA programming.
Another approach to finding Q is to use inference based on background knowledge. Of course,
this is complementary rather than contradictory to using evolutionary learning for finding Q.
There may be information in the knowledge base that can be used to "analogize" regarding
which Qs may match P. Indeed, this will generally be the case in the example given above,
where P involves controlling actions in a simulated environment but Q does not.
An important point is that, if one uses a certain Q1 within fitness estimation, the evidence
one gains by trying Q1 on numerous fitness cases may be utilized in future inferences regarding
other Q2 that may serve the role of Q. So, once inference gets into the picture, the quality
of fitness estimators may progressively improve via ongoing analogical inference based on the
internal structures of the previously attempted fitness estimators.
EFTA00624390
EFTA00624391
Section V
Declarative Learning
EFTA00624392
EFTA00624393
Chapter 34
Probabilistic Logic Networks
Co-authored with Matthew lkle
34.1 Introduction
Now we turn to CogPrime's methods for handling declarative knowledge - beginning with
a series of chapters discussing the Probabilistic Logic Networks (PLN) IG?,•11H08J approach
to uncertain logical reasoning, and then turning to chapters on pattern mining and concept
creation. In this first of the chapters on PLN, we give a high-level overview, summarizing
material given in the book Probabilistic Logic Networks IGNIIII081 more compactly and in
a somewhat differently-organized way. For a more thorough treatment of the concepts and
motivations underlying PLN, the reader is encouraged to read EGNIIHOSI.
PLN is a mathematical and software framework for uncertain inference, operative within
the CogPrime software framework and intended to enable the combination of probabilistic
truth values with general logical reasoning rules. Some of the key requirements underlying the
development of PLN were the following:
• To enable uncertainty-savvy versions of all known varieties of logical reasoning, including for
instance higher-order reasoning involving quantifiers, higher-order functions, and so forth
• To reduce to crisp "theorem prover" style behavior in the limiting case where uncertainty
tends to zero
• To encompass inductive and abductive as well as deductive reasoning
• To agree with probability theory in those reasoning cases where probability theory, in its
current state of development, provides solutions within reasonable calculational effort based
on assumptions that are plausible in the context of real-world embodied software systems
• To gracefully incorporate heuristics not explicitly based on probability theory, in cases
where probability theory, at its current state of development, does not provide adequate
pragmatic solutions
• To provide "scalable" reasoning, in the sense of being able to carry out inferences involving
billions of premises.
• To easily accept input from, and send input to, natural language processing software systems
In practice, PLN consists of
• a set of inference rules (e.g. deduction, Bayes rule, variable unification, modus pollens, etc.),
each of which takes one or more logical relationships or terms (represented as CogPrime
Atoms) as inputs, and produces others as outputs
247
EFTA00624394
248 34 Probabilistic Logic Networks
• specific mathematical formulas for calculating the probability value of the conclusion of
an inference rule based on the probability values of the premises plus (in some cases)
appropriate background assumptions.
PLN also involves a particular approach to estimating the confidence values with which
these probability values are held (weight of evidence, or second-order uncertainty). Finally,
the implementation of PLN in software requires important choices regarding the structural
representation of inference rules, and also regarding "inference control" - the strategies required
to decide what inferences to do in what order, in each particular practical situation. Currently
PLN is being utilized to enable an animated agent to achieve goals via combining actions in
a game world. For example, it can figure out that to obtain an object located on top of a
wall, it may want to build stairs leading from the floor to the top of the wall. Earlier PLN
applications have involved simpler animated agent control problems, and also other domains,
such as reasoning based on information extracted from biomedical text using a language parser.
For all its sophistication, however, PLN falls prey to the same key weakness as other logical
inference systems: combinatorial explosion. In trying to find a logical chain of reasoning leading
to a desired conclusion, or to evaluate the consequences of a given set of premises, PLN may
need to explore an unwieldy number of possible combinations of the Atoms in CogPrime's
memory. For PLN to be practical beyond relatively simple and constrained problems (and most
definitely, for it to be useful for AGI at the human level or beyond), it must be coupled with a
powerful method for "inference tree pruning" - for paring down the space of passible inferences
that the PLN engine must evaluate as it goes about its business in pursuing a given goal in a
certain context. Inference control will be addressed in Chapter 36.
34.2 A Simple Overview of PLN
The key elements of PLN are its rules and Formulas. In general,a PLN rule has
• Input: A tuple of Atoms (which must satisfy certain criteria, specific to the Rule)
• Output: A tuple of Atoms
Actually, in nearly all cases, the output is a single Atom; and the input is a single Atom or a
pair of Atoms.
The prototypical example is the DeductionRule. Its input must look like
X_Link A B
X_Link B C
And its output then looks like
X_Link A C
Here, X_Link may be either InheritanceLink, SubsetLink, ImplicationLink or Extension-
allmplicationLink.
A PLN formula goes along with a PLN rule, and tells the uncertain truth value of the output,
based on the uncertain truth value of the input. For example, if we have
X_Link A B <sAB>
X_Link B C <BBC>
then the standard PLN deduction formula tells us
EFTA00624395
34.2 A Simple Overview of PLN 249
X_Link A C <sAC>
with
— SAO (SC — SESBC)
SAC = SABSBC ÷
1 — .48
where e.g. s A denote the strength of the truth value of node A.
In this example, the uncertain truth value of each Atom is given as a single "strength" number.
In general, uncertain truth values in PLN may take multiple forms, such as
• Single strength values like .8, which may indicate probability or fuzzy truth value, depending
on the Atom type
• (strength, confidence) pairs like (.8, .4)
• (strength, count) pairs like (.8, 15)
• indefinite probabilities like (.G,. .9, .95) which indicate credible intervals of probabilities
34.2.1 Forward and Backward Chaining
Typical patterns of usage of PLN are forward-chaining and backward-chaining inference.
Forward chaining basically means:
1. Given a pool (a list) of Atoms of interest
2. One applies PLN rules to these Atoms, to generate new Atoms, hopefully also of interest
3. Adding these new Atoms to the pool, one returns to Step 1
EXAMPLE: "People are animals" and "animals breathe" are in the pool of Atoms. These are
combined by the Deduction rule to form the conclusion "people breathe".
Backward chaining falls into two cases. First:
• "'Truth value query."' Given a target Atom whose truth value is not known (or is too
uncertainly known), plus a pool of Atoms, find a way to estimate the truth value of the
target Atom, via combining the Atoms in the pool using the inference Rules
EXAMPLE: The target is "do people breathe?" (InheritanceLink people breathe). The truth
value of the target is estimated via doing the inference "People are animals, animals breathe,
therefore people breathe."
Second:
• "'Variable fulfillment query"'. Given a target Link (Atoms may be Nodes or Links) with
one or more VariableAtoms among its targets, figure out what Atoms may be put in place
of these VariableAtoms, so as to give the target Link a high strength* confidence (i.e. a
"high truth value").
EXAMPLE: The target is "what breathes?", i.e. "InheritanceLink $X breathe"... Direct lookup
into the Atomspace reveals the Atom "InheritanceLink animal breathe", indicating that the slot
$X may be filled by "animal". Inference reveals that "Inheritance people breathe" so that the
slot $X may also be filled by "people".
EFTA00624396
250 34 Probabilistic Logic Networks
EXAMPLE: the target is "what breathes and adds", ie "(InheritanceLink $X breathe) AND
(InheritanceLink $X add)". Inference reveals that the slot $X may be filled by "people" but not
"cats" or "computers."
Common-sense inference may involve a combination of backward chaining and forward chain-
ing.
The hardest part of inference is "inference control" - that is, knowing which among the
many possible inference steps to take, in order to obtain the desired information (in backward
chaining) or to obtain interesting new information (in forward chaining). In an Atomspace
with a large number of (often quite uncertain) Atoms, there are many, many possibilities and
powerful heuristics are needed to choose between them. The best guide to inference control is
some sort of induction based on the system's past history of which inferences have been useful.
But of course, a young system doesn't have much history to go on. And relying on indirectly
relevant history is, itself, an inference problem - which can be solved best by a system with
some history to draw on!
34.3 First Order Probabilistic Logic Networks
We now review the essentials of PLN in a more formal way. PLN is divided into first-order and
higher-order sub-theories (FOPLN and HOPLN). These terms are used in a nonstandard way
drawn conceptually from NARS [Wan061. We develop FOPLN first, and then derive HOPLN
therefrom.
FOPLN is a term logic, involving terms and relationships (links) between tents. It is an
uncertain logic, in the sense that both terms and relationships are associated with truth value
objects, which may come in multiple varieties ranging from single numbers to complex structures
like indefinite probabilities. Terms may be either elementary, observations, or abstract tokens
drawn from a token-set T.
34.3.1 Core FOPLN Relationships
"Core FOPLN" involves relationships drawn from the set: negation; Inheritance and probabilis-
tic conjunction and disjunction; Member and fuzzy conjunction and disjunction. Elementary
observations can have only Member links, while token terms can have any kinds of links. PLN
makes clear distinctions, via link type semantics, between probabilistic relationships and fuzzy
set relationships. Member semantics are usually fuzzy relationships (though they can also be
crisp), whereas Inheritance relationships are probabilistic, and there are rules governing the
interoperation of the two types.
Suppose a virtual agent makes an elementary VisualObservation o of a creature named Fluffy.
The agent might classify o as belonging, with degree 0.9, to the fuzzy set of furry objects. The
agent might also classify o as belonging with degree 0.8 to the fuzzy set of animals. The agent
could then build the following links in its memory:
Member o furry < 0.9 >
Member o animals < 0.8 >
EFTA00624397
34.3 First Order Probabilistic Logic Networks 251
The agent may later wish to refine its knowledge, by combining these MemberLinks. Using
the minimum fuzzy conjunction operator, the agent would conclude:
fuzzyAND < 0.8 >
Member o furry
Member o animals
meaning that the observation o is a visual observation of a fairly furry, animal object.
The semantics of (extensional) Inheritance are quite different from, though related to, those
of the MemberLink. ExtensionalInheritance represents a purely conditional probabilistic subset
relationship and is represented through the Subset relationship. If A is Fluffy and B is the set
of cat, then the statement
Subset < 0.9 >
A
B
means that
Pfr is in the set BIx is in the set A) = 0.9.
34.3.2 PLN Truth Values
PLN is equipped with a variety of different types of truth-value types. In order of increasing
information about the full probability distribution, they are:
• strength truth-values, which consist of single numbers; e.g., < s > or < .8 >. Usually
strength values denote probabilities but this is not always the case.
• SimpleTruthValues. consisting of pairs of numbers. These pairs come in two forms: < s, w >,
where s is a strength and w is a "weight of evidence" and < s, N >, where N is a "count."
"Weight of evidence" is a qualitative measure of belief, while "count" is a quantitative mea-
sure of accumulated evidence.
• IndefiniteTruthValues, which quantify truth-values in terms of an interval [L, UI, a credibil-
ity level b, and an integer k (called the lookahead). IndefiniteTruthValues quantify the idea
that after k more observations there is a probability b that the conclusion of the inference
will appear to lie in IL,
• DistributionalrruthValues, which are discretized approximations to entire probability dis-
tributions.
34.3.3 Auxiliary FOPLN Relationships
Beyond the core FOPLN relationships, FOPLN involves additional relationship types of two
varieties. There are simple ones like Similarity, defined by
Similarity A B
EFTA00624398
252 34 Probabilistic Logic Networks
We say a relationship R is simple if the truth value of R A B can be calculated solely in terms of
the truth values of core FOPLN relationships between A and B. There are also complex "aux-
iliary" relationships like IntensionalInheritance. which as discussed in depth in the Appendix
??, measures the extensional inheritance between the set of properties or patterns associated
with one term and the corresponding set associated with another.
Returning to our example, the agent may observe that two properties of cats are that they
are furry, and purr. Since the Fluffy is also a furry animal, the agent might then obtain, for
example
IntensionalInheritance < 0.5 >
Fluffy
cat
meaning that the Fluffy shares about 50% of the properties of cat. Building upon this relation-
ship even further, PLN also has a mixed intensional/extensional Inheritance relationship which
is defined simply as the disjunction of the Subset and IntensionalInheritance relationships.
As this example illustrates, for a complex auxiliary relationship R, the truth value of R A B
is defined in terms of the truth values of a number of different FOPLN relationships among
different terms (others than A and B), specified by a certain mathematical formula.
34.3.4 PLN Rules and Formulas
A distinction is made in PLN between rules and formulas. PLN logical inferences take the
form of "syllogistic rules," which give patterns for combining statements with matching terms.
Examples of PLN rules include, but are not limited to,
• deduction ((A —> B) A (B 4. (A -, C)),
• induction ((A B) A (A -, C)),
• abduction ((A —> C) A (B 0* (A —> C)),
• revision, which merges two versions of the same logical relationship that have different truth
values
• inversion ((A —) B) a (B —) A)).
The basic schematic of the first four of these rules is shown in Figure 34.1. We can see that the
first three rules represent the natural ways of doing inference on three interrelated terms. We
can also see that induction and abduction can be obtained from the combination of deduction
and inversion, a fact utilized in PLN's truth value formulas.
Related to each rule is a formula which calculates the truth value resulting from application
of the rule. As an example, suppose sA , SD, 3C, SAS, and sBc represent the truth values for the
terms A, B, C, as well the truth values of the relationships A —> B and B C, respectively.
Then, under suitable conditions imposed upon these input truth values, the formula for the
deduction rule is given by:
- .AD) ( 4C SDSDC)
SAC = SABSDC
- SB
EFTA00624399
34.3 First Order Probabilistic Logic Networks 233
deduction abduction
M
A
/
•
Ste
- 40 P 4/—i• p
induction revision
•
4
b. I )
S
p Ns. ..
Fig. 34.1: The four most basic first-order PLN inference rules
where sAc represents the truth value of the relationship A —> C. This formula is directly derived
from probability theory given the assumption that A -, B and B C are independent.
For inferences involving solely fuzzy operators, the default version of PLN uses standard
fuzzy logic with min/max truth value formulas (though alternatives may also be employed
consistently with the overall PLN framework). Finally, the semantics of combining fuzzy and
probabilistic operators is hinted at in IGNIIII081 but addressed more rigorously in 'GUN, which
gives a precise semantics for constructs of the form
Inheritance A B
where A and B are characterized by relationships of the form Member C A, Member D B, etc.
It is easy to see that, in the crisp case, where all MemberLinks and InheritanceLinks have
strength 0 or 1, FOPLN reduces to standard propositional logic. Where inheritance is crisp
but membership isn't, FOPLN reduces to higher-order fuzzy logic (including fuzzy statements
about terms or fuzzy statements, etc.).
34.3.5 Inference Trails
Inference trails are a mechanism used in some implementations of PLN, borrowed from the
NABS inference engine lWan061. In this approach, each Atom contains a nail structure. which
keeps a record of which Atoms were used in deriving the given Atom's TruthValue. In its simplest
form, the Mail can just be a list of Atoms. The total set of Atoms involved in a given nail, in
EFTA00624400
254 34 Probabilistic Logic Networks
principle, could be very large; but one can in practice cap Trail size at 50 or some other similar
number. In a more sophisticated version, one can record the rules used with the Atoms in the
Trail as well, allowing recapitulation of the whole inference history producing an Atom's truth
value. If the PLN MindAgents store all the inferences they do in some global inference history
structure, then nails are obviated, as the information in the Mail can be found via consulting
this history structure.
The purpose of keeping inference trails is to avoid errors due to double-counting of evidence.
If links L1 and L2 are both derived largely based on link Lo, and L1 and L2 both lead to ity
as a consequence - do we want to count this as two separate, independent pieces of evidence
about L4? Not really, because most of the information involved comes from the single Atom Lo
anyway. If all the Atoms maintain nails then this sort of overlapping evidence can be identified
easily; otherwise it will be opaque to the reasoning system.
While Trails can be a useful tool, there is reason to believe they're not strictly necessary.
If one just keeps doing probabilistic inference iteratively without using nails, eventually the
dependencies and overlapping evidence bases will tend to be accounted for, much as in a loopy
Bayes net. The key question then comes down to: how long is "eventually" and can the reasoning
system afford to wait? A reasonable strategy seems to be
• Use Trails for high-STI Atoms that are being reasoned about intensively, to minimize the
amount of error
• For lower-STI Atoms that are being reasoned on more casually in the background, allow
the double-counting to exist in the short term, figuring it will eventually "come out in the
wash" so it's not worth spending precious compute resources to more rigorously avoid it in
the short term
34.4 Higher-Order PLN
Higher-order PLN (HOPLN) is defined as the subset of PLN that applies to predicates (con-
sidered as functions mapping arguments into truth values). It includes mechanisms for dealing
with variable-bearing expressions and higher-order functions.
A predicate, in PLN. is a special kind of term that embodies a function mapping terms or
relationships into truth-values. HOPLN contains several relationships that act upon predicates
including Evaluation, Implication, and several types of quantifiers. The relationships can involve
constant terms, variables, or a mixture.
The Evaluation relationship, for example, evaluates a predicate on an input term. An agent
can thus create a relationship of the form
Evaluation
near
(Bob's house, Fluffy)
or, as an example involving variables,
Evaluation
near
(X, Fluffy)
EFTA00624401
34.4 Higher-Order PLN 255
The Implication relationship is a particularly simple kind of HOPLN relationship in that it
behaves very much like FOPLN relationships, via substitution of predicates in place of simple
terms. Since our agent knows, for example,
Implication
is_Fluffy
AND is_furry purrs
and
Implication
AND is_furry puns
is_cat
the agent could then use the deduction rule to conclude
Implication is_Fluffy is_cat
PLN supports a variety of quantifiers, including traditional crisp and fuzzy quantifiers, plus
the AverageQuantifier defined so that the truth value of
AverageQuantifter X F(X)
is a weighted average of F(X) over all relevant inputs X. AverageQuantifier is used implicitly
in PLN to handle logical relationships between predicates, so that e.g. the conclusion of the
above deduction is implicitly interpreted as
AverageQuantifier X
Implication
Evaluation is_Fluffy X
Evaluation is_cat X
We can now connect PLN with the SRAM model (defined in Chapter 7 of Part 1).
Suppose for instance that the agent observes Fluffy from across the room, and that it has
previously learned a Fetch procedure that tells it how to obtain an entity once it sees that
entity. Then, if the agent has the goal of finding a cat, and it has concluded based on the above
deduction that Fluffy is indeed a cat (since it is observed to be furry and purr), the cognitive
schematic (knowledge of the form Context & Procedure —> Coat as explained in Chapter 8 of
Part 1) may suggest it to execute the Fetch procedure.
34.4.1 Reducing HOPLN to FOPLN
In ICMIH081 it is shown that in principle, over any finite observation set, HOPLN reduces to
FOPLN. The key ideas of this reduction are the elimination of variables via use of higher-order
functions, and the use of the set-theoretic definition of function embodied in the SatisfyingSet
operator to map function-argument relationships into set-member relationships.
As an example, consider the Implication link. In HOPLN, where X is a variable
EFTA00624402
256 34 Probabilistic Logic Networks
Implication
RI AX
R2 B X
may be reduced to
Inheritance
SatisfyingSet(Ri A X)
SatisfyingSet(R2 B X)
where e.g. SatisfyingSet(R1 A X) is the fuzzy set of all X satisfying the relationship RI(A, X).
Furthermore in Appendix ??, we show how experience-based possible world semantics can
be used to reduce PLN's existential and universal quantifiers to standard higher order PLN
relationships using AverageQuantifier relationships. This completes the reduction of HOPLN to
FOPLN in the SRAM context.
One may then wonder why it makes sense to think about HOPLN at all. The answer is that it
provides compact expression of a specific subset of FOPLN expressions, which is useful in cases
where agents have limited memory and these particular expressions provide agents practical
value (it biases the agent's reasoning ability to perform just as well as in first or higher orders).
34.5 Predictive Implication and Attraction
This section briefly reviews the notions of predictive implication and predictive attraction,
which are critical to many aspects of CogPrime dynamics including goal-oriented behavior.
Define
Attraction A B <s>
as P(BIA) - P(BHA) = s, or in node and link terms
s (Inheritance A B).s - (Inheritance -.A B).s
For instance
(Attraction fat pig).s
(Inheritance fat pig).s - (Inheritance -.fat pig).s
Belatedly, in the temporal domain, we have the link type PredictiveImplication, where
Predictivelmplication A B <s>
roughly means that s is the probability that
Implication A B <s>
holds and also A occurs before B. More sophisticated versions of Predictivelmplication come
along with more specific information regarding the time lag between A and B: for instance a
time interval T in which the lag mast lie, or a probability distribution governing the lag between
the two events.
We may then introduce
PredictiveAttraction A B <s>
to mean
s (PredictiveImplication A B).s - (Predictivelmplication B).s
For instance
EFTA00624403
34.6 Confidence Decay 257
(PredictiveAttraction kiss_Ben be_happy).s -
1Predictivelmplication kiss_Ben be_happyl.s
- (Predictivelmplication -.kiss_Ben be_happy).s
This is what really matters in terms of determining whether kissing Ben is worth doing in
pursuit of the goal of being happy, not just how likely it is to be happy if you kiss Ben, but
how differentially likely it is to be happy if you kiss Ben.
Along with predictive implication and attraction, sequential logical operations are important,
represented by operators such as SequentialAND, SimultaneousAND and SimultaneousOR. For
instance:
PredictiveAttraction
SeguentialAND
Teacher says 'fetch'
I get the ball
I bring the ball to the teacher
I get a reward
combines SequentialAND and PredictiveAttraction. In this manner, an arbitrarily complex
system of serial and parallel temporal events can be constructed.
34.6 Confidence Decay
PLN is all about uncertain truth values, yet there is an important kind of uncertainty it doesn't
handle explicitly and completely in its standard truth value representations: the decay of infor-
mation with time.
PLN does have an elegant mechanism for handling this: in the <s,d> formalism for truth
values, strength s may remain untouched by time (except as new evidence specifically corrects
it), but d may decay over time. So, our confidence in our old observations decreases with time.
In the indefinite probability formalism, what this means is that old truth value intervals get
wider, but retain the same mean as they had back in the good old days.
But the tricky question is: How fast does this decay happen?
This can be highly context-dependent.
For instance, 20 years ago we learned that the electric guitar is the most popular instrument
in the world, and also that there are more bacteria than humans on Earth. The former fact is
no longer true (keyboard synthesizers have outpaced electric guitars). but the latter is. And,
if you'd asked us 20 years ago which fact would be more likely to become obsolete, we would
have answered the former - because we knew particulars of technology would likely change far
faster than basic facts of earthly ecology.
On a smaller scale, it seems that estimating confidence decay rates for different sorts of
knowledge in different contexts is a tractable data mining problem, that can be solved via
the system keeping a record of the observed truth values of a random sampling of Atoms as
they change over time. (Operationally, this record may be maintained in parallel with the
SystemActivityTable and other tables maintained for purposes of effort estimation, attention
allocation and credit assignment.) If the truth values of a certain sort of Atom in a certain
context change a lot, then the confidence decay rate for Atoms of that sort should be increased.
This can be quantified nicely using the indefinite probabilities framework.
For instance, we can calculate, for a given sort of Atom in a given context, separate b-level
credible intervals for the L and U components of the Atom's truth value at time t-r, centered
EFTA00624404
258 34 Probabilistic Logic Networks
about the corresponding values at time t. (This would be computed by averaging over all t
values in the relevant past. where the relevant past is defined as some particular multiple of r;
and over a number of Atoms of the same sort in the same context.)
Since historically-estimated credible-intervals won't be available for every exact value of r,
interpolation will have to be used between the values calculated for specific values of r.
Also, while separate intervals for L and U would be kept for maximum accuracy, for reasons
of pragmatic memory efficiency one might want to maintain only a single number x, considered
as the radius of the confidence interval about both L and U. This could be obtained by averaging
together the empirically obtained intervals for L and U.
Then, when updating an Atom's truth value based on a new observation, one performs a
revision of the old TV with the new, but before doing so, one first widens the interval for the
old one by the amounts indicated by the above-mentioned credible intervals.
For instance, if one gets a new observation about A with TV (L„c,„, Une„,), and the prior
TV of A, namely (Lad, Udd), is 2 weeks old, then one may calculate that Ldd should really be
considered as:
SCLJolcll - x, L_rold}+x)3
and U_old should really be considered as:
SW_fold} - x, QJ old} + x)$
so that (L_new, U_new) should actually be revised with:
Sct.Joicil - x, U_(old} + x)$
to get the total:
(L,U)
for the Atom after the new observation.
Note that we have referred fuzzily to "sort of Atom" rather than "type of Atom" in the above.
This is because Atom type is not really the right level of specificity to be looking at. Rather -
as in the guitar vs. bacteria example above - confidence decay rates may depend on semantic
categories, not just syntactic (Atom type) categories. To give another example, confidence in
the location of a person should decay more quickly than confidence in the location of a building.
So ultimately confidence decay needs to be managed by a pool of learned predicates, which are
applied periodically. These predicates are mainly to be learned by data mining, but inference
may also play a role in some cases.
The ConfidenceDecay MindAgent must take care of applying the confidence-decaying pred-
icates to the Atoms in the AtomTahle. periodically.
The ConfidenceDecayUpdater MindAgent must take care of:
• forming new confidence-decaying predicates via data mining, and then revising them with
the existing relevant confidence-decaying predicates.
• flagging confidence-decaying predicates which pertain to important Atoms but are uncon-
fident, by giving them STICurrency, so as to make it likely that they will be visited by
inference.
34.6.1 An Example
As an example of the above issues, consider that the confidence decay of:
EFTA00624405
34.6 Confidence Decay 259
Inh Ari male
should be low whereas that of:
Inh Ari tired
should be higher, because we know that for humans, being male tends to be a more permanent
condition than being tired.
This suggests that concepts should have context-dependent decay rates, e.g. in the context
of humans, the default decay rate of maleness is low whereas the default decay rate of tired-ness
is high.
However, these defaults can be overridden. For instance, one can say "As he passed through
his 80's, Grandpa just got tired, and eventually he died." This kind of tiredness, even in the
context of humans, does not have a rapid decay rate. This example indicates why the confidence
decay rate of a particular Atom needs to be able to override the default.
In terms of implementation, one mechanism to achieve the above example would be as follows.
One could incorporate an interval confidence decay rate as an optional component of a truth
value. As noted above one can keep two separate intervals for the L and U bounds; or to simplify
things one can keep a single interval and apply it to both bounds separately.
Then, e.g., to define the decay rate for tiredness among humans, we could say:
ImplicationLink_HOJ
InheritanceLink $X human
InheritanceLink $X tired <confidenceDecay - (0,.1J>
or else (preferably):
ContextLink
human
InheritanceLink $X tired <confidenceDecay - [0,.1(>
Similarly, regarding maleness we could say:
ContextLink
human
Inh SX male <confidenceDecay a 10,.00001)>
Then one way to express the violation of the default in the case of grandpa's tiredness would
be:
InheritanceLink grandpa tired <confidenceDecay - (0,.001l>
(Another way to handle the violation from default, of course, would be to create a separate
Atom:
tired_from_old_age
and consider this as a separate sense of "tired" from the normal one, with its own confidence
decay setting.)
In this example we see that, when a new Atom is created (e.g. InheritanceLink Ari tired),
it needs to be assigned a confidence decay rate via inference based on relations such as the
ones given above (this might be done e.g. by placing it on the queue for immediate attention
by the ConfidenceDecayUpdater MindAgent). And periodically its confidence decay rate could
be updated based on ongoing inferences (in case relevant abstract knowledge about confidence
decay rates changes). Making this sort of inference reasonably efficient might require creating a
special index containing abstract relationships that tell you something about confidence decay
adjustment. such as the examples given above.
EFTA00624406
260 34 Probabilistic Logic Networks
34.7 Why is PLN a Good Idea?
We have explored the intersection of the family of conceptual and formal structures that is
PLN, with a specific formal model of intelligent agents (SRAM) and its extension using the
cognitive schematic. The result is a simple and explicit formulation of PLN as a system by
which an agent can manipulate tokens in its memory, thus represent observed and conjectured
relationships (between its observations and between other relationships), in a way that assists
it in choosing actions according to the cognitive schematic.
We have not, however, rigorously answered the question: What is the contribution of PLN
to intelligence, within the formal agents framework introduced above? This is a quite subtle
question, to which we can currently offer only an intuitive answer, not a rigorous one.
Firstly, there is the question of whether probability theory is really the best way to manage
uncertainty, in a practical context. Theoretical results like those of Ca t ox611 and de Finetti
FIF371 demonstrate that probability theory is the optimal way to handle uncertainty, if one
makes certain reasonable assumptions. However, these reasonable assumptions don't actually
apply to real-world intelligent systems, which must operate with relatively severe computa-
tional resource constraints. For example, one of Cox's axioms dictates that a reasoning system
must assign the same truth value to a statement, regardless of the route it uses to derive the
statement. This is a nice idealization, but it can't be expected of any real-world, finite-resources
reasoning system dealing with a complex environment. So an open question exists, as to whether
probability theory is actually the best way for practical AGI systems to manage uncertainty.
Most contemporary Al researchers assume the answer is yes, and probabilistic AI has achieved
increasing popularity in recent years. However, there are also significant voices of dissent, such
as Pei Wang tWanthil in the AGI community, and many within the fuzzy logic community.
PLN is not strictly probabilistic, in the sense that it combines formulas derived rigorously
from probability theory with others that are frankly heuristic in nature. PLN was created in a
spirit of open-mindedness regarding whether probability theory is actually the optimal approach
to reasoning under uncertainty using limited resources, versus merely an approximation to the
optimal approach in this case. Future versions of PLN might become either more or less strictly
probabilistic, depending on theoretical and practical advances.
Next, aside from the question of the practical value of probability theory, there is the question
of whether PLN in particular is a good approach to carrying out significant parts of what an
AGI system needs to do, to achieve human-like goals in environments similar to everyday human
environments.
Within a cognitive architecture where explicit utilization the cognitive schematic (Context
& Procedure —> Goal) is useful, clearly PLN is useful if it works reasonably well - so this
question partially reduces to: what are the environments in which agents relying on the cognitive
schematic are intelligent according to formal intelligent measures like those defined in Chapter
7 of Part 1. And then there is the possibility that some uncertain reasoning formalism besides
PLN could be even more useful in the context of the cognitive schematic.
In particular, the question arises: What are the unique, peculiar aspects of PLN that makes
it more useful in the context of the cognitive schematic, than some other, more straightforward
approach to probabilistic inference? Actually there are multiple such aspects that we believe
make it particularly useful. One is the indefinite probability approach to truth values, which
we believe is more robust for AGI than known alternatives. Another is the clean reduction of
higher order logic (as defined in PLN) to first-order logic (as defined in PLN), and the utilization
EFTA00624407
34.7 Why is PLN a Good Idea? 261
of term logic instead of predicate logic wherever possible — these aspects make PLN inferences
relatively simple in most cases where, according to human common sense, they should be simple.
A relatively subtle issue in this regard has to do with PLN intension. The cognitive schematic
is formulated in terms of PredictiveExtensionallmplication (or any other equivalent way like
PredictiveExtensionalAttraction), which means that intensional PLN links are not required
for handling it. The hypothesis of the usefulness of intensional PLN links embodies a subtle
assumption about the nature of the environments that intelligent agents are operating in. As
discussed in roe06] it requires an assumption related to Peirce's philosophical axiom of the
"tendency to take habits," which posits that in the real world, entities possessing some similar
patterns have a probabilistically surprising tendency to have more similar patterns.
Reflecting on these various theoretical subtleties and uncertainties, one may get the feeling
that the justification for applying PLN in practice is quite insecure! However, it must be noted
that no other formalism in AI has significantly better foundation, at present. Every AI method
involves certain heuristic assumptions, and the applicability of these assumptions in real life is
nearly always a matter of informal judgment and copious debate. Even a very rigorous technique
like a crisp logic formalism or support vector machines for classification, requires non-rigorous
heuristic assumptions to be applied to the real world (how does sensation and actuation get
translated into logic formulas, or SW feature vectors)? It would be great if it were possible to
use rigorous mathematical theory to derive an AGI design, but that's not the case right now,
and the development of this sort of mathematical theory seems quite a long way off. So for now,
we must proceed via a combination of mathematics, practice and intuition.
In terms of demonstrated practical utility, PLN has not yet confronted any really ambitious
AGI-type problems, but it has shown itself capable of simple practical problem-solving in areas
such as virtual agent control and natural language based scientific reasoning r HMOS" The
current PLN implementation within CogPrime can be used to learn to play fetch or tag, draw
analogies based on observed objects, or figure out how to carry out tasks like finding a cat.
We expect that further practical applications, as well as very ambitious AGI development, can
be successfully undertaken with PLN without a theoretical understanding of exactly what are
the properties of the environments and goals involved that allow PLN to be effective. However,
we expect that a deeper theoretical understanding may enable various aspects of PLN to be
adjusted in a more effective manner.
EFTA00624408
EFTA00624409
Chapter 35
Spatiotemporal Inference
35.1 Introduction
Most of the problems and situations humans confront every day involve space and time explicitly
and centrally. Thus, any AGI system aspiring to humanlike general intelligence must have some
reasonably efficient and general capability to solve spatiotemporal problems. Regarding how
this capability might get into the system, there is a spectrum of possibilities, ranging from rigid
hard-coding to tabula rasa experiential learning. Our bias in this regard is that it's probably
sensible to somehow "wire into" CogPrime some knowledge regarding space and time - these
being, after all, very basic categories for any embodied mind confronting the world.
It's arguable whether the explicit insertion of prior knowledge about spacetime is necessary
for achieving humanlike AGI using feasible resources. As an argument against the necessity
of this sort of prior knowledge, Ben Kuipers and his colleagues ISNIK 21 have shown that
an AI system can learn via experience that its perceptual stream comes from a world with
three, rather than two or four dimensions. There is a long way from learning the number of
dimensions in the world to learning the full scope of practical knowledge needed for effectively
reasoning about the world - but it does seem plausible, from their work, that a broad variety
of spatiotemporal knowledge could be inferred from raw experiential data. On the other hand,
it also seems clear that the human brain does not do it this way, and that a rich fund of
spatiotemporal knowledge is "hard-coded" into the brain by evolution - often in ways so low-
level that we take them for granted, e.g. the way some motion detection neurons fire in the
physical direction of motion, and the way somatosensory cortex presents a distorted map of the
body's surface. On a psychological level, it is known that some fundamental intuition for space
and time is hard-coded into the human infant's brain 1.1011051. So while we consider the learning
of basic spatiotemporal knowledge from raw experience a worthy research direction, and fully
compatible with the CogPrime vision; yet for our main current research, we have chosen to
hard-wire some basic spatiotemporal knowledge.
If one does wish to hard-wire some basic spatiotemporal knowledge into one's AI system,
multiple alternate or complementary methodologies may be used to achieve this, including spa-
tiotemporal logical inference, internal simulation, or techniques like recurrent neural nets whose
dynamics defy simple analytic explanation. Though our focus in this chapter is on inference, we
must emphasize that inference, even very broadly conceived, is not the only way for an intelli-
gent agent to solve spatiotemporal problems occurring in its life. For instance, if the agent has
a detailed map of its environment, it may be able to answer some spatiotemporal questions by
263
EFTA00624410
264 35 Spatiotemporal Inference
directly retrieving information from the map. Or, logical inference may be substituted or aug-
mented by (implicitly or explicitly) building a model that satisfies the initial knowledge - either
abstractly or via incorporating "visualization" connected to sensory memory - and then inter-
pret new knowledge over that model instead of inferring it. The latter is one way to interpret
what DeSTIN and other CSDLNs do; indeed, DeSTIN's perceptual hierarchy is often referred to
as a "state inference hierarchy." Any CSDLN contains biasing toward the commonsense struc-
ture of space and time, in its spatiotemporal hierarchical structure. It seems plausible that the
human mind uses a combination of multiple methods for spatiotemporal understanding, just as
we intend CogPrime to do.
In this chapter we focus on spatiotemporal logical inference, addressing the problem of creat-
ing a spatiotemporal logic adequate for use within an AGI system that confronts the same sort
of real-world problems that humans typically do. The idea is not to fully specify the system's un-
derstanding of space and time in advance. but rather to provide some basic spatiotemporal logic
rules, with parameters to be adjusted based on experience, and the opportunity for augmenting
the logic over time with experientially-acquired rules. Most of the ideas in this chapter are
reviewed in more detail, with more explanation. in the book Real World Reasoning [MC+ Ill;
this chapter represent a concise summary, compiled with the AGI context specifically in mind.
A great deal of excellent work has already been done in the areas of spatial, temporal and
spatiotemporal reasoning; however, this work does not quite provide an adequate foundation
for a logic-incorporating AGI system to do spatiotemporal reasoning, because it does not ade-
quately incorporate uncertainty. Our focus here is to extend existing spatiotemporal calculi to
appropriately encompass uncertainty, which we argue is sufficient to transform them into an
AGI-ready spatiotemporal reasoning framework. We also find that a simple extension of the
standard PLN uncertainty representations, inspired by P(Z)-logic Tian lOj, allows more elegant
expression of probabilistic fuzzy predicates such as arise naturally in spatiotemporal logic.
In the final section of the chapter, we discuss the problem of planning, which has been con-
sidered extensively in the Al literature. We describe an approach to planning that incorporates
PLN inference using spatiotemporal logic, along with MOSES as a search method, and some
record-keeping methods inspired by traditional Al planning algorithms.
35.2 Related Work on Spatio-temporal Calculi
We now review several calculi that have previously been introduced for representing and rea-
soning about space, time and space-t line combined.
Spatial Calculi
Calculi dealing with space usually model three types of relationships between spatial regions:
topological, directional and metric.
The most popular calculus dealing with topology is the Region Connection Calculus (RCC)
IRC(793], relying on a base relationship C (for Connected) and building up other relationships
from it, like P (for PartOf), or 0 (for Overlap). For instance P(X, Y), meaning X is a part of
Y, can be defined using C as follows
EFTA00624411
35.2 Related Work on Spatio-temporal Calculi 265
P(X, Y) iff VZ E Li, C(Z, X) C(Z, Y) (35.1)
Where U is the universe of regions. RCC-8 models eight base relationships. see Figure 35.1. And
DC(X. Y) EC(X. Y) TPP(X. Y) NTPP(X. Y)
PO(X. Y) EQ(X. Y) TPPi(X. Y) NTPPi(X. Y)
Fig. 35.1: The eight base relationships of RCC-8
it is possible, using the notion of convexity, to model more relationships such as inside, partially
inside and outside, see Figure 35.2. For instance RCC-23 is an extension of RCC-8 using
relationships based on the notion of convexity. The 9-intersection calculus IWM95, Kur09I is
Insidc(%L Y) P—Insidc(X Y) Outsidc(X. Y)
Fig. 35.2: Additional relationships using convexity
another calculus for reasoning on topological relationships, but handling relationships between
heterogeneous objects, points, lines, surfaces.
Regarding reasoning about direction, the Cardinal Direction Calculus IC Eot , ZI41)(08] con-
siders directional relationships between regions, to express propositions such as "region A is to
the north of region B".
And finally regarding metric reasoning, spatial reasoning involving qualitative distance (such
as close, medium, far) and direction combined is considered in ICFH97I.
Some work has also been done to extend and combine these various calculi, such as combining
RCC-8 and the Cardinal Direction Calculus IMAM], or using size [GR00) or shape ICoh95]
information in RCC.
Temporal Calculi
The best known temporal calculus is Allen's Interval Algebra which considers 13 rela-
tionships over time intervals, such as Before, During, Overlap, Meet, etc. For instance one
EFTA00624412
266 35 Spatiotemporal Inference
can express that digestion occurs after or right after eating by
Before(Eet, Digest) V Meet(Eat, Digest)
equivalently denoted Eet{Bef ore,Meet}Digest. There also exists a generalization of Allen's
Interval Algebra that works on semi-intervals IFF92j, that axe intervals with possibly undefined
start or end.
There are modal temporal logics such as LTL and CT L, mostly used to check temporal
constraints on concurrent systems such as deadlock or fairness using Model Checking IN lai001.
Calculi with Space and Time Combined
There exist calculi combining space and time, first of all those obtained by "temporizing" spatial
calculus, that is tagging spatial predicates with timestamps or time intervals. For instance STCC
(for Spatio-temporal Constraint Calculus) IC NO2j is basically RCC-8 combined with Allen's
Algebra. With STCC one can express spatiotemporal propositions such as
Meet(DC( Finger, Key), EC( Finger, Key))
which means that the interval during which the finger is away from the key meets the interval
during which the finger is against the key.
Another way to combine space and time is by modeling motion; e.g. the Qualitative Tra-
jectory Calculus (QTC) IWKB05] can be used to express whether 2 objects are going for-
ward/backward or left/right relative to each other.
Uncertainty in Spatio-temporal Calculi
In many situations it is worthwhile or even nectsary to consider non-crisp extensions of these
calculi. For example it is not obvious how one should consider in practice whether two regions
are connected or disconnected. A desk against the wall would probably be considered connected
to it even if there is a small gap between the wall and the desk. Or if A is not entirely part
of B it may still be valuable to consider to which extent it is, rather than formally rejecting
PartOf (A, B). There are several ways to deal with such phenomena; one way is to consider
probabilistic or fuzzy extensions of spatiotemporal
For instance in ISIXICK08b, SIX'CKOSal the RCC relationship C (for Connected) is replaced
by a fuzzy predicate representing closeness between regions and all other relationships based
on it are extended accordingly. So e.g. DC (for Disconnected) is defined as follows
DC(X, Y) =1 — C(X, Y) (35.2)
P (for Part0f) is defined as
P(X, Y) = g gc(z, x),C(Z, Y)) (35.3)
where / is a fuzzy implication with some natural properties (usually /(x l , x2) = max(1 —xi , x2)).
Or, FA (for Equal) is defined as
EFTA00624413
35.3 Uncertainty with Distributional Fuzzy Values 267
EQ(X, Y) = min(P(X, Y), P(Y, X)) (35.4)
and so on.
However the inference rules cannot determine the exact fuzzy values of the resulting rela-
tionships but only a lower bound, for instance
T(P(X, Y), P(Y, Z)) ≤ P(X, Z) (35.5)
where T(x l , 22) = max(0, x1 +22 -1). This is to be expected since in order to know the resulting
fuzzy value one would need to know the exact spatial configuration. For instance Figure 35.3
depicts 2 possible configurations that would result in 2 different values of P(X, Z).
(A) (b)
Fig. 35.3: Depending on where Z is, in dashline, P(X, Z) gets a different value.
One way to address this difficulty is to reason with interval-value fuzzy logic [DP001, with the
downside of ending up with wide intervals. For example applying the same inference rule from
Equation 35.5 in the case depicted in Figure 35.4 would result in the interval [0, 1], corresponding
to a state of total ignorance. This is the main reason why, as explained in the next section, we
have decided to use distributional fuzzy values for our AGloriented spatiotemporal reasoning.
There also exist attempts to use probability with RCC. For instance in [Win0(1, RCC re-
lationships are extracted from computer images and weighted based on their likelihood as
estimated by a shape recognition algorithm. However, to the best of our knowledge, no one
has used distributional fuzzy values [Yan WI in the context of spatiotemporal reasoning; and we
believe this Ls important for the adaptation of spatiotemporal calculi to the AGI context.
35.3 Uncertainty with Distributional Fuzzy Values
P(Z) Ivan is an extension of fuzzy logic that considers distributions of fuzzy values rather
than mere fuzzy values. That is, fuzzy connectors are extended to apply over probability density
functions of fuzzy truth value. For instance the connector (often defined as —x = 1 — x) is
extended such that the resulting distribution p, : [0, R+ is
It~(x)= p(1 - x) (35.6)
where A is the probability density function of the unique argument. Similarly, one can define
Inn : [0,1] R+ as the resulting density function of the connector xi A X2 = min(x , x2) over
the 2 arguments µ i : [0,1] and µz : [01 R+
EFTA00624414
268 35 Spatiotemporal Inference
11/4 (2) = (x)1 s2(x2)dx2
(35.7)
+$12(x).1 111(xl)dx1
See [Ilan WI for the justification of Equations 35.6 and 35.7.
Besides extending the traditional fuzzy operators, one can also define a wider class of con-
nectors that can fully modulate the output distribution. Let F : [0,1)" H ([0,1J H R+) be a
n-ary connector that takes n fuzzy values and returns a probability density function. In that
case the probability density function resulting from the extension of F over distributional fuzzy
values is:
AP" =
1: F(xl, x„)fizi (xi) µ„(x,,)dxi dx„ (35.8)
where sti , p„ are the n input arguments. That is, it is the average of all density functions
output by F applied over all fuzzy input values. Let us call that type of connectors fuzzy-
probabilistic.
In the following we give an example of such a fuzzy-probabilistic connector.
Example with PartOf
Let us consider the RCC relationship PartOf (P for short as defined in Equation 35.1). A typical
inference rule in the crisp case would be:
P(X,Y) P(Y,Z)
(35.9)
P(X, Z)
expressing the transitivity of P. But using a distribution of fuzzy values we would have the
following rule
P (X, Y) P(Y,Z) (µ2) (35.10)
P(X, Z) (nor)
POT stands for PartOf Transitivity. The definition of szpor for that particular inference rule
may depend on many assumptions like the shapes and sizes of regions X, Y and Z. In the
following we will give an example of a definition of ppm, with respect to some oversimplified
assumptions chosen to keep the example short.
Let us define the fuzzy variant of PartOf (X, Y) as the proportion of X which is part of Y
(as suggested in 'Pang). Let us also assume that every region is a unitary circle. In this case,
the required proportion depends solely on the distance dxy between the centers of X and Y,
so we may define a function f that takes that distance and returns the according fuzzy value;
that is, f(dxy) = P(X,Y)
EFTA00624415
35.3 Uncertainty with Distributional Fuzzy Values 269
4a — dxy sin (a ) if 0
< dxy ≤ 2
f(dxy) = S 27r (35.11)
0 if dxr > 2
where a = cos' (dxy/2).
For 0 ≤ dxy ≤ 2, f(dxy) is monotone decreasing, so the inverse of f(dxy), that takes a
fuzzy value and returns a distance, is a function declared as f (x) : [0,1] a-). [0,2].
Let be xxy = P(X,Y), xyz = P(Y,Z), x = P(X, Z), dxY = f -1(xxv), drz = f-1(xyz),
= Idxy —dyzi and Is = dxy -I-dyz. For dxy and dy2 fixed, let g : [0, ni H [I, u] be a function
that takes as input the angle Q of the two lines from the center of Y to X and Y to Z (as
depicted in Figure 35.4) and returns the distance dxz. g is defined as follows
9(0) = V(dxy - dvz sin (0))2 + (dYz cos (0 2
So 1 ≤ dxz ≤ u. It is easy to see that g is monotone increasing and surjective, therefore there
exists a function inverse C I : [l, u] H [0,71. Let h = fog, so h takes an angle as input and
(a) (b) (e)
Fig. 35.4: dxz, in dashline, for 3 different angles
returns a fuzzy value, h : [0,71 H [0,1]. Since f is monotone decreasing and g is monotone
increasing, h is monotone decreasing. Note that the codomain of h is [0, f -1(0] if 1 < 2 or {0}
otherwise. Assuming that 1 < 2, then the inverse of h is a function with the following signature
: [0, f -1.(1)] H [0,71. Using and assuming that the probability of picking ti E [0,7] is
uniform, we can define the binary connector POT. Let us define v = POT(xxy, xyz), recalling
that POT returns a density function and assuming x < f (I)
h-1(=) 1
v(x) = 2 A-1(x+86)
hm
2 1 -1(x -F
lim
— ar6
2h—it (x)
(35.12)
where ICI' is the derivative of IC'. If x ≥ ri(l) then v(x) = 0. For sake of simplicity the exact
expressions of h-1 and v(x) have been left out, and the case where one of the fuzzy arguments
EFTA00624416
270 35 Spatiotemporal Inference
xxv, xyz or both are null has not been considered but would be treated similarly assuming
some probability distribution over the distances day and dxz.
It is now possible to define ppm, in rule 35.10 (following Equation 35.8)
ILPOT =
(35.13)
POT(2 11X2)iilfri)112(X2)dXidX2
.10 0
Obviously, assuming that regions are unitary circles is crude; in practice, regions might be
of very different shapes and sizes. In fact it might be so difficult to chose the right assumptions
(and once chosen to define POT correctly), that in a complex practical context it may be best
to start with overly simplistic assumptions and then learn POT based on the experience of
the agent. So the agent would initially perform spatial reasoning not too accurately, but would
improve over time by adjusting POT, as well as the other connectors corresponding to other
inference rules.
It may also be useful to have more premises containing information about the sizes (e.g
Big(X)) and shapes (e.g Long(Y)) of the regions, like
B(X) L(Y) (A2) P(X, Y) (P) P(Y,Z) (A4)
P(X, Z) (a)
where B and L stand respectively for Big and Long.
Simplifying Numerical Calculation
Using probability density as described above is computationally expensive, and in many prac-
tical cases it's overkill. To decrease computational cost, several cruder approaches are passible,
such as discretizing the probability density functions with a coarse resolution, or restricting
attention to beta distributions and treating only their means and variances (as in [Yan
The right way to simplify depends on the fuzzy-probabilistic connector involved and on how
much inaccuracy can be tolerated in practice.
35.4 Spatio-temporal Inference in PLN
We have discussed the representation of spatiotemporal knowledge, including associated micer-
tainty. But ultimately what matters is what an intelligent agent can do with this knowledge.
We now turn to uncertain reasoning based on uncertain spatiotemporal knowledge, using the
integration of the above-discussed calculi into the Probabilistic Logic Networks reasoning sys-
tem, an uncertain inference framework designed specifically for AGI and integrated into the
OpenCog AGI framework.
We give here a few examples of spatiotemporal inference rules coded in PLN. Although
the current implementation of PLN incorporates both fuzziness and probability it does not
have a built-in truth value to represent distributional fuzzy values, or rather a distribution of
distribution of fuzzy value, as this is how, in essence, confidence is represented in PLN. At that
point, depending on design choice and experimentation, it is not clear whether we want to use
EFTA00624417
35.4 Spatio-temporal Inference in PLN 271
the existing truth values and treat them as distributional truth values or implement a new type
of truth value dedicated for that, so for our present theoretical purposes we will jast call it DF
Truth Value.
Due to the highly flexible Hal formalism (Higher Order Judgment, explained in the PLN
book in detail) we can express the inference rule for the relationship PartOf directly as Nodes
and Links as follows
ForAllLink $X $Y $Z
ImplicationLink_HOJ
ANDLink
PartOf ($X, SY) (tvi)
(35.14)
PartOf ($Y,SZ) (tv2)
ANDLink
tv3 = µpo7(tvl,tv2)
PartOf($X,$Z) (tv3)
where µpoi is defined in Equation 35.13 but extended over the domain of PLN DF Truth Value
instead of P(Z) distributional fuzzy value. Note that Part0f(SX,SY) (tv) is a shorthand for
EvaluationLink (tv)
PartOf
ListLink (35.15)
$71
$Y
and ForAllLink $X $Y $Z is a shorthand for
ForAllLink
ListLink
$X (35.16)
$Y
$Z
Of course one advantage of expressing the inference rule directly in Nodes and Links rather
than a built-in PLN inference rule is that we can use OpenCog itself to improve and refine it,
or even create new spatiotemporal rules based on its experience. In the next 2 examples the
fuzzy-probabilistic connectors are ignored, (so no DF Truth Value is indicated) but one could
define them similarly to ;tpor•
First consider a temporal rule from Allen's Interval Algebra. For instance "if $h meets $I2
and $I3 is during $I2 then $I3 is after SIC would be expressed as
ForAllLink $I1 $12 $I3
ImplicationLink
ANDLink
(35.17)
Meet(SI1,$I2)
During($I3,$I2)
After(SI3. $It )
EFTA00624418
272 35 Spatiotentporal Inference
And a last example with a metric predicate could be "if $X is near $Y and VC is far from $Z then
$Y is far from Sr
ForAllLink $X $Y $Z
ImplicationLink_HOJ
ANDLink
(35.18)
Near(SX,SY)
Far(SX, SZ)
Far($Y, $Z)
That is only a small and partial illustrative example - for instance other rules may be used to
specify that Near and Far and reflexive and symmetric.
35.5 Examples
The ideas presented here have extremely broad applicability; but for sake of concreteness, we
now give a handful of examples illustrating applications to commonsense reasoning problems.
35.5.1 Spatiotemporal Rules
The rules provided here are reduced to the strict minimum needed for the examples:
1. At $T, if $X is inside $Y and $Y is inside $Z then Si( is inside $Z
ForAllLink $T $X $Y $Z
ImplicationLink_HOJ
ANDLink
atTime(ST, Inside($X, $Y))
atTime(ST, Inside($Y, $Z))
atTime(ST, Inside($X,SZ))
2. If a small object $X is over $Y and $Y is far from $Z then $X is far from $Z
ForAllLink
ImplicationLink_HOJ
ANDLink
Small(SX)
Over($X,$Y)
Far(SY)
Far(SX)
That rule is expressed in a crisp way but again is to be understood in an uncertain way, although
we haven't worked out the exact formulae.
EFTA00624419
35.5 Examples 273
35.5.2 The Laptop is Safe from the Rain
A laptop is over the desk in the hotel room, the desk is far from the window and we want assess
to what extent the laptop is far from the window, therefore same from the rain.
Note that the truth values are ignored but each concept is to be understood as fuzzy, that
is having a PLN Fuzzy Truth Value but the numerical calculation are left out.
We want to assess how far the Laptop is from the window
Far( Window, Laptop)
Assuming the following
1. The laptop is small
Small(Laptop)
2. The laptop is over the desk
Over(Laptop, Desk)
3. The desk is far from the window
Far(Desk, Window)
Now we can show an inference trail that lead to the conclusion, the numeric calculation are let
for later.
1. using axioms 1, 2, 3 and PLN AND rule
ANDLink
Small(Laptop)
Over(Laptop, Desk)
Far(Desk, Window)
2. using spatiotemporal rule 2, instantiated with SX = Laptop, Si = Desk and $Z = Window
ImplicationLink_HOJ
ANDLink
Small(Laptop)
Over(Laptop, Desk)
Far(Desk, Window)
Far(Laptop, Window)
3. using the result of previous step as premise with PLN implication rule
Far(Laptop, Window)
35.5.3 Fetching the Toy Inside the Upper Cupboard
Suppose we know that there is a toy in an upper cupboard and near a bag, and want to assess
to which extent climbing on the pillow is going to bring us near the toy.
Here are the following assumptions
1. The toy is near the bag and inside the cupboard. The pillow is near and below the cupboard
Near (toy, bag) (tvi)
Inside(toy, cupboard) (tv2)
Below(pillow, cupboard) (tv3)
Near(pillow, cupboard) (tv4)
EFTA00624420
274 35 Spatiotemporal Inference
2. The toy is near the bag inside the cupboard, how near is the toy to the edge of the cupboard?
ImplicationLink_HOJ
ANDLink
Near(toy,bag)(tvi)
Inside(toy,cupboard)(tv2)
ANDLink
tv3 = tv2)
Near(toy,cupboard_edge)(tv3)
3. If I climb on the pillow, then shortly after I'll be on the pillow
PredictivelmplicationLink
Climb_on(pillow)
Over(self, pillow)
4. If I am on the pillow near the edge of the cupboard how near am I to the toy?
ImplicationLink_HOJ
ANDLink
Below(pillow, cupboard) (tvi)
Near(pillow, cupboard) (tv2)
Over (self , pillow) (tv3)
Near(toy, cupboard_edge) (tv4)
ANDLink
tv5 = OF2 (tvi, tv2, tv3, tv4)
Near (self , toy) (tits)
The target theorem is "How near I am to the toy if I climb on the pillow."
PredictivelmplicationLink
Climb_on(pillow)
Near(self, toy) (?)
And the inference chain as follows
1. Axiom 2 with axiom 1
Near(toy, cupboard_edge) (tv6)
2. Step 1 with axiom 1 and 3
PredictivelmplicationLink
Climb_on(pillow)
ANDLink
Below(pillow, cupboard) (tv3)
Near(pillow, cupboard) (tv4)
Over (self , pillow) (tv7)
Near(toy, cupboard_edge) (tv6)
3. Step 2 with axiom 4, target theorem: How near am I to the toy if I climb on the pillow
PredictivelmplicationLink
Climb_on(pillow)
Near(self, toy) (tv3)
EFTA00624421
35.6 An Integrative Approach to Planning 275
35.6 An Integrative Approach to Planning
Planning is a major research area in the mainstream AI community, and planning algorithms
have advanced dramatically in the last decade. However, the best of breed planning algorithms
are still not able to deal with planning in complex environments in the face of a high level of
uncertainty, which is the sort of situation routinely faced by humans in everyday life. Really
powerful planning, we suggest, requires an approach different than any of the dedicated planning
algorithms, involving spatiotemporal logic combined with a sophisticated search mechanism
(such as MOSES).
It may be valuable (or even necessary) for an intelligent system involved in planning-intensive
goals to maintain a specialized planning-focused data structure to guide general learning mech-
anisms toward more efficient learning in a planning context. But even if so, we believe planning
must ultimately be done as a case of more general learning, rather than via a specialized algo-
rithm.
The basic approach we suggest here is to
• use MOSES for the core plan learning algorithm. That is, MOSES would maintain a popu-
lation of "candidate partial plans", and evolve this population in an effort to find effective
complete plans.
• use PLN to help in the fitness evaluation of candidate partial plans. That is, PLN would
be used to estimate the probability that a partial plan can be extended into a high-quality
complete plan. This requires PLN to make heavy use of spatiotemporal logic, as described
in the previous sections of this chapter.
• use a GraphPlan-style 1B1:971 planning graph, to record information about candidate plans,
and to propagate information about mutual exclusion between actions. The planning graph
maybe be used to help guide both MOSES and PLN.
In essence, the planning graph simply records different states of the world that may be achiev-
able, with a high-strength PredictivelmplicationLink pointing between state X and Y if X
can sensibly serve as a predecessor to X; and a low-strength (but potentially high-confidence)
PredictivelmplicationLink between X and Y if the former excludes the latter. This may be a
subgraph of the Atomspace or it may be separately cached; but in each case it must be frequently
accessed via PLN in order for the latter to avoid making a massive number of unproductive
inferences in the course of assisting with planning.
One can think of this as being a bit like PGraphPlan IBI.9!II. except that
• MOSES is being used in place of forward or backward chaining search, enabling a more
global search of the plan space (mixing forward and backward learning freely)
• PLN is being used to estimate the value of partial plans, replacing heuristic methods of
value propagation
Regarding PLN, one possibility would be to (explicitly, or in effect) create a special API
function looking something like
EstimateSuccessProbability(PartialPlan PP, Goal G)
(assuming the goal statement contains information about the time allotted to achieve the
goal). The PartialPlan is simply a predicate composed of predicates linked together via temporal
EFTA00624422
276 35 Spatiotemporal Inference
links such as Predictivelmplication and SimultaneousAND. Of course, such a function could be
used within many non-MOSES approaches to planning also.
Put simply, the estimation of the success probability is "just" a matter of asking the PLN
backward-chainer to figure out the truth value of a certain ImplicationLink, i.e.
PredictivelmplicationLink [time-lag T]
EvaluationLink do PP
G
But of course, this may be a very difficult inference without some special guidance to help
the backward chainer. The GraphPlan-style planning graph could be used by PLN to guide it
in doing the inference, via telling it what variables to look at, in doing its inferences. This sort
of reasoning also requires PLN to have a fairly robust capability to reason about time intervals
and events occurring therein (i.e., basic temporal inference).
Regarding MOSES, given a candidate plan, it could look into the planning graph to aid
with program tree expansion. That is, given a population of partial plans, MOSES would
progressively add new nodes to each plan, representing predecessors or successors to the actions
already described in the plans. In choosing which nodes to add, it could be probabilistically
biased toward adding nodes suggested by the planning graph.
So, overall what we have is an approach to doing planning via MOSES, with PLN for fitness
estimation - but using a GraphPlan-style planning graph to guide MOSES's exploration of the
neighborhood of partial plans, and to guide PLN's inferences regarding the success likelihood
of partial plans.
EFTA00624423
Chapter 36
Adaptive, Integrative Inference Control
36.1 Introduction
The subtlest and most difficult aspect of logical inference is not the logical rule-set nor the
management of uncertainty, but the coning! of inference: the choice of which inference steps
to take, in what order, in which contexts. Without effective inference control methods, logical
inference is an unscalable and infeasible approach to learning declarative knowledge. One of the
key ideas underlying the CogPrime design is that inference control cannot effectively be handled
by looking at logic alone. Instead, effective inference control must arise from the intersection
between logical methods and other cognitive processes. In this chapter we describe some of the
general principles used for inference control in the CogPrime design.
Logic itself is quite abstract and relatively (though not entirely) independent of the specific
environment and goals with respect to which a system's intelligence is oriented. Inference con-
trol, however, is (among other things) a way of adapting a logic system to operate effectively
with respect to a specific environment and goal-set. So, the reliance of CogPrime's inference
control methods on the integration between multiple cognitive processes, is a reflection of the
foundation of CogPrime on the assumption (articulated in Chapter 9) that the relevant en-
vironment and goals embody interactions between world-structures and interaction-structures
best addressed by these various processes.
36.2 High-Level Control Mechanisms
The PLN implementation in CogPrime is complex and lends itself to utilization via many
different methods. However, a convenient way to think about it is in terms of three basic
backward-focused query operations:
• findtv, which takes in an expression and tries to find its truth value.
• findExamples, which takes an expression containing variables and tries to find concrete
terms to fill in for the variables.
• createExamples, which takes an expression containing variables and tries to create new
Atoms to fill in for the variables, using concept creation heuristics as discussed in a Chapter
38, coupled with inference for evaluating the products of concept creation.
277
EFTA00624424
278 36 Adaptive, Integrative Inference Control
and one forward-chaining operation:
• findConclusions, which takes a set of Atoms and seeks to draw the most interesting
possible set of conclusions via combining them with each other and with other knowledge
in the AtomTable.
These inference operations may of course call themselves and each other recursively, thus cre-
ating lengthy chains of diverse inference.
Findtv is quite straightforward, at the high level of discussion adopted here. Various inference
rules may match the Atom; in our current PLN implementation, loosely described below, these
inference rules are executed by objects called Rules. In the course of executing findtv, a decision
must be made regarding how much attention to allocate to each one of these Rule objects, and
some choices must be made by the objects themselves - issues that involve processes beyond
pure inference, and will be discussed later in this chapter. Depending on the inference rules
chosen, findtv may lead to the construction of inferences involving variable expressions, which
may then be evaluated via findExamples or createExamples queries.
The findExamples operation sometimes reduces to a simple search through the AtomSpace.
On the other hand, it can also be done in a subtler way. If the findExamples Rule wants to find
examples of $X so that F(SX), but can't find any, then it can perform some sort of heuristic
search, or else it can run another findExamples query, looking for $G so that
Implication SG F
and then running findExamples on G rather than F. But what if this findExamples query
doesn't come up with anything? Then it needs to run a createExamples query on the same
implication, trying to build a $G satisfying the implication.
Finally, forward-chaining inference (findConclusions) may be conceived of as a special heuris-
tic for handling special kinds of findExample problems. Suppose we have K Atoms and want to
find out what consequences logically ensue from these K Atoms, taken together. We can form
the conjunction of the K Atoms (let's call it C), and then look for SD so that
Implication C SD
Conceptually, this can be approached via findExamples, which defaults to createExamples in
cases where nothing is found. However, this sort of findExamples problem is special, involving
appropriate heuristics for combining the conjuncts contained in the expression C, which embody
the basic logic of forward-chaining rather than backward-chaining inference.
36.2.1 The Need for Adaptive Inference Control
It is clear that in humans, inference control is all about context. We use different inference
strategies in different contexts, and learning these strategies is most of what learning to think
is all about. One might think to approach this aspect of cognition, in the CogPrime design, by
introducing a variety of different inference control heuristics, each one giving a certain algorithm
for choosing which inferences to carry out in which order in a certain context. (This is similar to
what has been done within Cyc, for example http: cyc . corn.) However, in keeping with the
integrated intelligence theme that pervades CogPrime, we have chosen an alternate strategy for
PLN. We have one inference control scheme, which is quite simple, but which relies partially on
EFTA00624425
36.3 Inference Control in PLN 279
structures coming from outside PLN proper. The requisite variety of inference control strategies
is provided by variety in the non-PLN structures such as
• HebbianLinks existing in the AtomTable.
• Patterns recognized via pattern-mining in the corpus of prior inference trails
36.3 Inference Control in PLN
We will now describe the basic "inference control" loop of PLN in CogPrime. Pre-2013 OpenCog
versions used a somewhat different scheme, more similar to a traditional logic engine. The ap-
proach presented here is more cognitive synergy oriented, achieving PLN control via a combi-
nation of logic engine style methods and integration with attention allocation.
36.3.1 Representing PLN Rules as GroundedSchemallodes
PLN inference rules may be represented as GroundedSchemallodes. So for instance the PLN
Deduction Rule, becomes a GroundedSchemallode with the properties:
• Input: a pair of links (LI, L2), where LI and L2 are the same type, and must be one of
InheritanceLink, ImplicationLink, SubsetLink or ExtensionalimplicationLink
• Output: a single link, of the same type as the input
The actual PLN Rules and Formulas are then packed into the internal execution methods of
GroundedSchemallodes.
In the current PLN code, each inference rule has a Rule class and a separate Formula class.
So then, e.g. the PLNDeductionRule GroundedSchemallode, invok.es a function of the general
form
Link PLNDeductionRule(Link LI, Link L2)
which calculates the deductive consequence of two links. This function then invokes a function
of the form
TruthValue PLNDeductionFormula(TruthValue tAB, TruthValue tBC, TruthValue tA, TruthValue tB, Truth
which in turn invokes functions such as
SimpleTruthValue SimplePLNDeductionFormula(SimpleTruthValue tAB, SimpleTruthValue tBC, SimpleTruth
IndefiniteTruthValue IndefinitePLNDeductionFormula(IndefiniteTruthValue tAB, IndefiniteTruthValue
36.3.2 Recording Executed PLN Inferences in the Atomspace
Once an inference has been carried out, it can be represented in the Atomspace, e.g. as
EFTA00624426
280 36 Adaptive. Integrative Inference Control
ExecutionLink
GroundedSchemallode: PLNDeductionRule
ListLink
HypotheticalLink
InheritanceLink people animal <tvl>
HypotheticalLink
InheritanceLink animal breathe <tv2>
HypotheticalLink
InheritanceLink people breathe <tv3>
Note that a link such as
InheritanceLink people breathe <.8,.2>
will have its truth value stored as a truth value version within a CompositeTruthValue object.)
In the above, e.g.
InheritanceLink people animal
is used as shorthand for
InheritanceLink Cl C2
where CI and C2 are ConceptNodes representing "people" and "animal" respectively.
We can also have records of inferences involving variables, such as
ExecutionLink
GroundedSchemallode: PLNDeductionRule
ListLink
HypotheticalLink
InheritanceLink $V1 animal <tvl>
HypotheticalLink
InheritanceLink animal breathe <tv2>
HypotheticalLink
InheritanceLink $V1 breathe <tv3>
where SW is a specific VariableNode.
36.3.3 Anatomy of a Single Inference Step
A single inference step, then, may be viewed as follows:
1. Choose an inference rule R. and a tuple of Atoms that collectively match the input conditions
of the rule
2. Apply the chosen rule R to the chosen input Atoms
3. Create an ExecutionLink recording the output found
4. In addition to retaining this ExecutionLink in the Atomspace. also save a copy of it to the
InferenceRepository 'this is not needed for the very first implementation, but will be very
useful once PLN is in regular use.
The InferenceRepository, referred to here, is a special Atomspace that exists just to save a
record of PLN inferences. It can be mined, after the fact, to learn inference patterns, which can
be used to guide future inferences.
EFTA00624427
36.3 Inference Control in PLN 281
36.3.4 Basic Forward and Backward Inference Steps
The choice of an inference step, at the microscopic level, may be done in a number of ways, of
which perhaps the simplest are:
• "'Basic forward step". Choose an Atom Al, then choose a rule R. If It only takes one input,
then apply R. to Al. If R applies to two Atoms, then find another Atom A2 so that (Al,
A2) may be taken as the inputs of R.
• "'Basic backward step."' Choose an Atom Al, then choose a rule R. If R takes only one
input, then find an Atom A2 so that applying R. to A2, yields Al as output. If R takes two
inputs, then find two Atoms (A2, A3) so that applying It to (A2, A3) yields Al as output.
Given a target Atom such as
Al - Inheritance $V1 breathe
the VariableAbstractionRule will do inferences such as
ExecutionLink
VariableAbstractionRule
HypotheticalLink
Inheritance people breathe
HypotheticalLink
Inheritance Svl breathe
This allows the basic backward step to carry out variable fulfillment queries as well as truth
value queries. We may encapsulate these processes in the Atomspace as
GroundedSchemallode: BasicForwardlnferenceStep
GroundedSchemallode: BasicBackwardInferenceStep
which take as input some Atom Al.
and also as
GroundedSchemallode: AttentionalForwardlnferenceStep
GroundedSchemallode: AttentionalBackwardlnferenceStep
which automatically choose the Atom Al they start with, via choosing some Atom within the
AttentionalFocus, with probability proportional to STI.
Forward chaining, in its simplest form, then becomes: The process of repeatedly executing
the AttentionalForwardlnferenceStep Schemallode.
Backward chaining, in the simplest case (we will discuss more complex cases below), becomes
the process of
1. Repeatedly executing the BasicBackwardlnferenceStep Schemallode, starting from a given
target Atom
2. Concurrently. repeatedly executing the AttentionalBackwardlnferenceStep Schemallode, to
ensure that backward inference keeps occurring, regarding Atoms that were created via Step
1
Inside the BasicForwardStep or BasicBackwardStep schema, there are two chokes to be
made: choosing a rule R, and then choosing additional Atoms A2 and possibly A3.
The choice of the rule R should be made probabilistically, choosing each rule with probability
proportional to a certain weight associated with each rule. Initially we can assign these weights
EFTA00624428
282 36 Adaptive, Integrative Inference Control
generically, by hand, separately for each application domain. Later on they should be chosen
adaptively, based on information mined from the InferenceRepository, regarding which rules
have been better in which contexts.
The choice of the additional Atoms A2 and A3 is subtler, and should be done using STI
values as a guide:
• First the AttentionalFocus is searched, to find all the Atoms there that fit the input criteria
of the rule R. Among all the Atoms found, an Atom is chosen with probability proportional
to STI.
• If the AttentionalFocus doesn't contain anything suitable, then an effort may be made to
search the rest of the Atomspace to find something suitable. If multiple candidates are
found within the amount of effort allotted, then one should be chosen with probability
proportional to STI
If an Atom A is produced as output of a forward inference step, or is chosen as the input of a
backward inference step, then the STI of this Atom A should be incremented. This will increase
the probability of A being chosen for ongoing inference. In this way, attention allocation is used
to guide the course of ongoing inference.
36.3.5 Interaction of Forward and Backward Inference
Starting from a target, a series of backward inferences can figure out ways to estimate the truth
value of that target, or fill in the variables within that target.
However, once the backward-going chain of inferences is done (to some reasonable degree
of satisfaction), there is still the remaining task of using all the conclusions drawn during the
series of backward inferences, to actually update the target.
Elegantly, this can be done via forward inference. So if forward and backward inference
are both operating concurrently on the same pool of Atoms, it is forward inference that will
propagate the information learned during backward chaining inference, up to the target of the
backward chain.
36.3.6 Coordinating Variable Bindings
Probably the thorniest subtlety that comes up in a PLN implementation is the coordination of
the values assigned to variables, across different micro-level inferences that are supposed to be
coordinated together as part of the same macro-level inference.
For a very simple example, suppose we have a truth-value query with target
Al a InheritanceLink Bob rich
Suppose the deduction rule It is chosen.
Then if we can find (A2, A3) that look like, say,
A2 a InheritanceLink Hob owns_mansion
A3 a InheritanceLink owns_mansion rich
EFTA00624429
36.3 Inference Control in PLN 283
, our problem is solved.
But what if there is no such simple solution in the Atomspace available? Then we have to
build something like
A2 - InheritanceLink Bob Svl
A3 - InheritanceLink Svl rich
and try, to find something that works to fill in the variable $v1.
But this is tricky, because $v1 now has two constraints (A2 and A3). So, suppose A2 and
A3 are both created as a result of applying BasicBackwardlnferenceStep to Al, and thus A2
and A3 both get high STI values. Then both A2 and A3 are going to be acted on by Atten-
tionalBackwardInferenceStep. But as A2 and A3 are produced via other inputs using backward
inference, it is necessary that the values assigned to $v1 in the context of A2 and A3 remain
consistent with each other.
Note that, according to the operation of the Atomspace, the same VariableAtom will be used
to represent $vl no matter where it occurs.
For instance, it will be problematic if one inference rule schema tries to instantiate $vl with
"owns_mansion", but another tries to instantiate $vl with "lives_in_Manhattan".
That is, we don't want to find
InheritanceLink Bob lives_in_mansion
InheritanceLink lives_in_mansion owns_mansion
InheritanceLink Bob owns_mansion
which binds $v1 to owns_mansion, and
InheritanceLink lives_in_Manhattan lives_in_top_city
InheritanceLink lives_in_top_city rich
!-
InheritanceLink lives_in_Manhattan rich
which binds $vl to lives_in_Manhattan
We want A2 and A3 to be derived in ways that bind $vl to the same thing.
The most straightforward way to avoid confusion in this sort of context, is to introduce an
addition kind of inference step,
• "'Variable-guided backward step"'. Choose a set V of VariableNodes (which may just be
a single VariableNode $v1), and identify the set S_V of all Atoms involving any of the
variables in V.
- Firstly: If V divides into two sets VI and V2, so that no Atom contains variables in
both VI and V2, then launch separate variable-guided backwards steps for VI and V2.
'This step is "Problem Decompositionl
— Carry out the basic backward step for all the Atoms in S_V, but restricting the search
for Atoms A2, A3 in such a way that each of the variables in V is consistently instan-
tiated. This is a non-trivial optimization, and more will be said about this below.
• "'Variable-guided backward step, Atom-triggered."" ChnacP an Atom Al. Identify the set
V of VariableNodes targeted by Al, and then do a variable-guided backward step starting
from V.
This variable guidance may, of course, be incorporated into the AttentionalBackwardlnfer-
enceStep as well. In this case, backward chaining becomes the process of
EFTA00624430
284 36 Adaptive, Integrative Inference Control
• Repeatedly executing the VariableGuidedBackwardlnferenceStep Schemallode, starting
from a given target Atom
• Concurrently, repeatedly executing the AttentionalVariableGuidedBackwardlnferenceStep
Schemallode, to ensure that backward inference keeps occurring, regarding Atoms that were
created via Step 1
The hard work here is then done in step 2 of the Variable Guided Backward Step, which has
to search for multiple Atoms, to fulfill the requirements of multiple inference rules, in a way
that keeps consistent variable instantiations. But this same difficulty exists in a conventional
backward chaining framework, it's just arranged differently, and not as neatly encapsulated.
36.3.7 An Example of Problem Decomposition
Illustrating a point raised above, we now give an example of a case where, given a problem of
finding values to assign a set of variables to make a set of expressions hold simultaneously, the
appropriate course is to divide the set of expressions into two separate parts.
Suppose we have the six expressions
El - Inheritance ( $vl, Animal)
E2 a Evaluation( $v1, ($v2, Bacon) )
E3 - Inheritance( $v2, $v3)
E4 - Evaluation( Eat, ($v3, $vl) )
ES a Evaluation (Eat, ($v7, $v9) )
E6 a Inheritance $v9 $v6
Since the set {El, E2, E3, E4} doesn't share any variables with {E5, EC), there is no reason
to consider them all as one problem. Rather we will do better to decompose it into two problems,
one involving {El, E2, E3, E4} and one involving {E5, E6}.
In general, given a set of expressions, one can divide it into subsets, where each subset S has
the property that: for every variable v contained in 5, all occurrences of v in the Atomspace,
are in expressions contained in S.
36.3.8 Example of Casting a Variable Assignment Problem as an
Optimization Problem
Suppose we have the four expressions
El a Inheritance ( $v1, Animal)
E2 a Evaluation( $v2, ($vl, Bacon) )
E3 a Inheritance( $v2, $v3)
E4 a Evaluation( Enjoy, ($v1, $v3) )
EFTA00624431
36.3 Inference Control in PLN 285
where Animal. Bacon and Enjoy are specific Atoms.
Suppose the task at hand is to find values for ($v1, $v2, $v3) that will make all of these
expressions confidently true.
If there is some assignment
(6v1, $v2, $v3) (A1,A2, A3)
ready to hand in the Atomspace, that fulfills the equations El, E2, E3, E4, then the Atomspace
API's pattern matcher will find it. For instance,
(6v1, $v2, $v3) - (Cat, Eat, Chew)
would work here, since
El - Inheritance 4 Cat, Animal)
E2 - Evaluation( Eat, (Cat, Bacon) )
E3 - Inheritance( Eat, Chew)
E4 - Evaluation( Enjoy, (Cat, Chew) )
are all reasonably true.
If there is no such assignment ready to hand, then one is faced with a search problem. This
can can be approached as an optimization problem, e.g. one of maximizing a function
f(Sol, $v2, $v3) sc(E1) * sc(E2) * sc(E3)
where
sc(A) A.strength • A.confidence
The function f is then a function with signature
f: Atom"4 --> float
f can then be optimized by a host of optimization algorithms. For instance a genetic algorithm
approach might work, but a BOA (Bayesian Optimization Algorithm) approach would probably
be better.
In a GA approach, mutation would work as follows. Suppose one had a candidate
(Sol, Sv2, Sv3) - (Al,A2, A3)
Then one could mutate this candidate by (for instance) replacing Al with some other Atom
that is similar to Al, e.g. connected to Al with a high-weight SimilarityLink in the Atomspace.
36.3.9 Backward Chaining via Nested Optimization
Given this framework that does inference involving variables via using optimization to solve
simultaneous equations of logical expressions with overlapping variables, "backward chaining"
becomes the iterative launch of repeated optimization problems, each one defined in terms of
the previous ones. We will now illustrate this point via continuing with the E2, E3, E4)).
example from above. Suppose one found an assignment
(Sol, Sv2, Sv3) - (Al,A2, A3)
that worked for every equation except E3. Then there is the problem of finding some way to
make
EFTA00624432
286 36 Adaptive, Integrative Inference Control
E3 - Inheritance( A2, A3)
work.
For instance, what if we have found the assignment
(Svl, Sv2, Sv3) - (Cat, Eat, Chase)
In this case, we have
El - Inheritance ( Cat, Animal) -- YES
E2 - Evaluation( Eat, (Cat, Bacon) ) -- YES
E3 - Inheritance( Eat, Chase) -- NO
E4 - Evaluation( Enjoy, (Cat, Chase) ) -- YES
so the assignment works for every equation except E3. Then there is the problem of finding
some way to make
E3 Inheritance( Eat, Chase)
work. But if the tnith value of
Inheritance( Eat, Chase)
has a low strength and high confidence, this may seem hopeless, so this assignment may not
get followed up on.
On the other hand, we might have the assignment
(Svl, Sv2, Sv3) - (Cat, Eat, SocialActivity)
In this case, for a particular CogPrime instance, we might have
El - Inheritance ( Cat, Animal) -- YES
E2 - Evaluation( Eat, (Cat, Bacon) ) -- YES
E3 - Inheritance( Eat, SocialActivity) -- UNKNOWN
E4 - Evaluation( Enjoy, (Cat, SocialActivity) ) -- YES
The above would hold if the reasoning system knew that cats enjoy social activities, but did not
know whether eating is a social activity. In this case, the reasoning system would have reason
to launch a new inference process aimed at assessing the truth value of
E3 - Inheritance( Eat, SocialActivity) --
This is backward chaining: Launching a new inference process to figure out a question raised
by another inference process.
For instance, in this case the inference engine might: Choose an inference Rule (let's say it's
Deduction, for simplicity), and then look for $v4 so that
Inheritance Eat Sv4
Inheritance Sv4 SocialActivity
are both true. In this case one has spawned a new Variable-Guided Backward Inference
problem, which must be solved in order to make {AI, A2, A3} an OK solution for the problem
of {El, E2, E3, E4}.
Or it might choose the Induction rule, and look for $v4 so that
Inheritance Sv4 Eat
Inheritance Sv4 SocialActivity
Maybe then it would find that $v4=Dinner works, because it knows that
EFTA00624433
36.3 Inference Control in PLN 287
Inheritance Dinner Eat
Inheritance Dinner SocialActivity
But maybe $v4=Dinner doesn't boost the truth value of
E3 - Inheritance( Eat, SocialActivity)
high enough. In that case it may keep searching for more information about E4 in the context
of this particular variable assignment. It might choose Induction again, and discover e.g. that
Inheritance Lunch Eat
Inheritance Lunch SocialActivity
In this example, wove assumed that some non-backward-chaining heuristic search mechanism
found a solution that almost works, so that backward chaining is only needed on E3. But of
course, one could backward chain on all of El, E2, E3, E4 simultaneously - or various subsets
thereof.
For a simple example, suppose one backward chains on
El - Inheritance ( Svl, Animal)
E3 Inheritance( $v2, SocialActivity)
simultaneously. Then one is seeking, say, ($v4, $v5) so that
Inheritance $vl $vS
Inheritance $vS Animal
Inheritance $v2 $v4
Inheritance $v4 SocialActivity\
This adds no complexity, as the four relations partition into two disjoint sets of two. Separate
chaining processes may be carried out for El and E3.
On the other hand, for a slightly more complex example, what if we backward chain on
E2 Evaluation( Sv2, {$vl, Bacon) )
E3 - Inheritance( Sv2, SocialActivity)
simultaneously? (Assuming that a decision has already been made to explore the possibility
$v3 = SocialActivity.) Then we have a somewhat more complex situation. We are trying to find
$v2 that is a SocialActivity, so that $v1 likes to do $v2 in conjunction with Bacon.
If the Member2Evaluation rule is chosen for E2 and the Deduction rule is chosen for E3,
then we have
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
E7 - Member ($v1, Bacon) (SatisfyingSet $v2)
and if the Inheritance2Member rule is then chosen for E7, we have
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
ES - Inheritance ($vl, Bacon) (SatisfyingSet $v2)
and if Deduction is then chosen for ES then we have
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
E9 - Inheritance ($vl , Bacon) $v8
ElD - Inheritance $v8 (SatisfyingSet $v2)
EFTA00624434
288 36 Adaptive. Integrative Inference Cont roe
Following these steps expands the search to involve more variables and means the inference
engine now gets to deal with
El - Inheritance I Svl, Animal)
E4 - Evaluation( Enjoy, (Svl, SocialActivity) )
ES - Inheritance $v2 $v6
E6 - Inheritance $v6 SocialActivity
E9 - Inheritance ($171 , Bacon) Sv8
E10 - Inheritance 6178 (SatisfyingSet Sv2)
or some such i.e. we have expanded our problem to include more and more simtdtaneou.s logical
equations in more and more variables! Which is not necesbarily a terrible thing, but it does get
complicated.
We might find, for example, that $v=1 Pig, $v6=Dance, $v2=Waltz, $v8 = PiggyWaltz
El - Inheritance ( Pig, Animal)
E4 - Evaluation( Enjoy, (Pig, SocialActivity) I
ES - Inheritance Waltz Dance
E6 - Inheritance Dance SocialActivity
E9 - Inheritance (Pig , Bacon) PiggyWaltz
E10 - Inheritance PiggyWaltz (SatisfyingSet Waltz)
Here PiggyWaltz is a special dance that pigs do with their Bacon, as a SocialActivity!
Of course, this example is extremely contrived. Real inference examples will rarely be this
simple, and will not generally involve Nodes that have simple English names. This example is
just for illustration of the concepts involved.
36.4 Combining Backward and Forward Inference Steps with
Attention Allocation to Achieve the Same Effect as Backward
Chaining (and Even Smarter Inference Dynamics)
Backward chaining is a powerful heuristic, one can achieve the same effect - and even smarter
inference dynamics - via a combination of
• heuristic search to satisfy simultaneous expressions
• boosting the STI of expressions being searched
• importance spreading (of STI)
• ongoing background forward inference
can combine to yield the same basic effect as backward chaining, but without explicitly doing
backward chaining.
The basic idea is: When system of expressions involving variables is explored using a GA or
whatever other optimization process is deployed, these expressions also get their STI boosted.
Then, the atoms with high STI, are explored by the forward inference process, which is
always acting in the background on the atoms in the Atomspace. Other atoms related to these
also get STI via importance spreading. And these other related Atoms are then acted on by
forward inference as well.
This forward chaining will then lead to the formation of new Atoms, which may make the
solution of the system of expressions easier the next time it is visited by the backward inference
process
EFTA00624435
36.5 Hebbian Inference Control 289
In the above example, this means:
• El, E2, E3, E4 will all get their STI boosted
• Other Atoms related to these (Animal, Bacon and Enjoy) will also get their STI boosted
• These other Atoms will get forward inference done on them
• This forward inference will then yield new Atoms that can be drawn on when the solution
of the expression-system El, E2, E3, E4 is pursued the next time
So, for example, if the system did not know that eating is a social activity, it might learn this
during forward inference on SocialActivity. The fact that SocialActivity has high STI would
cause forward inferences such as
Inheritance Dinner Eat
Inheritance Dinner SocialActivity
Inheritance Eat SocialActivity
to get done. These forward inferences would then produce links that could simply be found by
the pattern matcher when trying to find variable assignments to satisfy (El, E2, E3, E4}.
36.4.1 Breakdown into MindAgents
To make this sort of PLN dynamic work, we require a number of MindAgents to be operating
"ambiently" in the background whenever inference is occurring; to wit:
• attentional forward chaining (i.e. each time this MindAgent is invoked, it chooses high-STI
Atoms and does basic forward chaining on them)
• attention allocation (importance updating is critical, Hebbian learning is also useful)
• attentional (variable guided) backward chaining
On top of this ambient inference, we may then have query-driven backward chaining inferences
submitted by other processes (via these launching backward inference steps and giving the
associated Atoms lots of STI). The ambient inference processes will help the query-driven
inference processes to get fulfilled.
36.5 Hebbian Inference Control
A key aspect of the PLN control mechanism described here is the use of attention allocation
to guide inference. A key aspect here is the use of attention allocation to guide Atom choice in
the course of forward and backward inference. Figure 36.1 gives a simple illustrative example
of the use of attention allocation, via HebbianLinks, for PLN backward chaining.
The semantics of a HebbianLink between A and B is, intuitively: In the past, when A was
important, B was also important. HebbianLinks are created via two basic mechanisms: pattern-
mining of associations between importances in the system's history, and PLN inference based
on HebbianLinks created via pattern mining (and inference). Thus, saying that PLN inference
control relies largely on HebbianLinks is in part saying that PLN inference control relies on
EFTA00624436
290 36 Adaptive, Integrative Inference Control
WILBUR a FRIENDLY]
DEDUCTION
WILBUR toe) PIG I PIG hFRIENDLY
G
%
0 DEDUCTION
<OG0
#
O, i t
ot
Search episodic
memory for
PIG 26( FRIENDLY
episodes with
friendly or DECLARATIVE-ATTENTIONAL
unfriendly pigs INTERACTION
USE IMPORTANCE SPREADING:
GIVE WILBUR PIG AND FRIENDLY
HIGH STI AND TRY SOME
THAT GETS HIGH STI
Fig. 36.1: The Use of Attention Allocation for Guiding Backward Chaining Inference.
PLN. There is a bit of a recursion here, but it's not a bottomless recursion because it bottoms
out with HebbianLinks learned via pattern mining.
As an example of the Atom-choices to be made by a forward or backward inference agent
in the course of doing inference, consider that to evaluate (Inheritance A C) via the deduction
Rule, some collection of intermediate nodes for the deduction must be chosen. In the case of
higher-order deduction, each deduction may involve a number of complicated subsidiary steps,
so perhaps only a single intermediate node will be chosen. This choice of intermediate nodes
must be made via context-dependent prior probabilities. In the case of other Rules besides
deduction, other similar choices must be made.
The basic means of using HebbianLinks in inferential Atom-choice is simple: If there are
Atoms linked via HebbianLinks with the other Atoms in the inference tree, then these Atoms
should be given preference in the selection process.
Along the same lines but more subtly, another valuable heuristic for guiding inference control
is "on-the-fly associatedness assessment." If there is a chance to apply the chosen Rule via
working with Atoms that are:
• strongly associated with the Atoms in the Atom being evaluated (via HebbianLinks)
EFTA00624437
36.5 Hebbian Inference Control 291
• strongly associated with each other via HebbianLinks (hence forming a cohesive set)
then this should be ranked as a good thing.
For instance, it may be the case that, when doing deduction regarding relationships between
humans, using relationships involving other humans as intermediate nodes in the deduction is
often useful. Formally this means that, when doing inference of the form:
AND
Inheritance A human
Inheritance A
Inheritance C human
Inheritance C
Inheritance A C
then it is often valuable to choose B so that:
HebbianLink B human
has high strength. This would follow from the above-mentioned heuristic.
Next, suppose one has noticed a more particular heuristic - that in trying to reason about
humans, it is particularly useful to think about their wants. This suggests that in abductions
of the above form it is often useful to choose B of the form:
SatisfyingSet [ wants(human, SX) )
This is too fine-gained of a cognitive-control intuition to come from simple association-
following. Instead, it requires fairly specific data-mining of the system's inference history. It
requires the recognition of "Hebbian predicated' of the form:
Hebbianlmplication
AND
Inheritance SA human
Inheritance SC human
Similarity
SB
SatisfyingSet
Evaluation wants (human, SX)
AND
Inheritance SA SB
Inheritance SC SB
The semantics of:
Hebbianlmplication X Y
is that when X is being thought about, it is often valuable to think about Y shortly thereafter.
So what is required to do inference control according to heuristics like think about humans
according to their wants is a kind of backward-chaining inference that combines Hebbian im-
plications with PLN inference rules. PLN inference says that to assess the relationship between
two people, one approach is abduction. But Hebbian learning says that when setting up an
abduction between two people, one useful precondition is if the intermediate term in the ab-
duction regards wants. Then a check can be made whether there are any relevant intermediate
terms regarding wants in the system's memory.
What we see here is that the overall inference control strategy can be quite simple. For each
Rule that can be applied, a check can be made for whether there is any relevant Hebbian knowl-
edge regarding the general constructs involved in the Atoms this Rule would be manipulating.
EFTA00624438
292 36 Adaptive, Integrative Inference Control
If so, then the prior probability of this Rule is increased, for the purposes of the Rule-choice
bandit problem. Then, if the Rule is chosen, the specific Atoms this Rule would involve in the
inference can be summoned up, and the relevant Hebbian knowledge regarding these Atoms
can be utilized.
To take another similar example, suppose we want to evaluate:
Inheritance pig dog
via the deduction Rule (which also carries out induction and abduction). There are a lot of
possible intermediate terms, but a reasonable heuristic is to ask a few basic questions about
them: How do they move around? What do they eat? How do they reproduce? How intelligent
are they? Some of these standard questions correspond to particular intermediate terms, e.g.
the intelligence question partly boils down to computing:
Inheritance pig intelligent
and:
Inheritance dog intelligent
So a link:
Hebbianlmplication animal intelligent
may be all that's needed to guide inference to asking this question. This HebbianLink says that
when thinking about animals, it's often interesting to think about intelligence. This should bias
the system to choose "intelligent" as an intermediate node for inference.
On the other hand, the what do they eat question is subtler and boils down to asking; Find
$X so that when:
R(SX) SatisfyingSet[SY] eats (SY,SX)
holds (R($X) is a concept representing what eat $X), then we have:
Inheritance pig RISX)
and:
Inheritance dog RISX)
In this case, a HebbianLink from animal to eat would not really be fine-grained enough. Instead
we want a link of the form:
Hebbianlmplication
Inheritance SX animal
SatisfyingSet[SY] eats (SX, SY)
This says that when thinking about an animal, it's interesting to think about what that animal
eats.
The deduction Rule, when choosing which intermediate nodes to use, needs to look at the
scope of available HebbianLinks and HebbianPredicates and use them to guide its choice. And
if there are no good intermediate nodes available, it may report that it doesn't have enough
experience to assess with any confidence whether it can come up with a good conclusion. As a
consequence of the bandit-problem dynamics, it may be allocated reduced resources, or another
Rule is chosen altogether.
EFTA00624439
36.7 Evolution As an Inference Control Scheme 293
36.6 Inference Pattern Mining
Along with general-purpose attention spreading, it it very useful for PLN processes to receive
specific guidance based on patterns mined from previously performed and storedife
This information is stored in CogPrime in a data repository called the InferencePattern-
Repository - which is, quite simply, a special "data table" containing inference trees extracted
from the system's inference history, and patterns recognized therein. An "inference tree" refers to
a tree whose nodes, called InferenceTreeNodes, are Atoms (or generally Atom-versions, Atoms
with truth value relative to a certain context), and whose links are inference steps (so each link
is labeled with a certain inference rule).
In a large CogPrime system it may not be feasible to store all inference trees; but then a wide
variety of trees should still be retained, including mainly successful ones as well as a sampling
of unsuccessful ones for purpose of comparison.
The InferencePatternRepository may then be used in two ways:
• An inference tree being actively expanded (i.e. utilized within the PLN inference system)
may be compared to inference trees in the repository, in real time, for guidance. That is, if
a node N in an inference tree is being expanded, then the repository can be searched for
nodes similar to N, whose contexts (within their inference trees) are similar to the context
of N within its inference tree. A study can then be made regarding which Rules and Atoms
were most useful in these prior, similar inferences, and the results of this can be used to
guide ongoing inference.
• Patterns can be extracted from the store of inference trees in the InferencePatternRepos-
itory, and stored separately from the actual inference trees (in essence, these patterns are
inference subtrees with variables in place of some of their concrete nodes or links). An infer-
ence tree being expanded can then be compared to these patterns instead of, or in addition
to, the actual trees in the repository. This provides greater efficiency in the case of common
patterns among inference trees.
A reasonable approach may be to first check for inference patterns and see if there are any
close matches; and if there are not, to then search for individual inference trees that are close
matches.
Mining patterns front the repository of inference trees is a potentially highly computationally
expensive operation, but this doesn't particularly matter since it can be run periodically in
the background while inference proceeds at its own pace in the foreground, using the mined
patterns. Algorithmically, it may be done either by exhaustive frequent-itemset-mining (as in
deterministic greedy datamining algorithms), or by stochastic greedy mining. These operations
should be carried out by an InferencePatternMiner MindAgent.
36.7 Evolution As an Inference Control Scheme
It is possible to use PEPL (Probabilistic Evolutionary Program Learning, such as MOSES)
as, in essence, an InferenceControl scheme. Suppose we are using an evolutionary learning
mechanism such as MOSES or PLEASURE EGoe08al to evolve populations of predicates or
schemata. Recall that there are two ways to evaluate procedures in CogPrime : by inference
EFTA00624440
294 36 Adaptive, Integrative Inference Control
or by direct evaluation. Consider the case where inference is needed in order to provide high-
confidence estimates of the evaluation or execution relationships involved. Then, there is the
question of how much effort to spend on inference, for each procedure being evaluated as part
of the fitness evaluation process. Spending a small amount of effort on inference means that
one doesn't discover much beyond what's immediately apparent in the AtomSpace. Spending a
large amount of effort on inference means that one is trying very hard to use indirect evidence
to support conjectures regarding the evaluation or execution Links involved.
When one is evolving a large population of procedures, one can't afford to do too much
inference on each candidate procedure being evaluated. Yet, of course, doing more inference
may yield more accurate fitness evaluations, hence decreasing the number of fitness evaluations
required.
Often, a good heuristic is to gradually increase the amount of inference effort spent on
procedure evaluation, during the course of evolution. Specifically, one may make the amount
of inference effort roughly proportional to the overall population fitness. This way, initially,
evolution is doing a cursory search, not thinking too much about each possibility. But once it
has some fairly decent guesses in its population, then it starts thinking hard, applying more
inference to each conjecture.
Since the procedures in the population are likely to be interrelated to each other, inferences
done on one procedure are likely to produce intermediate knowledge that's useful for doing
inference on other procedures. Therefore, what one has in this scheme is evolution as a control
mechanism for higher-order inference.
Combined with the use of evolutionary learning to achieve memory across optimization runs,
this is a very subtle approach to inference control, quite different from anything in the domain
of logic-based Al. Rather than guiding individual inference steps on a detailed basis, this type
of control mechanism uses evolutionary logic to guide the general direction of inference, pushing
the vast mass of exploratory inferences in the direction of solving the problem at hand, based
on a flexible usage of prior knowledge.
36.8 Incorporating Other Cognitive Processes into Inference
Hebbian inference control and inference pattern mining are valuable and powerful processes,
but they are not always going to be enough. The solution of some problems that CogPrime
chooses to address via inference will ultimately require the use of other methods, too. In these
cases, one workaround is for inference to call on other cognitive processes to help it out.
This is done via the forward or backward chaining agents identifying specific Atoms deserv-
ing of attention by other cognitive processes, and then spawning Tasks executing these other
cognitive processes on the appropriate Atoms.
Firstly, which Atoms should be selected for this kind of attention? What we want are Infer-
enceTreeNodes that:
• have high STI.
• have the impact to significantly change the overall truth value of the inference tree they are
embedded in (something that can be calculated by hypothetically varying the truth value of
the InferenceTreeNode and seeing how the truth value of the overall conclusion is affected).
• have truth values that are known with low confidence.
EFTA00624441
36.9 PLN and Bayes Nets 295
Truth values meeting these criteria should be taken as strong candidates for attention by other
cognitive processes.
The next question is which other cognitive processes do we apply in which cases?
MOSES in supervised categorization mode can be applied to a candidate InferenceTreeNode
representing a CogPrime Node if it has a sufficient number of members (Atoms linked to it
by MemberLinks); and, a sufficient number of new members have been added to it (or have
had their membership degree significantly changed) since MOSES in supervised categorization
mode was used on it last.
Next, pattern mining can be applied to look for connectivity patterns elsewhere in the Atom-
Table, similar to the connectivity patterns of the candidate Atom, if the candidate Atom has
changed significantly since pattern mining last visited it.
More subtly, what if, we try to find whether "cross breed" implies "Ugliness", and we know
that "bad genes" implies Ugliness, but can't find a way, by backward chaining, to prove that
"cross breed" implies "bad genes". Then we could launch a non-backward-chaining algorithm to
measure the overlap of SatisfyingSet(cross breed) and SatisfyingSet(bad genes). Specifically, we
could use MOSES in supervised categorization mode to find relationships characterizing "cross
breed" and other relationships characterizing "bad genes", and then do some forward chaining
inference on these relationships. This would be a general heuristic for what to do when there's
a link with low confidence but high potential importance to the inference tree.
SpeculativeConceptFormation (see Chapter 38) may also be used to create new concepts and
attempt to link them to the Atoms involved in an inference (via subsidiary inference processes,
or HebbianLink formation based on usage in learned procedures, etc.), so that they may be
used in inference.
36.9 PLN and Bayes Nets
Finally, we give some comments on the relationship between PLN and Bayes Nets IP.I88al. We
have not yet implemented such an approach, but it may well be that Bayes Nets methods can
serve as a useful augmentation to PLN for certain sorts of inference (specifically, for inference
on networks of knowledge that are relatively static in nature).
We can't use standard Bayes Nets as the primary way of structuring reasoning in CogPrime
because CogPrime's knowledge network is loopy. The peculiarities that allow standard Bayer net
belief propagation to work in standard loopy Bayes nets, don't hold up in CogPrime, because
of the way you have to update probabilities when you're managing a very large network in
interaction with a changing world, so that different parts of which get different amounts of focus.
So in PLN we use different mechanisms (the "inference trail" mechanism) to avoid "repeated
evidence counting" whereas in loopy Bayes nets they rely on the fact that in the standard loopy
Bayer net configuration, extra evidence counting occurs in a fairly constant way across the
network.
However, when you have within the AtomTable a set of interrelated knowledge items that you
know are going to be static for a while, and you want to be able to query them probabilistically,
then building a Bayes Net of some sort (i.e. "freezing" part of CogPrime's knowledge network
and mapping it into a Bayes Net) may be useful. I.e., one way to accelerate some PLN inference
would be:
EFTA00624442
296 36 Adaptive, Integrative Inference Control
1. Freeze a subnetwork of the AtomTable which is expected not to change a lot in the near
future
2. Interpret this subnetwork as a loopy Bayes net, and use standard Bayesian belief propaga-
tion to calculate probabilities based on it
This would be a highly efficient form of "background inference" in certain contexts. (Note
that this ideally requires an "indefinite Bayes net" implementation that propagates indefinite
probabilities through the standard Bayes-net local belief propagation algorithms, but this is
not mathematically problematic.)
On the other hand, if you have a very important subset of the Atomspace, then it may
be worthwhile to maintain a Bayes net modeling the conditional probabilities between these
Atoms, but with a dynamically updated structure.
EFTA00624443
Chapter 37
Pattern Mining
Co-authored with Jade O'Neill
37.1 Introduction
Having discussed inference in depth we now turn to other, simpler but equally important ap-
proaches to creating declarative knowledge. This chapters deals with pattern mining - the
creation of declarative knowledge representing patterns among other knowledge (which may be
declarative, sensory, episodic, procedural, etc.) - and the following chapter deals with specula-
tive concept creation.
Within the scope of pattern mining, we will discuss two basic approaches:
• supervised learning: given a predicate, finding a pattern among the entities that satisfy that
predicate.
• unsupervised learning: undirected search for "interesting patterns".
The supervised learning case is easier and we have done a number of experiments using
MOSES for supervised pattern mining, on biological (microarray gene expression and SNP)
and textual data. In the CogPrime case, the "positive examples" are the elements of the Sat-
isfyingSet of the predicate P, and the "negative examples" are everything else. This can be a
relatively straightforward problem if there are enough positive examples and they actually share
common aspects ... but some trickiness emerges, of course, when the common aspects are, in
each example, complexly intertwined with other aspects.
The unsupervised learning case is considerably trickier. The main problem issue here regards
the definition of an appropriate fitness function. We are searching for "interesting patterns." So
the question is, what constitutes an interesting pattern?
We will also discuss two basic algorithmic approaches:
• program learning, via MOSES or hificlimbing
• frequent subgraph mining, using greedy algorithms
The value of these various approaches is contingent on the environment and goal set being
such that algorithms of this nature can actually recognize relevant patterns in the world and
mind. Fortunately, the everyday human world does appear to have the property of possessing
multiple relevant patterns that are recognizable using varying levels of sophistication and effort.
It has patterns that can be recognized via simple frequent pattern mining, and other patterns
that are too subtle for this, and are better addressed by a search-based approach. In order for
297
EFTA00624444
298 37 Pattern Mining
an environment and goal set to be appropriate for the learning and teaching of a human-level
AI, it should have the same property of possessing multiple relevant patterns recognizable using
varying levels of subtlety.
37.2 Finding Interesting Patterns via Program Learning
As one important case of pattern mining, we now discuss the use of program learning to find
"interesting" patterns in sets of Atoms.
Clearly, "interestingness" is a multidimensional concept. One approach to defining it is em-
pirical, based on observation of which predicates have and have not proved interesting to the
system in the past (based on their long-term importance values, i.e. LTI).
In this approach, one has a supervised categorization problem: learn a rule predicting whether
a predicate will fall into the interesting category or the uninteresting category. Once one has
learned this rule, which has expressed this rule as a predicate itself, one can then use this rule
as the fitness function for evolutionary learning evolution.
There is also a simpler approach, which defines an objective notion of interestingness. This
objective notion is a weighted sum of two factors:
• Compactness.
• Surprisingness of truth value.
Compactness is easy to understand: all else equal, a predicate embodied in a small Combo tree
is better than a predicate embodied in a big one. There is some work hidden here in Combo
tree reduction; ideally, one would like to find the smallest representation of a given Combo tree,
but this is a computationally formidable problem, so one necessarily approaches it via heuristic
algorithms.
Surprisingness of truth value is a slightly subtler concept. Given a Boolean predicate, one can
envision two extreme ways of evaluating its truth value (represented by two different types of
ProcedureEvaluator). One can use an IndependenceAssumingProcedureEvaluator, which deals
with all AND and OR operators by assuming probabilistic independence. Or, one can use an
ordinary EffortBasedProcedureEvaluator, which uses dependency information wherever feasible
to evaluate the truth values of AND and OR operators. These two approaches will normally
give different truth values but, how different? The more different, the more surprising is the
truth value of the predicate, and the more interesting may the predicate be.
In order to explore the power of this kind of approach in a simple context, we have tested
pattern mining using MOSES on Boolean predicates as a data mining algorithm on a number
of different datasets, including some interesting and successful work in the analysis of gene
expression data, and some more experimental work analyzing sociological data from the National
Longitudinal Survey of Youth (NLSY) (http : // st at s bl s goy in Is /).
A very, simple illustrative result from the analysis of the NLSY data is the pattern:
OR
(NOT(MothersAge(X)) AND NOT(FirstSexAge(X)))
(Wealth(%) AND PIAT(X))
where the domain of X are individuals, meaning that:
• being the child of a young mother correlates with having sex at a younger age;
EFTA00624445
37.3 Pattern Mining via Frequent/Surprising Subgraph Mining 299
• being in a wealthier family correlates with better Math (PIAT) scores;
• the two sets previously described tend to be disjoint.
Of course, many data patterns are several times more complex than the simple illustrative
pattern shown above. However, one of the strengths of the evolutionary learning approach to
pattern mining is its ability to find simple patterns when they do exist, yet without (like some
other mining methods) imposing any specific restrictions on the pattern format.
37.3 Pattern Mining via Frequent/Surprising Subgraph Mining
Probabilistic evolutionary learning is an extremely powerful approach to pattern mining, but,
may not always be realistic due to its high computational cost. A cheaper, though also weaker,
alternative, is to use frequent subgraph mining algorithms such as RIWP03, IiKOIJ, which may
straightforwardly be adapted to hypergraphs such as the Atomspace.
Frequent subgraph mining is a port to the graph domain of the older, simpler idea of frequent
itemset mining, which we now briefly review. There are a number of algorithms in the latter
category, the classic is Apriori LAS911, and an alternative is Relim 11101051 which Ls conceptually
similar but seems to give better performance.
The basic goal of frequent itemset mining is to discover frequent subsets ("helmets") in a
group of sets, whose members are all drawn from some base set of items. One knows that for a
set of N items, there are 2A — I possible subgroups. The algorithm operates in several rounds.
Round i heuristically computes frequent i-itemsets (i.e. frequent sets containing i items). A
round has two steps: candidate generation and candidate counting. In the candidate generation
step, the algorithm generates a set of candidate i-itemsets whose support - the percentage of
events in which the item must appear - has not been yet been computed. In the candidate-
counting step, the algorithm scans its memory, database, counting the support of the candidate
itemsets. After the scan. the algorithm discards candidates with support lower than the specified
minimum (an algorithm parameter) and retains only the sufficiently frequent i-itemsets. The
algorithm reduces the number of tested subsets by pruning apriori those candidate itemsets that
cannot be frequent, based on the knowledge about infrequent itemsets obtained from previous
rounds. So for instance if {A, B} is a frequent 2-itemset then {A, B, C) will be considered as a
potential 3-itemset, on the contrary if {A, B} is not a frequent itemset then {A, B,C}, as well
as any superset of {A, B}, will be discarded. Although the worst case of this sort of algorithm is
exponential, practical executions are generally fast, depending essentially on the support limit.
Frequent subgraph mining follows the same pattern, but instead of a set of items it deals with
a group of graphs. There are many frequent subgraph mining algorithms in the literature, but
the basic concept underlying nearly all of them is the same: first find small frequent subgraphs.
Then seek to find slightly larger frequent patterns encompassing these small ones. Then seek
to find slightly larger frequent patterns encompassing these, etc. This approach is much faster
than something like MOSES, although management of the large number of subgraphs to be
searched through can require subtle design and implementation of data structures.
If, instead of an ensemble of small graphs, one has a single large graph like the AtomSpace,
one can follow the same approach, via randomly subsampling from the large graph to find the
graphs forming the ensemble to be mined from; see VI I1(11 for a detailed treatment of this sort of
EFTA00624446
300 37 Pattern Mining
approach. The fact that the AtomSpace is a hypergraph rather than a graph doesn't fundamen-
tally affect the matter since a hypergraph may always be considered a graph via introduction
of an additional node for each hyperedge (at the cost of a potentially great multiplication of
the number of links).
Frequent subgraph mining algorithms appropriately deployed can find subgraphs which occur
repeatedly in the Atomspace, including subgraphs containing Atom-valued variables . Each such
subgraph may be represented as a PredicateNode, and frequent subgraph mining will find such
PredicateNodes that have surprisingly high truth values when evaluated across the Atomspace.
But unlike MOSES when applied as described above, such an algorithm will generally find such
predicates only in a "greedy" way.
For instance, a greedy subgraph mining algorithm would be unlikely to find
OR
INOT(MothersAge(X)) AND NOT(FirstSexAge(X)))
(Nealth(X) AND PIATOO)
as a surprising pattern in an AtomSpace, unless at least one (and preferably both) of
Wealth(%) AND PIAT(X)
and
NOT(MothersAge(X)) AND NOT(FirstSexAge(X))
were surprising patterns in that Atomspace on their own.
37.4 Fishgram
Fishgram is a simple example of an algorithm for finding patterns in an Atomspace, instantiating
the general concepts presented in the previous section. It represents patterns as conjunctions
(AndLink) of Links, which usually contain variables. It does a greedy search, so it can quickly
find many patterns. In contrast, algorithms like MOSES are designed to find a small number
of the best patterns. Fishgram works by finding a set of objects that have links in common,
so it will be most effective if the AtomSpace has a lot of raw data, with simple patterns. For
example, it can be used on the perceptions from the virtual world. There are predicates for
basic perceptions (e.g. what kind of object something is, objects being near each other, types
of blocks. and actions being performed by the user or the AI).
The details of the Fishgram code and design are not sufficiently general or scalable to serve as
a robust, omnipurpose pattern mining solution for CogPrime. However, Fishgram is nevertheless
interesting, as an existent, implemented and tested prototype of a greedy frequent/interesting
subhypergraph mining system. A more scalable analogous system, with a similar principle of
operation, has been outlined and is in the process of being designed at time of writing, but will
not be presented here.
37.4.1 Example Patterns
Here is some example output from Fishgram, when run on the virtual agent's memories.
EFTA00624447
37A Fishgram 301
(AndLink
(EvaluationLink is_edible:PredicateNode (ListLink $1000041))
(InheritanceLink $1000041 Battery:ConceptNode)
This means a battery which can be "eaten" by the virtual robot. The variable $1000041 refers
to the object (battery).
Fishgram can also find patterns containing a sequence of events. In this case, there is a list
of EvaluationLinks or InheritanceLinks which describe the objects involved, followed by the
sequence of events.
(AndLink
(InheritanceLink $1007703 Battery:ConceptNode)
(SequentialAndLink
(EvaluationLink isHolding:PredicateNode (ListLink $1008725 $1007703)))
This means the agent was holding a battery. $1007703 is the battery, and there is also a
variable for the agent itself. Many interesting patterns involve more than one object. This
pattern would also include the user (or another AI) holding a battery, because the pattern does
not refer to the AI character specifically.
It can find patterns where it performs an action and achieves a goal. There is code to create
implications based on these conjunctions. After finding many conjunctions, it can produce
ImplicationLinks based on some of them. Here is an example where the Al-controlled virtual
robot discovers how to get energy.
(ImplicationLink
(AndLink
(EvaluationLink is_edible:PredicateNode (ListLink $1011619))
(InheritanceLink $1011619 Battery:ConceptNode)
(PredictivelmplicationLink
(EvaluationLink actionDone:PredicateNode (ListLink
(ExecutionLink eat:GroundedSchemallode
(ListLink $1011619))))
(EvaluationLink increased: PredicateNode (ListLink
(EvaluationLink
EnergyDemandGoal:PredicateNode)))
37.4.2 The Fishgram Algorithm
The core Fishgram algorithm, in pseudocode, is as follows:
EFTA00624448
302 37 Pattern Mining
initial layer = every pair (relation, binding)
while previous layer is not empty:
foreach (conjunction, binding) in previous layer:
let incoming = all (relation, binding) pairs containing an object
in the conjunction
let possible_next_events = all (event, binding) pairs where
the event happens during or shortly
after the last event in conjunction
foreach (relation, relation_binding) in incoming
and possible_next_events:
(new_relation new_conjunction_binding) =
map_to_existing_variables(conjunction ,
binding , relation , relation_binding)
if new_relation is already in conjunction, skip it
new_conjunction = conjunction + new_relation
if new_conjunction has been found already, skip it
otherwise, add new_conjunction to the current layer
map_to_existing_variables(conjunction, conjunction_binding,
relation, relation_binding)
r', s' = a copy of the relation and binding using new variables
foreach variable v, object o in relation_binding:
foreach variable v2, object o2 in conjunction_binding:
if o == o2:
change r' and s' to use v2 instead of v
37.4.3 Preprocessing
There are several preprocessing steps to make it easier for the main Fishgram search to find
patterns. There is a list of things that have to be variables. For example, any predicate that
refers to object (including agents) will be given a variable so it can refer to any object. Other
predicates or InheritanceLinIcs can be added to a pattern, to restrict it to specific kinds of
objects, as shown above. So there Ls a step which goes through all of the links in the AtomSpace,
and records a list of predicates with variables. Such as "X is red" or "X eats Y". This makes the
search part simpler, because it never has to decide whether something should be a variable or
a specific object.
There is also a filter system, so that things which seem irrelevant can be excluded from the
search. There is a combinatorial explosion as patterns become larger. Some predicates may be
redundant with each other, or known not to be very useful. It can also try to find only patterns
in the Al's "attentional focus", which is much smaller than the whole AtomSpace.
The Fishgram algorithm cannot currently handle patterns involving numbers, although it
could be extended to do so. The two options would be to either have a separate discretization
step, creating predicates for different ranges of a value. Or alternatively, have predicates for
mathematical operators. It would be passible to search for a "splitpoint" like in decision trees.
EFTA00624449
37.4 Fishgram 303
So a number would be chosen, and only things above that value (or only things below that value)
would count for a pattern. It would also be possible to have multiple numbers in a pattern,
and compare them in various ways. It is uncertain how practical this would be in Fishgrain.
MOSES is good for finding numeric patterns, so it may be better to simply use those patterns
inside Fishgrain.
The "increased" predicate is added by a preprocessing step. The goals have a fuzzy TruthValue
representing how well the goal is achieved at any point in time, so e.g. the EnergyDemandGoal
represents how much energy the virtual robot has at some point in time. The predicate records
times that a goal's TruthValue increased. This only happens immediately after doing something
to increase it, which helps avoid finding spurious patterns.
37.4.4 Search Process
Fishgram search is breadth-first. It starts with all predicates (or InheritanceLinks) found by the
preprocessing step. Then it finds pairs of predicates involving the same variable. Then they are
extended to conjunctions of three predicates, and so on. Many relations apply at a specific time,
for example the agent being near an object, or an action being performed. These are included
in a sequence, and are added in the order they occurred.
Fishgram remembers the examples for each pattern. If there is only one variable in the
pattern, an example is a single object; otherwise each example is a vector of objects for each
variable in the pattern. Each time a relation is added to a pattern, if it has no new variables,
some of the examples may be removed, because they don't satisfy the new predicate. It needs
to have at least one variable in common with the previous relations. Otherwise the patterns
would combine many unrelated things.
In frequent itemset mining (for example the APRIOR] algorithm), there is effectively one
variable, and adding a new predicate will often decrease the number of items that match. It can
never increase it. The number of passible conjunctions increases with the length, up to some
point, after which it decreases. But when mining for patterns with multiple objects there is
a much larger combinatorial explosion of patterns. Various criteria can be used to prune the
search.
The most basic criterion is the frequency. Only patterns with at least N examples will be
included, where N is an arbitrary constant. You can also set a maximum number of patterns
allowed for each length (number of relations), and only include the best ones. The next level of
the breadth-first search will only search for extensions of those patterns.
One can also use a measure of statistical interestingness, to make sure the relations in a
pattern are correlated with each other. There are many spurious frequent patterns, because
anything which is frequent will occur together with other things, whet her they are relevant or
not. For example "breathing while typing" is a frequent pattern, because people breathe at all
times. But "moving your hands while typing" is a much more interesting pattern. As people
only move their hands some of the time, a measure of correlation would prefer the second
pattern. The best measure may be interaction information, which is a generalisation of mutual
information that applies to patterns with more than two predicates. An early-stage AI would
not have much knowledge of cause and effect, so it would rely on statistical measures to find
useful patterns.
EFTA00624450
304 37 Pattern Mining
37.4.5 Comparison to other algorithms
Fishgram is more suitable for OpenCogPrime's purposes than existing graph mining algorithms,
most of which were designed with molecular datasets in mind. The OpenCog AtomSpace is a
different graph in various ways. For one, there are many possible relations between nodes (much
like in a semantic network). Many relations involve more than two objects, and there are also
properties predicates about a single object. So the relations are effectively directed links of
varying arity. It also has events, and many states can change over time (e.g. an egg changes
state while it's cooking). Fishgram is designed for general knowledge in an embodied agent.
There are other major differences. Fishgram uses a breadth-first search, rather than depth-
first search like most graph mining algorithms. And it does an "embedding-based" search, search-
ing for patterns that can be embedded multiple times in a large graph. Molecular datasets have
many separate graphs for separate molecules, but the embodied perceptions are closer to a
single, fairly well-connected graph. Depth-first search would be very slow on such a graph, as
there are many very long paths through the graph, and the search would mostly find those.
Whereas the useful patterns tend to be compact and repeated many times.
Lastly the design of Fishgram makes it easy to experiment with multiple different scoring
functions, from simple ones like frequency to much more sophisticated functions such as inter-
action information.
As mentioned above, the current implementation of Fishgram is not sufficiently scalable to
be utilized for general-purpose Atomspaces. The underlying data structure within Fishgram,
used to store recognized patterns, would need to be replaced, which would lead to various
other modifications within the algorithm. But, the general principle and approach illustrated
by Fishgrarn will be persisted in any more scalable reimplementation.
EFTA00624451
Chapter 38
Speculative Concept Formation
38.1 Introduction
One of the hallmarks of general intelligence is its capability to deal with novelty in its envi-
ronment and/or goal-set. And dealing with novelty intrinsically requires creating novelty. It's
impossible to efficiently handle new situations without creating new ideas appropriately. Thus,
in any environment complex and dynamic enough to support human-like general intelligence
(or any other kind of highly powerful general intelligence), the creation of novel ideas will be
paramount. New idea creation takes place in OpenCog via a variety of methods - e.g. inside
MOSES which creates new program trees, PLN which creates new logical relationships, ECAN
which creates new associative relationships, etc. But there is also a role for explicit, purposeful
creation of new Atoms representing new concepts, outside the scope of these other learning
mechanisms.
The human brain gets by, in adulthood, without creating that many new neurons - although
neurogenesis does occur on an ongoing basis. But this is achieved only via great redundancy,
because for the brain it's cheaper to maintain a large number of neurons in memory at the same
time, than to create and delete neurons. Things are different in a digital computer: memory is
more expensive but creation and deletion of object is cheaper. Thus in CogPrime, forgetting
and creation of Atoms is a regularly occurring phenomenon. In this chapter we discuss a key
class of mechanisms for Atom creation, "speculative concept formation." Further methods will
be discussed in following chapters.
The philosophy underlying CogPrime's speculative concept formation is that new things
should be created from pieces of good old things (a form of "evolution", broadly construed), and
that probabilistic extrapolation from experience should be used to guide the creation of new
things (inference). It's clear that these principles are necvsbary for the creation of new mental
forms but it's not obvious that they're sufficient: this is a nontrivial hypothesis, which may also
be considered a family of hypotheses since there are many different ways to do extrapolation
and intercombination. In the context of mind-world correspondence, the implicit assumption
underlying this sort of mechanism is that the relevant patterns in the world can often be
combined to form other relevant patterns. The everyday human world does quite markedly
display this kind of combinatory structure, and such a property seems basic enough that it's
appropriate for use as an assumption underlying the design of cognitive mechanisms.
In CogPrime we have introduced a variety of heuristics for creating new Atoms - especially
ConceptNodes - which may then be reasoned on and subjected to implicit (via attention allo-
305
EFTA00624452
306 38 Speculative Concept Formation
cation) and explicit (via the application of evolutionary learning to predicates obtained front
concepts via "concept predicatization") evolution. Among these are the node logical operators
described in the book Probabilistic Logic Networks, which allow the creation of new concepts
via AND, OR, XOR and so forth. However, logical heuristics alone are not sufficient. In this
chapter we will review some of the nonlogical heuristics that are used for speculative concept
formation. These operations play an important role in creativity - to use cognitive-psychology
language, they are one of the ways that CogPrime implements the process of blending, which
Falconnier and Turner (2003) have argued is key to human creativity on many different levels.
Each of these operations may be considered as implicitly associated with a hypothesis that,
in fact, the everyday human world tends to assign utility to patterns that are combinations of
other patterns produced via said operation.
An evolutionary perspective may also be useful here, on a technical level as well as philo-
sophically. As noted in The Hidden Pattern and hinted at in Chapter 3 of Part 1, one way
to think about an AGI system like CogPrime is as a huge evolving ecology. The AtomSpace
is a biosphere of sorts, and the mapping from Atom types into species has some validity to
it (though not complete accuracy: Atom types do not compete with each other; but they do
reproduce with each other, and according to most of the reproduction methods in use, Atoms
of differing type cannot cross-reproduce). Fitness is defined by importance. Reproduction is
defined by various operators that produce new Atoms from old, including the ones discussed in
this chapter, as well as other operators such as inference and explicit evolutionary operators.
New ConceptNode creation may be triggered by a variety of circumstances. If two ConceptN-
odes are created for different purposes, but later the system finds that most of their meanings
overlap, then it may be more efficient to merge the two into one. On the other hand, a node may
become overloaded with different usages, and it is more useful to split it into multiple nodes,
each with a more consistent content. Finally, there may be patterns across large numbers of
nodes that merit encapsulation in individual nodes. For instance, if there are 1000 fairly similar
ConceptNodes, it may be better not to merge them all together, but rather to create a single
node to which they all link, reifying the category that they collectively embody.
In the following sections, we will begin by describing operations that create new ConceptN-
odes from existing ones on a local basis: by mutating individual ConceptNodes or combining
pairs of ConceptNodes. Some of these operations are inspired by evolutionary operators used
in the GA, others are based on the cognitive psychology concept of "blending." Then we will
turn to the use of clustering and formal concept analysis algorithms inside CogPrime to refine
the system's knowledge about existing concepts, and create new concepts.
38.2 Evolutionary Concept Formation
A simple and useful way to combine ConceptNodes is to use GA-inspired evolutionary operators:
crossover and mutation. In mutation, one replaces some number of a Node's links with other
links in the system. In crossover, one takes two nodes and creates a new node containing some
links from one and some links from another.
More concretely, to cross over two ConceptNodes X and Y, one may proceed as follows (in
short clustering the union of X and Y):
• Create a series of empty nodes Zr, Z2, . Zk
EFTA00624453
38.2 Evolutionary Concept Formation 307
• Form a "link pool" consisting of all X's links and all Y's links, and then divide this pool
into clusters (clustering algorithms will be described below).
• For each cluster with significant cohesion, allocate the links in that cluster to one of the
new nodes Zi
On the other hand, to mutate a ConceptNode, a number of different mutation processes are
reasonable. For instance, one can
• Cluster the links of a Node, and remove one or more of the clusters, creating a node with
less links
• Cluster the links, remove one or more clusters, and then add new links that are similar to
the links in the remaining clusters
The EvolutionaryConceptFormation MindAgent selects pairs of nodes from the system,
where the probability of selecting a pair is determined by
• the average importance of the pair
• the degree of similarity of the pair
• the degree of association of the pair
(Of course, other heuristics are possible too). It then crosses over the pair, and mutates the
result.
Note that, unlike in some GA implementations, the parent node(s) are retained within the
system; they are not replaced by the children. Regardless of how many offspring they generate
by what methods, and regardless of their age, all Nodes compete and cooperate freely forever
according to the fitness criterion defined by the importance updating function. The entire
AtomSpace may be interpreted as a large evolutionary, ecological system, and the action of
CogPrime dynamics, as a whole, is to create fit nodes.
A more advanced variant of the EvolutionaryConceptFormation MindAgent would adapt its
mutation rate in a context-dependent way. But our intuition is that it is best to leave this kind
of refinement for learned cognitive schemata, rather than to hard-wire it into a MindAgent.
To encourage the formation of such schemata, one may introduce elementary schema functions
that embody the basic node-level evolutionary operators:
ConceptNode ConceptCrossover(ConceptNode A, ConceptNode B)
ConceptNode mutate(ConceptNode A, mutationAmount m)
There will also be a role for more abstract schemata that utilize these. An example cognitive
schema of this sort would be one that said: "When all my schema in a certain context seem
unable to achieve their goals, then maybe I need new concepts in this context, so I should
increase the rate of concept mutation and crossover, hoping to trigger some useful concept
formation."
As noted above, this component of CogPrime views the whole AtomSpace as a kind of
genetic algorithm - but the fitness function is "ecological" rather than fixed, and of course
the crossover and mutation operators are highly specialized. Most of the concepts produced
through evolutionary operations are going to be useless nonsense, but will be recognized by the
importance updating process and subsequently forgotten from the system. The useful ones will
link into other concepts and become ongoing aspects of the system's mind. The importance
updating process amounts to fitness evaluation, and it depends implicitly on the sum total of
the cognitive processes going on in CogPrime.
EFTA00624454
308 38 Speculative Concept Formation
To ensure that importance updating properly functions as fitness evaluation, it is critical
that evolutionarily-created concepts (and other speculatively created Atoms) always comprise
a small percentage of the total concepts in the system. This guarantees that importance will
serve as a meaningful "fitness function" for newly created ConceptNodes. The reason for this
is that the importance measures how useful the newly created node is, in the context of the
previously existing Atoms. If there are too many speculative, possibly useless new ConceptNodes
in the system at once, the importance becomes an extremely noisy fitness measure, as it's
largely measuring the degree to which instances of new nonsense fit in with other instances
of new nonsense. One may find interesting self-organizing phenomena in this way, but in an
AGI context we are not interested in undirected spontaneous pattern-formation, but rather in
harnessing self-organizing phenomena toward system goals. And the latter is achieved by having
a modest but not overwhelming amount of speculative new nodes entering into the system.
Finally, as discussed earlier, evolutionary operations on maps may occur naturally and au-
tomatically as a consequence of other cognitive operations. Maps are continually mutated due
to fluctuations in system dynamics; and maps may combine with other maps with which they
overlap, as a consequence of the nonlinear properties of activation spreading and importance
updating. Map-level evolutionary operations are not closely tied to their Atom-level counter-
parts (a difference from e.g. the close correspondence between map-level logical operations and
underlying Atom-level logical operations).
38.3 Conceptual Blending
The notion of Conceptual Blending (aka Conceptual Integration) was proposed by Gilles Fan-
cannier and Mark Turner IFT021 as general theory, of cognition. According to this theory, the
basic operation of creative thought is the "blend" in which elements and relationships from
diverse scenarios are merged together in a judicious way. As a very simple example, we may
consider the blend of "tower" and "snake" to form a new concept of "snake tower" (a tower that
looks somewhat like a snake). However, most examples of blends will not be nearly so obvious.
For instance, the complex numbers could be considered a blend between 2D points and real
numbers. Figure 38.1 gives a conceptual illustration of the blending process.
The production of a blend is generally considered to have three key stages (elucidated via
the example of building a snake-tower out of blocks):
• composition: combining judiciously chosen elements from two or more concept inputs
- Example: Taking the "buildingness" and "verticalness" of a tower, and the "head" and
"mouth" and "tail" of a snake
• completion: adding new elements from implicit background knowledge about the concept
inputs
— Example: Perhaps a mongoose-building will be built out of blocks, poised in a position
indicating it is chasing the snake-tower (incorporating the background knowledge that
mongeese often chase snakes)
• elaboration: fine-tuning, which shapes the elements into a new concept, guided by the desire
to optimize certain criteria
EFTA00624455
38.3 Conceptual Blending 309
Concept
Blending
Segal.' rondo...
arrow
.......
i tleird
.......)
human
Fig. 38.1: Conceptual Illustration of Conceptual Blending
— Example: The tail of the snake-tower is a part of the building that rests on the ground,
and connects to the main tower. The head of the snake-tower is a portion that sits atop
the main tower, analogous to the restaurant atop the Space Needle.
The "judiciousness" in the composition phase may be partially captured in CogPrime via
PLN inference, via introducing a "consistency criterion" that the elements chosen as part of
the blend should not dramatically decrease in confidence after the blend's relationships are
submitted to PLN inference. One especially doesn't want to choose mutually contradictory
elements from the two inputs. For instance one doesn't want to choose "alive" as an element
of "snake", and "non-living" as an element of "building." This kind of contradictory choice can
be ruled out by inference, because after very few inference steps, this choice would lead to a
drastic confidence reduction for the InheritanceLinks to both "alive" and "non-living."
Aside from consistency, some other criteria considered relevant to evaluating the quality of
a blend, are:
• topology principle that relations in the blend should match the relations of their counterparts
in other concepts related to the concept inputs
• web principle that the representation in the blended space should maintain mappings to the
concept inputs
• unpacking principle that, given a blended concept, the interpreter should be able to infer
things about other related concepts
• good reason principle that there should be simple explanations for the elements of the blend
EFTA00624456
310 38 Speculative Concept Formation
• metonymic tightening that when metonymically related elements are projected into the
blended space, there is pressure to compress the "distance" between them.
While vague-sounding in their verbal formulations, these criteria have been computationally
implemented in the Sapper system, which uses blending theory, to model analogy and metaphor
IVlC91, VO071; and in a different form in rat-061's framework for computational creativity. In
CogPrime terms, these various criteria essentially boil down to: the new, blended concept
should get a lot of interesting links.
One could implement blending in CogPrime very straightforwardly via an evolutionary ap-
proach: search the space of possible blends, evaluating each one according to its consistency
but also the STI that it achieves when released into the Atomspace. However, this will be quite
computationally expensive, so a wiser approach is to introduce heuristics aimed at increasing
the odds of producing important blends.
A simple heuristic is to calculate, for each candidate blend, the amount of STI that the
blend would possess N cycles later if, at the current time, it was given a certain amount of STI.
A blend that would accumulate more STI in this manner may be considered more promising,
because this means that its components are more richly interconnected. Further, this heuristic
may be used as a guide for greedy heuristics for creating blends: e.g. if one has chosen a certain
element A of the first blend input, then one may seek an element B of the second blend input
that has a strong Hebbian link to A (if such a B exists).
However, it may also be interesting to pursue different sorts of heuristics, using information-
theoretic or other mathematical criteria to preliminarily filter possible blends before they are
evaluated more carefully via integrated cognition and importance dynamics.
38.3.1 Outline of a CogPrime Blending Algorithm
A rough outline of a concept blending algorithm for CogPrime is as follows:
• Choose a pair of concepts Cl and C2, which have a nontrivially-strong HebbianLink between
them, but not an extremely high-strength SimilarityLink between them (i.e. the concepts
should have something to do with each other, but not be extremely similar; blends of
extremely similar things are boring). These parameters may be twiddled.
• Form a new concept C3, which has some of Cl's links, and some of C2's links
• If C3 has obvious contradictions, resolve them by pruning links. (For instance, if Cl inherits
from alive to degree .9 and C2 inherits from alive to degree .1, then one of these two
TruthValue versions for the inheritance link from alive, has got to be pruned...)
• For each of C3's remaining links L, make a vector indicating everything it or its targets are
associated with (via HebbianLinks or other links). This is basically a list of "what's related
to L". Then, assess whether there are a lot of common associations to the links L that came
from Cl and the links L that came from C2
• If the filter in step 4 is passed, then let the PLN forward chainer derive some conclusions
about C3, and see if it comes up with anything interesting (e.g. anything with surprising
truth value, or anything getting high STI, etc.)
Steps 1 and 2 should be repeated over and over. Step 5 is basically "cognition as usual" - i.e.
by the time the blended concept is thrown into the Atomspace and subjected to Step 5, it's
being treated the same as any other ConceptNode.
EFTA00624457
38.3 Conceptual Blending 311
The above is more of a meta-algorithm than a precise algorithm. Many avenues for variation
exist, including
• Step 1: heuristics for choosing what to try to blend
• Step 3: how far do we go here, at removing contradictions? Do we try simple PLN inference
to see if contradictions are unveiled, or do we just limit the contradiction-check to seeing if
the same exact link is given different truth-values?
• Step 4: there are many different ways to build this association-vector. There are also many
ways to measure whether a set of association-vectors demonstrates "common associations".
Interaction information 113e1031 is one fancy way; there are also simpler ones.
• Step 5: there are various ways to measure whether PLN has come up with anything inter-
esting
38.3.2 Another Example of Blending
To illustrate these ideas further, consider the example of the SUV - a blend of "Car" and "Jeep"
Among the relevant properties of Car are:
• appealing to ordinary, consumers
• fuel efficient
• fits in most parking spots
• easy to drive
• 2 wheel drive
Among the relevant properties of Jeep are:
• 4 wheel drive
• rugged
• capable of driving off road
• high clearance
• open or soft top
Obviously, if we want to blend Car and Jeep, we need to choose properties of each that
don't contradict each other. We can't give the Car/Jeep both 2 wheel drive and 4 wheel drive.
4 wheel drive wins for Car/Jeep because sacrificing it would get rid of "capable of driving off
road", which is critical to Jeep-ness; whereas sacrificing 2WD doesn't kill anything that's really
critical to car-ness.
On the other hand, having a soft top would really harm "appealing to consumers", which
from the view of car-makers is a big part of being a successful car. But getting rid of the hard
top doesn't really harm other aspects of jeep-ness in any series way.
However, what really made the SUV successful was that "rugged" and "high clearance"
turned out to make SUVs look funky to consumers, thus fulfilling the "appealing to ordinary
consumers" feature of Car. In other words, the presence of the links
• looks funky —> appealing to ordinary consumers
• rugged Si high clearance —> looks funky
EFTA00624458
312 38 Speculative Concept Formation
made a big difference. This is the sort of thing that gets figured out once one starts doing PLN
inference on the links associated with a candidate blend.
However, if one views each feature of the blend as a probability distribution over concept
space - for instance indicating how closely associated each concept is with that feature (e.g.
via HebbianLinks) then we see that the mutual information (and more generally interaction
information) between the features of the blend, is a quick estimate of how likely it is that
inference will lead to interesting conclusions via reasoning about the combination of features
that the blend possesses.
38.4 Clustering
Next, a different method for creating new ConceptNodes in CogPrime is using clustering al-
gorithms. There are many different clustering algorithms in the statistics and data mining
literature, and no doubt many of them could have value inside CogPrime. We have experi-
mented with several different clustering algorithms in the CogPrime context, and have selected
one, which we call Omniclust IGCPM061, based on its generally robust performance on high-
volume, noisy data. However, other methods such as EM (Expectation-Maximization) clustering
IWI:051 would likely serve the purpose very well also.
In the above discussion on evolutionary concept creation, we mentioned the use of a cluster-
ing algorithm to cluster links. The same algorithm we describe here for clustering ConceptNodes
directly and creating new ConceptNodes representing these clusters, can also be used for clus-
tering links in the context of node mutation and crossover.
The application of Omniclust or any other clustering algorithm for ConceptNode creation
in CogPrime is simple. The clustering algorithm is run periodically, and the most significant
clusters that it finds are embodied as ConceptNodes, with InheritanceLinks to their members.
If these significant clusters have subclusters also identified by Omniclust, then these subclusters
are also made into ConceptNodes, etc., with InheritanceLinks between clusters and subclusters.
Clustering technology is famously unreliable, but this unreliability may be mitigated some-
what by using clusters as initial guesses at concepts, and using other methods to refine the
clusters into more useful concepts. For instance, a cluster may be interpreted as a disjunctive
predicate, and a search may be made to determine sub-disjunction about which interesting
PLN conclusions may be drawn.
38.5 Concept Formation via Formal Concept Analysis
Another approach to concept formation is an uncertain version of Formal Concept Analysis
V;SW051. There are many ways to create such a version, here we describe one approach we
have found interesting, called Fuzzy Concept Formation (FCF).
The general formulation of FCF begins with n objects 0 1,...,0„, in basic attributes
al, ..., am, and information that object 0 ; possesses attribute aj to degree E t0,1j. In
CogPrime, the objects and attributes are Atoms, and is% is the strength of the InheritanceLink
pointing from a to
EFTA00624459
38.5 Concept Formation via Formal Concept Analysis 313
In this context, we may define a concept as a fuzzy set of objects, and a derived attribute
as a fuzzy set of attributes.
Fuzzy concept formation (FCF) is, then, a process that produces N "concepts" Cn+N
and ill "derived attributes" ...,d,n+m, based on the initial set of objects and attributes.
We can extend the weight matrix .1% to include entries involving concepts and derived at-
tributes as well, so that e.g. wn+3,„,+5 indicates the degree to which concept Cn+3 possesses
derived attribute dm+s•
The learning engine underlying FCF is a clustering algorithm dust = X,.;b)
which takes in r vectors Xr E [0,1]n and outputs b or fewer clusters of these vectors. The
overall FCF process Ls independent of the particular clustering algorithm involved, though the
interestingness of the concepts and attributes formed will of course vary widely based on the
specific clustering algorithm. Some clustering algorithms will work better with large values of
b, others with smaller values of b.
We then define the process form_coneepts(b) to operate as follows. Given a set S = SI, S.
containing objects, concepts, or a combination of objects and concepts, and an attribute vector
wt of length h with entries in [0,1) corresponding to each Si, one applies dust to find b clusters
of attribute vectors B1,..., B6. Each of these clusters may be considered as a fuzzy set,
for instance by considering the membership of x in cluster B to be 2-d(x,centroid(B)) for an
appropriate metric d. These fuzzy sets are the b concepts produced by form_concepts(b).
38.5.1 Calculating Membership Degrees of New Concepts
The degree to which a concept defined in this way possesses an attribute, may be defined in a
number of ways, maybe the simplest is: weighted-summing the degree to which the members
of the concept possess the attribute. For instance, to figure out the degree to which beautiful
women (a concept) are insane (an attribute), one would calculate
EwEbeaulif ut _women Xbeendif ut_ women (W)Xinsane(W)
EwEbeata if tot_ women Xbeataifut _women (w)
where xx (w) denotes the fuzzy membership degree of win X. One could probably also consider
Extensionalinheritance beautiful_ women insane.
38.5.2 Forming New Attributes
One may define an analogous process form_aaributes(b) that begins with a set A = Ak
containing (basic and/or derived) attributes, and a column vector
Ch)
EFTA00624460
314 38 Speculative Concept Formation
of length It with entries in [0,1) corresponding to each Ai (the column vector tells the degrees to
which various objects possess the attributes Ai ). One applies dust to find b clusters of vectors
vi: B1,...,Bb. These clusters may be interpreted as fuzzy sets, which are derived attributes.
38.5.2.1 Calculating Membership Degrees of New, Derived Attributes
One must then defines the degree to which an object or concept possesses a derived attribute.
One way to do this is using a geometric mean. For instance, suppose there is a derived attribute
formed by combining the attributes vain, selfish and egocentric. Then. the degree to which the
concept banker possesses this new derived attribute could be defined by
ESE banker XbenkcA (Xvoin(b)XseifishitEl
,-,XceoccnericODlia
EbEtianker Xbanker(b)
38.5.3 Iterating the Fuzzy Concept Formation Process
Given a set S of concepts and/or objects with a set A of attributes, one may define
• append _concepts(S', 5) as the result of adding the concepts in the set S' to $, and evalu-
ating all the attributes in A on these concepts, to get an expanded matrix w
• append _attributes(A', A) as the result of adding the attributes in the set A' to A, and
evaluating all the attributes in A' on the concepts and objects in S, to get an expanded
matrix to
• collapse(S, A) is the result of taking (S, A) and eliminating any concept or attribute that
has distance less than c front some other concept or attribute that comes before it in
the lexicographic ordering of concepts or attributes. I.e., collapse removes near-duplicate
concepts or attributes.
Now, one may begin with a set S of objects and attributes, and iteratively run a process
such as
b = rac \\e.g. r=2, or r=1.5
while (b>1)
S append_concepts (S, form_concepts (S, b) )
S collapse (S)
S append_attributes (S, form_attributes (S, b) )
S collapse (S)
b b/r
with c corresponding to the number of iterations. This will terminate in finite time with a
finitely expanded matrix w containing a number of concepts and derived attributes in addition
to the original objects and basic attributes.
Or, one may look at
while(S is different from old_S) (
EFTA00624461
38.5 Concept Formation via Formal Concept Analysis 315
old_S = S
S = add_concepts(S, form_concepts(S,b))
S = collapse(S)
S = add_attributes(S, form_attributes(S,b))
S = collapse(S)
This second version raises the mathematical question of the speed with which it will terminate
(as a function of c). I.e., when does the concept and attribute formation process converge, and
how fast? This will surely depend on the clustering algorithm involved.
EFTA00624462
EFTA00624463
Section VI
Integrative Learning
EFTA00624464
EFTA00624465
Chapter 39
Dimensional Embedding
39.1 Introduction
Among the many key features of the human brain omitted by typical formal neural network
models, one of the foremost is the brain's three-dimensionality. The brain is not just a network of
neurons arranged as an abstract graph; it's a network of neurons arranged in three-dimensional
space, and making use of this three-dimensionality directly and indirectly in various ways and
for various purposes. The somatosensory cortex contains a geometric map reflecting, approxi-
matively, the geometric structure of parts of the body. The Visual cortex uses the 2D layout
of cortical sheets to reflect the geometric structure of perceived space: motion detection neu-
rons often fire in the actual physical direction of motion, etc. The degree to which the brain
uses 2D and 3D geometric structure to reflect conceptual rather than perceptual or motoric
knowledge is unclear, but we suspect considerable. One well-known idea in this direction is
the "self-organizing map" or Kohonen net [Koh011, a highly effective computer science algo-
rithm that performs automated classification and clustering via projecting higher-dimensional
(perceptual, conceptual or motoric) vectors into a simulated 2D sheet of cortex.
It's not clear that the exploitation of low-dimensional geometric structure is something an
AGI system necessarily must support - there are always many different approaches to any
aspect of the AGI problem. However, the brain does make clear that exploitation of this sort
of structure is a powerful way to integrate various useful heuristics. In the context of mind-
world correspondence theory, there seems clear potential value in having a mind mirror the
dimensional structure of the world, at some level of approximation.
It's also worth emphasizing that the brain's 3D structure has minuses as well as plusses - one
suspects it complexifies and constrains the brain, along with implicitly suggesting various useful
heuristics. Any mathematical graph can be represented in 3 dimensions without links crossing
(unlike in 2 dimensions), but that doesn't mean the representation will always be efficient or
convenient - sometimes it may result in conceptually related, and/or frequently interacting,
entities being positioned far away from each other geometrically. Coupled with noisy signaling
methods such as the brain uses, this sometime lack of alignment between conceptual/pragmatic
and geometric structure can lead to various sorts of confusion (i.e. when neuron A sends a signal
to physical distant neurons B, this may cause various side-effects along the path, sonic of which
wouldn't happen if A and B were close to each other).
In the context of CogPrime, the most extreme way to incorporate a brain-like 3D structure
would be to actually embed an Atomspace in a bounded 3D region. Then the Atomspace would
319
EFTA00624466
320 39 Dimensional Embedding
be geometrically something like a brain, but with abstract nodes and links (some having explicit
symbolic content) rather than purely sub symbolic neurons. This would not be a ridiculous thing
to do, and could yield interesting results. However, we are unsure this would be an optimal
approach. Instead we have opted for a more moderate approach: couple the non-dimensional
Atomspace with a dimensional space, containing points corresponding to Atoms. That is, we
perform an embedding of Atoms in the OpenCog AtomSpace into n-dimensional space - a
judicious transformation of (hyper)graphs into vectors.
This embedding has applications to PLN inference control, and to the guidance of instance
generation in PEPL learning of Combo trees. It is also, in itself, a valuable and interesting
heuristic for sculpting the link topology of a CogPrime AtomSpace. The basic dimensional
embedding algorithm described here is fairly simple and not original to CogPrime, but it has
not previously been applied in any similar context.
The intuition underlying this approach is that there are some cases (e.g. PLN control, and
PEPL guidance) where dimensional geometry provides a useful heuristic for constraining a
huge search space, via providing a compact way of storing a large amount of information.
Dimensionally embedding Atoms lets CogPrime be dimensional like the brain when it needs to
be, yet with the freedom of nondimensionality the rest of the time. This dual strategy is one
that may be of value for AGI generally beyond the CogPrime design, and is somewhat related
to (though different in detail from) the way the CLARION cognitive architecture ISZO-il maps
declarative knowledge into knowledge appropriate for its neural net layer.
There is an obvious way to project CogPrime Atoms into n-dimensional space, by assigning
each Atom a numerical vector based on the weights of its links. But this is not a terribly
useful approach, because the vectors obtained in this way will live, potentially, in millions- or
billions-dimensional space. The approach we describe here is a bit different. We are defining
more specific embeddings, each one based on a particular link type or set of link types. And we
are doing the embedding into a space whose dimensionality is high but not too high, e.g. n=50.
This moderate dimensional space could then be projected down into a lower dimensional space,
like a 3D space, if needed.
The philosophy underlying the ideas proposed here is similar to that underlying Principal
Components Analysis (PCA) in statistics poli0j. The n-dimensional spaces we define here, like
those used in PCA or LSI (for Latent Semantic Indexing II.NIDK(l7j), are defined by sets of or-
thogonal concepts extracted from the original space of concepts. The difference is that PCA and
LSI work on spaces of entities defined by feature vectors, whereas the methods described here
work for entities defined as nodes in weighted graphs. There is no precise notion of orthogonality
for nodes in a weighted graph, but one can introduce a reasonable proxy.
39.2 Link Based Dimensional Embedding
In this section we define the type of dimensional embedding that we will be talking about. For
concreteness we will speak in terms of CogPrime nodes and links, but the discussion applies
much more generally than that.
A link based dimensional embedding is defined as a mapping that maps a set of CogPrime
Atoms into points in an n-dimensional real space, by:
• mapping link strength into coordinate values in an embedding space, and
EFTA00624467
39.2 Link Based Dimensional Embedding 321
• representing nodes as points in this embedding space, using the coordinate values defined
by the strengths of their links.
In the usual case, a dimensional embedding is formed from links of a single type, or from links
whose types are very closely related (e.g. from all symmetrical logical links).
Mapping all the link strengths of the links of a given type into coordinate values in a dimen-
sional space is a simple, but not a very effective strategy. The approach described here is based
on strategically choosing a subset of particular links and forming coordinate values from them.
The choice of links is based on the desire for a correspondence between the metric structure of
the embedding space, and the metric structure implicit in the weights of the links of the type
being embedded. The basic idea of metric preservation is depicted in Figure 39.1.
V
DIMENSIONAL
EMBEDDING
SPACE
•
ATONSPACE
poi. equation
MOUSE a Stet
Fig. 39.1: Metric-Preserving Dimensional Embedding. The basic idea of the sort of em-
bedding described here is to map Atoms into numerical vectors, in such a way that, on average,
distance between Atoms roughly correlates with distance between corresponding vectors. (The
picture shows a 3D embedding space for convenience, but in reality the dimension of the em-
bedding space will generally be much higher.)
More formally, let proj(A) denote the point in IV' corresponding to the Atom A. Then if, for
example, we are doing an embedding based on SimilarityUnits, we want there to be a strong
correlation (or rather anticorrelation) between:
EFTA00624468
322 39 Dimensional Embedding
(SimilarityLink A El) .tv. s
and
dE (proj(A), proj(B))
where dE denotes the Euclidean distance on the embedding space. This is a simple case
because SimilarityLink is symmetric. Dealing with asymmetric links like InheritanceLinks is a
little subtler, and will be done below in the context of inference control.
Larger dimensions generally allow greater correlation, but add complexity. If one chooses the
dimensionality equal to the number of nodes in the graph, there is really no point in doing the
embedding. On the other hand, if one tries to project a huge and complex graph into 1 or 2
dimensions, one is bound to lose a lot of important structure. The optimally useful embedding
will be into a space whose dimension is law but not too tarp.
For internal CogPrime inference purposes, we should generally use a moderately high-
dimensional embedding space, say n=50 or n=100.
39.3 Hard and Koren's Dimensional Embedding Algorithm
Our technique for embedding CogPrime Atoms into high-dimensional space is based on an
algorithm suggested by David Harel and Yehuda Koren II IK021. Their work is concerned with
visualizing large graphs, and they propose a two-phase approach:
1. embed the graph into a high-dimensional real space
2. project the high-dimensional points into 2D or 3D space for visualization
In CogPrime, we don't always require the projection step (step 2); our focus is on the initial
embedding step. Hard and Koren's algorithm for dimensional embedding (step 1) is directly
applicable to the CogPrime context.
Of course this is not the only embedding algorithm that would be reasonable to use in an
CogPrime context; it's just one possibility that seems to make sense.
Their algorithm works as follows.
Suppose one has a graph with symmetric weighted links. Further, assume that between any
two nodes in the graph, there is a way to compute the weight that a link between those two
nodes would have. even if the graph in fact doesn't contain a link between the two nodes.
In the CogPrime context, for instance, the nodes of the graph may be ConeeptNodes, and
the links may be SimilarityLinks. We will discuss the extension of the approach to deal with
asymmetric links like InheritanceLinks, later on.
Let n denote the dimension of the embedding space (e.g. n = 50). We wish to map graph
nodes into points in R", in such a way that the weight of the graph link between A and B
correlates with the distance between proj(A) and proj(B) in R".
39.3.1 Step 1: Choosing Pivot Points
Choose n "pivot points" that are roughly uniformly distributed across the graph.
EFTA00624469
39.4 Embedding Based Inference Control 323
To do this, one chooses the first pivot point at random and then iteratively chooses the i'th
point to be maximally distant from the previous (i — 1) points chosen.
One may also use additional criteria to govern the selection of pivot points. In CogPrime, for
instance, we may use long-term stability as a secondary criterion for selecting Atoms to serve as
pivot points. Greater computational efficiency is achieved if the pivot-point Atoms don't change
frequently.
39.3.2 Step 2: Similarity Estimation
Estimate the similarity between each Atom being projected, and the n pivot Atoms.
This is expensive. However, the cost is decreased somewhat in the CogPrime case by caching
the similarity values produced in a special table (they may not be important enough otherwise
to be preserved in CogPrime ). Then, in cases where neither the pivot Atom nor the Atom
being compared to it have changed recently, the cached value can be reused.
39.3.3 Step 3: Embedding
Create an n-dimensional space by assigning a coordinate axis to each pivot Atom. Then, for an
Atom i, the i'th coordinate value is given by its similarity to the i'th pivot Atom.
After this step, one has transformed one's graph into a collection of n-dimensional vectors.
WIKISOURCE:EmbeddingBasedInference
39.4 Embedding Based Inference Control
One important application for dimensional embedding in CogPrime Ls to help with the control
of
• Logical inference
• Direct evaluation of logical links
We describe how it can be used specifically to stop the CogPrime system from continually trying
to make the same unproductive inferences.
To understand the problem being addressed, suppose the system tries to evaluate the strength
of the relationship
SimilarityLink foot toilet
Assume that no link exists in the system representing this relationship.
Here "foot" and "toilet" are hypothetical ConceptNodes that represent aspects of the concepts
of foot and toilet respectively. In reality these concepts might well be represented by complex
maps rather than individual nodes.
EFTA00624470
324 39 Dimensional Embedding
Suppose the system determines that the strength of this Link is very close to zero. Then
(depending on a threshold in the MindAgent), it will probably not create a SimilarityLink
between the "foot" and "toilet" nodes.
Now, suppose that a few cycles later, the system again tries to evaluate the strength of the
same Link,
SimilarityLink foot toilet
Again, very likely, it will find a low strength and not create the Link at
The same problem may occur with InheritanceLinks, or any other (first or higher order)
logical link type.
Why would the system try, over and over again, to evaluate the strength of the same nonex-
istent relationship? Because the control strategies of the current forward-chaining inference and
pattern mining MindAgents are simple by design. These MindAgents work by selecting Atoms
from the AtomTable with probability proportional to importance, and trying to build links
between them. If the foot and toilet nodes are both important at the same time, then these
MindAgents will try to build links between them - regardless of how many times they've tried
to build links between these two nodes in the past and failed.
How do we solve this problem using dimensional embedding? Generally:
• one will need a different embedding space for each link type for which one wants to prevent
repeated attempted inference of useless relationships. Sometimes, very closely related link
types might share the same embedding space; this must be decided on a case-by-case basis.
• in the embedding space for a link type L, one only embeds Atoms of a type that can be
related by links of type L
It is too expensive to create a new embedding very often. Fortunately, when a new Atom is cre-
ated or an old Atom is significantly modified, it's easy to reposition the Atom in the embedding
space by computing its relationship to the pivot Atoms. Once enough change has happened,
however, new pivot Atoms will need to be recomputed, which is a substantial computational
expense. We must update the pivot point set every N cycles, where N is large; or else, whenever
the total amount of change in the system has exceeded a certain threshold.
Now, how is this embedding used for inference control? Let's consider the case of similarity
first. Quite simply, one selects a pair of Atoms (A,B) for SimilarityMining (or inference of a
SimilarityLink) based on some criterion such as, for instance:
importance(A) * importance(S) * simproj(A,B)
where
distproj(A,B) dE( proj(A) proj(B) )
simproj 2-c*distproj
and c is an important tunable parameter.
What this means is that, if A and B are far apart in the SimilarityLink embedding space,
the system is unlikely to try to assess their similarity.
There is a tremendous space efficiency of this approach, in that, where there are N Atoms
and m pivot Atoms, 1%1'2 similarity relationships are being approximately stored in m*N coor-
dinate values. Furthermore, the cost of computation is m*N times the cost of assessing a single
SimilarityLink. By accepting crude approximations of actual similarity values, one gets away
with linear time and space cost.
EFTA00624471
39.5 Dimensional Embedding and InheritanceLinks 325
Because this is just an approximation technique, there are definitely going to be cases where
A and B are not similar, even though they're close together in the embedding space. When
such a case is found, it may be useful for the AtomSpace to explicitly contain a low-strength
SimilarityLink between A and B. This link will prevent the system from making false embedding-
based decisions to explore (SimilarityLink A B) in the future. Putting explicit low-strength
SimilarityLinks in the system in these cases, is obviously much cheaper than using them for all
cases.
We've been talking about SimilarityLinks, but the approach is more broadly applicable. Any
symmetric link type can be dealt with about the same way. For instance, it might be useful to
keep dimensional embedding maps for
• SimilarityLink
• ExtensionalSimilarityLink
• EquivalenceLink
• ExtensionalEquivalenceLink
On the other hand, dealing with asymmetric links in terms of dimensional embedding requires
more subtlety - we turn to this topic below.
39.5 Dimensional Embedding and InheritanceLinks
Next, how can we use dimensional embedding to keep an approximate record of which links
do not inherit from each other? Because inheritance is an asymmetric relationship, whereas
distance in embedding spaces is a symmetrical relationship, there's no direct and simple way
to do so.
However, there is an indirect approach that solves the problem, which involves maintaining
two embedding spaces. and combining information about them in an appropriate way. In this
subsection, we'll discuss an approach that should work for InheritanceLink, SubsetLink, Impli-
cationLink, and ExtensionalImplicationLink and other related link types. But we'll explicitly
present it only for the InheritanceLink case.
Although the embedding algorithm described above was intended for symmetric weighted
grapks, in fact we can use it for asymmetric links in just about the same way. The use of
the embedding graph for inference control differs, but not the basic method of defining the
embedding.
In the InheritanceLink case, we can define pivot Atoms in the same way, and then we can
define two vectors for each Atom A:
proj_iparent (A)_i (InheritanceLink A A_i) .tv.s
proj_i child) (A)_i (InheritanceLink A_i A) .tv.s
where is the i'th pivot Atom.
If generally projthisd(A); < projaim(B); then qualitatively "children of A are children of
B"; and if generally projp.,„„i (A)1 ≥ projp, a(B); then qualitatively "parents of B are parents
of A". The combination of these two conditions means heuristically that (Inheritance. A B) is
likely. So, by combining the two embedding vectors assigned to each Atom, one can get heuristic
guidance regarding inheritance relations, analogous to the case with similarity relationships. One
EFTA00624472
326 39 Dimensional Embedding
may produce mathematical formulas estimating the error of this approach under appropriate
conditions, but in practice it will depend on the probability distribution of the vectors.
EFTA00624473
Chapter 40
Mental Simulation and Episodic Memory
40.1 Introduction
This brief chapter deals with two important, coupled cognitive components of CogPrime : the
component concerned with creating internal simulations of situations and episodes in the ex-
ternal physical world, and the one concerned with storing and retrieving memories of situations
and episodes.
These are components that are likely significantly different in CogPrime from anything that
exists in the human brain, yet, the functions they carry out are obviously essential to human
cognition (perhaps more so to human cognition than to CogPrime's cognition, because Cog-
Prime is by design more reliant on formal reasoning than the human brain is).
Much of human thought consists of internal, quasi-sensory "imaging" of the external physical
world - and much of human memory consists of remembering autobiographical situations and
episodes from daily life, or from stories heard from others or absorbed via media. Often this
episodic remembering takes the form of visualization, but not always. Blind people generally
think and remember in terms of non-visual imagery, and many sighted people think in terms
of sounds, tastes or smells in addition to visual images.
So far, the various mechanisms proposed as part of CogPrime do not have much to do with
either internal imagery or episodic remembering, even though both seem to play a large role
in human thought. This is OK, of course, since CogPrime is not intended as a simulacrum of
human thought, but rather as a different sort of intelligence.
However, we believe it will actually be valuable to CogPrime to incorporate both of these
factors. And for that purpose, we propose
• a novel mechanism: the incorporation within the CogPrime system of a 3D physical-world
simulation engine.
• an episodic memory store centrally founded on dimensional embedding, and linked to the
internal simulation model
327
EFTA00624474
328 40 Mental Simulation and Episodic Memory
40.2 Internal Simulations
The current use of virtual worlds for OpenCog is to provide a space in which human-controlled
agents and CogPrime -controlled agents can interact, thus allowing flexible instruction of the
CogPrime system by humans, and flexible embodied, grounded learning by CogPrime systems.
But this very same mechanism may be used internally to CogPrime, i.e. a CogPrime system may
be given an internal simulation world, which serves as a sort of "mind's eye." Any sufficiently
flexible virtual world software may be used for this purpose, for example OpenSim (http:
\opensim.org).
Atoms encoding percepts may be drawn from memory and used to generate forms within
the internal simulation world. These forms may then interact according to
• the patterns via which they are remembered to act
• the laws of physics, as embodied in the simulation world
This allows a kind of "implicit memory," in that patterns emergent from the world-embedded
interaction of a number of entities need not explicitly be stored in memory, so long as they will
emerge when the entities are re-awakened within the internal simulation world.
The SimulatorMindAgent grabs important perceptual Atoms and uses them to generate
forms within the internal simulation world, which then act according to remembered dynamical
patterns, with the laws of physics filling in the gaps in memory. This provides a sort of running
internal visualization of the world. Just as important, however, are specific schemata that
utilize visualization in appropriate contexts. For instance, if reasoning is having trouble solving
a problem related to physical entities, it may feed these entities to the internal simulation world
to see what can be discovered. Patterns discovered via simulation can then be fed into reasoning
for further analysis.
The process of perceiving events and objects in the simulation world is essentially identical
to the process of perceiving events and objects in the "actual" world.
And of course, an internal simulation world may be used whether the CogPrime system in
question is hooked up to a virtual world like OpenSim, or to a physical robot.
Finally, perhaps the most interesting aspect of internal simulation is the generation of "vir-
tual perceptions" from abstract concepts. Analogical reasoning may be used to generate virtual
perceptions that were never actually perceived, and these may then be visualized. The need for
"reality discrimination" comes up here, and is easier to enforce in CogPrime than in humans.
A PerceptNode that was never actually perceived may be explicitly embedded in a Hypotheti-
calLink, thus avoiding the possibility of confusing virtual percepts with actual ones. How useful
the visualization of virtual perceptions will be to CogPrime cognition, remains to be seen. This
kind of visualization is key to human imagination but this doesn't mean it will play the same
role in CogPrime's quite different cognitive processes. But it is important that CogPrime has
the power to carry out this kind of imagination.
40.3 Episodic Memory
Episodic memory refers to the memory of our own "life history" that each of us has. Loss of this
kind of memory is the most common type of amnesia in fiction - such amnesia is particularly
dramatic because our episodic memories constitute so much of what we consider as our "selves."
EFTA00624475
40.3 Episodic Memory 329
To a significant extent, we as humans remember, reason and relate in terms of stories - and the
centerpiece of our understanding of stories is our episodic memory. A CogPrime system need
not be as heavily story-focused as a typical human being (though it could be, potentially) -
but even so, episodic memory is a critical component of any CogPrime system controlling an
agent in a world.
The core idea underlying CogPrime's treatment of episodic memory, is a simple one: two
dimensional embedding spaces dedicated to episodes. An episode - a coherent collection of
happenings, often with causal interrelationships, often (but not always) occurring near the
same spatial or temporal locations as each other - may be represented explicitly as an Atom,
and implicitly as a map whose key is that Atom. These episode-Atoms may then be mapped
into two dedicated embedding spaces:
• one based on a distance metric determined by spatiotemporal proximity
• one based on a distance metric determined by semantic similarity
A story is then a series of episodes - ideally one that, if the episodes in the series become
important sequentially in the AtomSpace, causes a significant important-goal-related (ergo emo-
tional) response in the system. Stories may also be represented as Atoms, in the simplest case
consisting of SequentialAND links joining episode-Atoms. Stories then correspond to paths
through the two episodic embedding spaces. Each path through each embedding space implic-
itly has a sort of "halo" in the space - visualizable as a tube snaking through the space, centered
on the path. This tube contains other paths - other stories - that related to the given center
story, either spatiotemporally or semantically.
The familiar everyday human experience of episodic memory may then be approximatively
emulated via the properties of the dimensional embedding space. For instance, episodic memory
is famously associative - when we think of one episode or story, we think of others that are
spatiotemporally or semantically associated with it. This emerges naturally from the embedding
space approach, due to the natural emergence of distance-based associative memory in an
embedding space.
Figures 40.1 and 40.2 roughly illustrates the link between episodic/perceptual and declarative
memory.
EFTA00624476
330 40 Mental Simulation and Episodic Memory
ASSOCIATIVE
EPISODIC
MEMORY
LINKED TO
PERCEPTION
HIERARCHY
kiss
19805 L
•
New York CRY - - ATOMSPACE
Jack) Me
Fig. 40.1: Relationship Between Episodic, Declarative and Perceptual Memory. The
nodes and links at the bottom depict declarative memory stored in the Atomspace; the picture at
the top illustrates an archetypal episode stored in episodic memory, and linked to the perceptual
hierarchy enabling imagistic simulation.
EFTA00624477
40.3 Episodic Memory 331
ASSOCIATIVE
EPISODIC
MEMORY
LINKED TO
PERCEPTION
HIERARCHY
my Energy
ATOMSPACE
Fig. 40.2: Relationship Between Episodic, Declarative and Perceptual Memory. An-
other example similar to the one in ??, but referring specifically to events occurring in an
OpenCogPrime -controlled agent's virtual world.
EFTA00624478
EFTA00624479
Chapter 41
Integrative Procedure Learning
41.1 Introduction
"Procedure learning" - learning step-by-step procedures for carrying out internal or external
operations - is a highly critical aspect of general intelligence, and is carried out in CogPrime
via a complex combination of methods. This somewhat heterogeneous chapter reviews several
advanced aspects of procedure learning in CogPrime, mainly having to do with the integration
between different cognitive processes.
In terms of general cognitive theory and mind-world correspondence, this is some of the
subtlest material in the book. We are not concerned just with how the mind's learning of one
sort of knowledge correlated with the way this sort of knowledge is structured in the mind's
habitual environments, in the context of its habitual goals. Rather, we are concerned with how
various sorts of knowledge intersect and interact with each other. The proposed algorithmic
intersections between, for instance, declarative and procedural learning processes. are reflective
of implicit assumptions about how declarative and procedural knowledge are presented in the
world in the context of the system's goals - but these implicit assumptions are not always easy
to tease out and state in a compact way. We will do our best to highlight these assumptions as
they arise throughout the chapter.
Key among these assumptions, however, are that a human-like mind
• is presented with various procedure learning problems at various levels of difficulty (so that
different algorithms may be appropriate depending on the difficulty level). This leads for
instance to the possibility of using various different algorithms like MOSES or hill climbing,
for different procedure learning problems.
• is presented with some procedure learning problems that may be handled in a relatively
isolated way, and others that are extremely heavily dependent on context, often in a way that
recurs across multiple learning instances in similar contexts. This leads to a situations where
the value of bringing declarative (PLN) and associative (ECAN) and episodic knowledge into
the procedure learning process, has varying value depending on the situation.
• is presented with a rich variety of procedure learning problems with complex interrelation-
ships, including many problems that are closely related to previously solved problems in
various ways. This highlights the value of using PLN analogical reasoning, and importance
spreading along HebbianLinks learned by ECAN, to help guide procedure learning in various
ways.
333
EFTA00624480
334 41 Integrative Procedure Learning
• needs to learn some procedures whose execution may be carried out in a relatively isolated
way, and other procedures whose execution requires intensive ongoing interaction with other
cognitive processes
The diversity of procedure learning situations reflected in these assumptions, leads naturally to
the diversity of technical procedure learning approaches described in this chapter. Potentially
one could have a single, unified algorithm covering all the different sorts of procedure learning,
but instead we have found it more practical to articulate a small number of algorithms which
are then combined in different ways to yield the different kinds of procedure learning.
41.1.1 The Diverse Technicalities of Procedure Learning in CogPrime
On a technical level, this chapter discusses two closely related aspects of CogPrime : schema
learning and predicate learning, which we group under the general category of "procedure learn-
ing."
Schema learning - the learning of Schemallodes and schema maps (explained further in the
Chapter 42) - is CogPrime lingo for learning how to do things. Learning how to act, how to
perceive, and how to think - beyond what's explicitly encoded in the system's MindAgents. As
an advanced CogPrime system becomes more profoundly self-modifying, schema learning will
drive more and more of its evolution.
Predicate learning, on the other hand, is the most abstract and general manifestation of
pattern recognition in the CogPrime system. PredicateNodes, along with predicate maps, are
CogPrime's way of representing general patterns (general within the constraints imposed by
the system parameters, which in turn are governed by hardware constraints). Predicate evolu-
tion, predicate mining and higher-order inference - specialized and powerful forms of predicate
learning - are the system's most powerful ways of creating general patterns in the world and in
the mind. Simpler forms of predicate learning are grist for the mill of these processes.
It may be useful to draw an analogy with another (closely related) very hard problem in
CogPrime, discussed in the book Probabilistic Logic Networks: probabilistic logical unification,
which in the CogPrime /PLN framework basically comes down to finding the SatisfyingSets of
given predicates. Hard logical unification problems can often be avoided by breaking down large
predicates into small ones in strategic ways, guided by non-inferential mind processes, and then
doing unification only on the smaller predicates. Our limited experimental experience indicates
that the same "hierarchical breakdown" strategy also works for schema and predicate learning,
to an extent. But still, as with unification, even when one does break down a large schema or
predicate learning problem into a set of smaller problems, one is still in most cases left with a
set of fairly hard problems.
More concretely, CogPrime procedure learning may be generally decomposed into three as-
pects:
1. Converting back and forth between maps and ProcedureNodes (encapsulation and expan-
sion)
2. Learning the Combo Trees to be embedded in grounded ProcedureNodes
3. Learning procedure maps (networks of grounded ProcedureNodes acting in a coordinated
way to carry out procedures)
EFTA00624481
41.1 Introduction 335
Each of these three aspects of CogPrime procedure learning mentioned above may be dealt with
somewhat separately, though relying on largely overlapping methods.
CogPrime approaches these problems using a combination of techniques:
• Evolutionary procedure learning and hillclimbing for dealing with brand new procedure
learning problems, requiring the origination of innovative, highly approximate solutions out
of the blue
• Inferential procedure learning for taking approximate solutions and making them exact,
and for dealing with procedure learning problems within domains where closely analogous
procedure learning problems have previously been solved
• Heuristic, probabilistic data mining for the creation of encapsulated procedures (which then
feed into inferential and evolutionary, procedure learning), and the expansion of encapsulated
procedures into procedure maps
• PredictivelmplicationLink formation (augmented by PLN inference on such links) as a Cog-
Prime version of goal-directed reinforcement learning
Using these different learning methods together, as a coherently-tuned whole, one arrives at a
holistic procedure learning approach that combines speculation, systematic inference, encapsu-
lation and credit assignment in a single adaptive dynamic process.
We are relying on a combination of techniques to do what none of the techniques can ac-
complish on their own. The combination is far from arbitrary, however. As we will see, each of
the techniques involved plays a unique and important role.
41.1.1.1 Comments on an Alternative Representational Approach
We briefly pause to contrast certain technical aspects of the present approach to analogous as-
pects of the Webmind AI Engine (one of CogPrime's predecessor Al systems, briefly discussed
in Chapter 19.1). This predecessor system used a knowledge representation somewhat similar
to the Atomspace, but with various differences; for instance the base types were Node and Link
rather than Atom, and there was a Node type not used in CogPrime called the Schemain-
stanceNode (each one corresponding to a particular instance of a Schemallode, used within a
particular procedure).
In this approach, complex, learned schema were represented as distributed networks of el-
ementary SchemalnstanceNodes, but these networks were not defined purely by function ap-
plication - they involved explicit passing of variable values through VariableNodes. Special
logic-gate-bearing objects were created to deal with the distinction between arguments of a
SchemalnstanceNode, and predecessor tokens giving a SchemanstanceNode permission to act.
While this approach is in principle workable, it proved highly complex in practice, and for
the Novamente Cognition Engine and CogPrime we chose to store and manipulate procedural
knowledge separately from declarative knowledge (via Combo trees).
EFTA00624482
336 41 Integrative Procedure Learning
41.2 Preliminary Comments on Procedure Map Encapsulation and
Expansion
Like other knowledge in CogPrime, procedures may be stored in either a localized (Combo
tree) or globalized (procedure map) manlier, with the different approaches being appropriate
for different purposes. Activation of a localized procedure may spur activation of a globalized
procedure, and vice versa - so on the overall mind-network level the representation of procedures
is heavily "glocal."
One issue that looms large in this context is the conversion between localized and globalized
procedures - i.e., in CogPrime lingo, the encapsulation and expansion of procedure maps. This
matter will be considered in more detail in Chapter 42 but here we briefly review some key
ideas.
Converting from grounded ProcedureNodes into maps is a relatively simple learning prob-
lem: one enacts the procedure, observes which Atoms are active at what times during the
enaction process, and then creating PredictivelmplicationLinks between the Atoms active at
a certain time and those active at subsequent times. Generally it will be nectsbary to enact
the procedure multiple times and with different inputs, to build up the appropriate library of
Fred ictivelmplicat ion Links.
Converting from maps into ProcedureNodes is significantly trickier. First, it involves carrying
out data mining over the network of ProcedureNodes, identifying subnetworks that are coherent
schema or predicate maps. Then it involves translating the control structure of the map into
explicit logical form, so that the encapsulated version will follow the same order of execution
as the map version. This is an important case of the general process of map encapsulation, to
be discussed in Chapter 42
Next, the learning of grounded ProcedureNodes is carried out by a synergistic combination
of multiple mechanisms, including pure procedure learning methods like hillclimbing and evolu-
tionary learning, and logical inference. These two approaches have quite different characteristics.
Evolutionary learning and hillclimbing excel at confronting a problem that the system has no
clue about, and arriving at a reasonably good solution in the form of a schema or predicate.
Inference excels at deploying the system's existing knowledge to form useful schemata or pred-
icates. The choice of the appropriate mechanism for a given problem instance depends largely
on how much relevant knowledge is available.
A relatively simple case of ProcedureNode learning is where one is given a ConceptNode and
wants to find a ProcedureNode matching it. For instance, given a ConceptNode C, one may
wish to find the simplest possible predicate whose corresponding PredicateNode P satisfies
SatisfyingSet(P) C
On the other hand, given a ConceptNode C involved in inferred ExecutionLinks of the form
ExecutionLink C Ai Bi
i-1, .. ,n
one may wish to find a Schemallode so that the corresponding Schemallode will fulfill this
same set of ExecutionLinks. It may seem surprising at first that a ConceptNode might be
involved with ExecutionLinks, but remember that a function can be seen as a set of tuples
(ListLink in CogPrime ) where the first elements, the inputs of the function, are associated
with a unique output. These kinds of ProcedureNode learning may be cast as optimization
problems, and carried out by hillclimbing or evolutionary programming. Once procedures are
learned via evolutionary programming or other techniques. they may be refined via inference.
EFTA00624483
41.3 Predicate Schematization 337
The other case of ProcedureNode learning is goal-driven learning. Here one seeks a Sche-
mallode whose execution will cause a given goal (represented by a Goal Node) to be satisfied.
The details of Goal Nodes have already been reviewed; but all we need to know here is simply
that a Goal Node presents an objective function, a function to be maximized; and that it poses
the problem of finding schemata whose enaction will cause this function to be maximized in
specified contexts.
The learning of procedure maps, on the other hand, is carried out by reinforcement learn-
ing, augmented by inference. This is a matter of the system learning HebbianLinks between
ProcedureNod , as will be described below.
41.3 Predicate Schematization
Now we turn to the process called "predicate schematization," by which declarative knowledge
about how to carry, out actions may be translated into Combo trees embodying specific pro-
cedures for carrying out actions. This process is straightforward and automatic in some cases,
but in other cases requires significant contextually-savvy inference. This is a critical process
because some procedure knowledge - especially that which is heavily dependent on context in
either its execution or its utility - will be more easily learned via inferential methods than via
pure procedure-learning methods. But, even if a procedure is initially learned via inference (or
is learned by inference based on cruder initial guesses produced by pure procedure learning
methods), it may still be valuable to have this procedure in compact and rapidly executable
form such as Combo provides.
To proceed with the technical description of predicate schematization in CogPrime. we first
need the notion of an "executable predicate". Some predicates are executable in the sense that
they correspond to executable schemata, others are not. There are executable atomic predi-
cates (represented by individual PredicateNodes). and executable predicates (which are link
structures). In general. a predicate may be turned into a schema if it is an atomic executable
predicate, or if it is a compound link structure that consists entirely of executable atomic
predicates (e.g. pick_up, walk_to, can_do, etc.) and temporal links (e.g. SimultaneousAND,
PredictiveImplication, etc.)
Records of predicate execution may then be made using ExecutionLinks, e.g.
ExecutionLink pick_up ( me, ball_7)
is a record of the fact that the schema corresponding to the pick_up predicate was executed
on the arguments (me, ball_7).
It is also useful to introduce some special (executable) predicates related to schema execution:
• can_do, which represents the system's perceived ability to do something
• do, which denotes the system actually doing something; this is used to mark actions as
opposed to perceptions
• just_done, which is true of a schema if the schema has very recently been executed.
The general procedure used in figuring out what predicates to schematize, in order to create
a procedure achieving a certain goal, is: Start from the goal and work backwards, following
Predictivelmplications and EventualPredictivelmplications and treating can_do's as transpar-
ent, stopping when you find something that can currently be done, or else when the process
dwindles due to lack of links or lack of sufficiently certain links.
EFTA00624484
338 41 Integrative Procedure Learning
In this process, an ordered list of perceptions and actions will be created. The Atoms in this
perception/action-series (PA-series) are linked together via temporal-logical links.
The subtlety of this process, in general, will occur because there may be many different
paths to follow. One has the familiar combinatorial explosion of backward-chaining inference,
and it may be hard to find the best PA-series among all the mess. Experience-guided pruning
is needed here just as with backward-chaining inference.
Specific rules for translating temporal links into executable schemata, used in this process,
are as follows. All these rule-statements assume that B is in the selected PA-series. All node
variables not preceded by do or can_do are assumed to be perceptions. The —> denotes the
transformation from predicates to executable schemata.
EventualPredictivelmplicationLink (do A)
Repeat (do A) Until B
EventualPredictivelmplicationLink (do A) (can_do 8)
Repeat
do A
do B
Until
Evaluation just done B
the understanding being that the agent may try to do B and fail, and then try again the next
time around the loop
PredictivelmplicationLink (do A) (can_do 8) <time-lag T>
do A
wait T
do B
SimultaneouslmplicationLink A (can do B)
if A then do B
SimultaneouslmplicationLink (do A) (can_do B)
do A
do B
PredictivelmplicationLink A (can_do 8)
if A then do B
SequentialAndLink Al
EFTA00624485
41.3 Predicate Sclmniatization 339
Al
An
SequentialAndLink Al <time_lag T>
Al
Wait T
A2
Wait T
Wait T
An
SimultaneousANDLink Al ? An
Al
An
Note how all instances of can_do are stripped out upon conversion from predicate to schema,
and replaced with instances of do.
41.3.1 A Concrete Example
For a specific example of this process, consider the knowledge that: "If I walk to the teacher
while whistling, and then give the teacher the ball, Ill get rewarded."
This might be represented by the predicates
walk to the teacher while whistling
A_1
SimultaneousAND
do Walk_to
ExOutLink locate teacher
EvaluationLink do whistle
If I walk to the teacher while whistling, eventually I will be next to the teacher
EventualPredictivelmplication
A_1
Evaluation next_to teacher
While next to the teacher, I can give the teacher the ball
Simultaneouslmplication
EvaluationLink next_to teacher
can_do
EvaluationLink give (teacher, ball)
If I give the teacher the ball, I will get rewarded
EFTA00624486
340 41 Integrative Procedure Learning
Predictivelmplication
just_done
EvaluationLink done give (teacher, ball)
Evaluation reward
Via goal-driven predicate schematization, these predicates would become the schemata
walk toward the teacher while whistling
Repeat:
do WalkTo
ExOut locate teacher
do Whistle
Until:
next_to(teacher, ball)
if next to the teacher, give the teacher the ball
If:
Evaluation next_to teacher
Then
do give(teacher, ball)
Carrying out these two schemata will lead to the desired behavior of walking toward the teacher
while whistling, and then giving the teacher the ball when next to the teacher.
Note that, in this example:
• The walk_to, whistle, locate and give used in the example schemata are procedures corre-
sponding to the executable predicates walk_to, whistle, locate and give used in the example
predicates
• Next_to is evaluated rather than executed because (unlike the other atomic predicates in
the overall predicate being made executable) it has no "do" or "can_do" next to it
41.4 Concept-Driven Schema and Predicate Creation
In this section we will deal with the "conversion" of ConceptNodes into Schemallodes or Fred-
icateNodes. The two cases involve similar but nonidentical methods; we will begin with the
simpler PredicateNode case. Conceptually, the importance of this should be clear: sometimes
knowledge may be gained via concept-learning or linguistic means, but yet may be useful to
the mind in other forms, e.g. as executable schema or evaluable predicates. For instance, the
system may learn conceptually about bicycle-riding, but then may also want to learn executable
procedures allowing it to ride a bicycle. Or it may learn conceptually about criminal individuals,
but may then want to learn evaluable predicates allowing it to quickly evaluate whether a given
individual is a criminal or not.
41.4.1 Concept-Driven Predicate Creation
Suppose we have a ConceptNode C, with a set of links of the form
EFTA00624487
41.4 Concept-Driven Schema and Predicate Creation 341
MemberLink A_i C, il, ...,n
Our goal is to find a PredicateNode so that firstly,
MemberLink X C
is equivalent to
X "within" SatisfyingSet(P)
and secondly,
P is as simple as possible
This is related to the "Occam's Razor," Solomonoff induction related heuristic to be presented
later in this chapter.
We now have an optimization problem: search the space of predicates for P that maximize
the objective function f(P,C), defined as for instance
f(P,C) = cp(P) x r(C,
where cp(P), the complexity penalty of P, is a positive function that decreases when P gets
larger and with r(C, =
GetStrength
SimilarityLink
C
SatisfyingSet(P)
This is an optimization problem over predicate space, which can be solved in an approximate
way by the evolutionary programming methods described earlier.
The ConceptPredicatization MindAgent selects ConceptNodes based on
• Importance
• Total (truth value based) weight of attached MemberLinks and EvaluationLinks
and launches an evolutionary learning or hillclimbing task focused on learning predicates based
on the nodes it selects.
41.4.2 Concept-Driven Schema Creation
In the schema learning case, instead of a ConceptNode with MemberLinks and EvaluationLinks,
we begin with a ConceptNode C with ExecutionLinks. These ExecutionLinks were presumably
produced by inference (the only CogPrime cognitive process that knows how to create Execu-
tionLinks for non-ProcedureNodes).
The optimization problem we have here is: search the space of schemata for S that maximize
the objective function f (S,C), defined as follows:
f(S,(7)==cp(S)xr(S,C)
Let Q(C) be the set of pairs (X, Y) so that ExecutionLink C X Y, and
EFTA00624488
342 41 Integrative Procedure Learning
r (S, G)
GetStrength
SubsetLink
GM)
Graph(S)
where Graph(S) denotes the set of pairs (X, Y) so that ExecutionLink S X Y, where S has
been executed over all valid inputs.
Note that we consider a SubsetLink here because in practice C would have been observed
on a partial set of inputs.
Operationally, the situation here is very similar to that with concept predicatization. The
ConceptSchematization MindAgent must select ConceptNodes based on:
• Importance
• Total (truth value based) weight of ExecutionLinks
and then feed these to evolutionary optimization or hillclimbing.
41.5 Inference-Guided Evolution of Pattern -Embodying Predicates
Now we turn to predicate learning - the learning of PredicateNodes, in particular.
Aside from logical inference and learning predicates to match existing concepts, how does
the system create new predicates? Goal-driven schema learning (via evolution or reinforcement
learning) provides one alternate approach: create predicates in the context of creating use-
ful schema. Pattern mining, discussed in Chapter 37, provides another. Here we will describe
(yet) another complementary dynamic for predicate creation: pattern-oriented, inference-guided
PredicateNode evolution.
In most general terms, the notion pursued here is to form predicates that embody patterns
in itself and in the world. This brings us straight back to the foundations of the patternist
philosophy of mind, in which mind is viewed as a system for recognizing patterns in itself and
in the world, and then embodying these patterns in itself. This general concept is manifested
in many ways in the CogPrime design, and in this section we will discuss two of them:
• Reward of surprisingly probable Predicates
• Evolutionary learning of pattern-embodying Predicates
These are emphatically not the only way pattern-embodying PredicateNodes get into the sys-
tem. Inference and concept-based predicate learning also create PredicateNodes embodying
patterns. But these two mechanisms complete the picture.
41 4. 1 Rewarding Surprising Predicates
The TruthValue of a PredicateNode represents the expected TruthValue obtained by averaging
its TruthValue over all its possible legal argument-values. Some Predicates, however, may have
high TruthValue without really being worthwhile. They may not add any information to their
EFTA00624489
41.5 Inference-Cuided Evolution of Pattern-Embodying Predicates 343
components. We want to identify and reward those Predicates whose TruthValues actually add
information beyond what is implicit in the simple fact of combining their components.
For instance, consider the PredicateNode
AND
InheritanceLink X man
InheritanceLink X ugly
If we assume the man and ugly concepts are independent, then this PredicateNode will have
the TruthValue
tnan.tv.s x ugly.tv.s
In general, a PredicateNode will be considered interesting if:
1. Its Links are important
2. Its TruthValue differs significantly from what would be expected based on independence
assumptions about its components
It is of value to have interesting Predicates allocated more attention than uninteresting ones.
Factor 1 is already taken into account, in a sense: if the PredicateNode is involved in many
Links this will boost its activation which will boost its importance. On the other hand, Factor
2 is not taken into account by any previously discussed mechanisms.
For instance, we may wish to reward a PredicateNode if it has a surprisingly large or small
strength value. One way to do this is to calculate:
sdiff = 'actual strength— strength predicted via independence assumptions'
x weight_ of evidence
and then increment the value:
K x sdiff
onto the PredicateNode's LongTermlmportance value, and similarly increment STI using a
different constant.
Another factor that might usefully be caused to increment LTI is the simplicity of a Predi-
cateNode. Given two Predicates with equal strength, we want the system to prefer the simpler
one over the more complex one. However, the OccamsRazor MindAgent, to be presented below,
rewards simpler Predicates directly in their strength values. Hence if the latter is in use, it
seems unnecessary to reward them for their simplicity in their LTI values as well. This is an
issue that may require some experimentation as the system develops.
Returning to the surprisingness factor, consider the PredicateNode representing
AND
InheritanceLink X cat
EvaluationLink (eats X) fish
If this has a surprisingly high truth value, this means that there are more X known to (or
inferred by) the system, that both inherit from cat and eat fish, than one would expect given
the probabilities of a random X both inheriting from cat and eating fish. Thus, roughly speaking,
the conjunction of inheriting from cat and eating fish may be a pattern in the world.
EFTA00624490
344 41 Integrative Procedure Learning
We now see one very clear sense in which CogPrime dynamics implicitly leads to predicates
representing patterns. Small predicates that have surprising truth values get extra activation,
hence are more likely to stick around in the system. Thus the mind fills up with patterns.
41.5.2 A More Formal Treatment
It is worth taking a little time to clarify the sense in which we have a pattern in the above
example, using the mathematical notion of pattern reviewed in Chapter 3 of Part 1.
Consider the predicate:
predl(T) .tv
equals
GetStrength
AND
Inheritance $X cat
Evaluation eats ($X, fish)
where T is sonic threshold value (e.g. .8). Let B = SatisfyingSet(predl(T)). B is the set of
everything that inherits from cat and eats fish.
Now we will make use of the notion of basic complexity. If one assumes the entire AtomSpace
A constituting a given CogPrime system as given background information, then the basic com-
plexity °(BI IA) may be considered as the number of bits required to list the handles of the
elements of B, for lookup in A; whereas c(B) is the number of bits required to actually list the
elements of B. Now, the formula given above, defining the set B, may be considered as a process
P whose output is the set B. The simplicity c(PII A) is the number of bits needed to describe
this proems, which is a fairly small number. We assume A is given as background information,
accessible to the process.
Then the degree to which P is a pattern in B is given by
1 — c(PIIA)MBIIA)
which, if B is a sizable category, is going to be pretty close to 1.
The key to there being a pattern here is that the relation:
(Inheritance X cat) AND (eats X fish)
has a high strength and also a high count. The high count means that B is a large set,
either by direct observation or by hypothesis (inference). In the case where the count represents
actual pieces of evidence observed by the system and retained in memory, then quite literally
and directly, the PredicateNode represents a pattern in a subset of the system (relative to the
background knowledge consisting of the system as a whole). On the other hand, if the count
value has been obtained indirectly by inference, then it is possible that the system does not
actually know any examples of the relation. In this case, the PredicateNode is not a pattern
in the actual memory store of the system, but it is being hypothesized to be a pattern in the
world in which the system is embedded.
EFTA00624491
41.6 PredicateNode Mining 345
41.6 PredicateNode Mining
We have seen how the natural dynamics of the CogPrime system, with a little help from spe-
cial heuristics, can lead to the evolution of Predicates that embody patterns in the system's
perceived or inferred world. But it is also valuable to more aggressively and directly create
pattern-embodying Predicates. This does not contradict the implicit process, but rather com-
plements it. The explicit process we use is called PredicateNode Alining and is carried out by a
PredicateNodeMiner MindAgent.
Define an Atom structure template as a schema expression corresponding to a CogPrime
Link in which some of the arguments are replaced with variables. For instance,
Inheritance X cat
EvaluationLink (eats X) fish
are Atom structure templates. (Recall that Atom structure templates are important in PLN
inference control, as reviewed in 36)
What the PredicateNodeMiner does is to look for Atom structure templates and logical
combinations thereof which
• Minimize PredicateNode size
• Maximize surprisingness of truth value
This is accomplished by a combination of heuristics.
The first step in PredicateNode mining is to find Atom structure templates with high truth
values. This can be done by a fairly simple heuristic search process.
First, note that if one specifies an (Atom, Link type), one is specifying a set of Atom structure
templates. For instance, if one specifies
(cat, InheritanceLink)
then one is specifying the templates
InheritanceLink SX cat
and
InheritanceLink cat SX
One can thus find Atom structure templates as follows. Choose an Atom with high truth value,
and then, for each Link type, tabulate the total truth value of the Links of this type involving
this Atom. When one finds a promising (Atom, Link type) pair, one can then do inference to
test the truth value of the Atom structure template one has found.
Next, given high-truth-value Atom structure templates, the PredicateNodeMiner experi-
ments with joining them together using logical connectives. For each potential combination
it assesses the fitness in terms of size and surprisingness. This may be carried out in two ways:
1. By incrementally building up larger combinations from smaller ones, at each incremental
stage keeping only those combinations found to be valuable
2. For large combinations, by evolution of combinations
Option 1 is basically greedy data mining (which may be carried out via various standard al-
gorithms, as discussed in Chapter 37), which has the advantage of being much more rapid
EFTA00624492
346 41 Integrative Procedure Learning
than evolutionary programming, but the disadvantage that it misses large combinations whose
subsets are not as surprising as the combinations themselves. It seems there is room for both
approaches in CogPrime (and potentially many other approaches as well). The PredicateN-
odeMiner MindAgent contains a parameter telling it how much time to spend on stochastic
pattern mining vs. evolution, as well as parameters guiding the processes it invokes.
So far we have discussed the process of finding single-variable Atom structure templates. But
multivariable Atom structure templates may be obtained by combining single-variable ones. For
instance, given
eats SX fish
lives_in $X Antarctica
one may choose to investigate various combinations such as
(eats SX SY) AND (lives_in $Y)
(this particular example will have a predictably low truth value). So, the introduction of multiple
variables may be done in the same process as the creation of single-variable combinations of
Atom structure templates.
When a suitably fit Atom structure template or logical combination thereof is found, then a
PredicateNode is created embodying it, and placed into the AtomSpace. WIKISOURCE:SchemaMaps
41.7 Learning Schema Maps
Next we plunge into the issue of procedure maps - schema maps in particular. A schema map
is a simple yet subtle thing - a subnetwork of the AtomSpace consisting of Schemallodes,
computing some useful quantity or carrying out some useful process in a cooperative way. The
general purpose of schema maps is to allow schema execution to interact with other mental
processes in a more flexible way than is allowed by compact Combo trees with internal hooks
into the AtomSpace. I.e., to handle cases where procedure execution needs to be very highly
interactive, mediated by attention allocation and other CogPrime dynamics in a flexible way.
But how can schema maps be learned? The basic idea is simply reinforcement learning. In
a goal-directed system consisting of interconnected, cooperative elements, you reinforce those
connections and/or those elements that have been helpful for achieving goals, and weaken those
connections that haven't. Thus, over time, you obtain a network of elements that achieves goals
effectively.
The central difficulty in all reinforcement learning approaches is the 'assignment of credit'
problem. If a component of a system has been directly useful for achieving a goal, then rewarding
it is easy. But if the relevance of a component to a goal is indirect, then things aren't so simple.
Measuring indirect usefulness in a large, richly connected system is difficult - inaccuracies creep
into the process easily.
In CogPrime, reinforcement learning is handled via HebbianLinks, acted on by a combination
of cognitive processes. Earlier, in Chapter 23. we reviewed the semantics of HebbianLinks, and
discussed two methods for forming HebbianLinks:
1. Updating HebbianLink strengths via mining of the System Activity Table
EFTA00624493
41.7 Learning Schema Maps 347
2. Logical inference on HebbianLinks, which may also incorporate the use of inference to
combine HebbianLinks with other logical links (for instance, in the reinforcement learning
context, PredictivelmplicationLinks)
We now describe how HebbianLinks, formed and manipulated in this manner, may play a
key role in goal-driven reinforcement learning. In effect, what we will describe is an implicit
integration of the bucket brigade with PLN inference. The addition of robust probabilistic
inference adds a new kind of depth and precision to the reinforcement learning process.
Goal Nodes have an important ability to stimulate a lot of Schemallode execution activity.
If a goal needs to be fulfilled, it stimulates schemata that are known to make this happen. But
how is it known which schemata tend to fulfill a given goal? A link:
PredictivelmplicationLink S G
means that after schema S has been executed, goal G tends to be fulfilled. If these links between
goals and goal-valuable schemata exist, then activation spreading from goals can serve the
purpose of causing goal-useful schemata to become active.
The trick, then, is to use HebbianLinks and inference thereon to implicitly guess Predic-
tivelmplicationLinks. A HebbianLink between Si and S says that when thinking about Si was
useful in the past, thinking about S was also often useful. This suggests that if doing S achieves
goal G, maybe doing Si is also a good idea. The system may then try to find (by direct lookup
or reasoning) whether, in the current context, there is a Predictivelmplication joining Si to S.
In this way Hebbian reinforcement learning is being used as an inference control mechanism to
aid in the construction of a goal-directed chain of PredictiveehmplicationLinks, which may then
be schematized into a contextually useful procedure.
Note finally that this process feeds back into itself in an interesting way, via contributing
to ongoing HebbianLink formation. Along the way, while leading to the on-the-fiy construction
of context-appropriate procedures that achieve goals, it also reinforces the HebbianLinks that
hold together schema maps, sculpting new schema maps out of the existing field of interlinked
Schemallodes.
41.7.1 Goal-Directed Schema Evolution
Finally, as a complement to goal-driven reinforcement learning, there is also a process of goal-
directed Schemallode learning. This combines features of the goal-driven reinforcement learning
and concept-driven schema evolution methods discussed above. Here we use a Goal Node to
provide the fitness function for schema evolution.
The basic idea is that the fitness of a schema is defined by the degree to which enactment
of that schema causes fulfillment of the goal. This requires the introduction of Causallmpli-
cationLinks, as defined in PLN. In the simplest case, a CausallmplicationLink is simply a
PredictiveImplicationLink.
One relatively simple implementation of the idea is as follows. Suppose we have a Goal
Node G, whose satisfaction we desire to have achieved by time Ti. Suppose we want to find a
Schemallode S whose execution at time T2 will cause G to be achieved. We may define a fitness
function for evaluating candidate S by:
f(S,C,T1,T2). cp(S) x r(S, C, TI, T2)
EFTA00624494
348 41 Integrative Procedure Learning
r(S,G,T1,T2)
GetStrength
CausallmplicationLink
EvaluationLink
AtTime
T1
ExecutionLink S X Y
EvaluationLink AtTime (T2, G)
Another variant specifies only a relative time lag, not two absolute times.
f(S,G,T) = cp(S) x c(S,G,T)
v(S,G,T)
AND
NonEmpty
SatisfyingSet r(S,G,T1,T2)
TI > T2 - T
Using evolutionary learning or hillclimbing to find schemata fulfilling these fitness functions,
results in Schemallodes whose execution is expected to cause the achievement of given goals.
This is a complementary approach to reinforcement-learning based schema learning, and to
schema learning based on PredicateNode concept creation. The strengths and weaknesses of
these different approaches need to be extensively experimentally explored. However, prior ex-
perience with the learning algorithms involved gives us some guidance.
We know that when absolutely nothing is known about an objective function, evolutionary
programming is often the best way to proceed. Even when there is knowledge about an objective
function, the evolution process can take it into account, because the fitness functions involve
logical links, and the evaluation of these logical links may involve inference operations.
On the other hand, when there's a lot of relevant knowledge embodied in previously executed
procedures, using logical reasoning to guide new procedure creation can be cumbersome, due
to the overwhelming potentially aseful number of facts to choose when carrying inference.
The Hebbian mechanisms used in reinforcement learning may be understood as inferential in
their conceptual foundations (since a HebbianLink is equivalent to an ImplicationLink between
two propositions about importance levels). But in practice they provide a much-streamlined
approach to bringing knowledge implicit in existing procedures to bear on the creation of new
procedures. Reinforcement learning, we believe, will excel at combining existing procedures
to form new ones, and modifying existing procedures to work well in new contexts. Logical
inference can also help here, acting in cooperation with reinforcement learning. But when the
system has no clue how a certain goal might be fulfilled, evolutionary schema learning provides
a relatively time-efficient way for it to find something minimally workable.
Pragmatically, the GoalDrivenSchemaLearning MindAgent handles this aspect of the sys-
tem's operations. It selects Goal Nodes with probability proportional to importance, and then
spawns problems for the Evolutionary Optimization Unit Group accordingly. For a given Goal
Node, PLN control mechanisms are used to study its properties and select between the above
objective functions to use, on an heuristic basis.
EFTA00624495
41.8 Occam's Razor 349
41S Occam's Razor
Finally we turn to an important cognitive process that fits only loosely into the category of
"CogPrime Procedure learning" - it's not actually a procedure learning process, but rather a
process that utilizes the fruits of procedure learning.
The well-known "Occam's razor" heuristic says that all else being equal, simpler is better.
This notion is embodied mathematically in the Solomonoff-Levin "universal prior," according
to which the a priori probability of a computational entity X is defined as a normalized version
of:
m(X) E 2-1(➢)
where:
• the sum is taken over all programs p that compute X
• l(p) denotes the length of the program p
Normalization is necessary because these values will not automatically sum to 1 over the space
of all X.
Without normalization, in is a semimeasure rather than a measure; with normalization it
becomes the "Solomonoff-Levin measure" iLev011.
Roughly speaking, Solomonoff's induction theorem 'Saluda, SolG 111 shows that, if one is
trying to learn the computer program underlying a given set of observed data, and one does
Bayesian inference over the set of all programs to try and obtain the answer, then if one uses
the universal prior distribution one will arrive at the correct answer.
CogPrime is not a Solomonoff induction engine. The computational cost of actually applying
Solomonoff induction is unrealistically large. However, as we have seen in this chapter, there are
aspects of CogPrime that are reminiscent of Solomonoff induction. In concept-directed schema
and predicate learning, in pattern-based predicate learning - and in causal schema learning, we
are searching for schemata and predicates that minimize complexity while maximizing some
other quality. These processes all implement the Occam's Razor heuristic in a Solomonoffian
style.
Now we will introduce one more method of imposing the heuristic of algorithmic simplicity
on CogPrime Atoms (and hence, indirectly, on CogPrime maps as well). This is simply to give
a higher a priori probability to entities that are more simply computable.
For starters, we may increase the node probability of ProcedureNodes proportionately to
their simplicity. A reasonable formula here is simply:
2- " (P)
where P is the ProcedureNode and r > 0 is a parameter. This means that infinitely complex
P have a priori probability zero, whereas an infinitely simple P has an a priori probability 1.
This is not an exact implementation of the Solomonoff-Levin measure, but it's a decent
heuristic approximation. It is not pragmatically realistic to sum over the lengths of all programs
that do the same thing as a given predicate P. Generally the first term of the Solomonoff-Levin
summation is going to dominate the sum anyway, so if the ProcedureNode P is maximally
compact, then our simplified formula will be a good approximation of the Solomonoff-Levin
EFTA00624496
350 41 Integrative Procedure Learning
summation. These a priori probabilities may be merged with node probability estimates from
other sources, using the revision rule.
A similar strategy may be taken with ConceptNodes. We want to reward a ConceptNode
C with a higher a priori probability if C E SatisfyingSet(P) for a simple PredicateNode P. To
achieve this formulaically, let sim(X, Y) denote the strength of the SimilarityLink between X
and Y, and let:
sirn'(C, P) = sim(C, SatisfyingSet(P))
We may then define the a priori probability of a ConceptNode as:
pr(C) = E pp—rc(P)
P
where the sum goes over all P in the system. In practice of course it's only necessary to
compute the terms of the sum corresponding to P so that sim'(C, P) is large.
As with the a priori PredicateNode probabilities discussed above, these a priori ConceptN-
ode probabilities may be merged with other node probability information, using the revision
rule, and using a default parameter value for the weight of evidence. There is one pragmatic
difference here from the PredicateNode case, though. As the system learns new PredicateNodes,
its best estimate of pr(C) may change. Thus it makes sense for the system to store the a priori
probabilities of ConceptNodes separately from the node probabilities, so that when the a priori
probability is changed, a two step operation can be carried out:
• First, remove the old a priori probability from the node probability estimate, using the
reverse of the revision rule
• Then, add in the new a priori probability
Finally, we can take a similar approach to any Atom Y produced by a Schemallode. We can
construct:
pr(Y) = :E:S(Srr iY,112—ri c(S)+cCY))
S,X
where the sum goes over all pairs (.9, X) so that:
ExecutionLink S X Y
and s(S, X, Y) is the strength of this ExecutionLink. Here, we are rewarding Atoms that are
produced by simple schemata based on simple inputs.
The combined result of these heuristics is to cause the system to prefer simpler explanations,
analysis, procedures and ideas. But of course this is only an apriori preference. and if more
complex entities prove more useful, these will quickly gain greater strength and importance in
the system.
Implementationally, these various processes are carried out by the OccamsRazor MindAgent.
This dynamic selects ConceptNodes based on a combination of:
• importance
• time since the a priori probability was last updated (a long time is preferred)
It selects ExecutionLinks based on importance and based on the amount of time since they
were last visited by the OccamsRazor MindAgent. And it selects PredicateNodes based on
importance, filtering out PredicateNodes it has visited before.
EFTA00624497
Chapter 42
Map Formation
Abstract
42.1 Introduction
In Chapter 20 we distinguished the explicit versus implicit aspects of knowledge representa-
tion in CogPrime. The explicit level consists of Atoms with clearly comprehensible meanings,
whereas the implicit level consists of "maps" - collections of Atoms that become important in a
coordinated manner, analogously to cell assemblies in an attractor neural net. The combination
of the two is valuable because the world-patterns useful to human-like minds in achieving their
goals, involve varying degrees of isolation and interpenetration, and their effective goal-oriented
processing involves both symbolic manipulation (for which explicit representation is most valu-
able) and associative creative manipulation (for which distributed, implicit representation is
most valuable).
The chapters since have focused primarily on explicit representation, commenting on the
implicit "map" level only occasionally. There are two reasons for this: one theoretical, one
pragmatic. The theoretical reason is that the majority of map dynamics and representations
are implicit in Atom-level correlates. And the pragmatic reason is that, at this stage, we simply
do not know as much about CogPrime maps as we do about CogPrime Atoms. Maps are
emergent entities and, lacking a detailed theory of CogPrime dynamics, the only way we have
to study them in detail is to run CogPrime systems and mine their System Activity Tables and
logs for information. If CogPrime research goes well, then updated versions of this book may
include more details on observed map dynamics in various contexts.
In this chapter, however, we finally turn our gaze directly to maps and their relationships
to Atoms, and discuss processes that convert Atoms into maps (expansion) and vice versa
(encapsulation). These processes represent a bridge between the concretely-implemented and
emergent aspects of CogPrime's mind.
Map encapsulation is the process of recognizing Atoms that tend to become important in
a coordinated manner, and then creating new Atoms grouping these. As such it is essentially
a form of AtomSpace pattern mining. In terms of patternist philosophy, map encapsulation
is a direct incarnation of the so-called "cognitive equation"; that is, the process by which the
mind recognizes patterns in itself, and then embodies these patterns as new content within
351
EFTA00624498
352 42 Map Formation
itself - an instance of what Hofstadter famously labeled a "strange loop" 'Hurl. In SMEPH
terms, the encapsulation process is how CogPrime explicitly studies its own derived hypergraph
and then works to implement this derived hypergraph more efficiently by recapitulating it at
the concretely-implemented-mind level. This of course may change the derived hypergraph
considerably. Among other things, map encapsulation has the possibility of taking the things
that were the mast abstract, highest level patterns in the system and forming new patterns
involving them and their interrelationships - thus building the highest level of patterns in the
system higher and higher. Figures 42.2 and 42.1 illustrate concrete examples of the process.
Map Encapsulation
Atom Table Before Atom Table After
Map Encapsulation Map Encapsulation
Fig. 42.1: Illustration of the process of creating explicit Atoms corresponding to a pattern
previously represented as a distributed "map."
Map expansion, on the other hand, is the process of taking knowledge that is explicitly
represented and causing the AtomSpace to represent it implicitly, on the map level. In many
cases this will happen automatically. For instance, a ConceptNode C may turn into a concept
map if the importance updating process iteratively acts in such a way as to create/reinforce a
map consisting of C and its relata. Or, an Atom-level InheritanceLink may implicitly spawn a
map-level InheritanceEdge (in SMEPH terms). However, there is one important case in which
Atom-to-map conversion must occur explicitly: the expansion of compound ProoedureNodes
into procedure maps. This must occur explicitly because the process graphs inside ProcedureN-
odes have no dynamics going on except evaluation; there is no opportunity for them to manifest
themselves as maps, unless a MindAgent is introduced that explicitly does so. Of course, just
unfolding a Combo tree into a procedure map doesn't intrinsically make it a significant part of
the derived hypergraph - but it opens the door for the inter-cognitive-process integration that
may make this occur.
EFTA00624499
42.2 Map Encapsulation 353
1 Implication
Guided by the results of
map formation, PLN seeks
specific, related logical relationships
Concept node
symbolizing concept node
map related to
the agent talking implication O symbolizing
map related to
happy people
Hap formation creates
new, abstract Concept Nodes
symbolizing the patterns of
co-Importance it has noted
Fig. 42.2: Illustration of the process of creating explicit Atoms corresponding to a pattern
previously represented as a distributed "map."
42.2 Map Encapsulation
Returning to encapsulation: it may be viewed as a form of symbolization, in which the system
creates concrete entities to serve as symbols for its own emergent patterns. It can then study
an emergent pattern's interrelationships by studying the interrelationships of the symbol with
other symbols.
For instance, suppose a system has three derived-hypergraph ConceptVertices A, B and C,
and observes that:
EFTA00624500
354 42 Map Formation
InheritanceEdge A S
InheritanceEdge B C
Then encapsulation may create ConceptNodes A', B' and C' for A, B and C, and Inheri-
tanceLinks corresponding to the InheritanceEdges, where e.g. A' is a set containing all the
Atoms contained in the static map A. First-order PLN inference will then immediately con-
clude:
InheritanceLink A' C'
and it may possibly do so with a higher strength than the strength corresponding to the (per-
haps not significant) InheritanceEdge between A and C. But if the encapsulation is done right
then the existence of the new InheritanceLink will indirectly cause the formation of the corre-
sponding:
InheritanceEdge A C
via the further action of inference, which will use (InheritanceLink A' C') to trigger the inference
of further inheritance relationships between members of A' and members of C', which will create
an emergent inheritance between members of A (the map corresponding to A') and C (the map
corresponding to C').
The above example involved the conversion of static maps into ConceptNodes. Another
approach to map encapsulation is to represent the fact that a set of Atoms constitutes a map
as a predicate; for instance if the nodes A, B and C are habitually used together, then the
predicate P may be formed, where:
P
AND
A is used at time T
B is used at time T
C is used at time T
The habitualness of A, B and C being used together will be reflected in the fact that P has
a surprisingly high truth value. By a simple concept formation heuristic, this may be used to
form a link AND(A, B, C), so that:
AND(A, B, C) is used at time T
This composite link AND(A, B, C) is then an embodiment of the map in single-Atom form.
Similarly, if a set of schemata is commonly used in a certain series, this may be recognized in
a predicate, and a composite schema may then be created embodying the component schemata.
For instance, suppose it is recognized as a pattern that:
AND
Si is used at time T on input I1 producing output 01
S2 is used at time T+s on input 01 producing output 02
Then we may explicitly create a schema that consists of Si taking input and feeding its output
to 52. This cannot be done via any standard concept formation heuristic; it requires a special
process.
One might wonder why this Atom-to-map conversion process is necessary: Why not just
let maps combine to build new maps, hierarchically, rather than artificially transforming some
maps into Atoms and letting maps then form from these map-representing Atoms. It is all a
matter of precision. Operations on the map level are fuzzier and less reliable than operations
on the Atom level. This fuzziness has its positive and its negative aspects. For example, it is
EFTA00624501
42.3 Atom and Predicate Activity Tables 355
good for spontaneous creativity, but bad for constructing lengthy, confident chains of thought.
WIKISOURCE:ActivityTables
42.3 Atom and Predicate Activity Tables
A major role in map formation is played by a collection of special tables. Map encapsulation
takes place, not by data mining directly on the AtomTable, but by data mining on these special
tables constructed from the AtomTable, specifically with efficiency of map mining in mind.
First, there is the Atom Utilization Table, which may be derived from the SystemActivi-
tyTable. The Atom Utilization Table, in its most simple possible version, takes the form shown
in Table 42.1.
Time Atom Handle H
? ??
T ? (Effort spent on Atom H at time t, utility derived front atom H at time t)
? ??
Table 42.1: Atom Utilization Table
The calculation of "utility" values for this purpose must be done in a "local" way by MindAgents,
rather than by a global calculation of the degree to which utilizing a certain Atom has led to
the achievement of a certain system goal (this kind of global calculation would be better in
principle, but it would require massive computational effort to calculate for every Atom in
the system at frequent intervals). Each MindAgent needs to estimate how much utility it has
obtained from a given Atom, as well as how much effort it has spent on this Atom, and report
these numbers to the Atom Utilization Table.
The normalization of effort values is simple, since effort can be quantified in terms of time and
space expended. Normalization of utility values Ls harder, as it is difficult to define a common
scale to span all the different MindAgents, which in some cases carry out very different sorts
of operations. One reasonably "objective" approach is to assign each MindAgent an amount of
"utility credit", at time T, equal to the amount of currency that the MindAgent has spent since
it last disbursed its utility credits. It may then divide up its utility credit among the Atoms it
has utilized. Other reasonable approaches may also he defined.
The use of utility and utility credit for Atoms and MindAgents is similar to the stimulus
used in the Attention allocation system. There, MindAgents reward Atoms with stimulus to
indicate that their short and long term importance should be increased. Merging utility and
stimulus is a natural approach to implementing utility in OpenCogPrime.
Note that there are many practical manifestations that the abstract notion of an Activi-
tyTable may take. It could be an ordinary row-and-column style table, but that is not the only
nor the most interesting possibility. An ActivityTable may also be effectively stored as a series
of graphs corresponding to time intervals - one graph for each interval, consisting of Hebbian-
Links formed solely based on importance during that interval. In this case it is basically a set
of graphs, which may be stored for instance in an AtomTable, perhaps with a special index.
Then there is the Procedure Activity Table, which records the inputs and outputs associated
with procedures:
EFTA00624502
356 42 Map Formation
Time ProcedureNode Handle H
? ??
T ? (Inputs to H, Outputs from H)
7 ?
Table 42.2: Procedure Activity Table for a Particular MindAgent
Data mining on these tables may be carried out by a variety of algorithms (see MapMining)
- the more advanced the algorithm, the fuller the transfer from the derived-hypergraph level
to the concretely-implemented level. There is a tradeoff here similar to that with attention
allocation - if too much time is spent studying the derived hypergraph, then there will not be
any interesting cognitive dynamics going on anymore because other cognitive processes get no
resources, so the map encapsulation process will fail because there is nothing to study!
These same tables may be used in the attention allocation process, for assigning of MindAgent-
specific AttentionValues to Atoms. WIKISOURCE:MapMining
42.4 Mining the AtomSpace for Maps
Searching for general maps in a complex AtomSpace is an unrealistically difficult problem, as
the search space is huge. So, the bulk of map-mining activity involves looking for the most
simple and obvious sorts of maps. A certain amount of resources may also be allocated to
looking for subtler maps using more resource-intensive methods.
The following categories of maps can be searched for at relatively low cost:
• Static maps
• Temporal motif maps
Conceptually, a static map is simply a set of Atoms that all tend to be active at the same
time.
Next, by a "temporal motif map" we mean a set of pairs:
ti )
of the type:
(Atom, int)
so that for many activation cycle indices T, A; is highly active at some time very close to index
T +ti. The reason both static maps and temporal motif maps are easy to recognize is that they
are both simply repeated patterns.
Perceptual context formation involves a special case of static and temporal motif mining. In
perceptual context formation, one specifically wishes to mine maps involving perceptual nodes
associated with a single interaction channel (see Chapter 26 for interaction channel). These
maps then represent real-world contexts, that may be useful in guiding real-world-oriented goal
activity (via schema-context-goal triads).
In CogPrime so far we have considered three broad approaches for mining static and temporal
motif maps from AtomSpaces:
EFTA00624503
42.4 Mining the AtomSpace for Maps 357
• Frequent subgraph mining, frequent itemset mining, or other sorts of datamining on Activity
Tables
• Clustering on the network of HebbianLinks
• Evolutionary Optimization based datamining on Activity Tables
The first two approaches are significantly more time-efficient than the latter, but also signifi-
cantly more limited in the scope of patterns they can find.
Any of these approaches can be used to look for maps subject to several types of constraints,
such as for instance:
• Unconstrained: maps may contain any kinds of Atoms
• Strictly constrained: maps may only contain Atom types contained on a certain list
• Probabilistically constrained: maps must contain Atom types contained on a certain
list, as x% of their elements
• Trigger-constrained: the map must contain an Atom whose type is on a certain list, as
its most active element
Different sorts of constraints will lead to different sorts of maps, of course. We don't know at
this stage which sorts of constraints will yield the best results. Some special cases, however, are
reasonably well understood. For instance:
• procedure encapsulation, to be discussed below, involves searching for (strictly-constrained)
maps consisting solely of ProcedureinstanceNodes.
• to enhance goal achievement, it is likely useful to search for trigger-constrained maps trig-
gered by Goal Nodes.
What the MapEncapsulation CIAO-Dynamic (Concretely-Implemented-Mind-Dynamic, see
Chapter 19) does once it finds a map, is dependent upon the type of map it's found. In the
special case of procedure encapsulation, it creates a compound ProcedureNode (selecting Sche-
mallode or PredicateNode based on whether the output is a TruthValue or not). For static
maps, it creates a ConceptNode, which links to all members of the map with MemberLinks, the
weight of which is determined by the degree of map membership. For dynamic maps, it creates
Predictivelmplication links depicting the pattern of change.
42.4.1 Frequent Itemset Mining for Map Mining
One class of technique that is useful here Ls frequent itemset mining (FIN1), a process that looks
to find all frequent combinations of items occurring in a set of data. Another useful class of
algorithms Ls greedy or stochastic itemset mining, which does roughly the same thing as FIM
but without being completely exhaustive (the advantage being greater execution speed). Here
we will discuss FIN1, but the basic concepts are the same if one is doing greedy or stochastic
mining instead.
The basic goal of frequent itemset mining is to discover frequent subsets in a group of
items. One knows that for a set of N items, there are 2N-1 possible subgroups. To avoid the
exponential explosion of subsets, one may compute the frequent itemsets in several rounds.
Round i computes all frequent i-itemsets.
EFTA00624504
358 42 Map Formation
A round has two steps: candidate generation and candidate counting. In the candidate gen-
eration step, the algorithm generates a set of candidate i-itemsets whose support - a minimum
percentage of events in which the item must appear - has not been yet been computed. In the
candidate-counting step, the algorithm scans its memory database, counting the support of the
candidate itemsets. After the scan, the algorithm discards candidates with support lower than
the specified minimum (an algorithm parameter) and retains only the frequent i-itemsets. The
algorithm reduces the number of tested subsets by pruning a priori those candidate itemsets
that cannot be frequent, based on the knowledge about infrequent itemsets obtained from pre-
vious rounds. So for instance if {A, B) is a frequent 2-itemset then {A, B,C} may possibly
be a 3-itemset, on the contrary if {A, B) is not a frequent itemset then {A, B,C}, as well as
any super set of (A,B), will be discarded. Although the worst case of this sort of algorithm is
exponential, practical executions are generally fast, depending essentially on the support limit.
To apply this kind of approach to search for static maps, one simply creates a large set of
sets of Atoms - one set for each time-point. In the set S(t) corresponding to time t, we place all
Atoms that were firing activation at time t. The itemset miner then searches for sets of Atoms
that are subsets of many different S(t) corresponding to many different times t. These are Atom
sets that are frequently co-active.
Table ?? presents a typical example of data prepared for frequent itemset mining, in the
context of context formation via static-map recognition. Columns represent important nodes
and rows indicate time slices. For simplicity, we have thresholded the values and show only
activity values; so that a 1in a cell indicates that the Atom indicated by the column was being
utilized at the time indicated by the row.
In the example, if we assume minimum support as 50 percent, the context nodes Cl = {Q,
Ft}, and C2 = {Q, T, U) would be created.
Using frequent itemset mining to find temporal motif maps is a similar, but slightly more
complex process. Here, one fixes a time-window W. Then, for each activation cycle index t, one
creates a set S(t) consisting of pairs of the form:
(A, a)
where A is an Atom and 0 ≤ s ≤ W is an integer temporal offset. We have:
(A,$) "within" S(t)
if Atom A is firing activation at time t+s. helmet mining is then used to search for common
subsets among the S(t). These common subsets are common patterns of temporal activation,
i.e. repeated temporal motifs.
The strength of this approach is its ability to rapidly search through a huge space of possibly
significant subsets. Its weakness is its restriction to finding maps that can be incrementally built
up from smaller maps. How significant this weakness is, depends on the particular statistics of
map occurrence in CogPrime. Intuitively, we believe frequent itemset mining can perform rather
well in this context, and our preliminary experiments have supported this intuition.
Frequent Subgraph Mining for Map Mining
A limitation of FIM techniques, from a CogPrime perspective, is that they are intended for
relational databases (RDBs); but the information about co-activity in a CogPrime instance is
generally going to be more efficiently stored as graphs rather than RDB's. Indeed an Activi-
tyTable may be effectively stored as a series of graphs corresponding to time intervals - one
EFTA00624505
42.5 Map Dynamics 359
graph for each interval, consisting of HebbianLinks formed solely based On importance during
that interval. From ActivityTable stores like this, the way to find maps is not frequent itemset
mining but rather frequent subgraph mining - a variant of FIM that is conceptually similar
but algorithmically more subtle, and on which there has arisen a significant literature in re-
cent years. We have already briefly discussed this technology in Chapter 37 on pattern mining
the Atomspace - map mining being an important special case of Atomspace pattern mining.
As noted there, some of the many approaches to frequent subgraph mining are described in
Ili W PO3, K KO I].
42.4.2 Evolutionary Map Detection
Just as general Atomspace pattern mining may be done via evolutionary learning as well as
greedy mining, the same holds for the special case of map mining. Complementary to the itemset
mining approach, the CogPrime design also uses evolutionary optimization to find maps. Here
the data setup is the same as in the itemset mining case, but instead of using an incremental
search approach, one sets up a population of subsets of the sets S(t), and seeks to evolve the
population to find an optimally fit S(t). Fitness is defined simply as high frequency - relative
to the frequency one would expect based on statistical independence assumptions alone.
In principle one could use evolutionary learning to do all map encapsulation, but this isn't
computationally feasible - it would limit too severely the amount of map encapsulation that
could be done. Instead, evolutionary learning must be supplemented by some more rapid, less
expensive technique.
42.5 Map Dynamics
Assume one has a collection of Atoms, with:
• Importance values I(A), assigned via the economic attention allocation mechanism.
• HebbianLink strengths (HebbianLink A B).tv.s, assigned as (loosely speaking) the proba-
bility of B's importance assuming A's importance.
Then, one way to search for static maps is to look for collections C of Atoms that are strong
clusters according to HebbianLinks. That is, for instance, to find collections C so that:
• The mean strength of (HebbianLink A B).tv.s, where A and B are in the collection C, is
large.
• The mean strength of (HebbianLink A Z).tv.s, where A is in the collection C and Z is not,
is small.
(this is just a very simple cluster quality measurement; there is a variety of other cluster quality
measurements one might use instead.)
Dynamic maps may be more complex, for instance there might be two collections CI and C2
so that:
• Mean strength of (HebbianLink A B).s, where A is in CI and B is in C2
EFTA00624506
360 42 Map Formation
• Mean strength of (HebbianLink B A).s, where B is in C2 and A is in Cl
are both wry large.
A static map will tend to be an attractor for CogPrime's attention-allocation-based dynamics,
in the sense that when a few elements of the map are acted upon, it is likely that other elements
of the map will soon also come to be acted upon. The reason is that, if a few elements of the map
are acted upon usefully, then their importance values will increase. Node probability inference
based on the HebbianLinks will then cause the importance values of the other nodes in the
map to increase, thus increasing the probability that the other nodes in the map are acted
upon. Critical here is that the HebbianLinks have a higher weight of evidence than the node
importance values. This is because the node importance values are assumed to be ephemeral
- they reflect whether a given node is important at a given moment or not - whereas the
HebbianLinks are assumed to reflect longer-lasting information.
A dynamic map will also be an attractor, but of a more complex kind. The example given
above, with Cl and C2, will be a periodic attractor rather than a fixed-point attractor. INIK-
ISOURCE:ProcedureEncapsulation
42.6 Procedure Encapsulation and Expansion
One of the most important special cases of map encapsulation is procedure encapsulation.
This refers to the process of taking a schema/predicate map and embodying it in a single
ProcedureNode. This may be done by mining on the Procedure Activity Table, described in
Activity Tables, using either:
• a special variant of itemset mining that seeks for procedures whose outputs serve as inputs
for other procedures.
• Evolutionary optimization with a fitness function that restricts attention to sets of pro-
cedurm that form a digraph, where the procedures lie at the vertices and an arrow from
vertex A to vertex B indicates that the outputs of A become the inputs of B.
The reverse of this process, procedure expansion, is also interesting, though algorithmically
easier - here one takes a compound ProcedureNode and expands its internals into a collection
of appropriately interlinked ProcedureNodes. The challenge here is to figure out where to split
a complex Combo tree into subtrees. But if the Combo tree has a hierarchical structure then
this is very simple; the hierarchical subunits may simply be split into separate ProcedureNodes.
These two processes may be used in sequence to interesting effect: expanding an important
compound ProcedureNode so it can be modified via reinforcement learning, then encapsulating
its modified version for efficient execution, then perhaps expanding this modified version later
on.
To an extent, the existence of these two different representations of procedures is an artifact
of CogPrime's particular software design (and ultimately, a reflection of certain properties of the
von Neumann computing architecture). But it also represents a more fundamental dichotomy,
between:
• Procedures represented in a way that allows them to be dynamically, improvisationally
restructured via interaction with other mental processes during the execution process.
EFTA00624507
42.6 Procedure Encapsulation and Expansion 361
• Procedures represented in a way that is relatively encapsulated and mechanical, allowing
collaboration with other aspects of the mind during execution only in fairly limited ways
Conceptually, we believe that this is a very useful distinction for a mind to make. In nearly
any reasonable cognitive architecture, it's going to be more efficient to execute a procedure
if that procedure is treated as something with a relatively rigid structure, so it can simply
be executed without worrying about interactions except in a few specific regards. This is a
strong motivation for an artificial cognitive system to have a dual (at least) representation of
procedures, or else a subtle representation that is flexible regarding its degree of flexibility, and
automagically translates constraint into efficiency.
42.6.1 Procedure Encapsulation in More Detail
A procedure map is a temporal motif: it is a set of Atoms (ProcedureNodes), which are habit-
ually, executed in a particular temporal order, and which implicitly pass arguments amongst
each other. For instance, if procedure A acts to create node X, and procedure B then takes
node X as input, then we may say that A has implicitly passed an argument to B.
The encapsulation process can recognize some very subtle patterns, but a fair fraction of its
activity can be understood in terms of some simple heuristics.
For instance, the map encapsulation process will create a node
h = Bfg= f og =f composed with g
(B as in combinatory logic) when there are many examples in the system of:
ExecutionLink g x y
ExecutionLink f y z
The procedure encapsulation process will also recognize larger repeated subgraphs, and their
patterns of execution over time. But some of its recognition of larger subgraphs may be done
incrementally, by repeated recognition of simple patterns like the ones just described.
42.6.2 Procedure Encapsulation in the Human Brain
Finally, we briefly discuss some conceptual issues regarding the relation between CogPrime
procedure encapsulation and the human brain. Current knowledge of the human brain is weak
in this regard, but we won't be surprised if, in time, it is revealed that the brain stores procedures
in several different ways, that one distinction between these different ways has to do with degree
of openness to interactions, and that the less open ways lead to faster execution.
Generally speaking, there is good evidence for a neural distinction between procedural,
episodic and declarative memory. But knowledge about distinctions between different kinds
of procedural memory, is scanter. It is known that procedural knowledge can be "routinized"
- so that, e.g., once you get good at serving a tennis ball or solving a quadratic equation,
your brain handles the process in a different way than before when you were learning. And it
seems plausible that routinized knowledge, as represented in the brain, has fewer connections
EFTA00624508
362 42 Map Formation
back to the rest of the brain than the pre-routinized knowledge. But there will be much firmer
knowledge about such things in the coming years and decades as brain scanning technology
advances.
Overall, there is more knowledge in cognitive and neural science about motor procedures than
cognitive procedures (see e.g. ISW051. In the brain, much of motor procedural memory resides
in the pre-motor area of the cortex. The motor plans stored here are not static entities and are
easily modified through feedback, and through interaction with other brain regions. Generally,
a motor plan will be stored in a distributed way across a significant percentage of the premotor
cortex; and a complex or multipart actions will tend to involve numerous sub-plans, executed
in both parallel and in serial. Often what we think of as separate/distinct motor-plans may
in fact be just slightly different combinations of subplans (a phenomenon also occurring with
schema maps in CogPrime ).
In the case of motor plans, a great deal of the mutinization process has to do with learning the
timing necessary for correct coordination between muscles and motor subplans. This involves
integration of several brain regions - for instance, timing is handled by the cerebellum to a
degree, and some motor-execution decisions are regulated by the basal ganglia.
One can think of many motor plans as involving abstract and concrete sub-plans. The ab-
stract sub-plans are more likely to involve integration with those parts of the cortex dealing
with conceptual thought. The concrete sub-plans have highly optimized timings, based on close
integration with cerebellum, basal ganglia and so forth - but are not closely integrated with
the conceptualization-focused parts of the brain. So, a rough CogPrime model of human motor
procedures might involve schema maps coordinating the abstract aspects of motor procedures,
triggering activity of complex Schemallodes containing precisely optimized procedures that
interact carefully with external actuators. WIKISOURCE:MapsAndAttention
42.7 Maps and Focused Attention
The cause of map formation is important to understand. Formation of small maps seems to
follow from the logic of focused attention, along with hierarchical maps of a certain nature. But
the argument for this is somewhat subtle, involving cognitive synergy between PLN inference
and economic attention allocation.
The nature of PLN is that the effectiveness of reasoning is maximized by (among other
strategies) minimizing the number of incorrect independence assumptions. If reasoning on N
nodes, the way to minimize independence assumptions is to use the full inclusion-exclusion
formula to calculate interdependencies between the N nodes. This involves 2N terms, one for
each subset of the N nodes. Very rarely, in practical cases, will one have significant information
about all these subsets. However, the nature of focused attention is that the system seeks to
find out about as many of these subsets as possible, so as to be able to make the most accurate
possible inferences, hence minimizing the use of unjustified independence assumptions. This
implies that focused attention cannot hold too many items within it at one time, because if N
is too big, then doing a decent sampling of the subsets of the N items is no longer realistic.
So, suppose that N items have been held within focused attention, meaning that a lot of
predicates embodying combinations of N items have been constructed and evaluated and rea-
soned on. Then, during this extensive process of attentional focus, many of the N items will be
useful in combination with each other - because of the existence of predicates joining the items.
EFTA00624509
42.8 Recognizing and Creating Self-Referential Structures 363
Hence, many HebbianLinks will grow between the N items - causing the set of N items to form
a map.
By this reasoning, it seems that focused attention will implicitly be a map formation process
- even though its immediate purpose is not map formation, but rather accurate inference (infer-
ence that minimizes independence assumptions by computing as many cross terms as is possible
based on available direct and indirect evidence). Furthermore, it will encourage the formation
of maps with a small number of elements in them (say, N<10). However, these elements may
themselves be ConceptNodes grouping other nodes together, perhaps grouping together nodes
that are involved in maps. In this way, one may see the formation of hierarchical maps. formed
of clusters of clusters of clusters..., where each cluster has N<10 elements in it. These hierar-
chical maps manifest the abstract dual network concept that occurs frequently in CogPrime
philosophy.
It is tempting to postulate that any intelligent system must display similar properties - so
that focused attention, in general, has a strictly limited scope and causes the formation of
maps that have central cores of roughly the same size as its scope. If this is indeed a general
principle, it is an important one, because it tells you something about the general structure of
derived hypergraphs assnriated with intelligent systems, based on the computational resource
constraints of the systems.
The scope of an intelligent system's attentional focus would seem to generally increase log-
arithmically with the system's computational power. This follows immediately if one assumes
that attentional focus involves free intercombination of the items within it. If attentional focus
is the major locus of map formation, then - lapsing into SMEPH-speak - it follows that the
bulk of the ConceptVertices in the intelligent system's derived hypergraphs may correspond
to maps focused on a fairly small number of other ConeeptVertic . In other words, derived
hypergraphs may tend to have a fairly localized structure, in which each ConceptVertex has
very strong InheritanceEdges pointing from a handful of other ConceptVertices (corresponding
to the other things that were in the attentional focus when that ConceptVertex was formed).
itVIKISOURCE:RecognizingAndCreatingSelfReferentialStructum
42.8 Recognizing and Creating Self-Referential Structures
Finally, this brief section covers a large and essential topic: how CogPrime will be able to
recognize and create large-scale self-referential structures.
Some of the most essential structures underlying human-level intelligence are self-referential
in nature. These include:
• the phenomenal self (see Thomas Metzinger's book "Being No One")
• the will
• reflective awareness
These structures are arguably not critical for basic survival functionality in natural environ-
ments. However, they are important for adequate functionality within advanced social systems,
and for abstract thinking regarding science, humanities, arts and technolo&v.
Recall that in Chapter 3 of Part 1 these entities are formalized in terms of hypersets and,
the following recursive definitions are given:
EFTA00624510
364 42 Map Formation
• "S is conscious of X" is defined as: The declarative content that "S is conscious of X"
correlates with "X is a pattern in S"
• "S wills X" is defined as: The declarative content that "S wills X" causally implies "S does
X"
• "X is part of S's self" is defined as: The declarative content that "X is a part of S's self"
correlates with "X is a persistent pattern in S over time"
Relatedly, one may posit multiple similar processes that are mutually recursive, e.g.
• S is conscious of T and U
• T is conscious of S and U
• U is conscious of S and T
The cognitive importance of this sort of mutual recursion is further discussed in Appendix ??.
According to the philosophy underlying CogPrime, none of these are things that should be
programmed into an artificial mind. Rather, they must emerge in the course of a mind's self-
organization in connection with its environment. However, a mind may be constructed so that,
by design, these sorts of important self-referential structures are encouraged to emerge.
42.8.1 Encouraging the Recognition of Self-Referential Structures in
the AtomSpace
How can we do this - encourage a CogPrime instance to recognize complex self-referential
structures that may exist in its AtomTable? This is important, because, according to the same
logic as map formation: if these structures are explicitly recognized when they exist, they can
then be reasoned on and otherwise further refined, which will then cause them to exist more
definitively, and hence to be explicitly recognized as yet more prominent patterns ... etc. The
same virtuous cycle via which ongoing map recognition and encapsulation is supposed to lead
to concept formation, may be posited on the level of complex self-referential structures, leading
to their refinement, development and ongoing complexity.
One really simple way is to encode self-referential operators in the Combo vocabulary, that
is used to represent the procedures grounding GroundedPredicateNodes.
That way, one can recognize self-referential patterns in the AtomTable via standard Cog-
Prime methods like MOSES and integrative procedure and predicate learning as discussed in
Chapter 41, so long as one uses Combo trees that are allowed to include self-referential operators
at their nodes. All that matters is that one is able to take one of these Combo trees, compare
it to an AtomTable, and assess the degree to which that Combo tree constitutes a pattern in
that AtomTable.
But how can we do this? How can we match a self-referential structure like:
EquivalenceLink
EvaluationLink will (S,X)
CausallmplicationLink
EvaluationLink will (S,X)
EvaluationLink do (S,X)
against an AtomTable or portion thereof?
The question is whether there is some "map" of Atoms (some set of PredicateNodes) willMap,
so that we may infer the SMEPH (see Chapter 14) relationship:
EFTA00624511
42.8 Recognizing and Creating Self-Referential Structures 365
EquivalenceEdge
EvaluationEdge willMap (S, X)
CausalImplicat ionEdge
EvaluationEdge willMap (S, X)
EvaluationEdge doMap ($,X)
as a statistical pattern in the AtomTable's history over the recent past. (Here, doMap is defined
to be the map corresponding to the built-in "do" predicate.)
If so, then this map willMap, may be encapsulated in a single new Node (call it willNode),
which represents the system's will. This willNode may then be explicitly reasoned upon, used
within concept creation, etc. It will lead to the spontaneous formation of a more sophisticated,
fully-fleshed-out will map. And so forth.
Now, what is required for this sort of statistical pattern to be recognizable in the AtomTable's
history? What is required is that EquivalenceEdges (which, note, must be part of the Combo
vocabulary in order for any MOSES-related algorithms to recognize patterns involving them)
must be defined according to the logic of hypersets rather than the logic of sets. What is
fascinating is that this is no big deal! In fact, the AtomTable software structures support this
automatically; it's just not the way most people are used to thinking about things. There is no
reason, in terms of the AtomTable, not to create self-referential structures like the one given
above.
The next question, though, is how do we calculate the truth values of structures like those
above. The truth value of a hyperset structure turns out to be an infinite order probability dis-
tribution, which a complex and peculiar entity [Coe' Oa]. Infinite-order probability distributions
are partially-ordered, and so one can compare the extent to which two different self-referential
structures apply to a given body of data (e.g. an AtomTable), via comparing the infinite-order
distros that constitute their truth values. In this way, one can recognize self-referential patterns
in an AtomTable, and carry out encapsulation of self-referential maps. This sounds very abstract
and complicated, but the class of infinite-order distributions defined in the above-referenced pa-
pers actually have their truth values defined by simple matrix mathematics, so there is really
nothing that abstruse involved in practice.
Finally, there is the question of how these hyperset structures are to be logically manipulated
within PLN. The answer is that regular PLN inference can be applied perfectly well to hypersets,
but some additional hyperset operations may also be introduced; these are currently being
researched.
Clearly, with this subtle, currently unimplemented component of the CogPrime design we
have veered rather far from anything the human brain could plausibly be doing in detail. But
yet, some meaningful connections may be drawn. In Chapter 13 of Part 1 we have discussed
how probabilistic logic might emerge from the brain, and also how the brain may embody
self-referential structures like the ones considered here, via (perhaps using the hippocampus)
encoding whole neural nets as inputs to other neural nets. Regarding infinite-order probabilities,
it is certainly the case that the brain is efficient at carrying out operations equivalent to matrix
manipulations (e.g. in vision and audition), and IG oe Will reduced infinite-order probabilities to
finite matrix manipulations, so that it's not completely outlandish to posit the brain could be
doing something mathematically analogous. Thus, all in all, it seems at least plausible that the
brain could be doing something roughly analogous to what we've described here, though the
details would obviously be very different.
EFTA00624512
EFTA00624513
Section VII
Communication Between Human and Artificial
Minds
EFTA00624514
EFTA00624515
Chapter 43
Communication Between Artificial Minds
43.1 Introduction
Language is a key aspect of human intelligence, and seems to be one of two critical factors
separating humans from other intelligent animals - the other being the ability to use tools.
Steven Mithen IMit9(1 argues that the key factor in the emergence of the modern human
mind from its predecessors was the coming-together of formerly largely distinct mental modules
for linguistic communication and tool making/use. Other animals do appear to have fairly
sophisticated forms of linguistic communication, which we don't understand very well at present;
but as best we can tell, modern human language has many qualitatively different aspects from
these, which enable it to synergize effectively with tool making and use, and which have enabled
it to co-evolve with various aspects of tool-dependent culture.
Some AGI theorists have argued that, since the human brain is largely the same as that of
apes and other mammals without human-like language, the emulation of human-like language is
not the right place to focus if one wants to build human-level AGI. Rather, this argument goes,
one should proceed in the same order that evolution did - start with motivated perception and
action, and then once these are mastered, human-like language will only be a small additional
step. We suspect this would indeed be a viable approach - but may not be well suited for the
hardware available today. Robot hardware Ls quite primitive compared to animal bodies, but
the kind of motivated perception and action that non-human animals do Ls extremely body-
centric (even more so than is the case in humans). On the other hand, modern computing
technology is quite sophisticated as regards language - we program computers (including AIs)
using languages of a sort, for example. This suggests that on a pragmatic basis, it may make
sense to start working with language at an earlier stage in AGI development, than the analogue
with the evolution of natural organisms would suggest.
The CogPrime architecture is compatible with a variety of different approaches to language
learning and capability, and frankly at this stage we are not sure which approach is best. Our
intention is to experiment with a variety of approaches and proceed pragmatically and empir-
ically. One option is to follow the more "natural" course and let sophisticated non-linguistic
cognition emerge first, before dealing with language in any serious way - and then encourage
human-like language facility to emerge via experience. Another option is to integrate some
sort of traditional computational linguistics system into CogPrime, and then allow CogPrime's
learning algorithms to modify this system based on its experience. Discussion of this latter
option occupies most of this section of the book - involves many tricks and compromises, but
369
EFTA00624516
370 43 Communication Between Artificial Minds
could potentially constitute a faster route to success. Yet another option is to communicate with
young CogPrime systems using an invented language halfway between the human-language and
programming-language domains, such as Lojban (this possibility is discussed in Appendix E).
In this initial chapter on communication, we will pursue a direction quite different from the
latter chapters, and discuss a kind of communication that we think may be very valuable in the
CogPrime domain, although it has no close analogue among human beings. Many aspects of
CogPrime closely resemble aspects of the human mind; but in the end CogPrime is not intended
as an emulation of human intelligence, and there are sonic aspects of CogPrime that bear no
resemblance to anything in the human mind, but exploit some of the advantages of digital
computing infrastructure over neural wetware. One of the latter aspects is Psynese, a word we
have introduced to refer to direct mind-to-mind information transfer between artificial minds.
Psynese has some relatively simple practical applications: e.g. it could aid with the use of
linguistic resources and hand-coded or statistical language parsers within a learning-based Ifni-
guage system, to be discussed in following chapters. In this use case, one sets up one CogPrime
using the traditional NLP approaches. and another CogPrime using a purer learning-based ap-
proach, and lets the two systems share mind-stuff in a controlled way. Psynese may also be
useful in the context of intelligent virtual pets, where one may wish to set up a CogPrime
representing "collective knowledge" of multiple virtual pets.
But it also has some grander potential implications, such as the ability to fuse multiple AI
systems into "mindplexes" as discussed in Chapter 12 of Part 1.
One might wonder why a community of two or more CogPrime s would need a language at
all, in order to communicate. After all, unlike humans, CogPrime systems can simply exchange
"brain fragments" - subspaces of their Atomspaces. One CogPrime can just send relevant nodes
and links to another CogPrime (in binary form, or in an XML representation, etc.), bypassing
the linear syntax of language. This is in fact the basis of Psynese: why transmit linear strings
of characters when one can directly transit Atoms? But the details are subtler than it might at
first seem.
One CogPrime can't simply "transfer a thought" to another CogPrime. The problem is that
the meaning of an Atom consists largely of its relationships with other Atoms, and so to pass
a node to another CogPrime, it also has to pass the Atoms that it is related to, and so on,
and so on. Atomspaces tend to be densely interconnected, and so to transmit one thought
fully accurately, a CogPrime system is going to end up having to transmit a copy of its entire
Atomspace! Even if privacy were not an issue, this form of communication (each utterance
coming packaged with a whole mind-copy) would present rather severe processing load on the
communicators involved.
The idea of Psynese is to work around this interconnectedness problem by defining a mecha-
nism for CogPrime instances to query each others' minds directly, and explicitly represent each
others' concepts internally. This doesn't involve any unique cognitive operations besides those
required for ordinary individual thought, but it requires some unique ways of wrapping up these
operations and keeping track of their products.
Another idea this leads to is the notion of a PsyneseVocabulary: a collection of Atoms,
associated with a community of CogPrime s, approximating the most important Atoms inside
that community. The combinatorial explosion of direct-Atomspace communication is then halted
by an appeal to standardized Psynese Atoms. Pragmatically, a PsyneseVocabulary might be
contained in a PsyneseVocabulary server, a special CogPrime instance that exists to mediate
communications between other CogPrime s, and provide CogPrime s with information. Psynese
makes sense both as a mechanism for peer-to-peer communication between CogPrime s, and as
EFTA00624517
43.2 A Simple Example Using a PsyneseVocabulary Server 371
a mechanism allowing standardized communication between a community of CogPrime s using
a PsyneseVocabulary server.
43.2 A Simple Example Using a PsyneseVocabulary Server
Suppose CogPrime 1 wanted to tell CogPrime 2 that "Russians are crazy" (with the latter word
meaning something inbetween "insane" and "impractical"); and suppose that both CogPrime s
are connected to the same Psynese CogPrime with PsyneseVocabulary PV. Then, for instance,
it must find the Atom in PV corresponding to its concept "crazy." To do this it must create an
AtomStructureTemplate such as
Predl(C1)
equals
ThereExists
WI, C2, C3, W2, W3
AND
ConceptNode: CI
ReferenceLink Cl WI
WordNode: W1 #crazy
ConceptNode: C2
HebbianLink CI C2
ReferenceLink C2 W2
WordNode: W2 #insane
ConceptNode: C3
HebbianLink CI C3
ReferenceLink C3 W3
WordNode: W3 #impractical
encapsulating relevant properties of the Atom it wants to grab from PV. In this example the
properties specified are:
• ConceptNode, linked via a ReferenceLink to the WordNode for "crazy"
• HebbianLinks with ConceptNodes linked via ReferenceLinks to the WordNodes for 'insane"
and -impractical"
So, what CogPrime 1 can do is fish in PV for "some concept that is denoted by the word 'crazy'
and is associated with 'insane' and 'impractical'." The association with 'insane" provides more
insurance of getting the correct sense of the word "crazy" as opposed to e.g. the one in the
phrase "He was crazy about her" or in 'That's crazy, man, crazy" (in the latter slang usage
"crazy" basically means "excellent"). The association with "impractical" biases away from the
interpretation that all Russians are literally psychiatric patients.
So, suppose that CogPrime 1 has fished the appropriate Atoms for "crazy" and "Russian"
from PV. Then it may represent in its Atomspace something we may denote crudely (a better
notation will be introduced later) as
InheritanceLink PV:477335:1256953732 PV:744444:1256953735 C.8.,6>
• A similar but perhaps more compelling example would be the interpretation of the phrase "the accountant
cooked the books." In this case both "cooked" and "books" are used in atypical senses, but specifying a Hebbian-
Link to 'accounting' would cause the right Nodes to get retrieved from PV.
EFTA00624518
372 43 Communication Between Artificial Minds
where e.g. "PV:744444" means "the Atom with Handle 744444 in CogPrime PV at time
1256953735," and may also wish to store additional information such as
PsyneseEvaluationLink <.9>
PV
Predl
PV:744444:1256953735
meaning that Predl(PV : 744444 : 1256953735) holds true with truth value < .9 > if all the
Atoms referred to within Predi are interpreted as existing in PV rather than CogPrime 1.
The InheritanceLink then means: "In the opinion of CogPrime 1, 'Russian' as defined by
PV:477335:1256953732 inherits from 'crazy' as defined by PV:744444:1256953735 with truth
value <.8,.6>."
Suppose CogPrime 1 then sends the InheritanceLink to CogPrime 2. It is going to be mean-
ingfully interpretable by CogPrime 2 to the extent that CogPrime 2 can interpret the relevant
PV Atoms, for instance by finding Atoms of its own that correspond to them. To interpret
these Atoms, CogPrime 2 must carry out the reverse process that CogPrime 1 did to find the
Atoms in the first place. For instance, to figure out what PV:744444:1256953735 means to it,
CogPrime 2 may find some of the important links associated with the Node in PV, and make
a predicate accordingly, e.g.:
Pred2(C1)
equals
ThereExists
Wl, C2, C3, W2, W3
AND
ConceptNode: CI
ReferenceLink Cl WI
WordNode: W1 ►crazy
ConceptNode: C2
HebbianLink CI C2
ReferenceLink C2 W2
WordNode: W2 #lunatic
ConceptNode: C3
HebbianLink CI C3
ReferenceLink C3 W3
WordNode: W3 ►unrealistic
On the other hand, if there is no PsyneseVocabulary involved, then CogPrime 1 can submit
the same query directly to CogPrime 2. There is no problem with this, but if there is a reasonably
large community of CogPrime s it becomes more efficient for them all to agree on a standard
vocabulary of Atoms to be used for communication - just as, at a certain point in human
history, it was recognized as more efficient for people to use dictionaries rather than to rely on
peer-to
-peer methods for resolution of linguistic disagreements.
The above examples involve human natural language terms, but this does not have to be
the case. PsyneseVocabularies can contain Atoms representing quantitative or other types of
data, and can also contain purely abstract concepts. The basic idea is the same. A CogPrime
has some Atoms it wants to convey to another CogPrime, and it looks in a PsyneseVocabulary
to see how easily it can approximate these Atoms in terms of "socially understood" Atoms.
This is particularly effective if the CogPrime receiving the communication is familiar with the
PsyneseVocabulary in question. Then the recipient may already know the PsyneseVocabulary
Atoms it is being pointed to; it may have already thought about the difference between these
consensus concepts and its own related concepts. Also, if the sender CogPrime is encapsulating
EFTA00624519
43.3 Psynese as a Language 373
maps for easy communication, it may specifically seek approximate encapsulations involving
PsyneseVocabulary terms, rather than first encapsulating in its own terms and then translating
into PsyneseVocabulary terms.
43.2.1 The Psynese Match Schema
One way to streamline the above operations is to introduce a Psynese Match Schema. with the
property that
ExOut
PsyneseMatch PV A
within CogPrime instance CPI, denotes the Atom within CogPrime instance PV that most
closely matches the Atom A in CPI. Note that the PsyneseMatch schema implicitly relies on
various parameters, because it must encapsulate the kind of process described explicitly in the
above example. PsyneseMatch must, internally, decide how many and which Atoms related to
A should be used to formulate a query to PV, and also how to rank the responses to the query
(e.g. by strength x confidence).
Using PsyneseMatch, the example written above as
Inheritance PV:477335:1256953732 PV:744444:1256953735 ‹.8.,6>
could be rewritten as
Inheritance c.8.,G>
ExOut
PsyneseMatch PV Cl
ExOut
PsyneseMatch PV C2
where Cl and C2 are the ConceptNodes in CPI corresponding to the intended senses of
"crazy" and "Ru.ssian."
43.3 Psynese as a Language
The general definition of a psynese expression for CogPrime is a Set of Atoms that contains
only:
• Nodes from Psynes.eVocabularies
• Perceptual nodes (numbers, words, etc.)
• Relationships relating no nodes other than the ones in the above two categories, and relating
no relationships except ones in this category
• Predicates or Schemata involving no relationships or nodes other than the ones in the above
three categories, or in this category
The PsyneseEvaluationLink type indicated earlier forces interpretation of a predicate as a Psy-
nese expression.
In what sense is the use of Psynese expressions to communicate a language? Clearly it is a
formal language in the mathematical sense. It is not quite a "human language" as we normally
EFTA00624520
374 43 Communication Between Artificial Minds
conceive it, but it is ideally suited to serve the same functions for CogPrime s as human language
serves for humans. The biggest differences front human language are:
• Psynese uses weighted, typed hypergraphs (i.e. Atomspaces) instead of linear strings of
symbols. This eliminates the "parsing" aspect of language (syntax being mainly a way of
projecting graph structures into linear expressions).
• Psynese lacks subtle and ambiguous referential constructions like "this", "it" and so forth.
These are tools allowing complex thoughts to be compactly expressed in a linear way, but
CogPrime s don't need them. Atoms can be named and pointed to directly without complex,
poorly-specified mechanisms mediating the process.
• Psynese has far less ambiguity. There may be Atoms with more than one aspect to their
meanings, but the cost of clarifying such ambiguities Ls much lower for CogPrime s than for
humans using language, and so habitually there will not be the rampant ambiguity that we
see in human expressions.
On the other hand. mapping Psynese into Lojban - a syntactically formal, semantically
highly precise language created for communication between humans - rather than a true natural
language would be much more straightforward. Indeed, one could create a PsyneseVocabulary
based on Lojban, which might be ideally suited to serve as an intermediary between different
CogPrime s. And Lojban may be used to create a linearized version of Psynese that looks more
like a natural language. We return to this point in Appendix ??.
43.4 Psynese Mindplexes
We now recall from Chapter 12 of Part 1 the notion of a mindplex: that is, an intelligent system
that:
1. Is composed of a collection of intelligent systems, each of which has its own "theater of
consciousness" and autonomous control system, but which interact tightly, exchanging large
quantities of information frequently
2. Has a powerful control system on the collective level, and an active "theater of consciousness"
on the collective level as well
In informal discussions, we have found that some people, on being introduced to the mindplex
concept, react by contending that either human minds or human social groups are mindplexes.
However, I believe that, while there are significant similarities between mindplexes and minds,
and between mindplexes and social groups, there are also major qualitative differences. It's
true that an individual human mind may be viewed as a collective, both from a theory-of-
cognition perspective (e.g. Minsky's "society of mind" theory IMM88I) and from a personality-
psychology perspective (e.g. the theory of subpersonalities Illow901). And it's true that social
groups display some autonomous control and some emergent-level awareness. However, in a
healthy human mind, the collective level rather than the cognitive-agent or subpersonality level
is dominant, the latter existing in service of the former; and in a human social group, the
individual-human level is dominant, the group-mind clearly "cognizing" much more crudely
than its individual-human components, and exerting most of its intelligence via its impact on
individual human minds. A mindplex is a hypothetical intelligent system in which neither level
is dominant, and both levels are extremely powerful. A mindplex is like a human mind in
EFTA00624521
43.4 Psynese Mindplexes 375
which the subpersonalities are fully-developed human personalities, with full independence of
thought, and yet the combination of subpersonalities is also an effective personality. Or, from
the other direction, a mindplex is like a human society that has become so integrated and so
cohesive that it displays the kind of consciousness and self-control that we normally associate
with individuals.
There are two mechanisms via which mindplexes may possibly arise in the medium-term
future:
1. Humans becoming more tightly coupled via the advance of communication technologies, and
a communication-centric AI system coming to embody the "emergent conscious theater" of
a human-incorporating mindplex
2. A society of AI systems communicating amongst each other with a richness not possible for
human beings, and coming to form a mindplex rather than merely a society of distinct Al's
The former sort of mindplex relates to the concept of a "global brain" discussed in Chapter
12 of Part 1. Of course, these two sorts of mindplexes are not mutually contradictory, and may
coexist or fuse. The passibility also exists for higher-order mindplexes, meaning mindplexes
whose component minds are themselves mindplexes. This would occur, for example, if one
had a mindplex composed of a family of closely-interacting AI systems, which acted within a
mindplex associated with the global communication network.
Psynese, however, is more directly relevant to the latter form of mindplex. It gives a concrete
mechanism via which such a mindplex might be sculpted.
43.4.1 AGI Mindpleres
How does one get from CogPrime s communicating via Psynese to CogPrime mindplexes?
Clearly, with the Psynese mode of communication, the potential is there for much richer
communication than exists between humans. There are limitations, posed by the private nature
of many concepts - but these limitations are much less onerous than for human language, and
can be overcome to some extent by the learning of complex cognitive schemata for translation
between the "private languages" of individual Atomspaces and the "public languages" of Psynese
servers.
But rich communication does not in itself imply the evolution of mindplexes. It is possible
that a community of Psynese-communicating CogPrime s might spontaneously evolve a mind-
plex structure - at this point, we don't know enough about CogPrime individual or collective
dynamics to say. But it is not necessary to rely on spontaneous evolution. In fact it is passible,
and even architecturally simple, to design a community of CogPrime s in such a way as to
encourage and almost force the emergence of a mindplex structure.
The solution is simple: simply beef up PsyneseVocabulary servers. Rather than relatively
passive receptacles of knowledge from the CogPrime s they serve, allow them to be active,
creative entities, with their own feelings, goals and motivations.
The PsyneseVocabulary servers serving a community of CogPrime s are absolutely critical
to these CogPrime s. Without them, high-level inter-CogPrime communication is effectively
impossible. And without the concepts the PsyneseVocabularies supply, high-level individual
CogPrime thought will be difficult, because CogPrime s will come to think in Psynese to at
least the same extent to which humans think in language.
EFTA00624522
376 43 Communication Between Artificial Minds
Suppose each PsyneseVocabulary server has its own full CogPrime mind, its own "conscious
theater". These minds are in a sense "emergent minds" of the CogPrime community they serve
- because their contents are a kind of "nonlinear weighted average" of the mind-contents of the
community. Furthermore, the actions these minds take will feed back and affect the community
in direct and indirect ways - by affecting the language by which the minds communicate. Clearly,
the definition of a mindplex is fulfilled.
But what will the dynamics of such a CogPrime mindplex be like? What will be the properties
of its cognitive and personality psychology? We could speculate on this here, but would have
very little faith in the possible accuracy of our speculations. The psychology of mindplexes will
reveal itself to us experimentally as our work on AGI engineering, education and socialization
proceeds.
One major issue that arises, however, is that of personality filtering. Put simply: each intelli-
gent agent in a mindplex must somehow decide for itself which knowledge to grab from available
PsyneseVocabulary servers and other minds, and which knowledge to avoid grabbing from oth-
ers in the name of individuality. Different minds may make different choices in this regard. For
instance, one choice could be to, as a matter of routine, take only extremely confident knowledge
from the PsyneseVocabulary server. This corresponds roughly to ingesting "facts" from the col-
lective knowledge pool, but not opinions or speculations. Less confident knowledge would then
be ingested from the collective knowledge pool on a carefully calculated and as-needed basis.
Another choice could be to accept only small networks of Atoms from the collective knowledge
pool, on the principle that these can be reflectively understood as they are ingested, whereas
large networks of Atoms are difficult to deliberate and reflect about. But any policies like this
are merely heuristic ones.
43.5 Psynese and Natural Language Processing
Next we review a more near-term, practical application of the Psynese mechanism: the fusion of
two different approaches to natural language processing in CogPrime, the experiential learning
approach and the "engineered NLP subsystem" approach.
In the former approach, language is not given any extremely special role, and CogPrime is
expected to learn language much as it would learn any other complex sort of knowledge. There
may of course be learning biases programmed into the system, to enable it to learn language
based on its experience more rapidly. But there is no concrete linguistic knowledge programmed
in.
In the latter approach, one may use knowledge from statistical corpus analysis, one may use
electronic resources like WordNet and FrameNet, and one may use sophisticated, specialized
tools like natural language parsers with hand-coded grammars. Rather than trying to emulate
the way a human child learns language, one is trying to emulate the way a human adult
comprehends and generates language.
Of course, there is not really a rigid dichotomy between these two approaches. Many linguistic
theorists who focus on experiential learning also believe in some form of universal grammar,
and would advocate for an approach where learning is foundational but is biased by in-built
abstract structures representing universal grammar. There is of course very little knowledge
(and few detailed hypotheses) about how universal grammar might be encoded in the human
brain, though there is reason to think it may be at a very abstract level, due to the significant
EFTA00624523
43.5 Psynese and Natural Language Processing 377
overlaps between grammatical structure, social role structure ICE3001, and physical reasoning
[Cam;
The engineered approach to NLP provides better functionality right "out of the box," and
enables the exploitation of the vast knowledge accumulated by computational linguists in the
past decades. However, we suspect that computational linguistics may have hit a ceiling in some
regards, in terms of the quality of the language comprehension and generation that it can deliver.
It runs up against problems related to the disambiguation of complex syntactic constructs, which
don't seem to be resolvable using either a tractable number of hand-coded rules, or supervised
or unsupervised learning based on a tractably large set of examples. This conclusion may be
disputed, and some researchers believe that statistical computational linguistics can eventually
provide human-level functionality, once the Web becomes a bit larger and the computers used to
analyze it become a bit more powerful. But in our view it is interesting to explore hybridization
between the engineered and experiential approaches, with the motivation that the experiential
approach may provide a level of flexibility and insight at dealing with ambiguity that the
engineered approach apparently lacks.
After all, the way a human child deals with the tricky disambiguation problems that stump
current computational linguistics systems is not via analysis of trillion-word corpuses, bur rather
via correlating language with non-linguistic experience. One may argue that the genome implic-
itly contains a massive corpus of speech, but there it's also to be noted that this is experientially
contextualized speech. And it seems clear front the psycholinguistic evidence Romo:31 that for
young human children, language is part and parcel of social and physical experience, learned in
a manner that's intricately tied up with the learning of many other sorts of skills.
One interesting approach to this sort of hybridization, using Psynese, is to create multiple
CogPrime instances taking different approaches to language learning, and let them communi-
cate. Most simply one may create
• A CogPrime instance that learns language mainly based on experience, with perhaps some
basic in-built structure and some judicious biasing to its learning (let's call this CPap)
• A CogPrime instance using an engineered NLP system (let's call this CPeng)
In this case, CPap can use CPc,k9 as a cheap way to test its ideas. For instance suppose,
CPap thinks it has correctly interpreted a certain sentence S into Atom-set A. Then it can
send its interpretation A to CPap and see whether CPens, thinks A is a good interpretation of
S, by consulting CPapthe trait value of
ReferenceLink
ExOut
PsyneseMatch CP a9 S
ExOut
PsyneseMatch GT." A
Similarly, if CPap believes it has found a good way (S) to linguistically express a collection $
of Atoms A, it can check whether these two match reasonably well in CP,,,g.
Of course, this approach could be abused in an inefficient and foolish way, for instance if
CP„,.p did nothing but randomly generate sentences and then test them against Ca w In this
case we would have a much less efficient approach than simply using CPeng directly. However,
effectively making use of CPap as a resource requires a different strategy: throwing CPap only
a relatively small selection of things that seem to make sense, and using CPap as a filter to
avoid trying out rough-draft guesses in actual human conversation.
EFTA00624524
378 43 Communication Between Artificial Minds
This hybrid approach, we suggest, may provide a way of getting the best of both worlds: the
flexibility of a experiential-learning-based language approach, together with the exploitation
of existing linguistic tools and resources. With this in mind, in the following chapters we will
describe both engineering and experiential-learning based approaches to NLP.
43.5.1 Collective Language Learning
Finally we bring the language-learning and mindplex themes together, in the notion of collective
language learning. One of the most interesting uses for a mindplex architecture is to allow
multiple CogPrime agents to share the linguistic knowledge they gain. One may envision a
PsyneseVocabulary server into which a population of CogPrime agents input their linguistic
knowledge specifically, and which these agents then consult when they wish to comprehend or
express something in language. and their individual NLP systems are not up to the task.
This could be a very powerful approach to language learning, because it would allow a
potentially very large number of AI systems to effectively act as a single language learning
system. It is an especially appealing approach in the context of CogPrime systems used to
control animated agents in online virtual worlds or multiplayer games. The amount of linguistic
experience undergone by, say, 100,000 virtually embodied CogPrime agents communicating with
human virtual world avatars and game players, would be far more than any single human child
or any single agent could undergo. Thus, to the extent that language learning can be accelerated
by additional experience, this approach could enable language to be learned quite rapidly.
EFTA00624525
Chapter 44
Natural Language Comprehension
Co-authored with Michael Ross and Linas Vepstas and Ruiting Lian
44.1 Introduction
Two key approaches to endowing AGI systems with linguistic facility exist, as noted above:
• "Experiential" - shorthand here for "gaining most of its linguistic knowledge from interactive
experience, in such a way that language learning is not easily separable from generic learning
how to survive and flourish"
• "Engineered" - shorthand here for "gaining most of its linguistic knowledge front sources
other than the system's own experience in the world" (including learning language front
resources like corpora)
This dichotomy is somewhat fuzzy, since getting experiential language learning to work well
may involve some specialized engineering, and engineered NLP systems may also involve some
learning from experience. However, in spite of the fuzziness, the dichotomy is still real and
important; there are concrete choices to be made in designing an NLP system and this dichotomy
compactly symbolizes some of them. Much of this chapter and the next few will be focused on
the engineering approach, but we will also devote some space to discussing the experience-based
approach. Our overall perspective on the dichotomy is that
• the engineering-based approach, on its own, is unlikely to take us to human-level NLP ... but
this isn't wholly impossible, if the engineering is done in a manner that integrates linguistic
functionality richly with other kinds of experiential learning
• using a combination of experience-based and engineering-based approaches may be the most
practical option
• the engineering approach is useful for guiding the experiential approach, because it tells us
a lot about what kinds of general structures and dynamics may be adequate for intelligent
language processing. To simplify a bit, one can prepare an AGI system for experiential
learning by supplying it with structures and dynamics capable of supporting the key com-
ponents of an engineered NIP system - and biased toward learning things similar to known
engineered NLP systems - but requiring all, or the bulk of, the actual linguistic content
to be learned via experience. This approach may be preferable to requiring a system to
learn language based on more abstract structures and dynamics, and may indeed be more
comparable to what human brains do, given the large amount of linguistic biasing that is
probably built into the human genome.
379
EFTA00624526
380 44 Natural Language Comprehension
Further distinctions, overlapping with this one, may also be useful. One may distinguish (at
least) five modes of instructing NLP systems, the first three of which are valid only for engineered
NLP systems, but the latter two of which are valid both for engineered and experiential NLP
systems:
• hand-coded rules
• supervised learning on hand-tagged corpuses, or via other mechanisms of explicit human
training
• unsupervised learning from static bodies of data
• unsupervised learning via interactive experience
• supervised learning via interactive experience
Note that, in principle, any of these modes may be used in a pure-language or a socially/phys-
ically embodied language context. Of course, there is also semi-supervised learning which may
be used in place of supervised learning in the above list IICSZO8j.
Another key dichotomy related to linguistic facility is language comprehension versos lan-
guage generation (each of which is typically divided into a number of different subprocesses).
In language comprehension, we have processes like stemming, part-of-speech tagging, grammar-
based parsing, semantic analysis, reference resolution and discourse analysis. In language gener-
ation, we have semantic analysis, syntactic sentence generation, pragmatic discourse generation,
reference-insertion, and so forth. In this chapter and the next two we will briefly review all these
different topics and explain how they may be embodied in CogPrime. Then, in Chapter ?? we
present a complementary approach to linguistic interaction with AGI systems based on the
invented language Lojban: and in Chapter 48 we discuss the use of CogPrime cognition to
regulate the dialogue process.
A typical, engineered computational NLP system involves hand-coded algorithms carrying
out each of the specific tasks mentioned in the previous paragraph, sometimes with parameters,
rules or number tables that are tuned or learned statistically based on corpuses of data. In
fact, most NLP systems handle only understanding or only generation; systems that cover both
aspects in a unified way are quite rare. The human mind, on the other hand, carries out these
tasks in a much more interconnected way - using separate procedures for the separate tasks, to
some extent, but allowing each of these procedures to be deeply informed by the information
generated by the other procedures. This interconnectedness is what allows the human mind
to really understand language - specifically because human language syntax is complex and
ambiguous enough that the only way to master it is to infuse one's syntactic analyses with
semantic (and to a lesser extent pragmatic) knowledge. In our treatment of NLP we will pay
attention to connections between linguistic ftmctionalities, as well as to linguistic functionalities
in isolation.
It's worth emphasizing that what we mean by a "experience based" language system is quite
different from corpus-based language systems as are commonplace in computational linguistics
today [MS99J (and from the corpus-based learning algorithm to be discussed in Chapter ??
below). In fact, we feel the distinction between corpus-based and rule-based language processing
systems is often overblown. Whether one hand-codes a set of rules, or carefully marks up a
corpus so that rules can be induced from it, doesn't ultimately make that much difference. For
instance, OpenCogPrime's RelEx system (to be described below) uses hand-coded rules to do
much the same thing that the Stanford parser does using rules induced from a tagged corpus.
But both systems do roughly the same thing. RelEx is currently faster due to using fewer rules,
and it handles some complex cases like comparatives better (presumably because they were not
EFTA00624527
44.2 Linguistic Atom Types 381
well covered in the Stanford parser's training corpus); but the Stanford parser may be preferable
in other respects, for instance it's more easily generalizable to languages beyond English (for
a language with structure fairly similar to English, one just has to supply a new marked-up
training corpus; whereas porting RelEx rules to other languages requires more effort).
An unsupervised corpus-based learning system like the one to be described in Chapter?? is
a little more distinct from rule-based systems, in that it is based on inducing patterns from
natural rather than specially prepared data. But still, it is learning language as a phenomenon
unto itself, rather than learning language as part and parcel of a system's overall experience in
the world.
The key distinction to be made, in our view, is between language systems that learn language
in a social and physical context, versus those that deal with language in isolation. Dealing with
language in context immediately changes the way the linguistics problem appears (to the Al
system, and also to the researcher), and makes hand-coded rules and hand-tagged corpuses less
viable, shifting attention toward experiential learning based approaches.
Ultimately we believe that the "right" way to teach an AGI system language is via semi-
supervised learning in a socially and physically embodied context. That is: talk to the system,
and have it learn both from your reinforcement signals and from unsupervised analysis of the
dialogue. However, we believe that other modes of teaching NLP systems can also contribute,
especially if used in support of a system that also does semi-supervised learning based on
embodied interactive dialogue.
Finally, a note on one aspect of language comprehension that we don't deal with here. We deal
only with text processing, not speech understanding or generation. A CogPrime approach to
speech would be quite feasible to develop, for instance using neural-symbolic hybridization with
DeSTIN or a similar perceptual-motor hierarchy. However, this potential aspect of CogPrime
has not been pursued in detail yet, and we won't devote space to it here.
44.2 Linguistic Atom Types
Explicit representation of linguistic knowledge in terms of Atoms is not a deep issue, more of a
"plumbing" type of issue, but it must be dealt with before moving on to subtler aspects.
In principle, for dealing with linguistic information coming in through ASCII, all we need
besides the generic CogPrime structures and dynamics are two node types and one relationship
type:
• CharacterNode
• CharacterinstanceNode
• a unary, relationship coziest denoting an externally-observed list of items
Sequences of characters may then be represented in terms of lists and the coziest schema. For
instance the word "pig" is represented by the list concat(#p, #i, #g)
The concat operator can be used to help define special NL atom types, such as:
• MorphemeNode/ MorphemeinstanceNode
• WordNode/WordlnstanceNode
• PhraseNode/PhraselnstanceNode
• SentenceNode/ SentencelnstanceNode
• UtteranceNode/ UtterancelnstanceNode
EFTA00624528
382 44 Natural Language Comprehension
44.3 The Comprehension and Generation Pipelines
Exactly how the "comprehension pipeline" is broken down into component transformations,
depends on one's linguistic theory of choice. The approach taken in OpenCogPrimes engineered
NLP framework, in use from 2008-2012, looked like:
Text --> Tokenizer --> Link Parser -->
Syntactico-Semantic Relationship Extractor (RelEx)
Semantic RelationshipExtractor (RelEx2Frame) -->
SemanticNodes & Links
In 2012-13, a new approach has been undertaken, which simplifies things a little and looks
like
Text --> Tokenizer --> Link Parser -->
Syntactico-Semantic Relationship Extractor (Syn2Sem) -->
Semantic Nodes & Links
Note that many other variants of the NL pipeline include a"tagging" stage, which assigns part
of speech tags to words based on the words occurring around them. In our current approach,
tagging is essentially subsumed within parsing; the choice of a POS (part-of-speech) tag for
a word instance is carried out within the link parser. However, it may still be valuable to
derive information about likely POS tags for word instances from other techniques, and use
this information within a link parsing framework by allowing it to bias the probabilities used
in the parsing process.
None of the processes in this pipeline are terribly difficult to carry out, if one is willing to
use hand-coded rules within each step, or derive rules via supervised learning, to govern their
operation. The truly tricky aspects of NL comprehension are:
• arriving at the rules used by the various subprocesses, in a way that naturally supports
generalization and modification of the rules based on ongoing experience
• allowing semantic understanding to bias the choice of rules in particular contexts
• knowing when to break the rules and be guided by semantic intuition instead
Importing rules straight from linguistic databases results in a system that (like the current
RelEx system) is reasonably linguistically savvy on the surface, but lacks the ability to adapt its
knowledge effectively based on experience, and has trouble comprehending complex language.
Supervised learning based on hand-created corpuses tends to result in rule-bases with similar
problems. This doesn't necessarily mean that hand-coding or supervised learning of linguistic
rules has no place in an AGI system. but it means that if one uses these methods, one must
take extra care to make one's rules modifiable and generalizable based on ongoing experience,
because the initial version of one's rules is not going to be good enough.
Generation is the subject of the following chapter, but for comparison we give here a high-
level overview of the generation pipeline, which may be conceived as:
1. Content determination: figuring out what needs to be said in a given context
2. Discourse planning: overall organization of the information to be communicated
3. Lexicalization: assigning words to concepts
4. Reference generation: linking words in the generated sentences using pronouns and other
kinds of reference
EFTA00624529
41.4 Parsing with Link Grammar 383
5. Syntactic and morphological realization: the generation of sentences via a process inverse to
parsing, representing the information gathered in the above phases
6. Phonological or orthographic realization: turning the above into spoken or written words,
complete with timing (in the spoken case), punctuation (in the written case), etc.
In Chapter 46 we explain how this pipeline is realized in OpenCogPrimes current engineered
NL generation system.
44.4 Parsing with Link Grammar
Now we proceed to explain some of the details of OpenCogPrime's engineered NL comprehension
system. This section gives an overview of link grammar, a key part of the current OpenCog
NLP framework, and explains what makes it different from other linguistic formalisms.
We emphasize that this particular grammatical formalism is not, in itself, a critical part
of the CogPrime design. In fact, it should be quite possible to create and teach a CogPrime
AGI system without using any particular grammatical formalism - having it acquire linguistic
knowledge in a purely experiential way. However, a great deal of insight into CogPrime -based
language processing may be obtained by considering the relevant issues in the concrete detail
that the assumption of a specific grammatical formalism provides. This insight is of course
useful if one is building a CogPrime that makes use of that particular grammatical formalism,
but it's also useful to some degree even if one is building a CogPrime that deals with human
language entirely experientially.
This material will be more comprehensible to the reader who has some familiarity with
computational linguistics, e.g. with notions such as parts of speech, feature structures, lexicons,
dependency grammars, and so forth. Excellent references are [MS99, Jac0:31. We will try to keep
the discussion relatively elementary, but have opted not to insert a computational linguistics
tutorial.
The essential idea of link grammar is that each word conies with a feature structure consisting
of a set of typed connectors. Parsing consists of matching up connectors from one word with
connectors from another
To understand this in detail, the best course is to consider an example sentence. We will use
the following example, drawn from the classic paper "Parsing with a Link Grammar" by Sleator
and Temperley IS'I'93):
The cat chased a snake
The link grammar parse structure for this sentence is:
« xp
I • ss
r •
lid
.0.--Ds +- 40- -4.-Jw-4.
I I I I
+--se-r
I
I
-Paf - 1.
I ♦
LEFT-WALL the person.n with whoa she works.v is.v silly.a
In phrase structure grammar terms, this corresponds loosely to
(S (NP The cat)
(VP chased
(NP a snake))
EFTA00624530
384 44 Natural Language Comprehension
but the OpenCog linguistic pipeline makes scant use of this kind of phrase structure rendition
(which is fine in this simple example; but in the case of complex sentences, construction of
analogous mappings from link parse structures to phrase structure grammar parse trees can
be complex and problematic). Currently the hierarchical view Ls used in OpenCog only within
some reference resolution heuristics.
There is a database called the "link grammar dictionary" which contains connectors associ-
ated with all common English words. The notation used to describe feature structures in this
dictionary is quite simple. Different kinds of connectors are denoted by letters or pairs of letters
like S or SX. Then if a word WI has the connector S+, this means that the word can have an S
link coming out to the right side. If a word W2 has the connector S-, this means that the word
can have an $ link coming out to the left side. In this case, if %VI occurs to the left of W2 in a
sentence, then the two words can be joined together with an S link.
The features of the words in our example sentence, as given in the S&T paper, are:
Words Formula
a, the D+
snake, cat D- & (0- or S+)
Chased S- & 0+
To illustrate the role of syntactic sense disambiguation, we will uce alternate formulas
for one of the words in the example: the verb sense of "snake." We then have
Words Formula
A, the D+
snake_N, cat, ran_N D- & (0- or S+)
Chased S- & 0+
snake_V S-
The variables to be used in parsing this sentence are, for each word:
1. the features in the Agreement structure of the word (for any of its senses)
2. the words matching each of the connectors of the word
For example,
1. For "snake," there are features for "word that links to D-", "word that links to 0-" and "word
that links to 8+". There are also features for "tense" and "person".
2. For "the", the only feature is "word that links to D+". No features for Agreement are needed.
The nature of linkage imposes constraints on the variable assignments; for instance, if "the"
is assigned as the value of the "word that links to D-" feature of "snake", then "snake" must be
assigned as the value of the "word that links to D+" feature of "the."
The rules of link grammar impose additional constraints — i.e. the planarity, connectivity,
ordering and exclusion metarules described in Sleator and Temperley's papers. Planarity means
that links don't cross - a rule that S&T's parser enforces with absoluteness, whereas we have
found it is probably better to impose it as a probabilistic constraint, since sometimes it's
really nice to let links cross (the representation of conjunctions is one example). Connectivity
means that the links and words of a sentence mast form a connected graph - all the words
must be linked into the other words in the sentence via some path. Again connectivity is a
valuable constraint but in some cases one wants to relax it - if one just can't understand
the whole sentence, one may wish to understand at least some parts of it, meaning that one
has a disconnected graph whose components are the phrases of the sentence that have been
EFTA00624531
4,1.4 Parsing with Link Grammar 385
successfully comprehended. Finally, linguistic transformations may potentially be applied while
checking if these constraints are fulfilled (that is, instead of just checking if the constraints are
fulfilled, one may check if the constraints are fulfilled after one or more transformations are
performed.)
We will use the term "Agreement" to refer to "person" values or ordered pairs (tense, person),
and NAGR to refer to the number of agreement values (12-40, perhaps, in most realistic linguis-
tic theories). Agreement may be dealt with alongside the connector constraints. For instance,
"chased" has the Agreement values (past, third person), and it has the constraint that its S-
argument must match the person component of its Agreement structure.
Semantic restrictions may be imposed in the same framework. For instance, it may be known
that the subject of "chased" is generally animate. In that case, we'd say
Words Formula
A, the D+
snake_N, cat D- & (0- or S+)
Chased (S-, g Inheritance animate <.8>)
tz 0+
Snake_V 5-
wl ere we've added the modifier Inheritance animate) to the S- connector of the verb "chased,"
to indicate that with strength .8, he word connecting to this S- connector should denote
something inheriting from "animate.' In this example, "snake" and "cat" inherit from "animate",
so the probabilistic restriction down t help the parser any. If the sentence were instead
The snake in the hat chased the car
then the "animate" constraint would tell the parsing process not to start out by trying to connect
"hat" to "chased", because the connection is semantically unlikely.
44.4.1 Link Grammar vs. Phrase Structure Grammar
Before proceeding further, it's worth making a couple observations about the relationship be-
tween link grammars and typical phrase structure grammars. These could also be formulated as
observations about the relationship between dependency grammars and phrase structure gram-
mars, but that gets a little more complicated as there are many kinds of dependency grammars
with different properties; for simplicity we will restrict our discussion here to t he link grammar
that we actually use in OpenCog. Two useful observations may be:
1. Link grammar formulas correspond to grammatical categories. For example, the link struc-
ture for "chased" is "5- & O+." In categorical grammar, this would seem to mean that "
'chased' belongs to the category of words with link structure '5- & O+'." In other words,
each "formula" in link grammar corresponds to a category of words attached to that formula.
2. Links to words might as well be interpreted as links to phrases headed by those words.
For example, in the sentence "the cat chased a snake", there's an O-link from "chased" to
"snake." This might as well be interpreted as "there's an O-link from the phrase headed
by `chased' to the phrase headed by `snake'." Link grammar simplifies things by implicitly
identifying each phrase by its head.
EFTA00624532
386 44 Natural Language Comprehension
Based on these observations, one could look at phrase structure as implicit in a link parse; and
this does make sense, but also leads to some linguistic complexities that we won't enter into
here.
Fig. 44.1: Dependency and Phrase-Structure Parses
the/D man/N that/IN eime/V eats/V bananas/II ritts/Ill LID fork/M
eatsi//
man/N eati/V
theme cameN eatiN bananas/kr with/1N
the man that/1N cameN eats bananas with/1N (akiN
that came kith al rfisl
I I
a for*
A comparison of dependency (above) and phrase-structure (below) parses. In general, one can be converted to
the other (algorithmically); dependency grammars tend to be easier understand.
(Image taken from C. Schneider, "Learning to Disambiguate Syntactic Relations" Linguistik online 17, 5/03)
44.5 The RelEx Framework for Natural Language Comprehension
Now we move forward in the pipeline from syntax toward semantics. The NL comprehension
framework provided with OpenCog at its inception in 2008 is RelEx, an English-language se-
mantic relationship extractor, which consists of two main components: the dependency extractor
and the relationship extractor. It can identify subject, object, indirect object and many other
dependency relationships between words in a sentence; it generates dependency trees, resem-
bling those of dependency grammars. In 2012 we are in the process of replacing RelEx with a
different approach that we believe will be more amenable to generalization based on experience.
Here we will describe both approaches.
The overall processing scheme of RelEx is shown in Figure 44.2.
The dependency extractor component carries out dependency grammar parsing via a cus-
tomized version of the open-source Sleator and Temperley's link parser, as reviewed above. The
link parser outputs several parses, and the dependencies of the best one are taken. The rela-
tionship extractor component is composed of a number of template matching algorithms that
act upon the link parser's output to produce a semantic interpretation of the parse. It contains
three steps:
EFTA00624533
41.5 The RelEx Fkamework for Natural Language Comprehension 3ST
•
LS Pent
Lilt Pere empalaq
Rpm enSsa gignmen
ILIMIMillaal it Milifteall•I•POOSSISOMIDSIMINIDADOSIO•••••••
/WWII, • • • • • • • • • ease, • • .M. • • W.V. ••• • ea.. • • • • • •
Saber
Irealartenest Aargau
Ibrionemen Ana'
II r ItebassIdp
Refs Scat EltIll4101
ltdoeselps
Fig. 44.2: A Overview of the RelEx Architecture for Language Comprehension
1. Convert the Link Parser output to a feature structure representation
2. Execute the Sentence Algorithm Applier, which contains a series of Sentence Algorithms,
to modify the feature structure.
3. Extract the final output representation by traversing the feature structure.
A feature structure, in the RelEx context, is a directed graph in which each node contains
either a value, or an unordered list of features. A feature is just a labeled link to another node.
Sentence Algorithm Applier loads a list of SentenceAlgorithms from the algorithm definition
file, and the SentenceAlgorithms are executed in the order they are listed in the file. RelEx
iterates through every single feature node in the feature structure, and attempts to apply the
algorithm to each node. Then the modified feature structures are used to generate the final
RelEx semantic relationships.
44.5.1 RelEx2Frame: Mapping Syntactico-Semantic Relationships
into PrameNet Based Logical Relationships
Next in the current OpenCog NL comprehension pipeline, the RelEx2Frame component uses
hand-coded rules to map RelEx output into sets of relationships utilizing FrameNet and other
similar semantic resources. This is definitively viewed as a "stopgap" without a role in a human-
level AGI system, but it's described here because it's part of the current OpenCog system and
EFTA00624534
388 44 Natural Language Comprehension
is now being used together with other OpenCog components in practical projects, including
theme with proto-AGI intentions.
The syntax currently used for describing semantic relationships drawn from FrameNet and
other sources is exemplified by the example
Al_Benefit:Benefitor(give,$varl)
The n indicates the data source, where 1 is a number indicating that the resource is FrameNet.
The "give" indicates the word in the original sentence from which the relationship is drawn,
that embodies the given semantic relationship. So far the resources we've utilized are:
1. FrameNet
2. Custom relationship names
but using other resources in future is quite possible.
An example using a custom relationship would be:
A2_inheritance($varl,$var2)
which defines an inheritance relationship: something that is part of CogPrime's ontology but
not part of FrameNet.
The "Benefit" part of the first example indicates the frame indicated, and the "Benefitor"
indicates the frame element indicated. This distinction (frame vs. frame element) is particular to
FrameNet; other knowledge resources might use a different sort of identifier. In general, whatever
lies between the underscore and the initial parenthese should be considered as particular to the
knowledge-resource in question, and may have different format and semantics depending on the
knowledge resource (but shouldn't contain parentheses or underscores unless those are preceded
by an escape character).
As an example, consider:
Put the ball on the table
Here the RelEx output is:
imperative(put) [1]
_obj (Put, ball) (1]
on (Put, table) (1]
singular (ball) (1]
singular (table) (1]
, The relevant FrameNet Mapping Rules are:
SvarO = ball
Svarl = table
I IF imperative(put) THEN Al_Placing:Agent(put,you)
I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0)
I IF on(put,$varl) & _obj(put,$var0) THEN A1_Placing:Goal(put,$varl)
Al_Locative_relation:Figure($var0) Al_Locative_relation:Ground($varl)
Finally, the output FrameNet Mapping is:
EFTA00624535
44.5 The RelEx Framework for Natural Language Comprehension 389
^1_Placing:Agent(put,you)
al_Placing:Theme(put,ball)
^1_Placing:Goal(put,table)
"l_Locative_relation:Figure(put,ball)
"l_Locative_relation:Ground(put,table)
The textual syntax used for the hand-coded rules mapping RelEx to FrameNet. at the mo-
ment, looks like:
I IF imperative(put) THEN Al_Placing:Agent(put,you)
I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0)
I IF on(put,$varl) & _obj(put,$var0) THEN Al_Placing:Goal(put,$varl) \
al_Locative_relation:Figure(Svar0) Al_Locative_relation:Ground($varl)
Basically, this means each rule looks like
I IF condition THEN action
where the condition is a series of RelEx relationships, and the action is a series of F\ameNet
relationships. The arguments of the relationships may be words or may be variables in which
case their names must start with $ The only variables appearing in the action should be ones
that appeared in the condition.
44.5.2 A Priori Probabilities For Rules
It can be useful to attach a priori, heuristic probabilities to RelEx2Frame rules, say
I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0) <.5>
to denote that the a priori probability for the rule is 0.5
This is a crude mechanism because the probability of a rule being useful, in reality, depends
so much on context; but it still has some nonzero value.
44.5.3 Exclusions Between Rules
It may be also useful to specify that two rules can't semantically-consistently be applied to the
same RelEx relationship. To do this, we need to associate rules with labels, and then specify
exclusion relationships such as
# IF on(put,$varl) & _obj(put,$var0) THEN Al_Placing:Goal(put,$varl) \
Al_Locative_relation:Figure($var0) ^1_Locative_relation:Ground($varl) (1]
# IF on(put,$varl) & _Aubj(put,$var0) THEN \
Al_Performing_arts:Performance(put,$varl) \
Al_Performing_arts:Performer(put,$var0) (2]
# EXCLUSION 1 2
• An escape character " must be used to handle cases where the character "S" starts a word
EFTA00624536
390 44 Natural Language Comprehension
In this example, Rule 1 would apply to "He put the ball on the table", whereas Rule 2 would
apply to "He put on a show". The exclusion says that generally these two rules shouldn't
be applied to the same situation. Of course some jokes, poetic expressions, etc., may involve
applying excluded rules in parallel.
44.5.4 Handling Multiple Prepositional Relationships
Finally, one complexity arising in such rules is exemplified by the sentence:
"Bob says killing for the Mafia beats killing for the government"
whose RelEx mapping looks like
uncountable (Bob) [6]
present(says) [6]
_subj (says, Bob) [6]
_that (says, beats) [3]
uncountable(killing) [6]
for (killing, Mafia) [3]
singular (Mafia) [6]
definite(Mafia) [6]
hyp(beats) [ 3 ]
present (beats) [ 5 ]
_subj (beats, killing) [3]
_obj (beats, killing_1) [5]
uncountable (killing_1) [5]
for (killing_l, government) [2]
definite(government) [6]
In this case there are two instances of "for". The output of Relac2Frame must thus take care to
distinguish the two different for's (or we might want to modify RelEx to make this distinction).
The mechanism currently used for this is to subscript the for's, as in
uncountable (Bob) [6]
present(says) [6]
_subj (says, Bob) [6]
_that (says, beats) [3]
uncountable(killing) [6]
for (killing, Mafia) [3]
singular (Mafia) [6]
definite(Mafia) [6]
hyp(beats) [3]
present (beats) [6]
_subj (beats, killing) [3]
_obj (beats, killing_1) [5]
uncountable (killing_1) [5]
for_l (killing_1, government) [2]
definite(government) [6]
EFTA00624537
44.5 The RelEx Framework for Natural Language Comprehension 391
so that upon applying the rule:
I IF for ($var0, $varl) ^ (present ($var0) OR past ($var0) OR future ($var0) ) \
THEN ^2_Benefit :Benefitor (for, $varl) ^2_Benefit :Act (for, $var0)
we obtain
^2_Benefit:Benefitor(for,Mafia)
^2_Benefit:Act(for,killing)
A2_Benefit:Benefitor(for_l,government)
^2_Benefit:Act(for_l,killing_1)
Here the first argument of the output relationships allows us to correctly associate the dif-
ferent acts of killing with the different benefitors
44.5.5 Comparatives and Phantom. Nodes
Next, a bit of subtlety is needed to deal with sentences like
Mike eats more cookies than Ben.
which RelEx handles via
_subj(eat, Mike)
_obj(eat, cookie)
more(cookie, $cVar0)
ScVar0(Ben)
Then a Relac2FrameNet mapping rule such as:
IF
_subj (eat, $var0)
_obj (eat,$varl)
more ($varl,$cVar0)
ScVarO($var2)
THEN
^2_AsymmetricEvaluativeComparison : Prof iledltem (more, $varl)
"2_AsymmetricEvaluativeComparison : Standardltem (more, $varl_1)
^2_AsymmetricEvaluativeComparison :Valence (more, more)
l_Ingest ion: Ingestor (eat, $var0)
l_Ingest ion: Ingested (eat, $varl)
l_Ingest ion: Ingestor (eat_1,$var2)
l_Ingest ion : Ingested (eat_l, $varl_1)
applies, which embodies the commonsense intuition about comparisons regarding eating. (Note
that we have introduced a new frame AsymmetricEvaluativeComparison here, by analogy to
the standard FrameNet frame Evaluative_comparison.)
Note also that the above rule may be too specialized, though it's not incorrect. One could
also try more general rules like
EFTA00624538
392 44 Natural Language Comprehension
IF
%Agent ($var0)
%Agent ($varl)
_subj ($var3, $var0)
_obj ($var3, $varl)
more ($varl, $cVar0)
ScVarO ($var2)
THEN
^2_AsymmetricEvaluativeComparison : Prof iledltem (more, $varl)
^2_AsymmetricEvaluativeComparison:StandardItem (more, $varl_1)
^2_AsymmetricEvaluativeComparison :Valence (more, more)
_subj ($var3, $var0)
_obj ($var3, $varl)
_subj ($var3_1, $var2)
_obj ($var3_1, $varl_1)
However, this rule is a little different than most RelEx2Frame rules, in that it produces output
that then needs to be processed by the RelEx2Frame rule-base a second time. There's nothing
wrong with this, it's just an added layer of complexity.
44.6 Frame2Atom
The next step in the current OpenCog NLP comprehension pipeline is to translate the output
of RelEx2Frame into Atoms. This may be done in a variety of ways; the current Frame2Atom
script embodies one approach that has proved workable, but is certainly not the only useful
one.
The Node types currently used in Frame2Atom are:
• WordNode
• ConceptNode
- DefinedFrameNode
- DefinedLinguisticConceptNode
• PredicateNode
- DefinedFrameElementNode
- DefinedLinguisticRelationshipNode
• SpecificEntityNode
The special node types
• DefinedFrameNode
• DefinedFrameElementNode
have been created to correspond to FrameNet frames and elements respectively (or frames and
elements drawn from similar resources to FrameNet, such as our own frame dictionary).
Similarly, the special node types
EFTA00624539
44.6 Frame2Atom 393
• DefinedLinguisticConceptNode
• DefinedLinguisticRelationshipNode
have been created to correspond to RelEx unary and binary relationships respectively.
The "defined" is in the names because once we have a more advanced CogPrime system, it
will be able to learn its own frames, frame elements, linguistic concepts and relationships. But
what distinguishes these "defined" Atoms is that they have names which correspond to specific
external resources.
The Link types we need for Frarne2Atom are:
• InheritanceLink
• ReferenceLink (current using WRLink aka "word reference link")
• FrameElementLink
ReferenceLink is a special link type for connecting concepts to the words that they refer to.
(This could be eliminated via using more complex constructs, but it's a very common case so
for practical purposes it makes sense to define it as a link type.)
FrameElementLink is a special link type connecting a frame to its element. Its semantics
(and how it could be eliminated at cost of increased memory and complexity) will be explained
below.
44.6.1 Examples of Frame2Atorn
Below follow some examples to illustrate the nature of the mapping intended. The examples
include a lot of explanatory discussion as well.
Note that, in these examples, [14 denotes an Atom with AtomHandle n. All Atoms have Han-
dles, but Handles are only denoted in cases where this seems useful. (In the XML representation
used in the current OpenCogPrime impehnentation, these are replaced by UUID's)
The notation
WordNode#pig
denotes a WordNode with name pig, and a similar convention is used for other AtomTypes
whose names are useful to know.
These examples pertain to fragments of the parse
Ben slowly ate the fat chickens.
A:_advmod:V(slowly:A, eat:VI
N:_nn:N(fat:N, chicken:N)
N:definite(Ben:N)
N:definite(chicken:N)
N:masculine(Ben:N)
N:person(Ben:14/
N:plural(chicken:N)
N:singular(Ben:N)
V:_obj:N(eat:V, chicken:N)
V:_subj:N(eat:V, Ben:N)
V:past(eat:V)
EFTA00624540
394 44 Natural Language Comprehension
Al_Ingestion:Ingestor(eat,Ben)
Al_Temporalcolocation:Event(past,eat)
Al_Ingestion:Ingestibles(eate chicken)
Al_Activity:Agent(subject,Ben)
Al_Activity:Activity(verb,eat)
Al_Transitive_action:Event(verbf eat)
AlTransitiveaction:Patient(objectf chicken)
44.6.1.1 Example 1
_obj(eat,chicken)
would map into
EvaluationLink
DefinedLinguisticRelationshipNode l_obj
ListLink
ConceptNode [2]
ConceptNode [3]
InheritanceLink
[2]
ConceptNode [4]
InheritanceLink
I 31
ConceptNode [5]
ReferenceLink [6]
WordNode feat [8]
[4]
ReferenceLink [7]
WordNode 'chicken [9]
I 51
Please note that the Atoms labeled 4,5,6,7,8,9 would not normally have to be created when
entering the relationship
_obj(eat,chicken)
into the AtomTable. They should already be there, assuming the system already knows about
the concepts of eating and chickens. These would need to be newly created only if the system
had never seen these words before.
For instance, the Atom 121 represents the specific instance of "eat" involved in the relationship
being entered into the system. The Atom 141 represents the general concept of "eat", which is
what is linked to the word "eat."
Note that a very simple step of inference, from these Atoms, would lead to the conclusion
EvaluationLink
EFTA00624541
44.6 Frame2Atom 395
DefinedLinguisticRelationshipNode i_obj
ListLink
ConceptNode [4]
ConceptNode [5]
which represents the general statement that chickens are eaten. This is such an obvious and
important step, that perhaps as soon as the relationship _obj(eat, chicken) is entered into the
system, it should immediately be carried out (i.e. that link if not present should be created,
and if present should have its truth value updated). This is a choice to be implemented in the
specific scripts or schema t hat deal with ingestion of natural language text.
44.6.1.2 Example 2
mascutine(Ben)
would map into
InheritanceLink
SpecificEntityNode [40]
DefinedLinguisticConceptNode *masculine
InheritanceLink
[40]
(10]
ReferenceLink
WordNode *Ben
[10]
44.6.1.3 Example 3
The mapping of the RelExToFrame output
Ingestion : Ingestor(eat, Ben)
would use the existing Atoms
DefinedFrameNode *Ingestion [11]
DefinedFrameElementNode *Ingestion:Ingestor [12]
which would be related via
FrameElementLink [11] [12]
(Note that FrameElementLink may in principle be reduced to more elementary PLN link types.)
Note that each FrameNet frame contains some core elements and some optional elements.
This may be handled by giving core elements links such as
FrameElementLink F E <1>
EFTA00624542
396 44 Natural Language Comprehension
and optional ones links such as
FrameElementLink F E <.7>
Getting back to the example at hand, we would then have
InheritanceLink [2] [11]
(recall, 12] is the instance of eating involved in Example 1; and, till is the Ingestion frame),
which says that this instance of eating is an instance of ingestion. (In principle, some instances
of eating might not be instances of ingestion - or more generally, we can't assume that all
instances of a given concept will always associate with the same FrameNodes. This could be
assumed only if we assumed all word-associated concepts were disambiguated to a single known
FrameNet frame, but this can't be assumed, especially if later on we want to use cognitive
processes to do sense disambiguation.)
We would then also have links denoting the role of Ben as an Ingestor in the frame-instance
12j, i.e.
EvaluationLink
DefinedFrameElementNode 4Ingestion:Ingestor [12]
ListLink
[2]
[40]
This says that the specific instance of Ben observed in that sentence OD served the role of
Ingestion:Ingestor in regard to the frame-instance 121 (which is an instance of eating, which is
known to be an instance of the frame of Ingestion).
44.6.2 Issues Involving Disambiguation
Right now, OpenCogPrime's RelEx2Frame rulebase is far from adequately large (there are
currently around 5000 rules) and the link parser and RelEx are also imperfect. The current
OpenCog NLP system does work, but for complex sentences it tends to generate too many
interpretations of each sentence - "parse selection" or more generally "interpretation selection"
is not yet adequately addressed. This is a tricky issue that can be addressed to some extent via
statistical linguistics methods, but we believe that to solve it convincingly and thoroughly will
require more cognitively sophisticated methods.
The most straightforward way to approach it statistically is to process a large number of
sentences, and then tabulate co-occurrence probabilities of different relationships across all the
sentences. This allows one to calculate the probability of a given interpretation conditional on
the corpus, via looking at the probabilities of the combinations of relationships in the inter-
pretation. This may be done using a Bayes Net or using PLN - in any case the problem is
one of calculating the probability of a conjunction of terms based on knowledge regarding the
probabilities of various sub-conjunctions. As this method doesn't require marked-up training
data, but is rather purely unsupervised, it's feasible to apply it to a very large corpus of text -
the only cost is computer time.
What the statistical approach won't handle, though, are the more conceptually original lin-
guistic constructs, containing combinations that didn't occur frequently in the system's training
EFTA00624543
44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and RelEx2Frame 397
corpus. It will rate innovative semantic constructs as unlikely, which will lead it to errors some-
times - errors of choosing an interpretation that seems odd in terms of the sentence's real-world
interpretation, but matches well with things the system has seen before. The only way to solve
this is with genuine understanding - with the system reasoning on each of the interpretations
and seeing which one makes more sense. And this kind of reasoning generally requires some
relevant commonsense background knowledge - which must be gained via experience, reading
and conversing, or from a hand-coded knowledge base, or via some combination of the above.
Related issues also involving disambiguation include word sense disambiguation (words with
multiple meanings) and anaphor resolution (recognizing the referents of pronouns, and of nouns
that refer to other nouns, etc.).
The current RelEx system contains a simple statistical parse ranker (which rates a parse
higher if the links it includes occur more frequently in a large parsed corpus), statistical methods
for word sense disambiguation IMihoil inspired by those in Rada Mihalcea's work IS:009I, and
an anaphor resolution algorithm based on the classic Hobbs Algorithm (customized to work with
the link parser) Illob78J. While reasonably effective in many cases, from an AGI perspective
these must all be considered "stopgaps" to be replaced with code that handles these tasks
using probabilistic inference. It is conceptually straightforward to replace statistical linguistic
algorithms with comparable PLN-based methods, however significant attention must be paid to
code optimization as using a more general algorithm is rarely as efficient as using a specialized
one. But once one is handling things in PLN and the Atomspace rather than in specialized
computational linguistics code, there is the opportunity to use a variety of inference rules for
generalization, analogy and so forth, which enables a radically more robust form of linguistic
intelligence.
44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and
RelEx2Frame
This section describes an alternative approach to the RelEx / RelEx2Frame approach described
above, which is in the midst of implementation at time of writing. This alternative represents
a sort of midway point between the rule-based RelEx / RelEx2Frame approach, and a concep-
tually ideal fully experiential learning based approach.
The motivations underlying this alternative approach have been to create an OpenCog NIP
system with the capability to:
• support simple dialogue in a video game like world, and a robot system
• leverage primarily semi-supervised experiential learning
• replace the RelEx2Frame rules, which are currently problematic, with a different way of
mapping syntactic relationships into Atoms, that is still reasoning and learning friendly
• require only relatively modest effort for implementation (not multiple human-years)
The latter requirement ruled out a pure "learn language from experience with no aid from
computational linguistics tools" approach, which may well happen within OpenCog at some
point.
EFTA00624544
398 44 Natural Language Comprehension
44.8 Mapping Link Parses into Atom Structures
The core idea of the new approach is to learn "Syn2Sem" rules that map link parses into Atom
structures. These rules may then be automatically reversed to form Sem2Syn rules, which may
be used in language generation.
Note that this is different from the RelEx approach as currently pursued (the "old approach"),
which contains
• one set of rules (the RelEx rules) mapping link parses into semantic relation-sets ("RelEx
relation-sets" or rel-sets)
• another set of rules (the RelEx2Frame rules) mapping rehsets into FrameNet-based relation-
sets
• another set of rules (the Frame2Atom rules) mapping FrameNet-based relation-sets into
Atom-sets
In the old approach, all the rules were hand-coded. In the new approach
• nothing needs to be hand-coded (except the existing link parser dictionary); the rules can be
learned from a corpus of (link-parse, Atom-set) pairs. This corpus may be human-created;
or may be derived via a system's experience in some domain where sentences are heard or
read, and can be correlated with observed nonlinguistic structures that can be described by
Atoms.
• in practice, some hand-coded rules are being created to map RelEx rd-sets into Atom-sets
directly (bypassing RelEx2Frame) in a simple way. These rules will be used, together with
RelEx, to create a large corpus of (link parse, Atom-set) pairs, which will be used as a
training corpus. This training corpus will have more errors than a hand-created corpus, but
will have the compensating advantage of being significantly larger than any hand-created
corpus would feasibly be.
In the old approach, NL generation was done by using a pattern-matching approach, applied
to a corpus of (link parse, rd-set) pairs, to mine rules mapping rd-sets to sets of link parser
links. This worked to an extent, but the process of piecing together the generated sets of link
parser links to form coherent "sentence parses" (that could then be turned into sentences)
turned out to be subtler than expected. and appeared to require an es. calatingly complex set of
hand-coded rules, to be extended beyond simple cases.
In the new approach, NL generation is done by explicitly reversing the mapping rules learned
for mapping link parses into Atom sets. This is possible because the rules are explicitly given in
a form enabling easy reversal; whereas in the old approach, RelEx transformed link parses into
rd-sets using a process of successively applying many rules to an ornamented tree, each rule
acting on variables ("ornaments") deposited by previous rules. Put simply, RelEx transformed
link parses into rehsets via imperative programming, whereas in the new approach, link parses
are transformed into Atom-sets using learned rules that are logical in nature. The movement
from imperative to logical style dramatically eases automated rule reversal.
EFTA00624545
4,1.9 Making a Training Corpus 399
44.8.1 Example Training Pair
For concreteness, an example (link parse, Atom-set) pair would be as follows. For the sentence
"Trains move quickly", the link parse looks like
Sp (trains, move)
MVa (move, quickly)
whereas the Atom-set looks like
Inheritance
move_l
move
Evaluation
move_l
train
Inheritance
move_l
quick
Rule learning proceeds, in the new approach, from a corpus consisting of such pairs.
44.9 Making a Training Corpus
44.9.1 Leveraging RelEx to Create a Training Corpus
To create a substantial training corpus for the new approach, we are leveraging the existence
of RelEx. We have a large corpus of sentences parsed by the link parser and then processed by
RelEx. A new collection of rules is being created, RelEx2Atom, that directly translates RelEx
parses into Atoms, in a simple way, embodying the minimal necessary degree of disambiguation
(in a sense to be described just below). Using these RelEx2Atom rules, one can transform a
corpus of (link parse, RelEx rel-set) triples into a corpus of (link parse, Atom-set) pairs - which
can then be used as training data for learning Syn2Sem rules.
44.9.2 Making an Experience Based Training Corpus
An alternate approach to making a training corpus would be to utilize a virtual world such as
the Unity3D world now being used for OpenCog game AI research and development.
A human game-player could create a training corpus by repeated:
• typing in a sentence
• indicating, via the graphic user interface, which entities or events in the virtual world were
referred to by the sentence
EFTA00624546
400 44 Natural Language Comprehension
Since OpenCog possesses code for transforming entities and events in the virtual world into
Atom-sets, this would implicitly produce a training corpus of (sentence, Atom-set) pairs, which
using the link parser could then be transformed into (link parse, Atom-set) pairs.
44.94 Unsupervised, Experience Based Corpus Creation
One could also dispense with the explicit reference-indication GUI, and just have a user type
sentences to the Al agent as the latter proceeds through the virtual world. The Al agent would
then have to figure out what specifically the sentences were referring to - maybe the human-
controlled avatar is pointing at something; maybe one thing recently changed in the game world
and nothing else did; etc. This mode of corpus creation would be reasonably similar to human
first language learning in format (though of course there are many differences from human
first language learning in the overall approach, for instance we are assuming the link parser,
whereas a human language learner has to learn grammar for themselves, based on complex and
ill-understood genetically encoded prior probabilistic knowledge regarding the likely aspects of
the grammar to be learned).
This seems a very interesting direction to explore later on, but at time of writing we are pro-
ceeding with the RelEx-hased training corpus, for sake of simplicity and speed of development.
44.10 Limiting the Degree of Disambiguation Attempted
The old approach is in a sense more ambitious than the new approach, because the RelEx2Frame
rules attempt to perform a deeper and more thorough level of semantic disambiguation than
the new rules. However, the RelEx2Frame rule-set in its current state is too "noisy" to be really
useful; it would need dramatic improvement to be helpful in practice. The key difference is that,
• In the new approach, the syntax-to-semantics mapping rules attempt only the disambigua-
tion that needs to be done to get the structure of the resultant Atom-set correct. Any
further disambiguation is left to be done later, by MindAgents acting on the Atom-sets
after they've already been placed in the AtomSpace.
• In the old approach, the RelEx2Frame rules attempted, in many cases, to disambiguate
between different meanings beyond the level needed to disambiguate the structure of the
Atom-set
To illustrate the difference, consider the sentences
• Love moves quickly.
• Trains move quickly.
These sentences involve different senses of "move" - change in physical location, versus a more
general notion of progress. However, both sentences map to the same basic conceptual structure,
e.g.
Inheritance
move_l
EFTA00624547
44.11 Rule Format 401
move
Evaluation
move_l
train
Inheritance
move_l
quick
versus
Inheritance
move_2
move
Evaluation
move_2
love
Inheritance
move_2
quick
The RelEx2Frame rules try to distinguish between these cases via, in effect, associating the
two instances move_l and move_2 with different frames, using hand-coded rules that map
RelEx rehsets into appropriate Atom-sets defined in terms of FrameNet relations. This is not a
useless thing to do; however, doing it well requires a very large and well-honed rule-base. Cyc's
natural language engine attempts to do something similar, though using a different parser than
the link parser and a different ontology than FrameNet; it does a much better job than the
current version of RelEx2Frame, but still does a surprisingly incomplete job given the massive
amount of effort put into sculpting the relevant rule-sets.
The new approach does not try to perform this kind of disambiguation prior to mapping
things into Atom-sets. Rather, this kind of disambiguation is left for inference to do, after the
relevant Atoms have already been placed in the AtomSpace. The rule of thumb is: Do precisely
the disambiguation needed to map the parse into a compact, simple Atom-set, whose component
nodes correspond to English words. Let the disambiguation of the meaning of the English words
be done by some other process acting on the AtomSpace.
44.11 Rule Format
To represent Syn2Sem rules, it is convenient to represent link parses as Atom-sets. Each element
of the training corpus will then be of the form (Atom set representing link parse, Atom-set
representing semantic interpretation). Syn2Sem rules are then rules mapping Atom-sets to
Atom-sets.
Broadly speaking, the format of a Syn2Sem rule is then
EFTA00624548
402 44 Natural Language Comprehension
Implication
Atom-set representing portion of link parse
Atom-set representing portion of semantic interpretation
44.11.1 Example Rule
A simple example rule would be
Implication
Evaluation
Predicate: Sp
\$V1
\ $V2
Evaluation
\ $V2
\$V1
This rule, in essence, maps verbs into predicates that take their subjects as arguments.
On the other hand, an Sem2Syn rule would look like the reverse:
Implication
Atom-set representing portion of link parse
Atom-set representing portion of semantic interpretation
Our current approach is to begin with Syn2Sem rules, because, due to the nature of natural
language, these rules will tend to be more certain. That is: it is more strongly the case in natural
languages that each syntactic construct maps into a small set of semantic structures, than that
each semantic structure is realizable only via a small set of syntactic constructs. There are
usually more ways structurally different, reasonably sensible ways to say an arbitrary thought,
than there are structurally different, reasonably sensible ways to interpret an arbitrary sentence.
Because of this fact about language, the design of the Atom-sets in the corpus is based on the
principle of finding an Atom structure that most simply represents the meaning of the sentence
corresponding to each given link parse. Thus, there will be many Syn2Sem rules with a high
degree of certitude attached to them. On the other hand, the Sem2Syn rules will tend to have
less certitude, because there may be many different syntactic ways to realize a given semantic
expression.
44.12 Rule Learning
Learning of Syn2Sem rules may be done via any algorithm that is able to search rule space for
rules of the proper format with high truth value as evaluated across the training set. Currently
we are experimenting with using OpenCogPrime's frequent subgraph mining algorithm in this
context. MOSES could also potentially be used to learn Syn2Sem rules. One suspects that
MOSES might be better than frequent subgraph mining for learning complex rules, but based
EFTA00624549
44.13 Creating a Cyc-Like Database via Text Mining 403
on preliminary, experimentation, frequent subgraph mining seems fine for learning the simple
rules involved in simple sentences.
PLN inference may also be used to generate new rules by combining previous ones, and to
generalize rules into more abstract forms.
44.13 Creating a Cyc-Like Database via Text Mining
The discussion of these NL comprehension mechanisms leads naturally to one interesting poten-
tial application of the OpenC,og NL comprehension pipeline - which is only indirectly related
to CogPrime, but would create a valuable resource for use by CogPrime if implemented. The
possibility exists to use the OpenCog NL comprehension system to create a vaguely Cyc-like
database of common-sense rules.
The approach would be as follows:
1. Get a corpus of text
2. Parse the text using OpenCog (RelEx or Syn2Sem)
3. Mine logical relationships among Atomrelationships from the data thus produced, using
greedy data-mining. MOSES, or other methods
These mined logical relationships will then be loosely analogous to the rules the Cyc team have
programmed in. For instance, there will be many rules like:
I IF _subj(understand,$var0) THEN ^l_Grasp:Cognizer(understand,$var0)
I IF _subj(know,$var0) THEN Al_Grasp:Cognizer(understand,$var0)
So statistical mining would learn rules like
IF "l_Mental_property(stupid) & Al_Mental_property:Protagonist($var0)
THEN ^l_Grasp:Cognizer(understand,$var0) <.3>
IF ^1_Mental_property(smart) & Al_Mental_property:Protagonist($var0)
THEN ^l_Grasp:Cognizer(understand,$var0) <.8>
which means that stupid people mentally grasp less than smart people do.
Note that these commonsense rules would come out automatically probabilistically quanti-
fied.
Note also that to make such rules come out well, one needs to do some (probabilistic)
synonym-matching on nouns, adverbs and adjectives, e.g. so that mentions of "smart", "intelli-
gent", "clever", etc. will count as instances of
Al_Mental_property (smart)
By combining probabilistic synonym matching on words. with mapping RelEx output into
FrameNet input, and doing statistical mining, it should be passible to build a database like Cyc
but far more complete and with coherent probabilistic weightings.
Although this way of building a commonsense knowledge base requires a lot of human
engineering, it requires far less than something like Cyc. One "just" needs to build the
RelEx2FrameNet mapping rules, not all the commonsense knowledge relationships directly —
EFTA00624550
404 44 Natural Language Comprehension
those come from text. We do not advocate this as a solution to the AGI problem, but merely
suggest that it could produce a large amount of useful knowledge to feed into an AGI's brain.
And of course, the better an AI one has, the better one can do the step labeled "Rank
the parses and FrameNet interpretations using inference or heuristics or both." So there is a
potential virtuous cycle here: more commonsense knowledge mined helps create a better AI
mind, which helps mine better commonsense knowledge, etc.
44.14 PROWL Grammar
We have described the crux of the NL comprehension pipeline that is currently in place in
the OpenCog codebase, plus some ideas for fairly moderate modifications or extensions. This
section is a little more speculative, and describes an alternative approach that fits better with
the overall CogPrime design, which however has not yet been implemented. The ideas given
here lead more naturally to a design for experience-based language learning and processing, a
connection that will be pointed out in a later section.
What we describe here is a partially-new theory of language formed via combining ideas
from three sources: Hudson's Word Grammar 11hu190, HudOiaj, Sleator and Temperley's link
grammar. and Probabilistic Logic Networks. Reflecting its origin in these three sources, we have
named the new theory PROWL grammar, meaning PRObabilistic Word Link Grammar. We
believe PROWL has value purely as a conceptual approach to understanding language; however,
it has been developed largely from the standpoint of computational linguistics - as part of an
attempt to create a framework for computational language understanding and generation that
both
1. yields broadly adequate behavior based on hand-coding of "expert rules" such as grammat-
ical rules, combined with statistical corpus analysis
2. integrates naturally with a broader Al framework that combines language with embodied
social, experiential learning, that ultimately will allow linguistic rules derived via expert
encoding and statistical corpus analysis to be replaced with comparable, more refined rules
resulting from the system's own experience
PROWL has been developed as part of the larger CogPrime project; but, it is described in this
section mostly in a CogPrime -independent way, and is intended to be independently evaluable
(and, hopefully, valuable).
As an integration of three existing frameworks, PROWL could be presented in various differ-
ent ways. One could choose any one of the three components as an initial foundation, and then
present the combined theory, as an expansion/modification of this component. Here we choose
to present it as an expansion/modification of Word Grammar, as this is the way it originated,
and it is also the most natural approach for readers with a linguistics background. From this
perspective, to simplify a fair bit, one may describe PROWL as consisting of Word Grammar
with three major changes:
1. Word Grammar's network knowledge representation is replaced with a richer PLN-based
network knowledge representation.
EFTA00624551
44.14 PROWL Grammar 405
a. This includes, for instance. the replacement of Word Grammar's single "isa" relation-
ship type with a more nuanced collection of logically distinct probabilistic inheritance
relationship types
2. Going along with the above, Word Grammar's "default inheritance" mechanism is replaced
by an appropriate PLN control mechanism that guides the use of standard PLN inference
rules
a. This allows the same default-inheritance based inferences that Word Grammar relies
upon, but embeds these inferences in a richer probabilistic framework that allows them
to be integrated with a wide variety of other inferences
3. Word Grammar's small set of syntactic link types is replaced with a richer set of syntactic
link types as used in Link Grammar
a. The precise optimal set of link types is not clear; it may be that the link grammar's
syntactic link type vocabulary is larger than necessary, but we also find it clear that
the current version of Word Grammar's syntactic link type vocabulary is smaller than
feasible (at least, without the addition of large, new, and as yet unspecified ideas to
Word Grammar)
In the following subsections we will review these changes in a little more detail. Basic familiarity
with Word Grammar. Link Grammar and PLN is assumed.
Note that in this section we will focus mainly on those issues that are somehow nonobvious.
This means that a host of very important topics that come along with the Word Grammar
/ PLN integration are not even mentioned. The way Word Grammar deals with morphology,
semantics and pragmatics, for instance, seems to us quite sensible and workable - and doesn't
really change at all when you integrate Word Grammar with PLN, except that Word Grammar's
crisp isa links become PLN-style probabilistic Inheritance links.
44.14.1 Brief Review of Word Grammar
Word Grammar is a theory of language structure which Richard Hudson began developing in
the early 1980's riltal901. While partly descended from Systemic Functional Grammar, there
are also significant differences. The main ideas of Word Grammar are as follows
• It presents language as a network of knowledge, linking concepts about words, their mean-
ings, etc. - e.g. the word "dog" is linked to the meaning 'dog', to the form /dog/, to the
word-class 'noun', etc.
• If language is a network, then it is possible to decide what kind of network it is (e.g. it
seems to be a scale-free small-world network)
• It is monostratal - only one structure per sentence, no transformations.
• It uses word-word dependencies - e.g. a noun is the subject of a verb.
• It does not use phrase structure - e.g. it does not recognise a noun phrase as the subject of
a clause, though these phrases are implicit in the dependency structure.
t the following list is paraphrased with edits front http://www.phon.ucl.ac.uk/home/dick/wg.htm
downloaded on June 27 2010
EFTA00624552
406 44 Natural Language Comprehension
• It shows grammatical relations/functions by explicit labels - e.g. 'subject' and 'object'.
• It uses features only for inflectional contrasts that are mentioned in agreement rules - e.g.
number but not tense or transitivity.
• It uses default inheritance, as a very general way of capturing the contrast between 'basic' or
'underlying' patterns and 'exceptions' or 'transformations' - e.g. by default, English words
follow the word they depend on, but exceptionally subjects precede it; particular cases
'inherit' the default pattern unless it is explicitly overridden by a contradictory rule.
• It views concepts as prototypes rather than 'classical' categories that can be defined by
necessary and sufficient conditions. All characteristics (i.e. all links in the network) have
equal status, though some may for pragmatic reasons be harder to override than others.
• In this network there are no clear boundaries between different areas of knowledge - e.g.
between 'lexicon' and 'grammar', or between 'linguistic meaning' and 'encyclopedic knowl-
edge; language is not a separate module of cognition.
• In particular, there is no clear boundary between 'internal' and 'external' facts about words,
so a grammar should be able to incorporate sociolinguistic facts - e.g. the speaker of "side-
walk" is an American.
44.14.2 Word Grammar's Logical Network Model
Word Grammar presents an elegant framework in which all the different aspects of language
are encompassed within a single knowledge network. Representationally, this network combines
two key aspects:
1. Inheritance (called is-a) is explicitly represented
2. General relationships between n-ary predicates and their arguments, including syntactic
relationships, are explicitly represented
Dynamically, the network contains two key aspects:
1. An inference rule called "default inheritance"
2. Activation-spreading, similar to that in a neural network or standard semantic network
The similarity between Word Grammar and CogPrime is fairly strong. In the latter, inheritance
and generic predicate-argument relationships are explicitly represented; and, a close analogue of
activation spreading is present in the "attention allocation" subsystem. As in Word Grammar,
important cognitive phenomena are grounded in the symbiotic combination of logical-inference
and activation-spreading dynamics.
At the most general level, the reaction of the Word Grammar network to any situation is
proposed to involve three stages:
1. Node creation and identification: of nodes representing the situation as understood, in its
most relevant aspects
2. Where choices need to be made (e.g. where an identified predicate needs to choose which
other nodes to bind to as arguments), activation spreading is used, and the most active
eligible argument is utilized (this is called "best fit binding")
3. Default inheritance is used to supply new links to the relevant nodes as necessary
EFTA00624553
44.14 PROWL Grammar 407
Default inheritance is a process that relies on the placement of each node in a directed acyclic
graph hierarchy (dag) of isa links. The basic idea is as follows. Suppose one has a node N, and a
predicate f(N,L), where L is another argument or list of arguments. Then, if the truth value of
f(N,L) is not explicitly stored in the network, N inherits the value from any ancestor A in the
dag so that: f(A,L) is explicitly stored in the network; and there is not any node P inbetween
N and A for which f(P,L) is explicitly stored in the network. Note that multiple inheritance is
explicitly supported, and in cases where this leads to multiple assignments of truth values to a
predicate, confusion in the linguistic mind may ensue. In many cases the option coming from
the ancestor with the highest level of activity may be selected.
Our suggestion is that Word Grammar's network representation may be replaced with PLN's
logical network representation without any loss, and with significant gain. Word Grammar's
network representation has not been fleshed out as thoroughly as that of PLN, it does not
handle uncertainty, and it is not associated with general mechanisms for inference. The one
nontrivial issue that must be addressed in porting Word Grammar to the PLN representation
is the role of default inheritance in Word Grammar. This is covered in the following subsection.
The integration of activation spreading and default inheritance proposed in Word Gram-
mar, should be easily achievable within CogPrime assuming a functional attention allocation
subsystem.
44.14.3 Link Grammar Parsing vs Word Grammar Parsing
From a CogPrime /PLN point of view, perhaps the most striking original contribution of Word
Grammar is in the area of syntax parsing. Word Grammar's treatment of morphology and se-
mantics is, basically, exactly what one would expect from representing such things in a richly
structured semantic network. PLN adds much additional riclmess to Word Grammar via al-
lowing nuanced representation of uncertainty, which is critical on every level of the linguistic
hierarchy - but this doesn't change the fundamental linguistic approach of Word Grammar.
Regarding syntax processing, however, Word Grammar makes some quite specific and unique
hypotheses, which if correct are very valuable contributions.
The conceptual assumption we make here is that syntax processing, while carried out using
generic cognitive processes for uncertain inference and activation spreading, also involves some
highly specific constraints on these processes. The extent to which these constraints are learned
versus inherited is yet unknown, and for the subtleties of this issue the reader is referred to
lEI33- 971. Word Grammar and Link Grammar are then understood as embodying different
hypotheses regarding what these constraints actually are.
It is interesting to consider the contributions of Word Grammar to syntax parsing via com-
paring it to Link Grammar.
Note that Link Grammar, while a less comprehensive conceptual theory than Word Gram-
mar, has been used to produce a state-of-the-art syntax parser, which has been incorporated
into a number of other software systems including OpenCog. So it is clear that the Link Gram-
mar approach has a great deal of pragmatic value. On the other hand, it also seems clear that
Link Grammar has certain theoretical shortcomings. It deals with many linguistic phenomena
very elegantly, but there are other phenomena for which its approach can only be described as
"hacky."
EFTA00624554
408 44 Natural Language Comprehension
Word Grammar contains fewer hacks than Link Grammar, but has not yet been put to the
test of large-scale computational implementation, so it's not yet clear how many hacks would
need to be added to give it the relatively broad coverage that Link Grammar currently has. Our
own impression is that to make Word Grammar actually work as the foundation for a broad-
coverage grammar parser (whether standalone, or integrated into a broader artificial cognition
framework), one would need to move it somewhat in the direction of link grammar, via adding
a greater number of specialized syntactic link types (more on this shortly). There are in fact
concrete indications of this in IIlud07ai.
The Link Grammar framework may be decomposed into three aspects:
1. The link grammar dictionary, which for each word in English, contains a number of links of
different types. Some links point left, some point right, and each link is labeled. Furthermore,
some links are required and others are optional.
2. The "no-links-cross" constraint, which states that the correct parse of a sentence will involve
drawing links between words, in such a way that all the required links of each word are
fulfilled, and no two links cross when the links are depicted in two dimensions
3. A processing algorithm, which involves first searching the space of all passible linkages
among the words in a sentence to find all complete linkages that obey the no-links-cross
constraint; and then applying various postprocessing rules to handle cases (such as con-
junctions) that aren't handled properly by this algorithm
In PROWL, what we suggest is that
1. The link grammar dictionary is highly valuable and provides a level of linguistic detail that
is not present in Word Grammar; and, we suggest that in order to turn Word Grammar
into a computationally tractable system, one will need something at least halfway between
the currently minimal collection of syntactic link types used in Word Grammar and the
much richer collection used in Link Grammar
2. The no-links-cross constraint is an approximation of a deeper syntactic constraint ("land-
mark transitivity") that has been articulated in the most recent formulations of Word Gram-
mar. Specifically: when a no-links-crossing parse is found, it is correct according to Word
Grammar; but Word Grammar correctly recognizes some parses that violate this constraint
3. The Link Grammar parsing algorithm is not cognitively natural, but is effective in a
standalone-parsing framework. The Word Grammar approach to parsing is cognitively
natural, but as formulated could only be computationally implemented in the context of
an already-very-powerful general intelligence system. Fortunately, various intermediary ap-
proaches to parsing seem possible.
44.14.3.1 Using Landmark Transitivity with the Link Grammar Dictionary
An earlier version of Word Grammar utilized a constraint called "no tangled links" which is
equivalent to the link parser's "no links cross" constraint. In the new version of Word Grammar
this is replaced with a subtler and more permissive constraint called "landmark transitivity."
While in Word Grammar. landmark transitivity is used with a small set of syntactic link types,
there is no reason why it can't be used with the richer set of link types that Link Grammar
provides. In fact, this seems to us a probably effective method of eliminating most or all of the
"postprocessing rules" that exist in the link parser, and that constitute the least elegant aspect
of the Link Grammar framework.
EFTA00624555
44.14 PROWL Grammar 409
The first foundational concept, on the path to the notion of landmark transitivity, is the
notion of a syntactic parent. In Word Grammar each syntactic link has a parent end and a
child end. In a dependency grammar context, the notion is that the child depends upon the
parent. For instance, in Word Grammar, in the link between a noun and an adjective, the noun
is the parent.
To apply landmark transitivity in the context of the Link Grammar, one needs to provide
some additional information regarding each link in the Link Grammar dictionary. One needs
to specify which end of each of the link grammar links is the "parent" and which is the "child."
Examples of this kind of markup are as follows (with P shown by the parent):
S link: subject-noun finite verb (P)
O link: transitive verb (P) direct or indirect object
D link: determiner noun (P)
MV link: verb (P) verb modifier
J link: preposition object (P)
ON link: on time-expression [P]
M link: noun [P] modifiers
In some cases a word may have more than one parent. In this case, the rule is that the landmark
is the one that is superordinate to all the other parents. In the rare case that two words are
each others' parents, then either may serve as the landmark.
The concept of a parent leads naturally into that of a landmark. The first rule regarding
landmarks is that a parent is a landmark for its child. Next, two kinds of landmarks are in-
troduced: Before landmarks (in which the child is before the parent) and After landmarks (in
which the child is after the parent). The Before/After distinction should be obvious in the Link
Grammar examples given above.
The landmark transitivity rule, then, has two parts. If A is a landmark for B, of subtype L
(where L is either Before or After), then
1. Subordinate transitivity says that if B is a landmark for C, then A is also a type-L
landmark for C
2. Sister transitivity says that if A is a landmark for C, then B is also a landmark for C
Finally, there are some special link types that cause a word to depend on its grandparents or
higher ancestors as well as its parents. I note that these are not treated thoroughly in (Hudson,
2007); one needs to look to the earlier, longer and rarer work [Hud901. Some questions are dealt
with this way. Another example is what in Word Grammar is called a "proxy link", as occurs
between "wit]l and - whom- in
The person with whom she works
EFTA00624556
410 44 Natural Language Comprehension
The link parser deals with this particular example via a .1w link
+ xp
I • Ss •i
+ Wd 4. 4.---Cs---•
4.--pe 4-40-.4.-je-• s.--Ss-• I-Pail.
I I I I I I
LEFT-WALL the person.n with whoa she works.v is.v silly.a
so to apply landmark transitivity in the context of the Link Grammar, in this case, it seems
one would need to implement the rule that in the case of two words connected by a .1w-link,
the child of one of the words is also the child of the other. Handling other special cases like
this in the context of Link Grammar seems conceptually unproblematic, though naturally some
hidden rocks may appear. Basically a list needs to be made of which kinds of link parser links
embody proxy relationships for which other kinds of link parser links.
According to the landmark transitivity approach, then, the criterion for syntactic correctness
of a parse is that, if one takes the links in the parse and applies the landmark transitivity rule
(along with the other special-case "raising" rules we've discussed), one does not arrive at any
contradictions (i.e. no situations where A is a Before landmark of B. mid
The main problem with the landmark-transitivity constraint seems to be computational
tractability. The problem exists for both comprehension and generation, but we'll focus on
comprehension here.
To find all possible parses of a sentence using Hudson's landmark-transitivity-based approach,
one needs to find all linkages that don't lead to contradictions when used as premises for reason-
ing based on the landmark-transitivity axioms. This appears to be extremely computationally
intensive! So, it seems that Word Grammar style parsing is only computationally feasible for
a system that has extremely strong semantic understanding, so as to be able to filter out the
vast majority of possible parses on semantic rather than purely syntactic grounds.
On the other hand, it seems possible to apply landmark-transitivity together with no-links-
cross, to provide parsing that is both efficient and general. If applying the no-links-cross con-
straint finds a parse in which no links cross, without using postprocessing rules, then this will
always be a legal parse according to the landmark-transitivity rule.
However, landmark-transitivity also allows a lot of other parses that link grammar either
needs postprocessing rules to handle, or can't find even with postprocessing rules. So, it would
make sense to apply no-links-cross parsing first, but then if this fails, apply landmark-transitivity
parsing starting from the partial parses that the former stage produced. This is the approach
suggested in PROWL, and a similar approach may be suggested for language generation.
44.14.3.2 Overcoming the Current Limitations of Word Grammar
Finally, it is worth noting that expanding the Word Grammar parsing framework to include
the link grammar dictionary, will likely allow us to solve some unsolved problems in Word
Grammar. For instance, II lud0iaj notes that the current formulation of Word Grammar has no
way to distinguish the behavior of last vs. this in
I ate last night
I ate this ham
The issue he sees is that in the first case, night should be considered the parent of last; whereas
in the second case, this should be considered the parent of ham.
EFTA00624557
44.14 PROWL Grammar 411
The current link parser also fails to handle this issue according to Hudson's intuition:
X13
LEFT —WALL I.p ate.v last.a night.t
+
+
AP
+
+
Qs
I
+--Wd--+ap*i+ + , D*1.1-+
I I I I I
LEFT I WALL I.p ate v this.d ham.n .
However, the link grammar framework gives us a clear possibility for allowing the kind of
interpretation Hudson wants: just allow this to take a left-going 0 -link, and (in PROWL) let
it optionally assume the parent role when involved in a D-link relationship. There are no funky
link-crossing or semantic issues here; just a straightforward link-grammar dictionary, edit.
This illustrates the syntactic flexibility of the link parsing framework, and also the inelegance
- adding new links to the dictionary, generally solves syntactic problems, but at the cast of
creating more complexity to be dealt with further down the pipeline, when the various link
types need to be compressed into a smaller number of semantic relationship types for purposes
of actual comprehension (as is done in RelEx, for example). However, as far as we can tell, this
seems to be a necessary cast for adequately handling the full complexity of natural language
syntax. Word Grammar holds out the hope of possibly avoiding this kind of complexity, but
without filling in enough details to allow a clear estimate of whether this hope can ever be
fulfilled.
44.14.4 Contextually Guided Greedy Parsing and Generation Using
Word Link Grammar
Another difference between Link Grammar and currently utilized. and Word Grammar as
described, is the nature of the parsing algorithm. Link Grammar operates in a manner that is
fairly traditional among contemporary parsing algorithms: given a sentence, it produces a large
set of possible parses, and then it is left to other methods/algorithms to select the right parse,
and to form a semantic interpretation of the selected parse. Parse selection may of course involve
semantic interpretation: one way to choose the right parse is to choose the one that has the
most contextually sensible semantic interpretation. We may call this approach whole-sentence
purely-syntactic parsing, or WSPS parsing.
One of the nice things about Link Grammar, as compared to many other computational
parsing frameworks, is that it produces a relatively small number of parses, compared for
instance to typical head-driven phrase-structure grammar parsers. For simple sentences the
link parser generally produces only handful of parses. But for complex sentences the link parser
can produce hundreds of parses, which can be computationally costly to sift through.
EFTA00624558
412 44 Natural Language Comprehension
Word Grammar, on the other hand, presents far fewer constraints regarding which words may
link to other words. Therefore, to apply parsing in the style of the current link parser, in the
context of Word Grammar, would be completely infeasible. The number of possible parses would
be tremendous. The idea of Word Grammar is to pare down parses via semantic/pragmatic
sensibleness, during the course of the syntax parsing proems, rather than breaking things down
into two phases (parsing followed by semantic/pragmatic interpretation). Parsing is suggested
to proceed forward through a sentence: when a word is encountered, it is linked to the words
coming before it in the sentence, in a way that makes sense. If this seems impossible, consistently
with the links that have already been drawn in the course of the parsing process, then some
backtracking is done and prior chokes may be revisited. This approach is more like what
humans do when parsing a sentence, and does not have the effect of producing a large number
of syntactically passible, semantically/pragmatically absurd parses, and then sorting through
them afterwards. It is what we call a contertually-guided greedy parsing (CGGP) approach.
For language generation, the link parser and Word Grammar approaches also suggest different
strategies. Link Grammar suggests taking a semantic network, then searching holistically for
a linear sequence of words that, when link-parsed, would give rise to that semantic network
as the interpretation. On the other hand, Word Grammar suggests taking that same semantic
network and iterating through it progressively, verbalizing each node of the network as one
walks through it, and backtracking if one reaches a point where there is no way to verbalize the
current node consistently with how one has already verbalized the previous nodes.
The main observation we want to make here is that, while Word Grammar by its nature
(due to the relative paucity of explicit constraints on which syntactic links may be formed), can
operate with CGGP but not WSPS parsing. On the other hand, while Link Grammar is cur-
rently utilized with WPSP parsing, there is no reason one can't use it with CGGP parsing just
as well. There is no objection to using CGGP parsing together with the link-parser dictionary,
nor with the no-links cross constraint rather than the landmark-transitivity constraint (in fact,
as noted above, earlier versions of Word Grammar made use of the no-links-cross constraint).
What we propose in PROWL is to use the link grammar dictionary together with the CGGP
parsing approach. The WSPS parsing approach may perhaps be useful as a fallback for handling
extremely complex and perverted sentences where CGGP takes too long to come to an answer
- it corresponds to sentences that are so obscure one has to do really hard, analytical thinking
to figure out what they mean.
Regarding constraints on link structure, the suggestion in PROWL is to use the no-links-
cross constraint as a first approximation. In comprehension, if no sufficiently high-probability
interpretation obeying the no-links-cross constraint is found, then the scope of investigation
should expand to include link-structures obeying landmark-transitivity but violating no-links-
cross. In generation, things are a little subtler: a list should be kept of link-type combinations
that often correctly violate no-links-cross, and when these combinations are encountered in the
generation process, then constructs that satisfy landmark-transitivity but not no-links-cross
should be considered.
Arguably, the PROWL approach is less elegant than either Link Grammar or Word Gram-
mar considered on its own. However, we are dubious of the proposition that human syntax
processing, with all its surface messiness and complexity, is really generated by a simple, uni-
fied, mathematically elegant underlying framework. Our goal is not to find a maximally elegant
theoretical framework, but rather one that works both as a standalone computational-linguistics
system, and as an integrated component of an adaptively-learning AGI system.
EFTA00624559
44.15 Aspects of Language Learning 413
44.15 Aspects of Language Learning
Now we finally turn to language learning - a topic that spans the engineered and experiential
approaches to NLP. In the experiential approach, learning is required to gain even simple lin-
guistic functionality. In the engineered approach, even if a great deal of linguistic functionality
is built in, learning may be used for adding new functionality and modifying the initially given
functionality. In this section we will focus on a kw aspects of language learning that would be
required even if the current engineered OpenCog comprehension pipeline were completed to a
high level of functionality. The more thoroughgoing language learning required for the expe-
riential approach will then be discussed in the following section. Further, Chapter 45 will dig
in depth into an aspect of language learning that to some extent cuts across the engineered/-
experiential dichotomy - unsupervised learning of linguistic structures from large corpora of
text.
44.15.1 Word Sense Creation
In our examples above, we've frequently referred to ReferenceLinks between WordNodes and
ConceptNodes. But, how do these links get built? One aspect of this is the process of word
sense creation.
Suppose we have a WordNode W that has ReferenceLinks to a number of different Con-
ceptNodes. A common case is that these ConceptNodes fall into clusters, each one denoting a
"sense" of the word. The clusters are defined by the following relationships:
1. ConceptNodes within a cluster have high-strength SimilarityLinks to each other
2. ConceptNodes in different clusters have low-strength (i.e. dissimilarity-denoting) Similar-
ityLinks to each other
When a word is first learned, it will normally be linked only to mutually agreeable ConceptN-
odes, i.e. there will only be one sense of the word. As more and more instances of the word
are seen, however, eventually the WordNode will gather more than one sense. Sometimes dif-
ferent senses are different syntactically, other times they are different only semantically, but
are involved in the same syntactic relationships. In the case of a word with multiple senses,
most of the relevant feature structure information will be attached to word-sense-representing
ConceptNodes, not to WordNodes themselves.
The formation of sense-representing ConceptNodes may be done by the standard clustering
and predicate mining processes, which will create such ConceptNodes when there are adequately
many Atoms in the system satisfying the criteria represent. It may also be valuable to create a
particular SenseMining CIAO-Dynamic, which uses the same criteria for node formation as the
clustering and predicate mining CIM-Dynamics, but focuses specifically on creating predicates
related to WordNodes and their nearby ConceptNodes.
EFTA00624560
414 44 Natural Language Comprehension
44.15.2 Feature Structure Learning
We've mentioned above the obvious fact that, to intelligently use a feature-structure based
grammar, the system needs to be capable of learning new linguistic feature structures. Probing
into this in more detail, we see that there are two distinct but related kinds of feature structure
learning:
1. learning the values that features have for particular word senses.
2. learning new features altogether.
Learning the values that features have for particular word senses must be done when new
senses are created; and even for features imported front resources like the link grammar, the
possibility of corrections must obviously be accepted. This kind of learning can be done by
straightforward inference - inference from examples of word usage, and by analogy from features
for similar words. A simple example to think about, e.g., is learning the verb sense of "fax" when
only the noun sense is known.
Next, the learning of new features can be viewed as a reasoning problem, in that inference
can learn new relations applied to nodes representing syntactic senses of words. In principle,
these "features" may be very general or very specialized, depending on the case. New feature
learning, in practice, requires a lot of examples, and is a more fundamental but less common
kind of learning than learning feature values for known word senses. A good example would be
the learning of "third person" by an agent that knows only first and second person.
In this example, it's clear that information from embodied experience would be extremely
helpful. In principle, it could be learned front corpus analysis alone - but the presence of knowl-
edge that certain words ("him", "her", "they", etc.) tend to occur in association with observed
agents different from the speaker or the hearer, would certainly help a lot with identifying
"third person" as a separate construct. It seems that either a very large number of un-embodied
examples or a relatively small number of embodied examples would be needed to support the
inference of the "third person" feature. And we suspect this example is typical - i.e. that the
most effective route to new feature structure learning involves both embodied social experience
and rather deep commonsense knowledge about the world.
44.15.3 Transformation and Semantic Mapping Rule Learning
Word sense learning and feature structure learning are important parts of language learning,
but they're far from the whole story. An equally important role is played by linguistic trans-
formations, such as the rules used in RelEx and RelEx2Frame. At least some of these must be
learned based on experience, for human-level intelligent language processing to proceed.
Each of these transformations can be straightforwardly cast as an ImplicationLink between
PredicateNodes, and hence formalistically can be learned by PLN inference, combined with one
or another heuristic methods for compound predicate creation. The question is what knowledge
exists for PLN to draw on in assessing the strengths of these links, and more critically, to guide
the heuristic predicate formation methods. This is a case that likely requires the full complexity
of "integrative predicate learning" as discussed in Chapter 41. And, as with feature structure
learning, it's a case that will be much more effectively handled using knowledge from social
embodied experience alongside purely linguistic knowledge.
EFTA00624561
44.16 Experiential Language Learning 415
44.16 Experiential Language Learning
We have talked a great deal about "engineered" approaches to NL comprehension and only
peripherally about experiential approaches. But there has been a not-so-secret plan underlying
this approach. There are many approaches to experiential language learning, ranging from a
"tabula rasa" approach in which language is just treated as raw data, to an approach where the
whole structure of a language comprehension system is programmed in, and "merely" the content
remains to be learned. There isn't much to say about the tabula rasa approach - we have already
discussed CogPrime's approach to learning, and in principle it is just as applicable to language
learning as to any other kind of learning. The more structured approach has more unique aspects
to it, so we will turn attention to it here. Of course, various intermediate approaches may be
constructed by leaving out various structures.
The approach to experiential language learning we consider most promising is based on
the PROWL approach, discussed above. In this approach one programs in a certain amount of
"universal grammar," and then allows the system to learn content via experience that obeys this
universal grammar. In a PROWL approach, the basic linguistic representational infrastructure
is given by the Atomspace that already exists in OpenCog, so the content of "universal grammar"
is basically
• the propensity to identify words
• the propensity to create a small set of asymmetric (i.e. parent/child) labeled relationship
types, to use to label relationships between semantically related word-instances. These are
"syntactic link types."
• the set of constraints on syntactic links implicit in word grammar, e.g. landmark transitivity
or no-links-cross
Building in the above items, without building in any particular syntactic links, seems enough
to motivate a system to learn a grammar resembling that of human languages.
Of course, experiential language learning of this nature is very, very different from "tabula
rasa" experiential language learning. But we note that, while PROWL style experiential lan-
guage learning seems like a difficult problem given existing AI technologies, tabula rasa language
learning seems like a nearly unapproachable problem. One could infer from this that current AI
technologies are simply inadequate to approach the problem that the young human child mind
solves. However, there seems to be some solid evidence that the young human child mind does
contain some form of universal grammar guiding its learning. Though we don't yet know what
form this universal prior linguistic knowledge takes in the human mind or brain, the evidence
regarding common structures arising spontaneously in various unrelated Creole languages is
extremely compelling supporting ideas presented previously based on different lines
of evidence. So we suggest that PROWL based experiential language learning is actually con-
ceptually closer to human child language learning than a tabula rasa approach - although we
certainly don't claim that the PROWL based approach builds in the exact same things as the
human genome does.
What we need to make experiential language learning work, then, is a language-focu.sed
inference-control mechanism that includes, e.g.
• a propensity to look for syntactic link types, as outlined just above
• a propensity to form new word senses, as outlined earlier
EFTA00624562
416 44 Natural Language Comprehension
• a propensity to search for implications of the general form of RelEx and RelEx2Frame or
Syn2Sem rules
Given these propensities, it seems reasonable to expect a PLN inference system to be able to
"fill in the linguistic content" based on its experience, using links between linguistic and other
experiential content as its guide. This is a very difficult learning problem, to be sure, but it
seems in principle a tractable one, since we have broken it down into a number of interrelated
component learning problems in a manner guided by the structure of language.
Other aspects of language comprehension, such as word sense disambiguation and anaphor
resolution, seem to plausibly follow from applying inference to linguistic data in the context of
embodied experiential data, without requiring especial attention to inference control or supply-
ing prior knowledge.
Chapter ?? presents an elaboration of this sort of perspective, in a limited case which enables
greater clarity: the learning of linguistic content from an unsupervised corpus, based on the
assumption of linguistic infrastructure s just summarized above.
44.17 Which Path(s) Forward?
We have discussed a variety of approaches to achieving human-level NL comprehension in the
CogPrime framework. Which approach do we think is best? All things considered, we suspect
that a tabula rasa experiential approach is impractical, whereas a traditional computational
linguistics approach (whether based on hand-coded rules. corpus analysis, or a combination
thereof) will reach an intelligence ceiling well short of human capability. On the other hand we
believe that all of these options
1. the creation of an engineered NL comprehension system (as we have already done), and
the adaptation and enhancement of this system using learning that incorporates knowledge
from embodied experience
2. the creation of an engineered NL comprehension system via unsupervised learning from a
large corpus, as described in Chapter ?? below
3. the creation of an experiential learning based NL comprehension system using in-built
structures, such as the PROWL based approach described above
4. the creation of an experiential learning based system as described above, using an engineered
system (like the current one) as a "fitness estimation" resource in the manner described at
the end of Chapter 43
have significant promise and are worthy of pursuit. Which of these approaches we focus on in
our ongoing OpenCogPrime implementation work will depend on logistical issues as much as
on theoretical preference.
EFTA00624563
Chapter 45
Language Learning via Unsupervised Corpus
Analysis
Co-authored with Linas Vepstas
45.1 Introduction
The approach taken to NLP in the OpenCog project up through 2013, in practice, has involved
engineering and integrating rule-based NLP systems as "scaffolding", with a view toward later
replacing the rule content with alternative content learned via an OpenCog system's experience.
In this chapter we present a variant on this approach, in which the rule content of the
existing rule-based NLP system is replaced with new content learned via unsupervised corpus
analysis. This content can then be modified and improved via an OpenCog system's experience,
embodied and otherwise, as needed.
This unsupervised corpus analysis based approach deviates fairly far from human cogni-
tive science. However, as discussed above, language processing is one of those areas where the
pragmatic differences between young humans and early-stage AGI systems may be critical to
consider. The automated learning of language from embodied, social experience is a key part of
the path to AGI, and is one way that CogPrimes and other AGI systems should learn language.
On the other hand. unsupervised corpus based language learning, may perhaps also have a
significant role to play in the path to linguistically savvy AGI, leveraging some advantages that
AGIs have that humans do not, such as direct access to massive amounts of online text (without
the need to filter the text through slow-paced sense-perception systems like eyes).
The learning of language from unannot.ated text corpora is not a major pursuit within the
computational linguistics community currently. Supervised learning of linguistic structures from
expert-annotated corpora plays a large role, but this is a wholly different sort of pursuit, more
analogous to rule-based NLP, in that it involves humans explicitly specifying formal linguistic
structures (e.g. parse trees for sentences in a corpus). However, we hypothesize that unsuper-
vised corpus-based language learning can be carried out by properly orchestrating the use of
some fairly standard machine learning algorithms (already included in OpenCog / CogPrime),
within an appropriate structured framework (such as OpenCog's current NLP framework).
The review of [K M0-1I provides a summary, of the state of the art in automatic grammar
induction (the third alternative listed above), as it stood a decade ago: it addresses a nun-
Dr. Vepstas would properly be listed as the first author of this chapter; this material was developed in a
collaboration between Vepstas and Coertzel. However, as with all the co-authored chapters in this book, final
responsibility for any flaws in the presentation of the material lies with Ben Coertzel, the chief author of the
bok.
417
EFTA00624564
418 45 Language Learning via Unsupervised Corpus Analysis
ber of linguistic issues and difficulties that arise in actual implementations of algorithms. It
is also notable in that it builds a bridge between phrase-structure grammars and dependency
grammars, essentially pointing out that these are more or less equivalent, and that, in fact, sig-
nificant progress can be achieved by taking on both points of view at once. Grammar induction
has progressed somewhat since this review was written, and we will mention some of the more
recent work below; but yet, it is fair to say that there has been no truly dramatic progress in
this direction.
In this chapter we describe a novel approach to achieving automated grammar induction, i.e.
to machine learning of linguistic content front a large, unannotated text corpus. The methods
described may also be useful for language learning based on embodied experience; and may make
use of content created using hand-coded rules or machine learning front annotated corpora. But
our focus in this chapter will be on learning linguistic content from a large, unannotated text
corpus.
The algorithmic approach given in this chapter is wholly in the spirit of the "PROWL"
approach reviewed above in Chapter 44. However, PROWL is a quite general idea. Here we
present a highly specific PROWL-like algorithm, which is focused on learning front a large
unannotated corpus rather than from embodied experience. Because of the corpus-oriented
focus, it is possible to tie the algorithm of this chapter in with the statistical language learning
literature, more tightly than is possible with PROWL language learning in general. Yet, the
specifics presented here could largely be generalized to a broader PROWL context.
We consider the approach described here as "deep learning" oriented because it is based on
hierarchical pattern recognition in linguistic data: identifying patterns, then patterns among
these patterns, etc., in a hierarchy that allows "higher level" (more abstract) patterns to feed
back down the hierarchy and affect the recognition of lower level patterns. Our approach does not
use conventional deep learning architectures like Deep Boltzmann machines or recurrent neural
networks. Conceptually, our approach is based on a similar intuition to these algorithms, in that
it relies on the presence of hierarchical structure in its input data, and utilizes a hierarchical
pattern recognition structure with copious feedback to adaptively identify this hierarchical
structure. But the specific pattern recognition algorithms we use, and the specific nature of the
hierarchy we construct, are guided by existing knowledge about what works and what doesn't
in (both statistical and rule-based) computational linguistics.
While the overall approach presented here is novel, most of the detailed ideas are extensions
and generalizations of the prior work of multiple authors, which will be referenced and in some
cases discussed below. In our view, the body of ideas needed to enable unsupervised learning of
language front large corpora has been gradually emerging during the last decade. The approach
given here has unique aspects, but also many aspects already validated by the work of others.
For sake of simplicity, we will deal here only with learning from written text here. We believe
that conceptually very similar methods can be applied to spoken language as well, but this brings
extra complexities that we will avoid for the purpose of the present document. (In short: Below
we represent syntactic and semantic learning as separate but similarly structured and closely
coupled learning processes. To handle speech input thoroughly, we would suggest phonological
learning as another separate, similarly structured and closely coupled learning process.)
Finally, we stress that the algorithms presented here are intended to be used in conjunction
with a large corpus, and a large amount of processing power. Without a very large corpus, some
of the feedbacks required for the learning process described would be unlikely to happen (e.g.
the ability of syntactic and semantic learning to guide each other). We have not yet sought
EFTA00624565
45.2 Assumed Linguistic Infrastructure 419
to estimate exactly how large a corpus would be required, but our informal estimate is that
Wikipedia might or might not be large enough, and the Web is certainly more than enough.
We don't pretend to know just how far this sort of unsupervised, corpus based learning can
be pushed. To what extent can the content of a natural language like English be learned this
way. How much, if any, ambiguity will be left over once this kind of learning has been thoroughly
done - only pragmatically disambiguable via embodied social learning? Strong opinions on these
sorts of issues abound in the cognitive science, linguistics and AI communities; but the only
apparent way to resolve these questions is empirically.
45.2 Assumed Linguistic Infrastructure
While the approach outlined in this chapter aims to learn the linguistic content of a language
from textual data, it does not aim to learn the idea of language. Implicitly, we assume a model
in which a learning system begins with a basic "linguistic infrastructure" indicating the various
parts of a natural language and how they generally interrelate; and it then learns the linguistic
content characterizing a particular language. In principle, it would also be passible to have an
AI system to learn the very concept of a language and build its own linguistic infrastructure.
However, that is not the problem we address here; and we suspect such an approach would
require drastically more computational resources.
The basic linguistic infrastructure assumed here includes:
• A formalism for expressing grammatical (dependency) rules is assumed.
- The ideas given here are not tied to any specific grammatical formalism, but as in
Chapter ?? we find it convenient to make use of a formalism in the style of dependency
grammarres511. Taking a mathematical perspective, different grammar formalisms can
be translated into one-another, using relatively simple rules and algorithinsIK:\ II. The
primary difference between them is more a matter of taste, perceived linguistic 'natural-
ness', adaptability, and choice of parser algorithm. In particular, categorial grammars
can be converted into link grammars in a straight-forward way, and vice versa, but link
grammars provide a more compact dictionary. Link grammarsIST9l, ST931 are a type
of dependency grammar; these, in turn, can be converted to and from phrase-structure
grammars. We believe that dependency grammars provide a more simple and natural
description of linguistic phenomena. We also believe that dependency grammars have a
more natural fit with maximum-entropy ideas, where a dependency relationship can be
literally interpreted as the mutual information between word-pair:0)108F Dependency
grammars also work well with Markov models; dependency parsers can be implemented
as Viterbi decoders. Figure 44.1 illustrates two different formalisms.
- The discussion below assumes the use of a formalism similar that of Link Grammar,
as described above. In this theory, each word is associated with a set of 'connector
disjuncts', each connector disjunct controlling the possible linkages that the word may
take part in. A disjunct can be thought of as a jig-saw puzzle-piece; valid syntactic word
orders are those for which the puzzle-pieces can be validly connected. A single connector
can be thought of as a single tab on a puzzle-piece (shown in figure ??). Connectors are
thus 'types' X with a + or - sign indicating that they connect to the left or right. For
EFTA00624566
420 45 Language Learning via Unsupervised Corpus Analysis
example, a typical verb disjunct might be S — & 0+ indicating that a subject (a noun)
is expected on the left, and an object (also a noun) is expected on the right.
- Some of the discussion below assumes select aspects of (Dick Hudson's) Word GrammarIII tid&I,
Iludoibl. As reviewed above, Word Grammar theory (implicitly) uses connectors simi-
lar to those of Link Grammar, but allows each connector to be marked as the head of
a link or not. A link then becomes an arrow from a head word to the dependent word.
(Somewhat confusingly, the head of the arrow points at the dependent word; this means
the tail of the arrow is attached to the head word).
- Each word is associated with a "lexical entry"; in Link Grammar, this is the set of
connector disjuncts for that word. It is usually the case that many words share a common
lexical entry; for example, most common nouns are syntactically similar enough that
they can all be grouped under a single lexical entry. Conversely, a single word is allowed
to have multiple lexical entries; so, for example, "saw", the noun, will have a different
lexical entry from "saw", the past tense of the verb "to see". That is, lexical entries can
loosely correspond to traditional dictionary entries. Whether or not a word has multiple
lexical entries is a matter of convenience, rather than a fundamental aspect. Curiously,
a single Link Grammar connector disjunct can be viewed as a very, fine-grained part-
of-speech. In this way, it is a stepping stone to the semantic meaning of a word.
• A parser, for extracting syntactic structure from sentences, is assumed. What's more, it is
assumed that the parser is capable of using semantic relationships to guide parsing.
- A paradigmatic example of such a parser is the "Viterbi Link Parser", currently under
development for use with the Link Grammar. This parser is currently operational in a
simple form. The name refers to its use of a the general ideas of the Viterbi algorithm.
This algorithm seems biologically plausible, in that it applies only a local analysis of
sentence structure, of limited scope, as opposed to a global optimization, thus roughly
emulating the process of human listening. The current set of legal parses of a sentence is
pruned incrementally and probabilistically, based on flexible criteria. These potentially
include the semantic relationships extractable from the partial parse obtained at a given
point in time. It also allows for parsing to be guided by inter-sentence relationships, such
as pronoun resolution, to disambiguate otherwise ambiguous sentences.
• A formalism for expressing semantic relationships is assumed.
- A semantic relationship generalizes the notion of a lexical entry, to allow for changes
of word order, paraphrasing, tense, number, the presence or absence of modifiers, etc.
An example of such a relationship would be eat(X, Y) - indicating the eating of some
entity Y by some entity X. This abstracts into common form several different syntac-
tic expressions: "Ben ate a cookie", "A cookie will be eaten by Ben", "Ben sat, eating
cookies".
- Nothing particularly special is assumed here regarding semantic relationships, beyond a
basic predicate-argument structure. It is assumed that predicates can have arguments
that are other predicates, and not just atomic terms; this has an explicit impact on how
predicates and arguments are represented. A "semantic representation" of a sentence is
a network of arrows (defining predicates and arguments), each arrow or a small subset
of arrows defining a "semantic relationship". However, the beginning or end of an arrow
is not necessarily a single node, but may land on a subgraph.
EFTA00624567
45.3 Linguistic Content To Be Learned 421
- Type constraints seem reasonable, but its not clear if these must be made explicit, or
if they are the implicit result of learning. Thus, eat(X, Y) requires that X and Y both
be entities, and not, for example, actions or prepositions.
- We have not yet thought through exactly how rich the semantic formalism should be for
handling the full variety of quantifier constructs in complex natural language. But we
suspect that it's OK to just use basic predicate-argument relationships and not build
explicit quantification into the formalism, allowing quantifiers to be treated like other
predicates.
- Obviously, CogPrime's formalism for expressing linguistic structures in terms of Atoms,
presented in Chapter 44, fulfills the requirements of the learning scheme presented in
this chapter. However, we wish to stress that the learning scheme presented her does
not depend on the particulars of CogPrime's representation scheme, though it is very
compatible with them.
45.3 Linguistic Content To Be Learned
Given the above linguistic infrastructure, what remains for a language learning system to
learn is the linguistic content that characterizes a particular language. Everything included
in OpenCog's existing "scaffolding" rule-based NLP system would, in this approach, be learned
to first approximation via unsupervised corpus analysis.
Specifically, given the assumed framework, key things to be learned include:
• A list of 'link types' that will be used to form 'disjuncts' must be learned.
- An example of a link type is the 'subject' link S. This link typically connects the sub-
ject of a sentence to the head verb. Given the normal English subject-verb word order,
nouns will typically have an Si-connector, indicating that an S link may be formed only
when the noun appears to the left of a word bearing an S— connector. Likewise, verbs
will typically be associated with 8— connectors. The current Link Grammar contains
roughly one hundred different link-types, with additional optional subtypes that are
used to further constrain syntactic structure. This number of different link types seems
required simply because there are many relationships between words: there is not just a
subject-verb or verb-object relationship, but also rather fine distinctions, such as those
needed to form grammatical time, date, money, and measurement expressions, punctu-
ation use, including street-addresses, cardinal and ordinal relationships, proper (given)
names, titles and suffixes, and other highly constrained grammatical constructions. This
is in addition to the usual linguistic territory of needing to indicate dependent clauses,
comparatives, subject-verb inversion, and so on. It is expected that a comparable mun-
ber of link types will need to be learned.
- Some link types are rather strict, such as those connect verb subjects and objects, while
other types are considerably more ambiguous, such as those involving prepositions.
This reflects the structure of English, where subject-verb-object order is fairly rigor-
ously enforced, but the ordering and use of prepositions Ls considerably looser. When
considering the looser cases, it becomes clear that there is no single, inherent 'right
answer' for the creation and assignment of link types, and that several different, yet
linguistically plausible linkage assignments may be made.
EFTA00624568
422 45 Language Learning via Unsupervised Corpus Analysis
- The definition of a good link-type is one that leads the parser - applied across the whole
corpus - to allow parsing to be successful for almost all sentences, and yet not to be
so broad as to enable parsing of word-salads. Significant pressure must be applied to
prevent excess proliferation of link types, yet no so much as to over-simplify things, and
provide valid parses for unobserved, ungrammatical sentences.
• Lexical entries for different words must be learned.
- Typically, multiple connectors are needed to define how a word can link syntactically to
others. Thus, for example, many verbs have the disjunct S — 0+ indicating that they
need a subject noun to the left, and an object to the right. All words have at least a
handful of valid disjuncts that they can be used with, and sometimes hundreds or even
more. Thus, a "lexical entry" mast be learned for each word, the lexical entry, being a
set of disjuncts that can be used with that word.
- Many words are syntactically similar; most common nouns can share a single lexical
entry. Yet, there are many exceptions. Thus, during learning, there is a back-and forth
process of grouping and =grouping words; clustering them so that they share lexical
entries, but also splitting apart clusters when its realized that some words behave dif-
ferently. Thus for example, the words "sing" and "apologize" are both verbs, and thus
share some linguistic structure, but one cannot say "I apologized a song to Vicky"; if
these two verbs were initially grouped together into a common lexical entry, they must
later be split apart.
- The definition of a good lexical entry is much the same as that for a good link type: ob-
served sentences must be parsable; random sentences mostly must not be, and excessive
proliferation and complexity must be prevented.
• Semantic relationships mast be learned.
- The semantic relationship eat(X,Y) is prototypical. Fotmdationally, such a semantic
relationship may be represented as a set whose elements consist of syntactico-semantic
subgraphs. For the relation eat(X, Y), a subgraph may be as simple as a single (syntactic)
disjunct S - & 0+ for the normal word order "Ben ate a cookie", but it may also be
a more complex set needed to represent the inverted word order in "a cookie was eaten
by Ben". The set of all of these different subgraphs defines the semantic relationship.
The subgraphs themselves may be syntactic (as in the example above), or they may be
other semantic relationships, or a mixture thereof.
- Not all re-phrasings are semantically equivalent. "Mr. Smith is late" has a rather dif-
ferent meaning from "The late Mr. Smith."
- In general, place-holders like X and Y may be words or category labels. In early stages
of learning, it is expected that X and Y are each just sets of words. At some point,
though, it should become clear that these sets are not specific to this one relationship,
but can appropriately take part in many relationships. In the above example, X and Y
mast be entities (physical objects), and, as such, can participate in (most) any other
relationships where entities are called for. More narrowly, X is presumably a person or
animal. while Y is a foodstuff. Furthermore, as entities, it might be inferred when these
refer to the same physical object (see the section 'reference resolution' below).
- Categories can be understood as sets of synonyms, including hyponyms (thus, "grub" is
a synonym for "food", while "cookie" is a hyponym.
EFTA00624569
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 423
• Idioms and set phrases must be learned.
- English has a large number of idiomatic expressions whose meanings cannot be inferred
from the constituent words (such as "to pull one's leg"). In this way, idioms present
a challenge: their sometimes complex syntactic constructions belie their often simpler
semantic content. On the other hand, idioms have a very rigid word-choice and word
order, and are highly invariant. Set phrases take a middle ground: word-choice is not
quite as fixed as for idioms, but, none-the-less, there is a conventional word order that
is usually employed. Not that the manually-constructed Link Grammar dictionaries
contain thousands of lexical entries for idiomatic constructions. In essence, these are
multi-word constructions that are treated as if they were a single word.
Each of the above tasks have already been accomplished and described in the literature; for
example, automated learning of synonymous words and phrases has been described by LinILP011
and Poon & DomingostPD09]. The authors are not aware of any attempts to learn all of these,
together, in one go, rather than presuming the pre-existance of dependent layers.
45.3.1 Deeper Aspects of Comprehension
While the learning of the above aspects of language is the focus of our discussion here, the search
for semantic structure does not end there; more is possible. In particular, natural language
generation has a vital need for lexical functions, so that appropriate word-choice can be made
when vocalizing ideas. In order to truly understand text, one also needs, as a minimum, to
discern referential structure. and sophisticated understanding requires discerning topics. We
believe automated, unsupervised learning of these aspects is attainable, but is best addressed
after the 'simpler' language learning described above. We are not aware of any prior work
aimed at automatically learning these, aside from relatively simple. unsophisticated (bag-of-
words style) efforts at topic categorization.
45.4 A Methodology for Unsupervised Language Learning from a
Large Corpus
The language learning approach presented here is novel in its overall nature. Each part of it,
however, draws on prior experimental and theoretical research by others on particular aspects
of language learning, as well as on our own previous work building computational linguistic
systems. The goal is to assemble a system out of parts that are already known to work well in
isolation.
Prior published research, from a multitude of authors over the last few decades, has already
demonstrated how many of the items listed above can be learnt in an unsupervised setting
(see e.g. Fur98, K4104, LPOI, CSW, PD09, AlihO7, ICSPCB] for relevant background). All
of the previously demonstrated results, however, were obtained in isolation, via research that
assumed the pre-existence of surrounding infrastructure far beyond what we assume above. The
approach proposed here may be understood as a combination, generalization and refinement
EFTA00624570
424 45 Language Learning via Unsupervised Corpus Analysis
these techniques, to create a system that can learn, more or less ab initio from a large corpus,
with a final result of a working, usable natural language comprehension system.
However, we must caution that the proposed approach is in no way a haphazard mash-up of
techniques. There is a deep algorithmic commonality to the different prior methods we combine,
which has not always been apparent in the prior literature due to the different emphases and
technical vocabularies used in the research papers in question. In parallel with implementing
the ideas presented here, we intend to workin fully formalizing the underlying mathematics
of the undertaking, so that it becomes clear what approximations are being taken, and what
avenues remain unexplored. Some fairly specific directions in this regard suggest themselves.
All of the prior research alluded to above invokes some or another variation of maximum en-
tropy principles, sometimes explicitly, but usually implicitly. In general, entropy maximization
principles provide the foundation for learning systems such as (hidden) Markov models, Markov
networks and Hopfield neural networks, and they connect indirectly with Bayesian probability
based analyses. However, the actual task of maximizing the entropy is an NP-hard problem;
forward progress depends on short-cuts, approximations and clever algorithms, some of which
are of general nature, and some domain-dependent. Part of the task of refining the details of
the language learning methodology presented here, is to explore various short-cuts and approx-
imations to entropy maximization, and discover new, clever algorithms of this nature that are
relevant to the language learning domain. As has been the case in physics and other domains,
we suspect that progress here will be best achieved via a coupled exploration of experimental
and mathematical aspects of the subject matter.
45.4.1 A High Level Perspective on Language Learning
On an abstract conceptual level, the approach proposed here depicts language learning as an
instance of a general learning loop such as:
1. Group together linguistic entities (i.e. words or linguistic relationships, such as those de-
scribed in the previous section) that display similar usage patterns (where one is looking at
usage patterns that are compactly describable given one's meta-language). Many but not
nectsbarily all usage patterns for a given linguistic entity, will involve its use in conjunction
with other linguistic entities.
2. For each such grouping make a category label.
3. Add these category labels to one's meta-language
4. Return to Step 1
It stands to reason that the result of this sort of learning loop, if successful, will be a hierarchi-
cally composed collection linguistic relationships possessing the following
Linguistic Coherence Property: Linguistic entities are reasonably well characterizable in terms
of the compactly describable patterns observable in their relationship with with other linguistic
entities.
Note that there Ls nothing intrinsically "deep" or hierarchical in this sort of linguistic co-
herence. However, the ability to learn the patterns relating linguistic entities with others, via a
recursive hierarchical learning loop such as described above, is contingent on the presence of a
fairly marked hierarchical structure in the linguistic data being studied. There is much evidence
that such hierarchical structure does indeed exist in natural languages. The "deep learning" in
EFTA00624571
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 425
our approach is embedded in the repeated cycles through the loop given above - each time one
goes through the loop, the learning gets one level deeper.
This sort of property has observed to hold for many linguistic entities, an observation dating
back at least to Saussure IdS771 and the start of structuralist linguistics. It is basically a fancier
way of saying that the meanings of words and other linguistic constructs, may be found via their
relationships to other words and linguistic constructs. We are not committed to structuralism
as a theoretical paradigm, and we have considerable respect for the aid that non-linguistic in-
formation -such as the sensorimotor data that comes from embodiment - can add to language,
as should be apparent from the overall discussion in this book. However, the potential dramatic
utility of non-linguistic information for language learning does not imply the impossibility or
infeasibility of learning language from corpus data alone. It is inarguable that non-linguistic re-
lationships comprise a significant portion of the everyday meaning of linguistic entities; but yet,
redundancy is prevalent in natural systems, and we believe that purely linguistic relationships
may well provide sufficient data for learning of natural languages. If there are some aspects of
natural language that cannot be learned via corpus analysis, it seems difficult to identify what
these aspects are via armchair theorizing, and likely that they will only be accurately identified
via pushing corpus linguistics as far as it can go.
This generic learning process is a special case of the general process of symbolization, de-
scribed in Chaotic Logic roe9-1] and elsewhere as a key aspect of general intelligence. In this
process, a system finds patterns in itself and its environment, and then symbolizes these patterns
via simple tokens or symbols that become part of the system's native knowledge representation
scheme (and hence parts of its "metalanguage" for describing things to itself). Having repre-
sented a complex pattern as a simple symbolic token, it can then easily look at other patterns
involving this patterns as a component.
Note that in its generic format as stated above, the "language learning loop" is not restricted
to corpus based analysis, but may also include extralinguistic aspects of usage patterns, such
as gestures, tones of voice, and the physical and social context of linguistic communication.
Linguistic and extra-linguistic factors may come together to comprise "usage patterns." How-
ever, the restriction to corpus data does not necessarily denude the language learning loop of
its power: it merely restricts one to particular classes of usage patterns, whose informativeness
must be empirically determined.
In principle, one might be able to create a functional language learning system based only
on a very generic implementation of the above learning loops. In practice, however. biases
toward particular sorts of usage patterns can be very valuable in guiding language learning. In
a computational language learning context, it may be worthwhile to break down the language
learning process into multiple instances of the basic language learning loops, each focused on
different sorts of usage patterns, and coupled with each other in specific ways. This is in fact
what we will propose here.
Specifically, the language learning process proposed here involves:
• One language learning loop for learning purely syntactic linguistic relationships (such as
link types and lexical entries, described above), which are then used to provide input to a
syntax parser.
• One language learning loop for learning higher-level "syntactico-semantic" linguistic rela-
tionships (such as semantic relationships, idioms, and lexical functions, described above),
which are extracted from the output of the syntax parser.
EFTA00624572
425 45 Language Learning via Unsupervised Corpus Analysis
These two loops are not independent of one-another; the second loop can provide feedback to the
first, regarding the correctness of the extracted structures; then as the first loop produces more
correct, confident results, the second loop can in turn become more confident in it's output. In
this sense, the two loops attack the same sort of slow-convergence issues that 'deep learning'
tackles in neural-net training.
The syntax parser itself, in this context, is used to extract directed acyclic graphs (dags),
usually trees, from the graph of syntactic relationships associated with a sentence. These dags
represent parses of the sentence. So the overall scope of the learning process proposed here is
to learn a system of syntactic relationships that displays appropriate coherence and
that, when fed into an appropriate parser, will yield parse trees that give rise to a
system of syntactico-semantic relationships that displays appropriate coherence.
45.4.2 Learning Syntax
The process of learning syntax from a corpus may be understood fairly directly in terms of
entropy maximization. As a simple example, consider the measurement of the entropy of the
arrangement of words in a sentence. To a fair degree. this can be approximated by the sum of
the mutual entropy between pairs of words. Yuret showed that by searching for and maximizing
this sum of entropies, one obtains a tree structure that closely resembles that of a dependency
parserin 91. That is, the word pairs with the highest mutual entropy are more or less the same
as the arrows in a dependency parse, such as that shown in figure 44.1. Thus, an initial task
is to create a catalog of word-pairs with a large mutual entropy (mutual information, or MI)
between them. This catalog can then be used to approximate the most-likely dependency parse
of a sentence, although, at this stage, the link-types are as yet unknown.
Finding dependency links using mutual information is just the first step to building a practical
parser. The generation of high-MI word-pairs works well for isolating which words should be
linked, but it does have several major drawbacks. First and foremast, the word-pairs do not come
with any sort of classification; there is no link type describing the dependency relationship
between two words. Secondly, most words fall into classes (e.g. nouns, verbs, etc.), but the
high-MI links do not tell us what these are. A compact, efficient parser appears to require this
sort of type information.
To discover syntactic link types, it is necessary to start grouping together words that appear
in similar contexts. This can be done with clustering and similarity techniques, which appears
to be sufficient to discover not only basic parts of speech (verbs, nouns, modifiers, determiners),
but also link types. So, for example, the computation of word-pair MI is likely to reveal the
following high-MI word pairs: "big car, "fast car", "expensive car", 'ted car". It is reasonable
to group together the words big, expensive, fast and red into a single category, interpreted as
modifiers to car. The grouping can be further refined if these same modifiers are observed with
other words (e.g. "big bicycle", "fast bicycle", etc.) This has two effects: it not only reinforces
the correctness of the original grouping of modifiers, but also suggests that perhaps cars and
bicycles should be grouped together. Thus, one has discovered two classes of words: modifiers
and nouns. In essence, one has crudely discovered parts of speech.
The link between these two classes carries a type; the type of that link is defined by these two
classes. The use of a pair of word classes to define a link type is a basic premise of categorial
grammarlICSIVA. In this example, a link between a modifier and a noun would be a type
EFTA00624573
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 427
denoted as Id \ N in categorial grammar, M denoting the class of modifiers, and N the class
of nouns. In the system of Link Grammar, this is replaced by a simple name, but its really
one and the same thing. (In this case, the existing dictionaries use the A link for this relation,
with A conjuring up 'adjective' as a mnemonic.) The simple-name is a boon for readability, as
categorial grammars usually have very complex-looking link-type names: e.g. (N1)1,S)/NP for
the simplest transitive verbs. Typing seems to be an inherent part of language; types must be
extracted durng the learning process.
The introduction of types here has mathematical underpinnings provided by type theory.
An introduction to type theory can be found in JPro13J, and an application of type theory to
linguistics can be found inEKSPC nil. This Ls a rather abstract work, but it sheds light on the
nature of link types, word-classes, parts-of-speech and the like as formal types of type theory.
This is useful in dispelling the seeming taint of ad hoc arbitrariness of clustering: in a linguistic
context, it is not so much ad hoc as it is a way of guaranteeing that only certain words can
appear in certain positions in grammatically correct sentences, a sort of constraint that seems
to be an inherent part of language, and seems to be effectively formalizable via type theory.
Word-clustering, as worked in the above example, can be viewed as another entropy-
maximization technique. It is essentially a kind of factorization of dependent probabilities
into most likely factors. By classifying a large number of words as 'modifiers of nouns', one
is essentially admitting that they are equi-probable in that role, in the Markovian senselAsh651
(equivalently, treating them as equally-weighted priors, in the Bayesian probability sense). That
is, given the word "car", we should treat big, fast, expensive and red as being equi-probable (in
the absence of other information). Equi-probability is an axiom in Bayesian probability (the
axiom of priors), but it derives from the principle of maximum entropy (as any other probability
assignment would have a lower entropy).
We have described how link types may be learned in an unsupervised setting. Connector
types are then trivially assigned to the left and right words of a word-pair. The dependency
graph, as obtained by linking only those word pairs with a high MI, then allows disjuncts to
be easily extracted, on a sentence-by-sentence basis. At this point, another stage of pattern
recognition may be applied: Given a single word, appearing in many different sentences, one
should presumably find that this word only makes use of a relatively small, limited set of
disjuncts. It is then a counting exercise to determine which disjuncts are occurring the most
often for this word: these then form this word's lexical entry. (This "counting exercise" may
also be thought of as an instance of frequent subgraph mining, as will be elaborated below.)
A second clustering step may then be applied: it's presumably noticeable that many words
use more-or-less the same disjuncts in syntactic constructions. These can then be grouped into
the same lexical entry. However, we previously generated a different set of word groupings (into
parts of speech), and one may ask: how does that grouping compare to this grouping? is it
close, or can the groupings be refined? If the groupings cannot be harmonized, then perhaps
there is a certain level of detail that was previously missed: perhaps one of the groups should be
split into several parts. Conversely, perhaps one of the groupings was incomplete, and should
be expanded to include more words. Thus, there is a certain back-and-forth feedback between
these different learning steps, with later steps reinforcing or refining earlier steps, forcing a new
revision of the later steps.
EFTA00624574
428 45 Language Learning via Unsupervised Corpus Analysis
45.4.2.1 Loose language
A recognized difficulty with the direct application of Yuret's observation (that the high-MI
word-pair tree is essentially identical to the dependency parse tree) is the flexibility of the
preposition in the English languagell<NIMI. The preposition is so widely used, in such a large
variety of situations and contexts, that the mutual information between it, and any other word
or word-set, is rather low (is unlikely and thus carries little information). The two-point, pair-
wise mutual entropy provides a poor approximation to what the English language is doing in
this particular case. It appears that the situation can be rescued with the use of a three-point
mutual information (a special case of interaction information 113010:31).
The discovery and use of such constructs is described in [PDOM. A similar, related issue can
be termed "the richness of the MV link type in Link Grammar". This one link type, describing
verb modifiers (which includes prepositions) can be applied in a very large class of situations;
as a result, discovering this link type, while at the same time limiting its deployment to only
grammatical sentences, may prove to be a bit of a challenge. Even in the manually maintained
Link Grammar dictionaries, it can present a parsing challenge because so many narrower cases
can often be treated with an MV link. In summary, some constructions in English are so flexible
that it can be difficult to discern a uniform set of rules for describing them; certainly, pair-wise
mutual information seems insufficient to elucidate these cases.
Curiously, these more challenging situations occur primarily with more complex sentence
constructions. Perhaps the flexibility is associated with the difficulty that humans have with
composing complex sentences; short sentences are almost 'set phrases', while longer sentences
can be a semi-grammatical jumble. In any case, some of the trouble might be avoided by limiting
the corpus to smaller, easier sentences at first, perhaps by working with children's literature at
first.
45.4.2.2 Elaboration of the Syntactic Learning Loop
We now reiterate the syntactic learning process described above in a more systematic way. By
getting more concrete, we also make certain assumptions, and restrictions, some of which may
end up getting changed or lifted in the course of implementation and detailed exploration of
the overall approach. What is discussed in this section is merely one simple, initial approach to
concretizing the core language learning loop we envision in a syntactic context.
Syntax, as we consider it here, involves the following basic entities:
• words
• categories of words
• "co-occurrence links", each one defined as (in the simplest case) an ordered pair or triple
of words, labeled with an uncertain truth value
• "syntactic link types", each one defined as a certain set of ordered pairs of words
• "disjuncts", each one associated with a particular word w, and consisting of an ordered
set of link types involving the word w. That is, each of these links contains at least one
word-pair containing w as first or second argument. (This nomenclature here comes from
Link Grammar; each disjunct is a conjunction of link types. A word is associated with a
set of disjuncts. In the course of parsing, one must choose between the multiple disjuncts
associated with a word, to fulfill the constraints required of an appropriate parse structure.)
EFTA00624575
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 429
An elementary version of the basic syntactic language learning loop described above would take
the form.
1. Search for high-MI word pairs. Define one's usage links as the given co-occurrence links
2. Cluster words into categories based on the similarity of their associated usage links
• Note that this will likely be a tricky instance of clustering, and classical clustering
algorithms may not perform well. One interesting, less standard approach would be to
use OpenCog's MOSES algorithm 114)006, ImoOrel to learn an array of program trees,
each one serving as a recognizer for a single cluster, in the same general manner done
with Genetic Programming in [I3E07/.
3. Define initial syntactic link types from categories that are joined by large bundles of usage
links
• That is, if the words in category Cu have a lot of usage links to the words in category C2,
then create a syntactic link type whose elements are (tui ,w2), for all tui E Cu,w2 E C2.
4. Associate each word with an extended set of usage links, consisting of: its existing usage
links, plus the syntactic links that one can infer for it based on the categories the word
belongs to. One may also look at chains of (e.g.) 2 syntactic links originating at the word.
• For example, suppose cat E Ci and Cu has syntactic link L. Suppose (cat, eat) and
(dog,run) are both in Li. Then if there is a sentence "The cat likes to rim", the link
L1 lets one infer the syntactic link cat 4 run. The frequency of this syntactic link in a
relevant corpus may be used to assign it an uncertain truth value.
• Given the sentence "The cat likes to run in the park," a chain of syntactic links such
as cat —) run Li park may be constructed.
5. Return to Step 2, but using the extended set of usage links produced in Step 4, with the
goal of refining both clusters and the set of link types for accuracy. Initially, all categories
contain one word each, and there is a unique link type for each pair of categories. This is
an inefficeint representation of language, and so the goal of clustering is to have a relatively
small set of clusters and link types, with many words/word-pairs assigned to each. This can
be done by maximizing the sum of the logarithms of the sizes of the clusters and link types;
that is, by maximing entropy. Since the category assignments depend on the link types, and
vice versa, a very large number of iterations of the loop are likely to be required. Based on
the current Link Grammar English dictionaries, one expects to discover hundreds of link
types (or more, depending on how subtypes are counted), and perhaps a thousand word
clusters (most of these corresponding to irregular verbs and idiomatic phrases).
Many variants of this same sort of process are conceivable, and it's currently unclear what sort
of variant will work best. But this kind of process is what one obtains when one implements
the basic language learning loop described above on a purely syntactic level.
How might one integrate semantic understanding into this syntactic learning loop? Once one
has semantic relationships associated with a word, one uses them to generate new "usage links"
for the word, and includes these usage links in the algorithm from Step l onwards. This may
be done in a variety of different ways, and one may give different weightings to syntactic versus
semantic usage links, resulting in the learning of different links.
EFTA00624576
430 45 Language Learning via Unsupervised Corpus Analysis
The above process would produce a large set of syntactic links between words. We then find
a further series of steps. These may be carried out concurrently with the above steps, as soon
as Step 4 has been reached for the first time.
1. This syntactic graph (with nodes as words and syntactic links joining them) may then be
mined, using a variety of graph mining tools, to find common combinations of links. This
gives the "disjuncts" mentioned above.
2. Given the set of disjuncts, one carries out parsing using a process such as link parsing or
word grammar parsing, thus arriving at a set of parses for the sentences in one's reference
corpus. Depending on the nature of one's parser, these parses may be ranked according to
semantic plausibility. Each parse may be viewed as a directed acyclic graph (dag), usually
a tree, with words at the nodes and syntactic-link type labels on the links.
3. One can now define new usage links for each word: namely, the syntactic links occurring
in sentence parses, containing the word in question. These links may be weighted based on
the weights of the parses they occur in.
4. One can now return to Step 2 using the new usage links, alongside the previous ones.
Weighting these usage links relative to the others may be done in various ways.
Several subtleties have been ignored in the above, such as the proper discovery, and treatment of
idomatic phrases, the discovery of sentence boundaries, the handling of embedded data (price
quotes, lists, chapter titles, etc.) as well as the potential speed bump that are prepositions.
Fleshing out the details of this loop into a workable, efficient design is the primary engineering
challenge. This will take significant time and effort.
45.4.3 Learning Semantics
Syntactic relationships provide only the shallowest interpretation of language; semantics conies
next. One may view semantic relationships (including semantic relationships close to the syntax
level, which we may call "syntactico-semantic" relationships) as ensuing from syntactic relation-
ships, via a similar but separate learning process to the one proposed above. Just as our approach
to syntax learning is heavily influenced by our work with Link Grammar, our approach to seman-
tics is heavily influenced by our work on the RelEx system [RVC03, LGEUJ, GPPG06, LG ' 2],
which maps the output of the Link Grammar parser into a more abstract, semantic form. Proto-
type systems [Goc10b, LGIC 1:4 have also been written mapping the output of RelEx into even
more abstract semantic form, consistent with the semantics of the Probabilistic Logic Networks
t(1( 108I formalism as implemented in CogPrime. These systems are largely based on hand-
coded rules, and thus not in the spirit of language learning pursued in this proposal. However,
they display the same structure that we assume here; the difference being that here we specify a
mechanism for learning the linguistic content that fills in the structure via unsupervised corpus
learning, obviating the need for hand-coding.
Specifically, we suggest that discovery of semantic relations requires the implementation of
something similar to'LI'01I, except that this work needs to be generalized from 2-point relations
to 3-point and N-point relations, roughly as described in EPD091. This allows the automatic,
unsupervised recognition of synonymous phrases, such as 'Texas borders on Mexico" and 'Texas
is next to Mexico", to extract the general semantic relation next_ to(X, Y), and the fact that
this relation can be expressed in one of several different ways.
EFTA00624577
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 431
At the simplest level, in this approach, semantic learning proceeds by scanning the corpus
for sentences that use similar or the same words, yet employ them in a different order, or have
point substitutions of single words, or of small phrases. Sentences which are very similar, or
identical, save for one word, offer up candidates for synonyms, or sometimes antonyms. Sentences
which use the same words, but in seemingly different syntactic constructions, are candidates
for synonymous sentences. These may be used to extract semantic relations: the recognition of
sets of different syntactic constructions that carry the same meaning.
In essence, similar contexts must be recognized, and then word and word-order differences
between these other-wise similar contexts must be compared. There are two primary challenges:
how to recognize similar contexts, and how to assign probabilities.
The work of 1P1)001 articulates solutions to both challenges. For the first, it describes a
general framework in which relations such as next_ to(X, Y) can be understood as lambda-
expressions AmAy.next_to(x, y), so that one can employ first-order logic constructions in place
of graphical representations. This is partly a notational trick; it just shows how to split up
input syntactic constructions into atoms and terms, for which probabilities can be assigned.
For the second challenge, they show how probabilities can be assigned to these expressions.
by making explicit use of the notions of conditional random fields (or rather, a certain special
case, termed Markov Logic Networks). Conditional random fields, or Markov networks, are a
certain mathematical formalism that provides the most general framework in which entropy
maximization problems can be solved: roughly speaking, it can be understood as a means of
properly distributing probabilities across networks. Unfortunately, this work is quite abstract
and rather dense. Thus, a much easier understanding to the general idea can be obtained from
IL Pull; unfortunately, the later fails to provide the general N-point case needed for semantic
relations in general, and also fails to consider the use of maximum entropy principles to obtain
similarity measures.
The above can be used to extract synonymous constructions, and, in this way, semantic
relations. However, neither of the above references deal with distinguishing different meanings
for a given word. That is, while eats(X, Y) might be a learnable semantic relation, the sentence
"He ate it" does not necvsbarily justify its use. Of course: "He ate it" is an idiomatic expression
meaning "he crashed", which should be associated with the semantic relation crash(X), not
eat(X, Y). There are global textual clues that this may be the case: trouble resolving the reference
"it", and a lack of mention of foodstuffs in neighboring sentences. A viable yet simple algorithm
for the disambiguation of meaning is offered by the Mihalcea algorithmIMTF04, SNI071.
This is an application of the (google) PageRank algorithm to word senses, taken across words
appearing in multiple sentences. The premise is that the correct word-sense is the one that is
most strongly supported by senses of nearby words; a graph between word senses is drawn,
and then solved as a Markov chain. In the original formulation, word senses are defined by
appealing to WordNet, and affinity between word-senses is obtained via one of several similarity
measures. Neither of these can be applied in learning a language de novo. Instead, these must
both be deduced by clustering and splitting, again. So, for example, it is known that word senses
correlate fairly strongly with disjuncts (based on authors unpublished experiments), and thus,
a reasonable first cut is to presume that every different disjunct in a lexical entry conveys a
different meaning, until proved otherwise. The above-described discovery of synonymous phrases
can then be used to group different disjuncts into a single "word sense". Disjuncts that remain
ungrouped after this process are already considered to have distinct senses, and so can be used
as distinct senses in the Mihalcea network.
EFTA00624578
432 45 Language Learning via Unsupervised Corpus Analysis
Sense similarity measures can then be developed by using the above-discovered senses, and
measuring how well they correlate across different texts. That is, if the word "bell" occurs multi-
ple times in a sequence of paragraphs, it is reasonable to assume that each of these occurrences
are associated with the same meaning. Thus, each distinct disjunct for the word "bell" can
then be presumed to still convey the same sense. One now asks, what words co-occur with
the word "ben"? The frequent appearance of "chime" and "ring" can and should be noted. In
essence, one Ls once-again computing word-pair mutual information, except that now, instead
of limiting word-pairs to be words that are near each other, they can instead involve far-away
words, several sentences apart. One can then expand the word sense of "bell" to include a list of
co-occurring words (and indeed, this is the slippery slope leading to set phrases and eventually
idioms).
Failures of co-occurrences can also further strengthen distinct meanings. Consider "he chimed
in" and "the bell chimed". In both cases, chime is a verb. In the first sentence, chime carries
the disjunct S- & K+ (here, K+ is the standard Link Grammar connector to particles) while
the second has only the simpler disjunct S-. Thus, based on disjunct usage alone, one already
suspects that these two have a different meaning. This is strengthened by the lack of occurrence
of words such as "belt' or "ring" in the first case, with a frequent observation of words pertaining
to talking.
There is one final trick that must be applied in order to get reasonably rapid learning; this
can be loosely thought of as "the sigmoid function trick of neural networks", though it may also
be manifested in other ways not utilizing specific neural net mathematics. The key point is that
semantics intrinsically involves a variety of uncertain, probabilistic and fuzzy relationships; but
in order to learn a robust hierarchy of semantic structures, one needs to iteratively crispen these
fuzzy relationships into strict ones.
In much of the above, there is a recurring need to categorize, classify and discover similarity.
The naivest means of doing so is by counting, and applying basic probability (Bayesian, Marko-
vian) to the resulting counts to deduce likelihoods. Unfortunately, such formulas distribute
probabilities in essentially linear ways (i.e. form a linear algebra), and thus have a rather poor
ability to discriminate or distinguish (in the sense of receiver operating characteristics, of dis-
criminating signal from noise). Consider the last example: the list of words co-occurring with
chime, over the space of a few paragraphs, is likely to be tremendous. Most of this is surely
noise. There is a trick to over-coming this that is deeply embedded in the theory of neural
networks, and yet completely ignored in probabilistic (Bayesian, Markovian) networks: the sig-
moid function. The sigmoid function serves to focus in on a single stimulus, and elevate its
importance, and, at the same time, strongly suppress all other stimuli. In essence, the sigmoid
function looks at two probabilities, say 0.55 and 0.45, and says "let's pretend the first one is 0.9
and the second one is 0.1, and move forward from there". It builds in a strong discrimination
to all inputs. In the language of standard, text-book probability theory, such discrimination is
utterly unwarranted; and indeed, it is. However, applying strong discrimination to learning can
help speed learning by converting certain vague impressions into certainties. These certainties
can then be built upon to obtain additional certainties, or to be torn apart. as needed.
Thus, in all of the above efforts to gauge the similarity between different things, it is useful
to have a sharp yes/no answer, rather than a vague muddling with likelihoods. In some of
the above-described algorithms, this sharpness is already built in: so, Yuret approximates the
mutual information of an entire sentence as the sum of mutual information between word pairs:
the smaller, unlikely corrections are discarded. Clearly, they mast also be revived in order to
handle prepositions. Something similar must also be done in the extraction of synonymous
EFTA00624579
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 433
phrases. semantic relations, and meaning; the domain is that much likelier to be noisy, and
thus, the need to discriminate signal from noise that much more important.
45.4.3.1 Elaboration of the Semantic Learning Loop
We now provide a more detailed elaboration of a simple version of the general semantic learning
process described above. The same caveat applies here as in our elaborated description of
syntactic learning above: the specific algorithmic approach outlined here is a simple instantiation
of the general approach we have in mind, which may well require refinement based on lessons
learned during experimentation and further theoretical analysis.
One way to do semantic learning, according to the approach outlined above, is as follows:
1. An initial semantic corpus is posited, whose elements are parse graphs produced by the
syntactic process described earlier
2. A semantic relationship set (or rel-set) is computed from the semantic corpus, via calculat-
ing the frequent (or otherwise statistically informative) subgraphs occurring in the elements
of the corpus. Each node of such a subgraph may contain a word, a category or a variable;
the links of the subgraph are labeled with (syntactic, or semantic) link types. Each parse
graph is annotated with the semantic graphs associated with the words it contains (ex-
plicitly: each word in a parse graph may be linked via a ReferenceLink to each variable or
literal with a semantic graph that corresponds to that word in the context of the sentence
underlying the parse graph.)
• For instance, the link combination v1 4 v2 4 v3 may commonly occur (representing
the standard Subject-Verb-Object (SVO) structure)
• In this case, for the sentence "The rock broke the window," we would have links such
ReferenceLink
as rock v1 connecting the nodes (such as the "rock" node) in the parse
structure with nodes (such as vi ) in this associated semantic subgraph.
3. Rel-sets are divided into categories based on the similarities of their associated semantic
graphs.
• This division into categories manifests the sigmoid-function-style crispening mentioned
above. Each rel-set will have similarities to other red-sets, to varying fuzzy degrees.
Defining specific categories turns a fuzzy web of similarities into crisp categorial bound-
aries; which involves some loss of information, but also creates a simpler platform for
further steps of learning.
• Two semantic graphs may be called "associated" if they have a nonempty intersection.
The intersection determines the type of association involved. Similarity assessment be-
tween graphs G and H may involve estimation of which graphs G and H are associated
with in which ways.
• For instance, "The cat ate the dog" and "The frog was eaten by the walrus" represent
the semantic structure eat(cat,dog) in two different ways. In link parser terminology,
they do so respectively via the subgraphs g, = v, 4 vs Or v3 and 02 = vi 4 vs 4
UV 1
v3 v4 -) v5. These two semantic graphs will have a lot of the same associations.
For iastance, in our corpus we may have "The big cat ate the dog in the morning"
(including big 4 cat) and also "The big frog was eaten by the walrus in the morning"
EFTA00624580
434 45 Language Learning via Unsupervised Corpus Analysis
(including big 4 frog), meaning that big 4 v5 is a graph commonly associated with
both g, and J Cz. Due to having many commonly associated graphs like this, gi and 02
are likely to be associated to a common cluster.
4. Nodes referring to these categories are added to the parse graphs in the semantic corpus.
Most simply, a category node C is assigned a link of type L pointing to another node x,
if any element of C has a link of type L pointing to x. (More sophisticated methods of
assigning links to category nodes may also be worth exploring.)
• If gi and 02 have been assigned to a common category C, then "I believe the pig ate
the horse" and "I believe the law was invalidated by the revolution" will both appear
as instantiations of the graph ga = believe cv C. This 03 is compact because of
the recognition of C as a cluster, leading to its representation as a single symbol. The
recognition of 03 will occur in Step 2 the next time around the learning loop.
5. Return to Step 2, with the newly enriched semantic corpus. As before, one wants to discover
not too many and not too few categories; again, the appropriate solution to this problem
appears to be entropy maximization. That is, during the frequent subgraph mining stages,
one maintains counts of how often these occur in the corpus; from these, one constructs the
equivalent of the mutual information associated with the subgraphs; categorization requires
maximizing the sum of the log of the sizes of the categories.
As noted earlier, these semantic relationships may be used in the syntactic phase of language
understanding in two ways:
• Semantic graphs associated with words may be considered as "usage links" and thus included
as part of the data used for syntactic category formation.
• During the parsing proems, full or partial parses leading to higher-probability semantic
graphs may be favored.
45.5 The Importance of Incremental Learning
The learning process described here builds up complex syntactic and semantic structures from
simpler ones. To start it, all one needs are basic before and after relationships derived from a
corpus. Everything else is built up from there, given the assumption of appropriate syntactic
and semantic formalisms and a semantics-guided syntax parser.
As we have noted, the series of learning steps we propose falls into the broad category of
"deep learning", or of hierarchical modeling. That is, learning must occur at several levels at
once, each reinforcing, and making use of results from another. Link types cannot be identified
until word clusters are found, and word clusters cannot be found until word-pair relationships
are discovered. However, once link-types are known, these can be then used to refine clusters
and the selected word-pair relations. Further, the process of finding word clusters - both pre
and post parsing - relies on a hierarchical build-up of clusters, each phase of clustering utilizing
results of the previous "lower level" phrase.
However, for this bootstrapping learning to work well, one will likely need to begin with
simple language, so that the semantic relationships embodied in the text are not that far
removed from the simple before/after relationships. The complexity of the texts may then be
EFTA00624581
45.6 Integrating Language Learned via Corpus Analysis into CogPrime's Experiential Learning 435
ramped up gradually. For instance, the needed effect might be achieved via sorting a very large
corpus in order of increasing reading level.
45.6 Integrating Language Learned via Corpus Analysis into
CogPrime's Experiential Learning
Supposing everything in this chapter were implemented and tested and worked reasonably well
as envisioned. What would this get us in terms of progress toward AGI?
Arguably, with a relatively modest additional effort, it could get as a natural language
question answering system, answering a variety of questions based on the text corpus available
to it. One would have to use the learned rules for language generation, but the methods of
Chapter 46 would likely suffice for that.
Such a dialogue system would be a valuable achievement in its own right, of scientific,
commercial and humanistic interest - but of course, it wouldn't be AGI. 'lb get something
approaching AGI from this sort of effort, one would have to utilize additional reasoning and
concept creation algorithms to enable the answering of questions based on knowledge not stored
explicitly in the provided corpus. The dialogue system would have to be able to piece together
new answers from various fragmentary, perhaps contradictory, pieces of information contain
in the corpus. Ultimately, we suspect, one would need something like the CogPrime architec-
ture, or something else with a comparable level of sophistication, to appropriately leverage the
information extracted from texts via the learned language rules.
An open question, as indicated above, is how much of language itself would a corpus based
language learning system like the one outlined here miss, assuming a massive but realistic
corpus (say, a significant fraction of the Web). This is unresolved and ultimately will only be
determined via experiment. Our suspicion is that a very large percentage of language can be
understood via these corpus-based methods. But there may be exceptions that would require
an unrealistically large corpus size.
As a simple example, consider the ability to interpret vaguely given spatial directions like
"Go right out the door, past a few curves in the road, then when you get to a hill with a big
red house on it (well not that big, but bigger than most of the others you'll see on the walk),
start heading down toward the water, till the brash gets thick, then start heading left.... Follow
the ground as it rises and eventually you'll see the lake." Of course, it is theoretically possible
for an AGI system to learn to interpret directions like this purely via corpus analysis. But it
seems the task would be a lot easier for an AGI endowed with a body so that it could actually
experience routes like the one being described. And space and time are not the only source of
relevant examples; social and emotional reasoning have a similar property. Learning to interpret
language about these from reading is certainly possible, but one will have an easier time and
do a better job if one is out in the world experiencing social and emotional life oneself.
Even if there turn out to be significant limitations regarding what can be learned in practice
about language via corpus analysis, though, it may still prove a valuable contributor to the
mind of a CogPrime system. As compared to hand-coded rules, comparably abstract linguistic
knowledge achieved via statistical corpus analysis should be much easier to integrate with the
results of probabilistic inference and embodied learning, due to its probabilistic weighting and
its connection with the specific examples that gave rise to it.
EFTA00624582
EFTA00624583
Chapter 46
Natural Language Generation
Co-authored with Ruiting Lian and Rui Liu
46.1 Introduction
Language generation, unsurprisingly, shares most of the key features of language comprehension
discussed in chapter 44 - after all, the division between generation and comprehension is to
some extent an artificial convention, and the two functions are intimately bound up both in the
human mind and in the CogPrime architecture.
In this chapter we discuss language generation, in a manner similar to the previous chapter's
treatment of language comprehension. First we discuss our currently implemented, "engineered"
language generation system, and then we discuss some alternative approaches:
• how a more experiential-learning based system might be made by retaining the basic struc-
ture of the engineered system but removing the "pre-wired" contents.
• how a "Sem2Syn" system might be made, via reversing the Syn2Sem system described in
Chapter 44. This is the subject of implementation effort, at time of writing.
At the start of Chapter 44 we gave a high-level overview of a typical NL generation pipeline.
Here we will focus largely but not entirely on the "syntactic and morphological realization"
stage, which we refer to for simplicity as "sentence generation" (taking a slight terminological
liberty, as "sentence fragment generation" is also included here). All of the stages of language
generation are important., and there is a nontrivial amount of feedback among them. However,
there is also a significant amount of autonomy, such that it often makes sense to analyze each
one separately and then tease out its interactions with the other stages.
46.2 SegSim for Sentence Generation
The sentence generation approach currently taken in OpenCog (front 2009 - early 2012), which
we call SegSim, is relatively simple and is depicted in Figure 46.1 and described as follows:
1. The NL generation system stores a large set of pairs of the form (semantic structure,
syntactic/morphological realization)
2. When it is given a new semantic structure to express, it first breaks this semantic structure
into natural parts, using a set of simple syntactic-semantic rules
437
EFTA00624584
438 46 Natural Language Generation
3. For each of these parts, it then matches the parts against its memory to find relevant pairs
(which may be full or partial matches), and uses these pairs to generate a set of syntactic
realizations (which may be sentences or sentence fragments)
4. If the matching has failed, then (a) it returns to Step 2 and carries out the breakdown
into parts again. But if this has happened too many times, then (b) it recourses to a
different algorithm (mast likely a search or optimization based approach, which is more
computationally costly) to determine the syntactic realization of the part in question.
5. If the above step generated multiple fragments, they are pieced together, and a certain rating
function is used to judge if this has been done adequately (using criteria of grammaticality
and expected comprehensibility, among others). If this fails, then Step 3 is tried again on
one or more of the parts; or Step 2 is tried again. (Note that one option for piecing the
fragments together is to string together a number of different sentences; but this may not
be judged optimal by the rating function.)
6. Finally, a "cleanup" phase is conducted, in which correct morphological forms are inserted,
and articles and certain other "function words" are inserted.
The specific OpenCog software implementing the SegSim algorithm is called "NLGen"; this
is an implementation of the SegSim concept that focuses on sentence generation from RelEx
semantic relationship. In the current (early 2012) NLGen version, Step 1 is handled in a very
simple way using a relational database; but this will be modified in future so as to properly
use the AtomSpace. Work is currently underway to replace NLGen with a different "Sem2Syn"
approach, that will be described at the end of this chapter. But discussion of NLGen is still
instructive regarding the intersection of language generation concepts with OpenCog concepts.
The substructure currently used in Step 2 is defined by the predicates of the sentence, i.e.
we define one substructure for each predicate, which can be described as follows:
Predicate(Alyumenti (Afodifyj))
where
• 1 < i < m and 0 < j < n and in, n are integers
• "Predicate" stands for the predicate of the sentence, corresponding to the variable $0 of the
RelEx relationship _subj($0, $1) or _obj($0, $1)
• Argument; is the i-th semantic parameter related with the predicate
• Modify.; is the j-th modifier of the Argument;
If there is more than one predicate, then multiple subnets are extracted analogously.
For instance, given the sentence "I happily study beautiful mathematics in beautiful China
with beautiful people." The substructure can be defined as Figure 46.2.
For each of these substructures, Step 3 is supposed to match the substructures of a sentence
against its global memory (which contains a large body of previously encountered 'semantic
structure, syntactic/morphological realization' pairs) to find the most similar or same substruc-
tures and the relevant syntactic relations to generate a set of syntactic realizations, which may
be sentences or sentence fragments. In our current implementation. a customized subgraph
matching algorithm has been used to match the subnets from the parsed corpus at this step.
If Step 3 generated multiple fragments, they mast be pieced together. In Step 4, the Link
Parser's dictionary has been used for detecting the dangling syntactic links corresponding to the
fragments, which can be used to integrate the multiple fragments. For instance, in the example
of Figure 46.3, according to the last 3 steps, SegSim would generate two fragments: "the parser
EFTA00624585
46.2 SegSim for Sentence Generation 139
MG brie Vron of Ref
ink twat Fula.lee
IV
• breed
• MM.
I ul
SAGA MN. MC
r otallf
'adman :
if
LS Phial .
tunny rad
Pisces Swiss's)
1
( league )
Imnal %runt ..41
Fig. 46.1: A Overview of the SegSim Architecture for Language Generation
will ignore the sentence" and "whose length is too long". Then it consults the Link Parser's
dictionary, and finds that "whose" has a connector "Mr-", which is used for relative clauses
involving "whose", to connect to the previous noun "sentence". Analogously, we can integrate
the other fragments into a whole sentence.
For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would
generate two fragments: "the parser will ignore the sentence" and "whose length is too long".
Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-",
which is used for relative clauses involving "whose", to connect to the previous noun "sentence".
Analogously, we can integrate the other fragments into a whole sentence.
Finally, a "cleanup" or "post-processing" phase is conducted, applying the correct inflections
to each word depending on the word properties provided by the input RelEx relations. For
example, we can use the RelEx relation "DEFINITE-FLAG(cover, T)" to insert the article "the"
in front of the word "cover". We have considered five factors in this version of NLGen: article,
EFTA00624586
440 46 Natural Language Generation
bentfil tenth'
Fig. 46.2: Example of a substructure
• - •• .....
”.-.b.....'.......••••-•••••••••••••••••••••••-”**********,
.- :Ir . ' • *
I •-•4*--•- ..-w•
I I I I I i I t i l l
1177-111t tit putts allot yeon.i tle tertiv.aS ;mu.* ita ta
Fig. 46.3: Linkage of an example
noun plural, verb intense, possessive and query type (the latter which is only for interrogative
sentences).
In the "cleanup" step, we also use the chunk parser tool from OpenNLP4 for adjusting the
position of an article being inserted. For instance, consider the proto-sentence "I have big red
apple." If we use the RelEx relation "noun_number(apple, singular)" to inflect the word "apple"
directly, the final sentence will be "I have big red an apple", which is not well-formed. So we use
the chunk parser to detect the phrase "big red apple" first, then apply the article rule in front
of the noun phrase. This is a pragmatic approach which may be replaced with something more
elegant and principled in later revisions of the NLGen system.
• http://opennIpsourceforge.net/
EFTA00624587
46.2 SegSim for Sentence Generation 411
46.2.1 NLGen: Example Results
NLGen is currently in a relatively early stage of development, and does not handle the full range
of linguistic and semantic phenomena that it will when it's completed. However, it can already
express a variety of sentences encapsulating a variety of syntactic and semantic phenomena; in
this section we will give some specific examples of what it can do.
The SegSim approach performs sentence generation by matching portions of propositional
input to a large corpus of parsed sentences, therefore, when the successful matches can be found
in the corpus, it can generate some similar and well-formed sentences via the relevant syntactic
structures.
There currently do not exist any authoritative evaluation criteria for Natural Language
Generation systems. Among many complicating factors here is the fact that different language
generation systems have different kinds of inputs, depending on many things including their
application area. So it's complicated to compare the results of NLGen with those obtained by
other systems.
It is easier however to test whether NLGen is implementing SegSim succmsfully. One ap-
proach is to take a sentence, run it through RelEx to generate a set of relationships, and see
if NLGen can regenerate the sentence from the relationship. We show here the results of some
simple tests of this nature, performed with the current version of the system using a very small
test corpus for similarity matching. Note: In each of these example results, the input of NLGen
is the RelEx relationships produced by the sentence before "==>"; and the sentence to the
right side of "==>" was one of the sentences generated by NLGen.
Example 1:
I study mathematics at Xiamen University.
==> I study mathematics at Xiamen\_University.
Example 2:
The red block is next to the blue cup.
==> The red block is next\_to the blue cup.
Example 3:
Ruiting walked from the table to the door.
==> Ruiting walked to the door from the table.
Example 4:
I intend to walk very slowly from Ben to Ruiting,
with Hugo's pencil in my hand.
==> With Hugo's pencil in my hand,
I intend to walk very slowly from Ben to Ruiting.
Example 5:
I'd like you to tell me where the red ball is.
==> I tell like you to 'd me where the red ball is.
Example 6:
EFTA00624588
442 46 Natural Language Generation
I study at Xiamen University, which is located in China.
==> I study at Xiamen1_University.
Example 7:
What is next to the tree?
==> What is next to the tree?
Example 8:
Where is the red ball?
==> Where is the red ball?
Example 9:
Pass me the ball.
==> Pass me the ball.
Example 10:
Tell Ben where the red ball is.
==> Tell Ben where the red ball is.
To make the process clearer, for Example 1 we also show the RelEx relationships produced
from the sentence before the "==>":
For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would
generate two fragments: "the parser will ignore the sentence" and "whose length is too long".
Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-",
which is used for relative clauses involving "whose", to connect to the previous noun "sentence".
Analogously, we can integrate the other fragments into a whole sentence.
Figure 46.4 shows the relationships of Example 1 fed to NLGen as input. The types of the
semantic relationships are documented in the RelEx's wiki pagest.
These examples illustrate some key points about the current version of NLGen. It works
well on simple, commonplace sentences (Example 1, 2), though it may reorder the sentence
fragments sometimes (Example 3, 4). On the other hand, because of its reliance on matching
against a corpus, NLGen is incapable of forming good sentences with syntactic structures not
found in the corpus (Example 5, 6). On a larger corpus these examples would have given
successful results. In Example 5, the odd error is due to the presence of too many "_subj"
RelEx relationships in the relationship-set corresponding to the sentence, which distracts the
matching proems when it attempts to find similar substructures in the small test corpus. Then
from Example 7 to 10, we can see NLGen still works well for question sentences and imperative
sentence if the substructures we extract can be matched, but the substructures may be similar
with the assertive sentence, so we need to refine it in the "cleanup" step. For example: the
substructures we extracted for the sentence "are you a student?" are the same as the ones for
"you are a student?", since the two sentences both have the same binary RelEx relationships:
_subj (be, you)
_obj (be, student)
t http://opencog.org/wiki/RelEx#Relations_and_Features
EFTA00624589
EFTA00624590
-鹵届工【6 由{…七山ユ。亂毛9・ ・
L 二-叩口…貫-中帥囑言岶c 靂 8 2 ユI 一召ュ田巴三吮田-己扁u 一韶-目畠品中E8 屡?
山u 岳芸器・工-中」唱一8 ・署屮・。m 蕃 昌 七 E 廿齧-【叩置-】°c 屈
-u嘱】莫山山看・二品8 原啻一・誌毛專】
v
-
一」一骨x 国一&&ョ哲6矯石一鹵《晶.瑯.等.警函
一←吐冪--「」白三」詈
一
」G 一…一一」-置苔~ユ
-互昌』詈且
一量弓口寫、H 冪-与【口三一
言m 岔◎、H -製獸
吾山>薯a 詈 且
「(蕾且、宀窘-一詈異当
一
尊m 訂>三岳舜冖-、曇豈岔
、-暮-菫冨…
H
一 一計昌ば-m 且
-」、盲巴晋囗口岳鬲暹嘗「」曲一「一【」曽
G弓(一箇寓凵当囗口岳舜三粤
一量曇》叩含冖m 訂h」曇岳舜冖”冪君口口巨
一g 一岔~~甘昌ロ」言且
詈 c…篇缶c 庠)8c3c●切蛔&【c 一嘱晶蚪等
444 46 Natural Language Generation
like "TRUTH-QUERY-FLAG(be, T)" which means if that the referent "be" is a verb/event and
the event is involved is a question.
The particular shortcomings demonstrated in these examples are simple to remedy within
the current NLGen framework, via simply expanding the corpus. However, to get truly general
behavior from NLGen it will be necessary to insert some other generation method to cover
those cases where similarity matching fails, as discussed above. The NLGen2 system created by
Blake Lentoine [Lenin)] is one possibility in this regard: based on RelEx and the link parser,
it carries out rule-based generation using an implementation of Chomsky's Merge operator.
Integration of NLGen with NLGen2 is currently being considered. We note that the Merge
operator is computationally inefficient by nature, so that it will likely never be suitable for the
primary, sentence generation method in a language generation system. However, pairing NLGen
for generation of familiar and routine utterances with a Merge-based approach for generation
of complex or unfamiliar utterances, may prove a robust approach.
46.3 Experiential Learning of Language Generation
As in the case of language comprehension, there are multiple ways to create an experiential
learning based language generation system, involving various levels of "wired in" knowledge.
Our best guess is that for generation as for comprehension. a "tabula rasa" approach will prove
computationally intractable for quite some time to come, and an approach in which some basic
structures and processes are provided, and then filled out with content learned via experience,
will provide the greatest odds of success.
A highly abstracted version of SegSim may be formulated as follows:
1. The Al system stores semantic and syntactic structures, and its control mechanism is biased
to search for, and remember, linkages between them
2. When it is given a new semantic structure to express, it first breaks this semantic structure
into natural parts, using inference based on whatever implications it has in its memory that
will serve this purpose
3. Its inference control mechanism is biased to carry out inferences with the following implica-
tion: For each of these parts, match it against its memory to find relevant pairs (which may
be full or partial matches), and use these pairs to generate a set of syntactic realizations
(which may be sentences or sentence fragments)
4. If the matching has failed to yield results with sufficient confidence, then (a) it returns
to Step 2 and carries out the breakdown into parts again. But if this has happened too
many times, then (b) it uses its ordinary inference control routine to try to determine the
syntactic realization of the part in question.
5. If the above step generated multiple fragments, they are pieced together, and an attempt
is made to infer, based on experience, whether the result will be effectively communicative.
If this fails, then Step 3 is tried again on one or more of the parts; or Step 2 is tried again.
6. Other inference-driven transformations may occur at any step of the process, but are par-
ticularly likely to occur at the end. In some languages these transformations may result in
the insertion of correct morphological forms or other "function words."
What we suggest is that it may be interesting to supply a CogPrime system with this overall
process, and let it fill in the rest by experiential adaptation. In the case that the system is
EFTA00624591
46.5 Conclusion 445
learning to comprehend at the same time as it's learning to generate, this means that its early-
stage generations will be based on its rough, early-stage comprehension of syntax - but that's
OK. Comprehension and generation will then "grow up" together.
46.4 Sem2Syn
A subject of current research is the extension of the Syn2Sem approach mentioned above into
a reverse-order. Sem2Syn system for language generation.
Given that the Syn2Sem rules are expressed as ImplicationLinks, they can be reversed auto-
matically and immediately - although, the reversed versions will not necessarily have the same
truth values. So if a collection of Syn2Sem rules are learned from a corpus, then they can be
used to automatically generate a set of Sem2Syn rules, each tagged with a probabilistic truth
value. Application of the whole set of Sem2Syn rules to a given Atom-set in need of articulation,
will result in a collection of link-parse links.
To produce a sentence from such a collection of link-parse links, another process is also
needed, which will select a subset of the collection that corresponds to a complete sentence,
legally parsable via the link parser. The overall collection might naturally break down into more
than one sentence.
In terms of the abstracted version of SegSim given above, the primary difference between
NLGen and SegSim lies in Step 3. Syn2Sem replaces the SegSim "data-store matching" algo-
rithm with inference based on implications obtained from reversing the implications used for
language comprehension.
46.5 Conclusion
There are many different ways to do language generation within OpenCog, ranging from pure
experiential learning to a database-driven approach like NLGen. Each of these different ways
may have value for certain applications, and it's unclear which ones may be viable in a human-
level AGI context. Conceptually we would favor a pure experiential learning approach. but, we
are currently exploring a "compromise" approach based on Sem2Syn. This is an area where
experimentation is going to tell us more than abstract theory.
EFTA00624592
EFTA00624593
Chapter 47
Embodied Language Processing
Co-authored with Samir Araujo and Welter Silva
47.1 Introduction
"Language" is an important abstraction — but one should never forget that it's an abstraction.
Language evolved in the context of embodied action, and even the most abstract language is
full of words and phrases referring to embodied experience. Even our mathematics is heavily
based on our embodied experience - geometry is about space; calculus is about space and
time: algebra is a sort of linguistic manipulation generalized from experience-oriented language,
etc. (see 'ELMO] for detailed arguments in this regard). To consider language in the context of
human-like general intelligence, one needs to consider it in the context of embodied experience.
There is a large literature on the importance of embodiment for child language learning,
but perhaps the most eloquent case has been made by Michael Tomasello, in his excellent
book Constructing a Language ??. Citing a host of relevant research by himself and others,
Tomasello gives a very clear summary of the value of social interaction and embodiment for
language learning in human children. And while he doesn't phrase it in these terms, the picture
he portrays includes central roles for reinforcement, imitative and corrective learning. Imitative
learning is obvious: so much of embodied language learning has to do with the learner copying
what it has heard other say in similar contexts. Corrective learning occurs every time a parent
or peer rephrases something for a child.
In this chapter, after some theoretical discussion of the nature of symbolism and the role of
gesture and sound in language, we describe some computational experiments run with OpenCog
controlling virtual pets in a virtual world, regarding the use of embodied experience for anaphor
resolution and question-answering. These comprise an extremely simplistic example of the in-
terplay between language and embodiment, but have the advantage of concreteness, since they
were actually implemented and experimented with. Some of the specific OpenCog tools used in
these experiments are no longer current (e.g. the use of RelEx2Frame, which is now deprecated
in favor of alternative approaches to mapping parses into more abstract semantic relationships);
but the basic principles and flow illustrated here are still relevant to current and future work.
447
EFTA00624594
448 47 Embodied Language Processing
47.2 Semiosis
The foundation of communication is semiosis - the representation between the signifier and the
signified. Often the signified has to do with the external world or the communicating agent's
body; hence the critical role of embodiment in language.
Thus, before turning to the topic of embodied language use and learning per se, we will
briefly treat the related topic of how an AGI system may learn semiosis itself via its embodied
experience. This is a large and rich topic, but we will restrict ourselves to giving a few relatively
simple examples intended to make the principles clear. We will structure our discussion of
semiotic learning according to Charles Sanders Peirce's theory of semiosis [Pei341, in which
there are three basic types of signs: icons, indices and symbols.
In Peirce's ontology of semiosis, an icon is a sign that physically resembles what it stands
for. Representational pictures, for example, are icons because they look like the thing they
represent. Onomatopoeic words are icons, as they sound like the object or fact they signify. The
iconicity of an icon need not be immediate to appreciate. The fact that "kirikiriki" is iconic for
a rooster's crow is not obvious to English-speakers yet it is to many Spanish-speakers; and the
the converse is true for "cock-a-doodle-doo."
Next, an index is a sign whose occurrence probabilistically implies the occurrence of some
other event or object (for reasons other than the habitual usage of the sign in connection with
the event or object among some community of communicating agents). The index can be the
cause of the signified thing, or its consequence, or merely be correlated to it. For example, a
smile on your face is an index of your happy state of mind. Loud music and the sound of many
people moving and talking in a room is an index for a party in the room. On the whole, more
contextual background knowledge is required to appreciate an index than an icon.
Finally, any sign that is not an icon or index is a symbol. More explicitly, one may say that a
symbol is a sign whose relation to the signified thing is conventional or arbitrary. For instance,
the stop sign is a symbol for the imperative to stop; the word "dog" is a symbol for the concept
it refers to.
The distinction between the various types of signs is not always obvious, and some signs may
have multiple aspects. For instance, the thumbs-up gesture is a symbol for positive emotion
or encouragement. It is not an index - unlike a smile which is an index for happiness because
smiling Ls intrinsically biologically tied to happiness. there is no intrinsic connection between the
thumbs-up signal and positive emotion or encouragement. On the other hand, one might argue
that the thumbs-up signal is very weakly iconic, in that its up-ness resembles the subjective
up-ness of a positive emotion (note that in English an idiom for happiness Ls "feeling up").
Teaching an embodied virtual agent to recognize simple icons is a relatively straightforward
learning task. For instance, suppose one wanted to teach an agent that in order to get the
teacher to give it a certain type of object, it should go to a box full of pictures and select a
picture of an object of that type, and bring it to the teacher. One way this may occur in an
OpenCog-controlled agent is for the agent to learn a rule of the following form:
EFTA00624595
47.2 Sentiosis 449
ImplicationLink
ANDLink
ContextLink
Visual
SimilarityLink $X $Y
PredictivelmplicationLink
SequentialANDLink
ExecutionLink goto box
ExecutionLink grab $X
ExecutionLink goto teacher
EvaluationLink give me teacher $Y
While not a trivial learning problem, this is straightforward to a CogPrime -controlled agent
that is primed to consider visual similarities as significant (i.e. is primed to consider the visual-
appearance context within its search for patterns in its experience).
Next, proceeding from icons to indices: Suppose one wanted to teach an agent that in order
to get the teacher to give it a certain type of object, it should go to a box full of pictures and
select a picture of an object that has commonly been used together with objects of that type,
and bring it to the teacher. This is a combination of iconic and indexical semiosis, and would
be achieved via the agent learning a rule of the form
Implication
AND
Context
Visual
Similarity $X $2
Context
Experience
SpatioTemporalAssociation $Z $Y
Predictivelmplication
SequentialAnd
Execution goto box
Execution grab $X
Execution goto teacher
Evaluation give me teacher $Y
Symbolism, finally, may be seen to emerge as a fairly straightforward extension of indexing.
After all, how does an agent come to learn that a certain symbol refers to a certain entity?
An advanced linguistic agent can learn this via explicit verbal instruction, e.g. one may tell it
"The word 'hideous' means 'very, ugly'." But in the early stages of language learning, this sort
of instructional device is not available, and so the way an agent learns that a word is associated
with an object or an action is through spatiotemporal association. For instance, suppose the
teacher wants to teach the agent to dance every time the teacher says the word "dance" - a very
simple example of symbolism. Assuming the agent already knows how to dance, this merely
requires the agent learn the implication
Predictivelmplication
Sequent ialAND
Evaluation say teacher me "dance"
Execution dance
EFTA00624596
450 47 Embodied Language Processing
give teacher me Reward
And, once this has been learned, then simultaneously the relationship
SpatioTemporalAssociation dance "dance"
will be learned. What's interesting is what happens after a number of associations of this nature
have been learned. Then, the system may infer a general rule of the form
Implication
AND
SpatioTemporalAssociation \$X \SZ
HasType \$X GroundedSchema
Predictivelmplication
SeguentialAND
Evaluation say teacher me \$Z
Execution \$X
Evaluation give teacher me Reward
This implication represents the general rule that if the teacher says a word corresponding to an
action the agent knows how to do, and the agent does it, then the agent may get a reward front
the teacher. Abstracting this from a number of pertinent examples is a relatively straightforward
feat of probabilistic inference for the PLN inference engine.
Of course, the above implication is overly simplistic, and would lead an agent to stupidly start
walking every time its teacher used the word "walk" in conversation and the agent overheard
it. To be useful in a realistic social context, the implication must be made more complex so as
to include some of the pragmatic surround in which the teacher utters the word or phrase $Z.
47.3 Teaching Gestural Communication
Based on the ideas described above, it is relatively straightforward to teach virtually embodied
agents the elements of gestural comunication. This is important for two reasons: gestural com-
munication is extremely useful unto itself, as one sees from its role in communication among
young children and primates 122i; and, gestural communication forms a foundation for verbal
communication, during the typical course of human language learning 1231. Note for instance
the study described in 122J, which "reports empirical longitudinal data on the early stages of
language development," concluding that
...the output systems of speech and gesture may draw on underlying brain mechanisms com-
mon to both language and motor functions. We analyze the spontaneous interaction with their
parents of three typically-developing children (2 Al, 1 F) videotaped monthly at home between
10 and 23 months of age. Data analyses focused on the production of actions, representational
and deictic gestures and words, and gesture-word combinations. Results indicate that there is
a continuity between the production of the first action schemes, the first gestures and the first
words produced by children. The relationship between gestures and words changes over time.
The onset of two-word speech was preceded by the emergence of gesture-word combinations.
If young children learn language as a continuous outgrowth of gestural communication, per-
haps the same approach may be effective for (virtually or physically) embodied AI's.
EFTA00624597
47.3 Teaching Gestural Communication 451
An example of an iconic gesture occurs when one smiles explicitly to illustrate to some other
agent that one is happy. Smiling is a natural expression of happiness, but of course one doesn't
always smile when one's happy. The reason that explicit smiling is iconic is that the explicit
smile actually resembles the unintentional smile, which is what it "stands for."
This kind of iconic gesture may emerge in a socially-embedded learning agent through a very
simple logic. Suppose that when the agent is happy, it benefits from its nearby friends being
happy as well, so that they may then do happy things together. And suppose that the agent
has noticed that when it smiles, this has a statistical tendency to make its friends happy. Then,
when it is happy and near its friends, it will have a good reason to smile. So through very
simple probabilistic reasoning, the use of explicit smiling as a communicative tool may result.
But what if the agent is not actually happy, but still wants some other agent to be happy?
Using the reasoning from the prior paragraph, it will likely figure out to smile to make the
other agent happy - even though it isn't actually happy.
Another simple example of an iconic gesture would be moving one's hands towards one's
mouth, mimicking the movements of feeding oneself, when one wants to eat. Many analogous
iconic gestures exist, such as doing a small solo part of a two-person dance to indicate that one
wants to do the whole dance together with another person. The general rule an agent needs to
learn in order to generate iconic gestures of this nature is that, in the context of shared activity,
mimicking part of a process will sometimes serve the function of evoking that whole process.
This sort of iconic gesture may be learned in essentially the same way as an indexical gesture
such as a dog repeatedly drawing the owner's attention to the owner's backpack, when the dog
wants to go outside. The dog doesn't actually care about going outside with the backpack - he
would just as soon go outside without it - but he knows the backpack is correlated with going
outside, which is his actual interest.
The general rule here is
R :=
Implication
Simultaneouslmplication
Execution $X $Y
Predictivelmplication $X $Y
I.e., if doing $X often correlates with $Y, then maybe doing $X will bring about $Y. This sort of
rule can bring about a lot of silly "superstitious" behavior but also can be particularly effective
in social contexts, meaning in formal terms that
Context
near_teacher
R
holds with a higher truth value than It itself. This is a very small conglomeration of semantic
nodes and links yet it encapsulates a very important communicational pattern: that if you want
something to happen, and act out part of it - or something historically associated with it -
around your teacher, then the thing may happen.
Many other cases of iconic gesture are more complex and mix iconic with symbolic aspects.
For instance, one waves one hand away from oneself, to try to get someone else to go away. The
hand is moving, roughly speaking, in the direction one wants the other to move in. However,
understanding the meaning of this gesture requires a bit of savvy or experience. One one does
grasp it, however, then one can understand its nuances: For instance, if I wave my hand in an
EFTA00624598
452 47 Embodied Language Processing
arc leading from your direction toward the direction of the door, maybe that means I want you
to go out the door.
Purely symbolic (or nearly so) gestures include the thumbs-up symbol mentioned above, and
many others including valence-indicating symbols like a nodded head for YES, a shaken-side-to-
side head for NO, and shrugged shoulders for "I don't know." Each of these valence-indicating
symbols actually indicates a fairly complex concept, which is learned from experience partly via
attention to the symbol itself. So, an agent may learn that the nodded head corresponds with
situations where the teacher gives it a reward, and also with situations where the agent makes
a request and the teacher complies. The cluster of situati<ms corresponding to the nodded-
head then forms the agent's initial concept of "positive valence," which encompasses, loosely
speaking, both the good and the true.
Summarizing our discussion of gestural communication: An awful lot of language exists
between intelligent agents even if no word is ever spoken. And, our belief is that these sorts
of non-verbal semiosis form the best possible context for the learning of verbal language, and
that to attack verbal language learning outside this sort of context is to make an intrinsically-
difficult problem even harder than it has to be. And this leads us to the final part of the
chapter, which is a bit more speculative and adventuresome. The material in this section and
the prior ones describes experiments of the sort we are currently carrying out with our virtual
agent control software. We have not yet demonstrated all the forms of semiosis and non-linguistic
communication described in the last section using our virtual agent control system, but we have
demonstrated some of them and are actively working on extending our system's capabilities. In
the following section, we venture a bit further into the realm of hypothesis and describe some
functionalities that are beyond the scope of our current virtual agent control software, but that
we hope to put into place gradually during the next 1-2 years. The basic goal of this work is to
move from non-verbal to verbal communication.
It is interesting to enumerate the aspects in which each of the above components appears to
be capable of tractable adaptation via experiential, embodied learning:
• Words and phrases that are found to be systematically associated with particular objects
in the world, may be added to the "gazeteer list" used by the entity extractor
• The link parser dictionary may be automatically extended. In cases where the agent hears
a sentence that is supposed to describe a certain situation, and realizes that in order for the
sentence to be mapped into a set of logical relationships accurately describing the situation,
it would be necessary for a certain word to have a certain syntactic link that it doesn't
have, then the link parser dictionary may be modified to add the link to the word. (On the
other hand, creating new link parser link types seems like a very difficult sort of learning -
not to say it is unaddressable, but it will not be our focus in the near term.)
• Similar to with the link parser dictionary, if it is apparent that to interpret an utterance in
accordance with reality a RelEx rule must be added or modified, this may be automatically
done. The RelEx rules are expressed in the format of relatively simple logical implications
between Boolean combinations of syntactic and semantic relationships, so that learning and
modifying them is within the scope of a probabilistic logic system such as OpenCogPrime's
PLN inference engine.
• The rules used by RelEx2Prame may be experientially modified quite analogously to those
used by RelEx
• Our current statistical parse ranker ranks an interpretation of a sentence based on the
frequency of occurrence of its component links across a parsed corpus. A deeper approach,
however, would be to rank an interpretation based on its commonsensical plausibility, as
EFTA00624599
47.3 Teaching Gestural Communication 453
inferred from experienced-world-knowledge as well as corpus-derived knowledge. Again, this
is within the scope of what an inference engine such as PLN should be able to do.
• Our word sense disambiguation and reference resolution algorithms involve probabilistic
estimations that could be extended to refer to the experienced world as well as to a parsed
corpus. For example, in assessing which sense of the noun "run" is intended in a certain
context, the system could check whether stockings, or sports-events or series-of-events, are
more prominent in the currently-observed situation. In assessing the sentence "The children
kicked the dogs, and then they laughed," the system could map "they" into "children" via
experientially-acquired knowledge that children laugh much more often than dogs.
• NLGen uses the link parser dictionary, treated above, and also uses rules analogous to (but
inverse to) RelEx rules, mapping semantic relations into brief word-sequences. The "gold
standard" for NLGen is whether, when it produces a sentence S from a set Ft of semantic
relationships, the feeding of $ into the language comprehension subsystem produces Et
(or a close approximation) as output. Thus, as the semantic mapping rules in RelEx and
RelEx2Frame adapt to experience, the rules used in NLGen must adapt accordingly, which
poses an inference problem unto itself.
All in all, when one delves in detail into the components that make up our hybrid
statistical/rule-based NLP system, one sees there is a strong opportunity for experiential adap-
tive learning to substantially modify nearly every aspect of the NLP system, while leaving the
basic framework intact.
This approach, we suggest, may provide means of dealing with a number of problems that
have systematically vexed existing linguistic approaches. One example is parse ranking for com-
plex sentences: this seems almost entirely a matter of the ability to assess the semantic plausi-
bility of different parses, and doing this based on statistical corpus analysis seems unreasonable.
One needs knowledge about a world to ground reasoning about plausiblity.
Another example is preposition disambiguation, a topic that is barely dealt with at all in
the computational linguistics literature (see e.g. [331 for an indication of the state of the art).
Consider the problem of assessing which meaning of "with" is intended in sentences like "I ate
dinner with a fork", "I ate dinner with my sister", "I ate dinner with dessert." In performing
this sort of judgment, an embodied system may use knowledge about which interpretations
have matched observed reality in the case of similar utterances it has processed in the past,
and for which it has directly seen the situations referred to by the utterances. If it has seen
in the past, through direct embodied experience, that when someone said "I ate cereal with a
spoon," they meant that the spoon was their tool not part of their food or their eating-partner;
then when it hears "I ate dinner with a fork," it may match "cereal" to "dinner" and "spoon"
to "fork" (based on probabilistic similarity measurement) and infer that the interpretation
of "with" in the latter sentence should also be to denote a tool. How does this approach to
computational language understanding tie in with gestural and general semiotic learning as we
discussed earlier? The study of child language has shown that early language use is not purely
verbal by any means, but is in fact a complex combination of verbal and gestural communication
123J. With the exception of first bullet point (entity extraction) above, every one of our instances
of experiential modification of our language framework listed above involves the use of an
understanding of what situation actually exists in the world, to help the system identify what
the logical relationships output by the NLP system are supposed to be in a certain context.
But a large amount of early-stage linguistic communication is social in nature, and a large
amount of the remainder has to do with the body's relationship to physical objects. And, in
understanding "what actually exists in the world" regarding social and physical relationships, a
EFTA00624600
454 47 Embodied Language Processing
full understanding of gestural communication is important. So, the overall pathway we propose
for achieving robust, ultimately human-level NLP functionality is as follows:
• The capability for learning diverse instances of semiosis is established
• Gestural communication is mastered, via nonverbal imitative/reinforcement/corrective
learning mechanisms such as we utilized for our embodied virtual agents
• Gestural communication, combined with observation of and action in the world and verbal
interaction with teachers, allows the system to adapt numerous aspects of its initial NLP
engine to allow it to more effectively interpret simple sentences pertaining to social and
physical relationships
• Finally, given the ability to effectively interpret and produce these simple and practical
sentences, probabilistic logical inference allows the system to gradually extend this ability
to more and more complex and abstract senses, incrementally adapting aspects of the NLP
engine as its scope broadens.
In this brief section we will mention another potentially important factor that we have
intentionally omitted in the above analysis - but that may wind up being very important,
and that can certainly be taken into account in our framework if this proves necessary. We
have argued that gesture is an important predecessor to language in human children, and that
incorporating it in AI language learning may be valuable. But there is another aspect of early
language use that plays a similar role to gesture, which we have left out in the above discussion:
this is the acoustic aspects of speech.
Clearly, pre-linguistic children make ample use of communicative sounds of various sorts.
These sounds may be iconic, indexical or symbolic; and they may have a great deal of sub-
tlety. Steven Mithen IMit961 has argued that non-verbal utterances constitute a kind of proto-
language. and that both music and language evolved out of this. Their role in language learning
is well-known. We are uncertain as to whether an exclusive focus on text rather than speech
would critically impair the language learning process of an AI system. We are fairly strongly
convinced of the importance of gesture because it seems bound up with the importance of
semiosis - gesture, it seems, Ls how young children learn flexible semiotic communication skills,
and then these skills are gradually ported from the gestural to the verbal domain. Semioti-
cally, on the other hand, phonology doesn't seem to give anything special beyond what gesture
gives. What it does give is an added subtlety of emotional expressiveness - something that is
largely missing from virtual agents aS implemented today, due to the lack of really fine-grained
facial expressions. Also, it provides valuable clues to parsing, in that groups of words that are
syntactically hound together are often phrased together acoustically.
If one wished to incorporate acoustics into the framework described above, it would not
be objectionably difficult on a technical level. Speech-to-text and text-to-speech software both
exist, but neither have been developed with a view specifically toward conveyance of emotional
information. One could approach the problem of assessing the emotional state of an utterance
based on its sound as a supervised categorization problem, to be solved via supplying a machine
learning algorithm with training data consisting of human-created pairs of the form (utterance,
emotional valence). Similarly, one could tune the dependence of text-to-speech software for
appropriate emotional expressiveness based on the same training corpus.
EFTA00624601
47.4 Simple Experiments with Embodiment and Anaphor Resolution 455
47.4 Simple Experiments with Embodiment and Anaphor Resolution
Now we turn to some fairly simple practical work that was done in 2008 with the OpenCog-based
PetBrain software, involving the use of virtually embodied experience to help with interpretation
of linguistic utterances. This work has been superseded somewhat by more recent work using
OpenCog to control virtual agents; but the PetBrain work was especially clear and simple, so
suitable in an expository sense for in-depth discussion here.
One of the two ways the PetBrain related language processing to embodied experience was via
using the latter to resolve anaphoric references in text produced by human-controlled avatars.
The PetBrain controlled agent lived in a world with many objects, each one with their own
characteristics. For example, we could have multiple balls, with varying colors and sizes. We
represent this in the OpenCog Atomspace via using multiple nodes: a single ConceptNode to
represent the concept "ball", a WordNode associated with the word "ball", and numerous Se-
meNocS representing particular balls. There may of course also be ConceptNodes representing
ball-related ideas not summarized in any natural language word, e.g. "big fat squishy balls,"
"balls that can usefully be hit with a bat", etc.
As the agent interacts with the world, it acquires information about the objects it finds,
through perceptions. The perceptions associated with a given object are stored as other nodes
linked to the node representing the specific object instance. All this information is represented
in the Atomspace using FrameNet-style relationships (exemplified in the next section).
When the user says, e.g., "Grab the red ball", the agent needs to figure out which specific
ball the user is referring to - i.e. it needs to invoke the Reference Resolution (RR) process. RR
uses the information in the sentence to select instances and also a few heuristic rules. Broadly
speaking, Reference Resolution maps nouns in the user's sentences to actual objects in the
virtual world. based on world-knowledge obtained by the agent through perceptions.
In this example, first the brain selects the ConceptNodes related to the word "ball". Then
it examines all individual instances associated with these concepts, using the determiners in
the sentence along with other appropriate restrictions (in this example the determiner is the
adjective "red"; and since the verb is "grab" it also looks for objects that can be fetched). If it
finds more than one "fetchable red ball", a heuristic is used to select one (in this case, it chooses
the nearest instance).
The agent also needs to map pronouns in the sentences to actual objects in the virtual world.
For example, if the user says "I like the red ball. Grab it.", the agent must map the pronoun
"it" to a specific red ball. This process is done in two stages: first using anaphor resolution to
associate the pronoun "it" with the previously heard noun "ball"; then using reference resolution
to associate the noun "ball" with the actual object.
The subtlety of anaphor resolution is that there may be more than one plausible "candidate"
noun corresponding to a given pronouns. As noted above, at time writing RelEx's anaphor
resolution system is somewhat simplistic and is based on the classical Hobbs algorithm[llob78].
Basically. when a pronoun (it, he, she, they and so on) is identified in a sentence, the Hobbs
algorithm searches through recent sentences to find the nouns that fit this pronoun according
to number, gender and other characteristics. The Hobbs algorithm is used to create a ranking
of candidate nouns, ordered by time (most recently mentioned nouns come first).
We improved the Hobbs algorithm results by using the agent's world-knowledge to help
choose the best candidate noun. Suppose the agent heard the sentences:
"The ball is red."
EFTA00624602
456 47 Embodied Language Processing
"The stick is brown."
and then it receives a third sentence
"Grab it.".
The anaphor resolver will build a list containing two options for the pronoun "it" of the third
sentence: ball and stick. Given that the stick corresponds to the most recently mentioned noun,
the agent will grab it instead of (as Hobbs would suggest) the ball.
Similarly, if the agent's history, contains
"From here I can see as tree and a ball."
"Grab it."
Hobbs algorithm returns as candidate nouns "tree" and "ball", in this order. But using our
integrative Reference Resolution process, the agent will conclude that a tree cannot be grabbed,
so this candidate is discarded and "ball" is chosen.
47.5 Simple Experiments with Embodiment and Question Answering
The PetBrain was also capable of answering simple questions about its feelings/emotions (hap-
piness, fear, etc.) and about the environment in which it lives. After a question is asked to the
agent, it is parsed by RelEx and classified as either a truth question or a discursive one. After
that, RelEx rewrites the given question as a list of Frames (based on FrameNet" with some
customizations), which represent its semantic content. The Frames version of the question is
then processed by the agent and the answer is also written in Frames. The answer Frames are
then sent to a module that converts it back to the RelEx format. Finally the answer, in RelEx
format, is processed by the NLGen module, that generates the text of the answer in English.
We will discuss this process here in the context of the simple question "What is next to the
tree?", which in an appropriate environment receives the answer 'he red ball is next to the
tree."
Question answering (QA) of course has a long history, in AI iNlayoil, and our approach
fits squarely into the tradition of "deep semantic QA systems"; however it is innovative in its
combination of dependency parsing with FrameNet and most importantly in the manner of its
integration of QA with an overall cognitive architecture for agent control.
4 7.5.1 Preparing/Matching Frames
In order to answer an incoming question, the agent tries to match the Frames list, created by
RelEx, against the Frames stored in its own memory. In general these Frames could come from
a variety of sources, including inference, concept creation and perception; but in the current
PetBrain they primarily come from perception, and simple transformations of perceptions.
However, the agent cannot use the incoming perceptual Frames in their original format
because they lack grounding information (information that connects the mentioned elements to
• http://frameneticsi.berkeley.edu
EFTA00624603
47.5 Simple Experiments with Embodiment and Question Answering 457
the real elements of the environment). So, two steps are then executed before trying to match
the names: Reference Resolution (described above) and names Rewriting. Frames Rewriting
is a process that changes the values of the incoming Frames elements into grounded values.
Here is an example:
Incoming Frame (Generated by RelEx)
EvaluationLink
DefinedFrameElementNode Color:Color
WordlnstanceNode "redelaaa"
EvaluationLink
DefinedFrameElementNode Color:Entity
WordlnstanceNode "ballellobb"
ReferenceLink
WordlnstanceNode "realaaa"
WordNode "red"
After Reference Resolution
ReferenceLink
WordlnstanceNode "ballebbb"
SemeNode "ball 99"
Grounded Frame (After Rewriting)
EvaluationLink
DefinedFrameElementNode Color:Color
ConceptNode "red"
EvaluationLink
DefinedFrameElementNode Color:Entity
SemeNode "ball 99"
Frame Rewriting serves to convert the incoming Frames to the same structure used by the
Frames stored into the agent's memory. After Rewriting, the new Frames are then matched
against the agent's memory, and if all names were found in it, the answer is known by the
agent, otherwise it is unknown.
In the PetBrain system, if a truth question was posed and all Frames were matched suc-
cessfully, the answer would be be "yes"; otherwise the answer is "no". Mapping of ambiguous
matching results into ambiguous responses were not handled in the PetBrain.
If the question requires a discursive answer the process is slightly different. For known answers
the matched Frames are converted into RelEx format by Frames2RelEx and then sent to NLGen,
which prepares the final English text to be answered. There are two types of unknown answers.
The first one is when at least one name cannot be matched against the agent's memory and
the answer is "I don't know". And the second type of unknown answer occurs when all Frames
were matched successfully but they cannot be correctly converted into RelEx format or NLGen
cannot identify the incoming relations. In this case the answer is "I know the answer, but I
don't know how to say it".
EFTA00624604
458 47 Embodied Language Processing
47.5.2 Frames2RelEx
As mentioned above, this module is responsible for receiving a list of grounded names and
returning another list containing the relations, in RelEx format, which represents the grammat-
ical form of the sentence described by the given names. That is, the names list represents a
sentence that the agent wants to say to another agent. NLGen needs an input in RelEx Format
in order to generate an English version of the sentence; Frames2RelEx does this conversion.
Currently, Frames2RelEx is implemented as a rule-based system in which the preconditions
are the required frames and the output is one or more RelEx relations e.g.
'Color (Entity,Color) =>
present ($2) .a($2) adj ($2) _predadj ($1, $2)
definite ($1) .n($1) noun ($1) singular ($1)
.v(be) verb(be) punctuation(.) det(the)
where the precondition comes before the symbol => and Color is a frame which has two
elements: Entity and Color. Each element is interpreted as a variable Entity = $1 and Color =
$2. The effect, or output of the rule, is a list of RelEx relations. As in the case of RelEx2Frame,
the use of hand-coded rules is considered a stopgap, and for a powerful AGI system based on
this framework such rules will need to be learned via experience.
47.5.3 Example of the Question Answering Pipeline
Turning to the example "What is next to the tree?", Figure ?? illustrates the processes involved:
Multiverse Client RelEx
r
What is next to the tree? I next to (be. tree)
_subj (be. _SqVar)
Re4Ex
tense (be. present)
itocative_relation:Figure = Sobject
elocative_relation:Ground = tree
olocativerelafion:Relation_type = next 4
Pet Brain
BLocative _relation:Soule = ball 99
PetBrain alLocative_mlabon:Ground = tree
BLocativejelabon:Relation type = next
_obi(not. tree)
subj(rext ball 99)
imperativelnext) NLGen
hyp(next) The ball is next to the tree.
Fig. 47.1: Overview of current PetBrain language comprehension process
EFTA00624605
47.5 Simple Experiments with Embodiment and Question Answering 459
The question is parsed by RelEx, which creates the frames indicating that the sentence is
a question regarding a location reference (next) relative to an object (tree). The frame that
represents questions is called Questioning and it contains the elements Manner that indicates
the kind of question (truth-question, what, where, and so on), Message that indicates the main
term of the question and Addressee that indicates the target of the question. To indicate that
the question is related to a location, the Locative_relation frame is also created with a variable
inserted in its element Figure, which represents the expected answer (in this specific case, the
object that is next to the tree).
The question-answer module tries to match the question frames in the Atomspace to fit the
variable element. Suppose that the object that is next to the tree is the red ball. In this way,
the module will match all the frames requested and realize that the answer is the value of
the element Figure of the frame Locative _relation stored in the Atom Table. Then, it creates
location frames indicating the red ball as the answer. These frames will be converted into RelEx
format by the RelEx2Frames rule based system as described above, and NLGen will generate
the expected sentence "the red ball is next to the tree".
47.5.4 Example of the PetBrain Language Generation Pipeline
To illustrate the process of language generation using NLGen, as utilized in the context of
PetBrain query response, consider the sentence "The red ball is near the tree'. When parsed by
RelEx, this sentence is converted to:
_obj (near, tree)
_subj (near, ball)
imperative (near)
hyp (near)
definite (tree)
singular (tree)
_to-do (be, near)
_subj (be, ball)
present (be)
definite (ball)
singular (ball)
So, if sentences with this format are in the system's experience, these relations are stored by
NLGen and will he used to match future relations that must be converted into natural language.
NLGen matches at an abstract level, so sentences like "The stick is next to the fountain" will
also be matched even if the corpus contain only the sentence 'The ball is near the tree".
If the agent wants to say that 'The red ball is near the tree", it must invoke NLGen with
the above RelEx contents as input. However, the knowledge that the red ball is near the tree
is stored as frames, and not as RelEx format. More specifically, in this case the related frame
stored is the Locative_relation one, containing the following elements and respective values:
Figure —> red ball, Ground —> tree, Relation _type near.
So we must convert these frames and their elements' values into the RelEx format accept by
NLGen. For AGI purposes, a system must learn how to perform this conversion in a flexible and
context-appropriate way. In our current system, however, we have implemented a temporary
EFTA00624606
460 47 Embodied Language Processing
short-cut: a system of hand-coded rules, in which the preconditions are the required frames and
the output is the corresponding RelEx format that will generate the sentence that represents
the frames. The output of a rule may contains variables that mast be replaced by the frame
elements' values. For the example above, the output _subj(be, bail) is generated from the rule
output _subj(be, &ran) with the Strarl replaced by the Figure element value.
Considering specifically question-answering (QA), the PetBrain's Language Comprehension
module represents the answer to a question as a list of frames. In this case, we may have the
following situations:
• The frames match a precondition and the RelEx output is correctly recognized by NLGen,
which generates the expected sentence as the answer;
• The frames match a precondition, but NLGen did not recognize the RelEx output generated.
In this case, the answer will be "I know the answer, but I don't know how to say it", which
means that the question Was answered correctly by the Language Comphrehension, but the
NLGen could not generate the correct sentence;
• The frames didn't match any precondition; then the answer will also be "I know the answer,
but I don't know how to say it".
• Finally, if no frames are generated as answer by the Language Comprehension module, the
agent's answer will be "I don't know".
If the question is a truth-question, then NLGen is not required. In this case, the creation of
frames as answer is considered as a "Yes", otherwise, the answer will be "No" because it was not
possible to find the corresponding frames as the answer.
47.6 The Prospect of Massively Multiplayer Language Teaching
Now we tie in the theme of embodied language learning with more general considerations
regarding embodied experiential learning.
Potentially, this may provide a means to facilitate robust language learning on the part of
virtually embodied agents, and lead to an experientially-trained AGI language facility that can
then be used to power other sorts of agents such as virtual babies, and ultimately virtual adult-
human avatars that can communicate with experientially-grounded savvy rather than in the
manner of chat-bots.
As one concrete, evocative example, imagine millions of talking parrots spread across different
online virtual worlds - all communicating in simple English. Each parrot has its own local
memories, its own individual knowledge and habits and likes and dislikes - but there's also
a common knowledge-base underlying all the parrots, which includes a common knowledge of
English.
The interest of many humans in interacting with chatbots suggests that virtual talking
parrots or similar devices would be likely to meet with a large and enthusiastic audience.
Yes, humans interacting with parrots in virtual worlds can be expected to try to teach the
parrots ridiculous things, obscene things, and so forth. But still, when it conies down to it, even
pranksters and jokesters will have more fun with a parrot that can communicate better, and
will prefer a parrot whose statements are comprehensible.
And for a virtual parrot, the test of whether it has used English correctly, in a given instance,
will come down to whether its human friends have rewarded it, and whether it has gotten what
EFTA00624607
47.6 The Prospect of Massively Multiplayer Language Thaching 461
it wanted. If a parrot asks for food incoherently, it's less likely to get food - and since the virtual
parrots will be programmed to want food, they will have motivation to learn to speak correctly.
If a parrot interprets a human-controlled avatar's request "Fetch my hat please" incorrectly, then
it won't get positive feedback from the avatar - and it will be programmed to want positive
feedback.
And of course parrots are not the end of the story. Once the collective wisdom of throngs of
human teachers has induced powerful language understanding in the collective bird-brain, this
language understanding (and the commonsense understanding coming along with it) will be
useful for many, many other purposes as well. Humanoid avatars - both human-baby avatars
that may serve as more rewarding virtual companions than parrots or other virtual animals; and
language-savvy human-adult avatars serving various useful and entertaining functions in online
virtual worlds and games. Once AN have learned enough that they can flexibly and adaptively
explore online virtual worlds and gather information from human-controlled avatars according
to their own goals using their linguistic facilities, it's easy to envision dramatic acceleration in
their growth and understanding.
A baby Al has numerous disadvantages compared to a baby human being: it lacks the
intricate set of inductive biases built into the human brain, and it also lacks a set of teachers
with a similar form and psyche to it. .. and for that matter, it lacks a really rich body and world.
However, the presence of thousands to millions of teachers constitutes a large advantage for the
AI over human babies. And a flexible AGI framework will be able to effectively exploit this
advantage. If nonlinguistic learning mechanisms like the ones we've described here, utilized in a
virtually-embodied context, can go beyond enabling interestingly trainable virtual animals and
catalyze the process of language learning - then, within a few years time, we may find ourselves
significantly further along the path to AGI than most observers of the field currently expect.
EFTA00624608
EFTA00624609
Chapter 48
Natural Language Dialogue
Co-authored with Ruiting Lian
48.1 Introduction
Language evolved for dialogue - not for reading, writing or speechifying. So it's natural that
dialogue is broadly considered a critical aspect of humanlike AGI - even to the extent that (for
better or for worse) the conversational "Thring Test" is the standard test of human-level AGI.
Dialogue is a high-level functionality rather than a foundational cognitive process, and in
the CogPrime approach it is something that must largely be learned via experience rather
than being programmed into the system. In that some, it may seem odd to have a chapter on
dialogue in a book section focused on engineering aspects of general intelligence. One might
think: Dialogue is something that should emerge from an intelligent system in conjunction with
other intelligent systems, not something that should need to be engineered. And this is certainly
a reasonable perspective! We do think that, as a CogPrime system develops, it will develop its
own approach to natural language dialogue, based on its own embodiment, environment and
experience - with similarities and differences to human dialogue.
However, we have also found it interesting to design a natural language dialogue system
based on CogPrime, with the goal not of emulating human conversation, but rather of enabling
interesting and intelligent conversational interaction with CogPrime systems. We call this sys-
tem "ChatPrime" and will describe its architecture in this chapter. The components used in
ChatPrime may also be useful for enabling CogPrime systems to carry out more humanlike
conversation, via their incorporation in learned schemata; but we will not focus on that aspect
here. In addition to its intrinsic interest, consideration of ChatPrime sheds much light on the
conceptual relationship between NLP and other aspects of CogPrime.
We are very aware that there is an active subfield of computational linguistics focused on
dialogue systems iWaldlli, LDA05J, however we will not draw significantly on that literature here.
Making practical dialogue systems in the absence of a generally functional cognitive engine is
a subtle and difficult art, which has been addressed in a variety of ways; however, we have
found that designing a dialogue system within the context of an integrative cognitive engine
like CogPrime is a somewhat different sort of endeavor.
463
EFTA00624610
464 48 Natural Language Dialogue
48.1.1 Two Phases of Dialogue System Development
In practical terms, we envision the ChatPrime system as pacsessing two phases of development:
1. Phase 1:
• "Lower levels" of NL comprehension and generation executed by a relatively traditional
approach incorporating statistical and rule-based aspects (the RelEx and NLGen sys-
tems)
• Dialogue control utilizes hand-coded procedures and predicates (SpeechActSchema and
SpeechActTriggers) corresponding to fine-grained types of speech act
• Dialogue control guided by general cognitive control system (OpenPsi, running within
OpenCog)
• SpeechActSchema and SpeechActTriggers, in some cases, will internally consult proba-
bilistic inference, thus supplying a high degree of adaptive intelligence to the conversa-
tion
2. Phase 2:
• "Lower levels" of NL comprehension and generation carried out within primary cogni-
tion engine, in a manner enabling their underlying rules and probabilities to be modified
based the system's experience. Concretely, one way this could be done in OpenCog would
be via
- Implementing the RelEx and RelEx2F'ame rules as PLN implications in the Atom-
space
- Implementing parsing via expressing the link parser dictionary as Atoms in the
Atomspace, and using the SAT link parser to do parsing as an example of logical
unification (carried out by a MindAgent wrapping an SAT solver)
— Implementing NLGen within the OpenCog core, via making NLGen's sentence
database a specially indexed Atomspace, and wrapping the NLGen operations in a
MindAgent
• Reimplement the SpeechActSchema and SpeechActTriggers in an appropriate combina-
tion of Combo and PLN logical link types, so they are susceptible to modification via
inference and evolution
It's worth noting that the work required to move from Phase 1 to Phase 2 is essentially
software development and computer science algorithm optimization work, rather than compu-
tational linguistics or AI theory. Then after the Phase 2 system is built there will, of course,
be significant work involved in "tuning" PLN, MOSES and other cognitive algorithms to ex-
perientially adapt the various portions of the dialogue system that have been moved into the
OpenCog core and refactored for adaptiveness.
48.2 Speech Act Theory and its Elaboration
We review here the very basics of speech act theory, and then the specific variant of speech act
theory that we feel will be most useful for practical OpenCog dialogue system development.
EFTA00624611
48.3 Speech Act Schemata and 'niggers 465
The core notion of speech act theory is to analyze linguistic behavior in terms of discrete
speech acts aimed at achieving specific goals. This is a convenient theoretical approach in
an OpenCog context, because it pushes us to treat speech acts just like any other acts that
an OpenCog system may carry out in its world, and to handle speech acts via the standard
OpenCog action selection mechanism.
Searle, who originated speech act theory, divided speech acts according to the following (by
now well known) ontology:
• Assertives : The speaker commits herself to something being true. The sky is blue.
• Directives: The speaker attempts to get the hearer to do something. Clean your room!
• Commissives: The speaker commits to some future course of action. 1 will do it.
• Expressives: The speaker expresses some psychological state. I'm sorry.
• Declarations: The speaker brings about a different state of the world. The meeting is
adjourned.
Inspired by this ontology, Twitchell and Nunamaker (in their 2004 paper "Speech Act Pro-
filing: A Probabilistic Method for Analyzing Persistent Conversations and Their Participants")
created a much more fine-grained ontology of 42 kinds of speech acts, called SWBD-DAMSL
(DAMSL = Dialogue Act Markup in Several Layers). Nearly all of their 42 speech act types
can be neatly mapped into one of Searle's 5 high level categories, although a handful don't fit
Searle's view and get categorized as "other." Figures 48.1 and 48.2 depict the 42 acts and their
relationship to Searle's categories.
48.3 Speech Act Schemata and Triggers
In the suggested dialogue system design, multiple SpeechActSchema would be implemented,
corresponding roughly to the 42 SWBD-DAMSL speech acts. The correspondence is "rough"
because
• we may wish to add new speech acts not in their list
• sometimes it may be most convenient to merge 2 or more of their speech acts into a single
SpeechActSchema. For instance, it's probably easiest to merge their YES ANSWER and NO
ANSWER categories into a single TRUTH VALUE ANSWER schema, yielding affirmative,
negative, and intermediate answers like "probably", "probably not", "I'm not sure", etc.
• sometimes it may be best to split one of their speech acts into several, e.g. to separately
consider STATEMENTs which are responses to statements, versus statements that are
unsolicited disbursements of "what's on the agent's mind."
Overall, the SWBD-DAMSL categories should be taken as guidance rather than doctrine. How-
ever, they are valuable guidance due to their roots in detailed analysis of real human conversa-
tions, and their role as a bridge between concrete conversational analysis and the abstractions
of speech act theory.
Each SpeechActSchema would take in an input consisting of a DialogueNode, a Node type
possessing a collection of links to
• a series of past statements by the agent and other conversation participants, with
- each statement labeled according to the utterer
EFTA00624612
466 48 Natural Language Dialoglie
Ttll NNs T44 hust&
STAIttibtif-NOtö.OMtilt" t4c. 1 it in k heta *Mittlei
ACKNOWLISOL takttOLINNOL) b Uhnuly
STATUCIT.0•IXION I dun Co inn
AOKLUACCIPT het non>
AlehNOOMOD.TtitN.Eittellt LIMINTRIMILTAWLI.
ArtiliCIATION ta I on ~Mf
Tt.l.NO.QtY111Ct% Du >on Lnc ealtn Nn µAtIJ rabling•
11on utan 5.101.114. LTbruet kounj
Ytt Atnitital Yn
enNvitaintiAL.Cuttitil >Voll yCy ten okt tyting nyo.
plaQlnallOM Wen. hr. edd art yes?
N1nANAW.IA Ho.
IteShinSt ArittOWLEIXAMILM Hy Oh. n,
MIME 1de LAn. inn Koda/ ny onne n nyt
DtCLARATItE YM-NO-Qt..STIOS gyed So y ou aftfed to pfi • brunt
On» cent Mk k $1.« eitt • bon. >n tona
BACICCIUM4L IN LJI'.AiITONIVMY ib« neir
OtttIATION Cg Kny 'KIL It pultane mon tun con
SinitlAttantltOthttlAlt hf Ult >ny anm ownbol who* ful nt
Mital4ATittNON.YU "anal ba
Antat,Dlititelive MI Kilo Ja,'I ymt yo
CtILLAINotA/rtt cotill.ntoN C2 Onn mon nottyln
ItsrFAI.Platti tent 11b. luta,
ttn %i-1,M FMK... IP lk• ann you'
RHETOIIICAL-ottSTIONt Who unn' 'tal le onnpipo '
NOKO Mfl« ANnutfAORLIMENT O ryn...uer • 140K
ItateCT Wen. n>
~Ulv( t•Ce4-1•0 .•!1$Watt N. at t.fMNa.
SIGNAL.NON.I.MIGOSIANIXttO tt Lunt tott
OTIUM titir•IRS I Onn Omn
Otta IS T letat.oreltINCI Ile. a »XI'
Oattnrse 9 1 N h a non n • nemnt.,•
DisinelanD Akt•lat Kl• %tan Koma!Inn
tall.PAInInTAI X II My fat« aur. onda,. Ane nat.
0•TIONt CekalITS amti ril tatt ot chra Mana
51.11-TALK 11 MTeI'ulm.r 1'ro Knm. hr
DowNFLAVIII T oe. J yiKlt
lAnnt/Affettl , PAIET MP Sueonlan L4 thot
CK Risk?
IMICLARAInt q•Cd lny on Hai brøl N bar
»Ot ot
lin ~sakt
Fig. 48.1: The 42 DAMSL speech act categories.
~0 be:nise, Dita342
Simulan Onttat ttti•Nte.Q. atm%
Vis AA,WIIII. Altv4amilivtrlartattsitii i lit. QtraTio%
No Anonts ...La<.~L/Actiatiutilt Otatlinint res444)-91,1Nlita
gyVfAllOh M.M.,* ACILMMLWilliGUIT 1114<ms.rass ,cesatiow
Patietitiiti tot”ivi minner. bonn .nounk oni..... st46 Si tematnatttninti i kT1
<Lutte...mi Comnatket Atuotsmirvi tann Atittn•Dettnat
IIIIPTONCAL-911f1~1 Alr•IIC1Afilla ow.e.nan.
tistattin lot -to Alla\ WO CIA, Vt., 11~,CLO/IN4 longo/onn
Orm' Annvint 111~ DIXtAllArNt Wi.-Qtrasnaa
MOIDIMII*15 n valt antatt
Ditritansaiu,saltiat Ruin Ot«
Cconninon Anovrto Onaz
Colonlyannt DislaAtta TtnaPtal y ltal
Onen. °niom. • Cotolrts Untlill/ACCIPT"t7 Noitttaittt
Atokott
TtimiNG
Wittimaiiitta
~Sik Od.altylravnvYb ~borda nis-Ymta
Fig. 48.2: Connecting the 42 DAMSL speech act categories to Searle's 5 higher-level categories.
- each statement uttered by the agent, labeled according to which SpeechActSchema
was used to produce it, plus (see below) which SpeechAct'I\'igger and which response
generator was involvecl
EFTA00624613
48.3 Speech Act Schemata and whiggers 467
• a set of Atoms comprising the context of the dialogue. These Atoms may optionally be
linked to some of the Atoms representing some of the past statements. If they are not so
linked, they are considered as general context.
The enaction of SpeechActSchema would be carried out via PredictivelmplicationLinks em-
bodying "Context AND Schema —> Goal" schematic implications, of the general form
Predictivelmplication
AND
Evaluation
SpeechActTrigger T
DialogueNode D
Execution
SpeechActSchema S
DialogueNode D
Evaluation
Evaluation
Goal G
with
ExecutionOutput
SpeechActSchema S
DialogueNode D
UtteranceNode U
being created as a result of the enaction of the SpeechActSchema. (An UtteranceNode is a series
of one or more SentenceNodes.)
A single SpeechActSchema may be involved in many such implications, with different prob-
abilistic weights, if it naturally has many different Trigger contexts.
Internally each SpeechActSchema would contain a set of one or more response generators,
each one of which is capable of independently producing a response based on the given input.
These may also be weighted, where the weight determines the probability of a given response
generation process being chosen in preference to the others, once the choice to enact that
particular SpeechActSchema has already been made.
48.3.1 Notes Toward Example SpeechActSchema
To make the above ideas more concrete, let's consider a few specific SpeechActSchema. We
won't fully specify them here, but will outline them sufficiently to make the ideas clear.
48.3.1.1 TruthValueAnswer
The TruthValueAnswer SpeechActSchema would encompass SWED-DAMSL's YES ANSWER
and NO ANSWER, and also more flexible truth value based responses.
EFTA00624614
468 48 Natural Language Dialogue
Trigger context
: when the conversation partner produces an utterance that RelEx maps into a truth-value
query (this is simple as truth-value-query is one of RelEx's relationship types).
Goal
: the simplest goal relevant here is pleasing the conversation partner, since the agent may have
noticed in the past that other agents are pleased when their questions are answers. (More
advanced agents may of course have other goals for answering questions, e.g. providing the
other agent with information that will let it be more useful in future.)
Response generation schema
: for starters, this SpeechActSchema could simply operate as follows. It takes the relationship
(Atom) corresponding to the query, and uses it to launch a query to the pattern matcher or PLN
backward chainer. Then based on the result, it produces a relationship (Atom) embodying the
answer to the query, or else updates the truth value of the existing relationship corresponding
to the answer to the query. This "answer" relationship has a certain truth value. The schema
could then contain a set of rules mapping the truth values into responses, with a list of possible
responses for each truth value range. For example a very high strength and high confidence
truth value would be mapped into a set of responses like (definitely, certainly, surely, yes, in-
deed).
This simple case exemplifies the overall Phase 1 approach suggested here. The conversa-
tion will be guided by fairly simple heuristic rules, but with linguistic sophistication in the
comprehension and generation aspects, and potentially subtle inference invoked within the
SpeechActSchema or (less frequently) the Trigger contexts. Then in Phase 2 these simple heuris-
tic rules will be refactored in a manner rendering them susceptible to experiential adaptation.
48.3.1.2 Statement: Answer
The next few SpeechActSchema (plus maybe some similar ones not given here) are intended to
collectively cover the ground of SWBD-DAMSL's STATEMENT OPINION and STATEMENT
NON-OPINION acts.
Trigger context
: The trigger is that the conversation partner asks a wh- question
EFTA00624615
48.3 Speech Act Schemata and 'nigger's 469
Goal
: Similar to the case of a TruthValueAnswer, discussed above
Response generation schema
: When a wh- question is received, one reasonable response is to produce a statement comprising
an answer. The question Atom is posed to the pattern matcher or PLN, which responds with an
Atom-set comprising a putative answer. The answer Atoms are then pared down into a series
of sentence-sized Atom-sets, which are articulated as sentences by NLGen. If the answer Atoms
have very low-confidence truth values, or if the Atomspace contains knowledge that other agents
significantly disagree with the agent's truth value assessments, then the answer Atom-set may
have Atoms corresponding to "I think" or "In my opinion" etc. added onto it (this gives an
instance of the STATEMENT NON-OPINION act).
48.3.1.3 Statement: Unsolicited Observation
Trigger context
: when in the presence of another intelligent agent (human or AI) and nothing has been said
for a while, there is a certain probability of choosing to make a "random" statement.
Goal 1
: Unsolicited observations may be made with a goal of pleasing the other agent, as it may have
been observed in the past that other agents are happier when spoken to
Goal 2
: Unsolicited observations may be made with goals of increasing the agent's own pleasure or
novelty or knowledge - because it may have been observed that speaking often triggers conver-
sations, and conversations are often more pleasurable or novel or educational than silence
Response generation schema
: One option is a statement describing something in the mutual environment, another option is
a statement derived from high-STI Atoms in the agent's Atomspace. The particulars are similar
to the "Statement: Answer" case.
EFTA00624616
470 48 Natural Language Dialogue
48.3.1.4 Statement: External Change Notification
Trigger context
: when in a situation with another intelligent agent, and something significant changes in the
mutually perceived situation, a statement describing it may be made.
Goal 1
: External change notification utterances may be made for the same reasons as Unsolicited
Observations, described above.
Goal 2
: The agent may think a certain external change is important to the other agent it is talking
to, for some particular reason. For instance, if the agent sees a dog steal Bob's property, it may
wish to tell Bob about this.
Goal 3
: The change may be important to the agent itself - and it may want its conversation partner
to do something relevant to an observed external change ... so it may bring the change to the
partner's attention for this reason. For instance, "Our friends are leaving. Please try to make
them come back."
Response generation schema
: The Atom-set for expression characterizes the change observed. The particulars are similar to
the "Statement: Answer" case.
48.3.1.5 Statement: Internal Change Notification
Trigger context 1
: when the importance level of an Atom increases dramatically while in the presence of an-
other intelligent agent, a statement expressing this Atom (and some of its currently relevant
surrounding Atoms) may be made
EFTA00624617
48.4 Probabilistic Mining of Trigger contexts 471
Trigger context 2
: when the truth value of a reasonably important Atom changes dramatically while in the
presence of another intelligent agent, a statement expressing this Atom and its truth value may
be made
Goal
: Similar goals apply here as to External Change Notification, considered above
Response generation schema
: Similar to the "Statement: External Change Notification" case.
48.3.1.6 WHQuestion
Trigger context
: being in the presence of an intelligent agent thought capable of answering questions
Goal 1
: the general goal of increasing the agent's total knowledge
Goal 2
: the agent notes that, to achieve one of its currently important goals, it would be useful to
possess a Atom fulfilling a certain specification
Response generation schema
: Formulate a query whose answer would be an Atom fulfilling that specification, and then
articulate this logical query as an English question using NLGen
48.4 Probabilistic Milling of Trigger contexts
One question raised by the above design sketch is where the Trigger contexts come from. They
may be hand-coded, but this approach may suffer from excessive brittleness. The approach
suggested by Twitchell and Nunamaker's work (which involved modeling human dialogues rather
EFTA00624618
472 48 Natural Language Dialogue
than automatically generating intelligent dialogues) is statistical. That is, they suggest marking
up a corpus of human dialogues with tags corresponding to the 42 speech acts, and learning
from this annotated corpus a set of Markov transition probabilities indicating which speech acts
are most likely to follow which others. In their approach the transition probabilities refer only
to series of speech acts.
In an OpenCog context one could utilize a more sophisticated training corpus in a more
sophisticated way. For instance, suppose one wants to build a dialogue system for a game char-
acter conversing with human characters in a game world. Then one could conduct experiments
in which one human controls a "human" game character, and another human puppeteers an
"Al" game character. That is, the puppeteered character funnels its perceptions to the AI
system, but has its actions and verbalizations controlled by the human puppeteer. Given the
dialogue from this sort of session, one could then perform markup according to the 42 speech
acts.
As a simple example, consider the following brief snippet of annotated conversation:
speaker utterance speech act type
Ben Go get me the ball ad
Al Where is it? qw
Ben Over there 'points] sd
Al By the table? qy
Ben Yeah ny
Al Thanks ft
Al I'll get it now. commits
A DialogueNode object based on this snippet would contain the inforn ation in the table, plus
some physical information about the situation, such as, in this case: predicates describing the
relative locations of the two agents, the ball an the table (e.g. the two agents are very near each
other, the ball and the able are very near each other, but these two groups of entities are only
moderately near each other): and, predicates involving
Then, one could train a machine learning algorithm such as MOSES to predict the probability
of speech act type S1 occurring at a certain point in a dialogue history, based on the prior history
of the dialogue. This prior history could include percepts and cognitions as well as utterances,
since one has a record of the AI system's perceptions and cognitions in the course of the
marked-up dialogue.
One question is whether to use the 42 SWBD-DAMSL speech acts for the creation of the
annotated corpus, or whether instead to use the modified set of speech acts created in design-
ing SpeechActSchema. Either way could work, but we are mildly biased toward the former,
since this specific SWBD-DAMSL markup scheme has already proved its viability for marking
up conversations. It seems unproblematic to map probabilities corresponding to these speech
acts into probabilities corresponding to a slightly refined set of speech acts. Also, this way
the corpus would be valuable independently of ongoing low-level changes in the collection of
SpeechActSchema.
In addition to this sort of supervised training in advance, it will be important to enable the
system to learn Trigger contexts online as a consequence of its life experience. This learning
may take two forms:
1. Most simply, adjustment of the probabilities associated with the PredictivelmplicationLinks
between SpeechActTriggers and SpeechActSchema
EFTA00624619
48.5 Conclusion 473
2. More sophisticatedly, learning of new SpeechActTrigger predicates. using an algorithm such
as MOSES for predicate learning, based on mining the history of actual dialogues to estimate
fitness
In both cases the basis for learning is information regarding the extent to which system goals
were fulfilled by each past dialogue. PredictiveImplications that correspond to portions of suc-
cessful dialogues will be have their truth values increased, and those corresponding to portions of
unsuccessful dialogues will have their truth values decreased. Candidate SpeechActTriggers will
be valued based on the observed historical success of the responses they would have generated
based on historically perceived utterances; and (ultimately) more sophisticatedly, based on the
estimated success of the responses they generate. Note that. while somewhat advanced, this kind
of learning is much easier than the procedure learning required to learn new SpeechActSchema.
48.5 Conclusion
While the underlying methods are simple, the above methods appear capable of producing
arbitrarily complex dialogues about any subject that is represented by knowledge in the Atom-
Space. There is no reason why dialogue produced in this manner should be indistinguishable
from human dialogue; but it may nevertheless be humanly comprehensible, intelligent and in-
sightful. What is happening in this sort of dialogue system is somewhat similar to current
natural language query systems that query relational databases, but the "database" in ques-
tion is a dynamically self-adapting weighted labeled hypergraph rather than a static relational
database, and this difference means a much more complex dialogue system is required, as well
as more flexible language comprehension and generation components.
Ultimately, a CogPrime system - if it works as desired - will be able to learn increased
linguistic functionality, and new languages, on its own. But this is not a prerequisite for having
intelligent dialogues with a CogPrime system. Via building a ChatPrime type system, as out-
lined here, intelligent dialogue can occur with a CogPrime system while it is still at relatively
early stages of cognitive development, and even while the underlying implementation of the
CogPrime design is incomplete. This is not closely analogous to human cognitive and linguistic
development, but, it can still be pursued in the context of a CogPrime development plan that
follows the overall arc of human developmental psychology.
EFTA00624620
EFTA00624621
Section VIII
From Here to AGI
EFTA00624622
EFTA00624623
Chapter 49
Summary of Argument for the CogPrime
Approach
49.1 Introduction
By way of conclusion, we now return to the "key claims" that were listed at the end of Chapter
1 of Part 1. Quite simply, this is a list of claims such that - roughly speaking - if the reader
accepts these claims, they should accept that the CogPrime approach to AGI is a viable one.
On the other hand if the reader rejects one or more of these claims, they may well find one
or more aspects of CogPrime unacceptable for some related reason. In Chapter 1 of Part 1 we
merely listed these claims; here we briefly discuss each one in the context of the intervening
chapters, giving each one its own section or subsection.
As we clarified at the start of Part 1, we don't fancy that we have provided an ironclad
argument that the CogPrime approach to AGI Ls guaranteed to work as hoped, once it's fully
engineered, tuned and taught. Mathematics isn't yet adequate to analyze the real-world behavior
of complex systems like these; and we have not yet implemented, tested and taught enough of
CogPrime to provide convincing empirical validation. So, most of the claims listed here have
not been rigorously demonstrated, but only heuristically argued for. That is the reality of
AGI work right now: one assembles a design based on the best combination of rigorous and
heuristic arguments one can, then proceeds to create and teach a system according to the design,
adjusting the details of the design based on experimental results as one goes along. = For an
uncluttered list of the claims, please refer back to Chapter 1 of Part 1; here we will review the
claims integrated into the course of discussion.
The following chapter, aimed at the more mathematically-minded reader, gives a list of
formal propositions echoing many of the ideas in the chapter - propositions such that, if they
are true, then the success of CogPrime as an architecture for general intelligence is likely.
49.2 Multi-Memory Systems
The first of our key claims is that to achieve general intelligence in the context of human-
intelligence-friendly environments and goals using feasible computational resources, it's impor-
tant that an AGI system can handle different kinds of memory (declarative, procedural, episodic,
sensory, intentional, attentional) in customized but interoperable ways. The basic idea is that
477
EFTA00624624
478 49 Summary of Argument for the CogPrime Approach
these different kinds of knowledge have very different characteristics, so that trying to handle
them all within a single approach, while surely possible, is likely to be unacceptably inefficient.
The tricky issue in formalizing this claim is that "single approach" is an ambiguous notion:
for instance, if one has a wholly logic-based system that represents all forms of knowledge using
predicate logic, then one may still have specialized inference control heuristics corresponding
to the different kinds of knowledge mentioned in the claim. In this case one has "customized
but interoperable ways" of handling the different kinds of memory, and one doesn't really have
a "single approach" even though one is using logic for everything. To bypass such conceptual
difficulties, one may formalize cognitive synergy using a geometric framework as discussed in
Appendix B, in which different types of knowledge are represented as metrized categories,
and cognitive synergy becomes a statement about paths to goals being shorter in metric spaces
combining multiple knowledge types than in those corresponding to individual knowledge types.
In CogPrime we use a complex combination of representations, including the Atomspace for
declarative, attentional and intentional knowledge and some episodic and sensorimotor knowl-
edge, Combo programs for procedural knowledge, simulations for episodic knowledge, and hi-
erarchical neural nets for some sensorimotor knowledge (and related episodic, attentional and
intentional knowledge). In cases where the same representational mechanism is used for dif-
ferent types of knowledge, different cognitive processes are used, and often different aspects of
the representation (e.g. attentional knowledge is dealt with largely by ECAN acting on Atten-
tionValues and HebbianLinks in the Atomspace: whereas declarative knowledge is dealt with
largely by PLN acting on TruthValues and logical links, also in the AtomSpace). So one has
a mix of the "different representations for different memory types" approach and the "different
control processes on a common representation for different memory, types" approach.
It's unclear how closely dependent the need for a multi-memory approach is on the particulars
of "human-friendly environments." We argued in Chapter 9 of Part 1 that one factor militating
in favor of a multi-memory approach is the need for multimodal communication: declarative
knowledge relates to linguistic communication; procedural knowledge relates to demonstrative
communication; attentional knowledge relates to indicative communication; and so forth. But
in fact the multi-memory approach may have a broader importance, even to intelligences with-
out multimodal communication. This is an interesting issue but not particularly critical to the
development of human-like, human-level AGI, since in the latter case we are specifically con-
cerned with creating intelligences that can handle multimodal communication. So if for no other
reason, the multi-memory approach is worthwhile for handling multi-modal communication.
Pragmatically, it is also quite clear that the human brain takes a multi-memory approach,
e.g. with the cerebellum and closely linked cortical regions containing special structures for
handling procedural knowledge, with special structures for handling motivational (intentional)
factors, etc. And (though this point is certainly not definitive, it's meaningful in the light of
the above theoretical discussion) decades of computer science and narrow-Al practice strongly
suggest that the "one memory structure fits all" approach is not capable of leading to effective
real-world approaches.
49.3 Perception, Action and Environment
The more we understand of human intelligence, the clearer it becomes how closely it has evolved
to the particular goals and environments for which the human organism evolved. This is true
EFTA00624625
49.4 Developmental Pathways 479
in a broad sense, as illustrated by the above issues regarding multi-memory systems, and is
also true in many particulars, as illustrated e.g. by Changizi's IChaol evolutionary, analysis of
the human visual system. While it might be possible to create a human-like, human-level AGI
by abstracting the relevant biases from human biology and behavior and explicitly encoding
them in one's AGI architecture, it seems this would be an inordinately difficult approach in
practice, leading to the claim that to achieve human-like general intelligence, it's important for
an intelligent agent to have sensory data and motoric affordances that roughly emulate those
available to humans. We don't claim this is a necessity - just a dramatic convenience. And if
one accepts this point, it has major implications for what sorts of paths toward AGI it makes
most sense to follow.
Unfortunately, though, the idea of a "human-like" set of goals and environments is fairly
vague; and when you come right down to it, we don't know exactly how close the emulation
needs to be to form a natural scenario for the maturation of human-like, human-level AGI
systems. One could attempt to resolve this issue via a priori theory, but given the current level
of scientific knowledge it's hard to see how that would be possible in any definitive sense ...
which leads to the conclusion that our AGI systems and platforms need to support fairly flexible
experimentation with virtual-world and/or robotic infrastructures.
Our own intuition is that currently neither current virtual world platforms, nor current
robotic platforms, are quite adequate for the development of human-level, human-like AGI.
Virtual worlds would need to become a lot more like robot simulators, allowing more flexible
interaction with the environment, and more detailed control of the agent. Robots would need
to become more robust at moving and grabbing - e.g. with Big Dog's movement ability but the
grasping capability of the best current grabber arms. We do feel that development of adequate
virtual world or robotics platforms is quite possible using current technology, and could be done
at fairly low cost if someone were to prioritize this. Even without AGI-focused prioritization, it
seems that the needed technological improvements are likely to happen during the next decade
for other reasons. So at this point we feel it makes sense for AGI researchers to focas on AGI
and exploit embodiment-platform improvements as they come along - at least, this makes
sense in the case of AGI approaches (like CogPrime ) that can be primarily developed in an
embodiment-platform-independent manner.
49.4 Developmental Pathways
But if an AGI system is going to live in human-friendly environments, what should it do there?
No doubt very many pathways leading from incompetence to adult-human-level general intel-
ligence exist, but one of them is much better understood than any of the others, and that's
the one normal human children take. Of course, given their somewhat different embodiment, it
doesn't make sense to try to force AGI systems to take exactly the same path as human chil-
dren, but having AGI systems follow a fairly close approximation to the human developmental
path seems the smoothest developmental course ... a point summarized by the claim that: To
work toward adult human-level, roughly human-like general intelligence, one fairly easily com-
prehensible path is to use environments and goals reminiscent of human childhood, and seek to
advance one's AGI system along a path roughly comparable to that followed by human children.
Human children learn via a rich variety of mechanisms; but broadly speaking one conclusion
one may drawn from studying human child learning is that it may make sense to teach an
EFTA00624626
480 49 Summary of Argument for the CogPrime Approach
AGI system aimed at roughly human-like general intelligence via a mix of spontaneous learning
and explicit instruction, and to instruct it via a combination of imitation, reinforcement and
correction, and a combination of linguistic and nonlinguistic instruction. We have explored
exactly what this means in Chapter 31 and others, via looking at examples of these types of
learning in the context of virtual pets in virtual worlds, and exploring how specific CogPrime
learning mechanisms can be used to achieve simple examples of these types of learning.
One important case of learning that human children are particularly good at is language
learning; and we have argued that this is a case where it may pay for AGI systems to take
a route somewhat different from the one taken by human children. Humans seem to be born
with a complex system of biases enabling effective language learning, and it's not yet clear
exactly what these biases are nor how they're incorporated into the learning process. It is very
tempting to give AGI systems a "short cut" to language proficiency via making use of existing
rule-based and statistical-corpus-analysis-based NLP systems; and we have fleshed out this
approach sufficiently to have convinced ourselves it makes practical as well as conceptual sense,
in the context of the specific learning mechanisms and NLP tools built into OpenCog. Thus we
have provided a number of detailed arguments and suggestions in support of our claim that one
effective approach to teaching an AGI system human language is to supply it with some in-built
linguistic facility, in the form of rule-based and statistical-linguistics-based NLP systems, and
then allow it to improve and revise this facility based on experience.
49.5 Knowledge Representation
Many knowledge representation approaches have been explored in the AI literature, and ulti-
mately many of these could be workable for human-level AGI if coupled with the right cog-
nitive processes. The key goal for a knowledge representation for AGI should be naturalness
with respect to the AGI's cognitive processes - i.e. the cognitive processes shouldn't need
to undergo complex transformative gymnastics to get information in and out of the knowl-
edge representation in order to do their cognitive work. Toward this end we have come to a
similar conclusion to some other researchers (e.g. Joscha Bach and Stan Franklin), and con-
cluded that given the strengths and weaknesses of current and near-future digital computers,
a (loosely) neural-symbolic network is a good representation for directly storing many kinds of
memory, and interfacing between those that it doesn't store directly. CogPrinte's AtomSpace is
a neural-symbolic network designed to work nicely with PLN, MOSES, ECAN and the other
key CogPrime cognitive processes; it supplies them with what they need without causing them
undue complexities. It provides a platform that these cognitive processes can use to adaptively,
automatically construct specialized knowledge representations for particular sorts of knowledge
that they encounter.
49.6 Cognitive Processes
The crux of intelligence is dynamics, learning, adaptation; and so the crux of an AGI design is
the set of cognitive processes that the design provides. These processes must collectively allow
the AGI system to achieve its goals in its environments using the resources at hand. Given
EFTA00624627
49.6 Cognitive Processes 481
CogPrime's multi-memory design, it's natural to consider CogPrime's cognitive processes in
terms of which memory subsystems they focus on (although, this is not a perfect mode of
analysis, since some of the cognitive processes span multiple memory types).
49.6.1 Uncertain Logic for Declarative Knowledge
One major decision made in the creation of CogPrime was that given the strengths and weak-
nesses of current and near-future digital computers, uncertain logic is a good way to handle
declarative knowledge. Of course this is not obvious nor is it the only possible route. Declarative
knowledge can potentially be handled in other ways; e.g. in a hierarchical network architecture,
one can make declarative knowledge emerge automatically from procedural and sensorimotor
knowledge, as is the goal in the Numenta and DeSTIN designs reviewed in Chapter 4 of Part
1. It seems clear that the human brain doesn't contain anything closely parallel to formal logic
- even though one can ground logic operations in neural-net dynamics as explored in Chapter
34, this sort of grounding leads to "uncertain logic enmeshed with a host of other cognitive
dynamics" rather than "uncertain logic as a cleanly separable cognitive process."
But contemporary digital computers are not brains - they lack the human brain's capacity
for cheap massive parallelism, but have a capability for single-operation speed and precision
far exceeding the brain's. In this way computers and formal logic are a natural match (a fact
that's not surprising given that Boolean logic lies at the foundation of digital computer opera-
tions). Using uncertain logic is a sort of compromise between brainlike messiness and fuzziness,
and computerlike precision. An alternative to using uncertain logic is using crisp logic and in-
corporating uncertainty as content within the knowledge base - this is what SOAR does, for
example, and it's not a wholly unworkable approach. But given that the vast mass of knowledge
needed for confronting everyday human reality is highly uncertain, and that this knowledge of-
ten needs to be manipulated efficiently in real-time, it seems to us there is a strong argument
for embedding uncertainty in the logic.
Many approaches to uncertain logic exist in the literature, including probabilistic and fuzzy
approaches, and one conclusion we reached in formulating CogPrime is that none of them was
adequate on its own — leading us, for example, to the conclusion that to deal with the problems
facing a human-level AG!, an uncertain logic must integrate imprecise probability and fuzziness
with a broad scope of logical constructs. The arguments that both fuzziness and probability are
needed scent hard to counter - these two notions of uncertainty are qualitatively different yet
both appear cognitively necessary.
The argument for using probability in an AGI system is assailed by some AGI researchers such
as Pei Wang, but we are swayed by the theoretical arguments in favor of probability theory's
mathematically fundamental nature, as well as the massive demonstrated success of probability
theory in various areas of narrow AI and applied science. However, we are also swayed by
the arguments of Pei Wang, Peter %Valley and others that using single-number probabilities to
represent truth values leads to untoward complexities related to the tabulation and manipulation
of amounts of evidence. This has led us to an imprecise probability based approach; and then
technical arguments regarding the limitations of standard imprecise probability formalisms has
led us to develop our own "indefinite probabilities" formalism.
The PLN logic framework is one way of integrating imprecise probability and fuzziness in a
logical formalism that encompasses a broad scope of logical constructs. It integrates term logic
EFTA00624628
482 49 Summary of Argument for the CogPrime Approach
and predicate logic - a feature that we consider not necessary, but very convenient, for AGI.
Either predicate or term logic on its own would suffice, but each is awkward in certain cases,
and integrating them as done in PLN seems to result in more elegant handling of real-world
inference scenarios. Finally, PLN also integrates intensional inference in an elegant manner
that demonstrates integrative intelligence - it defines intension using pattern theory, which
binds inference to pattern recognition and hence to other cognitive processes in a conceptually
appropriate way.
Clearly PLN is not the only possible logical formalism capable of serving a human-level AGI
system; however, we know of no other existing, fleshed-out formalism capable of fitting the bill.
In part this is because PLN has been developed as part of an integrative AGI project whereas
other logical formalisms have mainly been developed for other purposes, or purely theoretically.
Via using PLN to control virtual agents, and integrating PLN with other cognitive processes, we
have tweaked and expanded the PLN formalism to serve all the roles required of the "declarative
cognition" component of an AGI system with reasonable elegance and effectiveness.
49.6.2 Program Learning for Procedural Knowledge
Even more so than declarative knowledge, procedural knowledge is represented in many different
ways in the Al literature. The human brain also apparently uses multiple mechanisms to embody
different kinds of procedures. So the choice of how to represent procedures in an AGI system
is not particularly obvious. However, there is one particular representation of procedures that
is particularly well-suited for current computer systems, and particularly well-tested in this
context: programs. In designing CogPrime, we have acted based on the understanding that
programs are a good way to represent procedures - including both cognitive and physical-action
procedures, but perhaps not including low-level motor-control procedures.
Of course, this begs the question of programs in what programming language, and in this
context we have made a fairly traditional choice, using a special language called Combo that is
essentially a minor variant of LISP, and supplying Combo with a set of customized primitives
intended to reduce the length of the typical programs CogPrime needs to learn and use. What
differentiates this use of LISP from many traditional uses of LISP in Al is that we are only
using the LISP-ish representational style for procedural knowledge, rather than trying to use it
for everything.
One test of whether the use of Combo programs to represent procedural knowledge makes
sense is whether the procedures useful for a CogPrime system in everyday human environments
have short Combo representations. We have worked with Combo enough to validate that they
generally do in the virtual world environment - and also in the physical-world environment
if lower-level motor procedures are supplied as primitives. That is, we are not convinced that
Combo is a good representation for the procedure a robot needs to do to move its fingers to
pick up a cup, coordinating its movements with its visual perceptions. It's certainly possible to
represent this sort of thing in Combo, but Combo may be an awkward tool. However, if one
represents low-level procedures like this using another method, e.g. learned cell assemblies in a
hierarchical network like DeSTIN, then it's very feasible to make Combo programs that invoke
these low-level procedures, and encode higher-level actions like "pick up the cup in front of you
slowly and quietly, then hand it to Jim who is standing next to you."
EFTA00624629
49.6 Cognitive Processes 483
Having committed to use programs to represent many procedures, the next question is how
to learn programs. One key conclusion we have come to via our empirical work in this area is
that some form of powerful program normalization is essential. Without normalization, it's too
hard for existing learning algorithms to generalize from known, tested programs and draw useful
uncertain conclusions about untested ones. We have worked extensively with a generalization
of Holman's "Elegant Normal Form" in this regard.
For learning normalized programs, we have come to the following conclusions:
• for relatively straightforward procedure learning problems, hilldimbing with random restart
and a strong Occam bias is an effective method
• for more difficult problems that elude hinclimbing, probabilistic evolutionary program learn-
Mg is an effective method
The probabilistic evolutionary program learning method we have worked with most in OpenCog
is MOSES, and significant evidence has been gathered showing it to be dramatically more
effective than genetic programming on relevant classes of problems. However, more work needs to
be done to evaluate its progress on complex and difficult procedure learning problems. Alternate,
related probabilistic evolutionary program learning algorithms such as PLEASURE have also
been considered and may be implemented and tested as well.
49.6.5 Attention Allocation
There is significant evidence that the brain uses some sort of "activation spreading" type method
to allocate attention, and many algorithms in this spirit have been implemented and utilized in
the Al literature. So, we find ourselves in agreement with many others that activation spreading
is a reasonable way to handle attentional knowledge (though other approaches, with greater
overhead cost, may provide better accuracy and may be appropriate in some situations). We
also agree with many others who have chosen Hebbian learning as one route of learning
associative relationships, with more sophisticated methods such as information-geometric
ones potentially also playing a role.
Where CogPrime differs from standard practice is in the use of an economic metaphor to reg-
ulate activation spreading. In this matter CogPrime is broadly in agreement with Eric Baum's
arguments about the value of economic methods in Al, although our specific use of economic
methods is very different from his. Baum's work (e.g. Hayek [13an0 ID embodies more complex
and computationally expensive uses of artificial economics, whereas we believe that in the con-
text of a neural-symbolic network, artificial economics is an effective approach to activation
spreading; and CogPrime's ECAN framework seeks to embody this idea. ECAN can also make
use of more sophisticated and expensive uses of artificial currency when large amount of system
resources are involved in a single choice, rendering the cost appropriate.
One major choice made in the CogPrime design is to focus on two kinds of attention: proces-
sor (represented by ShortTermImportance) and memory (represented by LongTermImportance).
This is a direct reflection of one of the key differences between the von Neumann architecture
and the human brain: in the former but not the latter, there is a strict separation between mem-
ory and processing in the underlying compute fabric. We carefully considered the possibility of
using a larger variety of attention values, and in Chapter 23 we presented some mathematics
and concepts that could be used in this regard, but for reasons of simplicity and computational
EFTA00624630
484 49 Summary of Argument for the CogPrime Approach
efficiency we are currently using only STI and LTI in our OpenCogPrime implementation, with
the passibility of extending further if experimentation proves it necessary.
49.6.4 Internal Simulation and Episodic Knowledge
For episodic knowledge, as with declarative and procedural knowledge, CogPrime has opted
for a solution motivated by the particular strengths of contemporary digital computers. When
the human brain runs through a "mental movie" of past experiences, it doesn't do any kind of
accurate physical simulation of these experiences. But that's not because the brain wouldn't
benefit from such - it's because the brain doesn't know how to do that sort of thing! On the
other hand, any modern laptop can run a reasonable Newtonian physics simulation of everyday
events, and more fundamentally can recall and manage the relative positions and movements of
items in an internal 3D landscape paralleling remembered or imagined real-world events. With
this in mind, we believe that in an AGI context, simulation is a good way to handle episodic
knowledge; and running an internal "world simulation engine" is an effective way to handle
simulation.
CogPrime can work with many different simulation engines; and since simulation technology
is continually advancing independently of AGI technology, this is an area where AGI can buy
some progressive advancement for free as time goes on. The subtle issues here regard interfacing
between the simulation engine and the rest of the mind: mining meaningful information out of
simulations using pattern mining algorithms; and more subtly, figuring out what simulations
to run at what times in order to answer the questions most relevant to the AGI system in the
context of achieving its goals. We believe we have architected these interactions in a viable way
in the CogPrime design, but we have tested our ideas in this regard only in some fairly simple
contexts regarding virtual pets in a virtual world, and much more remains to be done here.
49.6.5 Low-Level Perception and Action
The centrality or otherwise of low-level perception and action in human intelligence is a matter
of ongoing debate in the AI community. Sonic feel that the essence of intelligence lies in cognition
and/or language, with perception and action having the status of "peripheral devices." Others
feel that modeling the physical world and one's actions in it is the essence of intelligence, with
cognition and language emerging as side-effects of these more fundamental capabilities. The
CogPrime architecture doesn't need to take sides in this debate. Currently we are experimenting
both in virtual worlds, and with real-world robot control. The value added by robotic versus
virtual embodiment can thus be explored via experiment rather than theory, and may reveal
nuances that no one currently foresees.
As noted above, we are tmconfident of CogPrime's generic procedure learning or pattern
recognition algorithms in terms of their capabilities to handle large amounts of raw sensorimotor
data in real time, and so for robotic applications we advocate hybridizing CogPrime with a
separate (but closely cross-linked) system better customized for this sort of data, in line with
our general hypothesis that Hybridization of one's integrative neural-symbolic system with a
spatiotemporally hierarchical deep learning system is an effective way to handle representation
EFTA00624631
49.7 Fulfilling the "Cognitive Equation" 485
and learning of low-level sensorimotor knowledge. While this general principle doesn't depend
on any particular approach, DeSTIN is one example of a deep learning system of this nature
that can be effective in this context
We have not yet done any sophisticated experiments in this regard - our current experiments
using OpenCog to control robots involve cruder integration of OpenCog with perceptual and
motor subsystems, rather than the tight hybridization described in Chapter 26. Creating such
a hybrid system is last" a matter of software engineering, but testing such a system may lead
to many surprises!
49.6.6 Goals
Given that we have characterized general intelligence as "the ability to achieve complex goals in
complex environments," it should be plain that goals play a central role in our work. However,
we have chosen not to create a separate subsystem for intentional knowledge, and instead have
concluded that one effective way to handle goals is to represent them declaratively, and allocate
attention among them economically. An advantage of this approach is that it automatically
provides integration between the goal system and the declarative and attentional knowledge
systems.
Goals and subgoaLs are related using logical links as interpreted and manipulated by PLN, and
attention is allocated among goals using the STI dynamics of ECAN, and a specialized variant
based on RFS's (requests for service). Thus the mechanics of goal management is handled using
uncertain inference and artificial economics, whereas the figuring-out of how to achieve goals
is done integratively, relying heavily on procedural and episodic knowledge as well as PLN and
ECAN.
The combination of ECAN and PLN seems to overcome the well-known shortcomings found
with purely neural-net or purely inferential approaches to goals. Neural net approaches gener-
ally have trouble with abstraction, whereas logical approaches are generally poor at real-time
responsiveness and at tuning their details quantitatively based on experience. At least in prin-
ciple, our hybrid approach overcomes all these shortcomings; though of current, it has been
tested only in fairly simple cases in the virtual world.
49.7 Fulfilling the "Cognitive Equation"
A key claim based on the notion of the "Cognitive Equation" posited in Chaotic Logic [Coe9lJ is
that it is important for an intelligent system to have some way of recognizing large-scale patterns
in itself, and then embodying these patterns as new, localized knowledge items in its memory.
This dynamic introduces a feedback dynamic between emergent pattern and substrate, which
is hypothesized to be critical to general intelligence under feasible computational resources. It
also ties in nicely with the notion of "glocal memory" - essentially positing a localization of
some global memories, which naturally will result in the formation of some glocal memories.
One of the key ideas underlying the CogPrime design is that given the use of a neural-symbolic
network for knowledge representation, a graph-mining based "map formation" heuristic is one
good way to do this.
EFTA00624632
486 49 Summary of Argument for the CogPrime Approach
Map formation seeks to fulfill the Cognitive Equation quite directly, probably more directly
than happens in the brain. Rather than relying on other cognitive processes to implicitly recog-
nize overall system patterns and embody them in the system as localized memories (though this
implicit recognition may also happen), the MapFormation MindAgent explicitly carries out this
process. Mostly this is done using fairly crude greedy pattern mining heuristics, though if really
subtle and important patterns seem to be there, more sophisticated methods like evolutionary
pattern mining may also be invoked.
It seems possible that this sort of explicit approach could be less efficient than purely implicit
approaches; but, there is no evidence for this, and it may actually provide increased efficiency.
And in the context of the overall CogPrime design, the explicit NIapFormation approach seems
most natural.
49.8 Occam's Razor
The key role of "Occam's Razor" or the urge for simplicity in intelligence has been observed by
many before (going back at least to Occam himself, and probably earlier!), and is fully embraced
in the CogPrime design. Our theoretical analysis of intelligence, presented in Chapter 2 of Part
1 and elsewhere, portrays intelligence as closely tied to the creation of procedures that achieve
goals in environments in the simplest possible way. And this quest for simplicity is present in
many places throughout the CogPrime design, for instance
• In MOSES and hilIclimbing, where program compactness is an explicit component of pro-
gram tree fitness
• In PLN, where the backward and forward chainers. explicitly favor shorter proof chains,
and intensional inference explicitly characterizes entities in terms of their patterns (where
patterns are defined as compact characterizations)
• In pattern mining heuristics, which search for compact characterizations of data
• In the forgetting mechanism, which seeks the smallest set of Atoms that will allow the
regeneration of a larger set of useful Atoms via modestly-expensive application of cognitive
processes
• Via the encapsulation of procedural and declarative knowledge in simulations, which in
many cases provide a vastly compacted form of storing real-world experiences
Like cognitive synergy and emergent networks, Occam's Razor is not something that is imple-
mented in a single place in the CogPrime design, but rather an overall design principle that
underlies nearly every part of the system.
49.8.1 Mind Geometry
The three mind-geometric principles outlined in Appendix ?? are:
• syntax-semantics correlation
• cognitive geometrodynamics
• cognitive synergy
EFTA00624633
49.8 Occam's Razor 487
The key role of syntax-semantics correlation in CogPrime is clear. It plays an explicit role
in MOSES. In PLN, it is critical to inference control, to the extent that inference control is
based on the extraction of patterns from previous inferences. The syntactic structures are the
inference trees, and the semantic structures are the inferential conclusions produced by the trees.
History-guided inference control assumes that prior similar trees will be a good starting-point
for getting results similar to prior ones - i.e. it assumes a reasonable degree of syntax-semantics
correlation. Also, without a correlation between the core elements used to generate an episode,
and the whole episode, it would be infeasible to use historical data mining to understand what
core elements to use to generate a new episode - and creation of compact, easily manipulable
seeds for generating episodes would not be feasible.
Cognitive geometrodynamics is about finding the shortest path from the current state to
a goal state, where distance is judged by an appropriate metric including various aspects of
computational effort. The ECAN and effort management frameworks attempt to enforce this,
via minimizing the amount of effort spent by the system in getting to a certain conclusion.
MindAgents operating primarily on one kind of knowledge (e.g. MOSES, PLN) may for a time
seek to follow the shortest paths within their particular corresponding memory spaces; but then
when they operate more interactively and synergetically, it becomes a matter of finding short
paths in the composite mindspace corresponding to the combination of the various memory
types.
Finally, cognitive synergy is thoroughly and subtly interwoven throughout CogPrime. In a
way the whole design is about cognitive synergy - it's critical for the design's functionality
that it's important that the cognitive processes associated with different kinds of memory can
appeal to each other for assistance in overcoming bottlenecks in a manner that: a) works in
"real time'; i.e. on the time scale of the cognitive processes internal processes; b) enables each
cognitive process to act in a manner that is sensitive to the particularities of each others' internal
representations.
Recapitulating in a bit more depth, recall that another useful way to formulate cognitive
synergy as follows. Each of the key learning mechanisms underlying CogPrime is susceptible
to combinatorial explosions. As the problems they confront become larger and larger, the per-
formance gets worse and worse at an exponential rate, because the number of combinations
of items that mast be considered to solve the problems grows exponentially with the problem
size. This could be viewed as a deficiency of the fundamental design, but we don't view it that
way. Our view is that combinatorial explosion is intrinsic to intelligence. The task at hand is to
dampen it sufficiently that realistically large problems can be solved, rather than to eliminate
it entirely. One possible way to dampen it would be to design a single, really clever learning
algorithm - one that was still susceptible to an exponential increase in computational require-
ments as problem size increases, but with a surprisingly small exponent. Another approach is
the mirrorhouse approach: Design a bunch of learning algorithms, each focusing on different
aspects of the learning process, and design them so that they each help to dampen each others'
combinatorial explosions. This is the approach taken within CogPrime. The component algo-
rithms are clever on their own - they are less susceptible to combinatorial explosion than many
competing approaches in the narrow-AI literature. But the real meat of the design lies in the
intended interactions between the components, manifesting cognitive synergy.
EFTA00624634
488 49 Summary of Argument for the CogPrime Approach
49.9 Cognitive Synergy
To understand more specifically how cognitive synergy works in CogPrime, in the following sub-
sections we will review some synergies related to the key components of CogPrime as discussed
above. These synergies are absolutely critical to the proposed functionality of the CogPrime
system. Without them, the cognitive mechanisms are not going to work adequately well, but
are rather going to succumb to combinatorial explosions. The other aspects of CogPrime - the
cognitive architecture, the knowledge representation, the embodiment framework and associ-
ated developmental teaching methodology - are all critical as well, but none of these will yield
the critical emergence of intelligence without cognitive mechanisms that effectively scale. And,
in the absence of cognitive mechanisms that effectively scale on their own, we mast rely on
cognitive mechanisms that effectively help each other to scale. The reasons why we believe these
synergies will exist are essentially qualitative: we have not proved theorems regarded these syn-
ergies, and we have observed them in practice only in simple cases so far. However, we do have
some ideas regarding how to potentially prove theorems related to these synergies, and some of
these are described in Appendix H.
49.9.1 Synergies that Help Inference
The combinatorial explosion in PLN is obvious: forward and backward chaining inference are
both fundamentally explosive processes, reined in only by pruning heuristics. This means that
for nontrivial complex inferences to occur, one needs really, really clever pruning heuristics.
The CogPrime design combines simple heuristics with pattern mining, MOSES and economic
attention allocation as pruning heuristics. Economic attention allocation assigns importance
levels to Atoms, which helps guide pruning. Greedy pattern mining is used to search for patterns
in the stored corpus of inference trees, to see if there are any that can be used as analogies
for the current inference. And MOSES comes in when there is not enough information (from
importance levels or prior inference history) to make a choice, yet exploring a wide variety
of available options is unrealistic. In this case, MOSES tasks may be launched, pertinently to
the leaves at the fringe of the inference tree, under consideration for expansion. For instance,
suppose there is an Atom A at the fringe of the inference tree, and its importance hasn't been
assessed with high confidence, but a number of items B are known so that:
MemberLink A B
Then, MOSES may be used to learn various relationships characterizing A, based on recognizing
patterns across the set of B that are suspected to be members of A. These relationships may
then be used to assess the importance of A more confidently, or perhaps to enable the inference
tree to match one of the patterns identified by pattern mining on the inference tree corpus. For
example, if MOSES figures out that:
SimilarityLink G A
then it may happen that substituting G in place of A in the inference tree, results in something
that pattern mining can identify as being a good (or poor) direction for inference.
EFTA00624635
49.10 Synergies that Help MOSES 489
49.10 Synergies that Help MOSES
MOSES's combinatorial explosion is obvious: the number of possible programs of size N increases
very rapidly with N. The only way to get around this is to utilize prior knowledge, and as much
as possible of it. When solving a particular problem, the search for new solutions must make
use of prior candidate solutions evaluated for that problem, and also prior candidate solutions
(including successful and unsuccessful ones) evaluated for other related problems.
But, extrapolation of this kind is in essence a contextual analogical inference problem. In
some cases it can be solved via fairly straightforward pattern mining; but in subtler cases it will
require inference of the type provided by PLN. Also, attention allocation plays a role in figuring
out, for a given problem A, which problems B are likely to have the property that candidate
solutions for B are useful information when looking for better solutions for A.
49.10.1 Synergies that Help Attention Allocation
Economic attention allocation, without help from other cognitive processes, is just a very sim-
ple process analogous to "activation spreading" and "Hebbian learning" in a neural network.
The other cognitive processes are the things that allow it to more sensitively understand the
attentional relationships between different knowledge items (e.g. which sorts of items are often
usefully thought about in the same context, and in which order).
49.10.2 Further Synergies Related to Pattern Mining
Statistical, greedy pattern mining is a simple process, but it nevertheless can be biased in
various ways by other, more subtle processes.
For instance, if one has learned a population of programs via MOSES, addressing some
particular fitness function, then one can study which items tend to be utilized in the same
programs in this population. One may then direct pattern mining to find patterns combining
these items found to be in the MOSES population. And conversely, relationships denoted by
pattern mining may be used to probabilistically bias the models used within MOSES.
Statistical pattern mining may also help PLN by supplying it with information to work
on. For instance, conjunctive pattern mining finds conjunctions of items, which may then be
combined with each other using PLN, leading to the formation of more complex predicates.
These conjunctions may also be fed to MOSES as part of an initial population for solving a
relevant problem.
Finally, the main interaction between pattern mining and MOSES/PLN is that the former
may recognize patterns in links created by the latter. These patterns may then be fed back
into MOSES and PLN as data. This virtuous cycle allows pattern mining and the other, more
expensive cognitive processes to guide each other. Attention allocation also gets into the game,
by guiding statistical pattern mining and telling it which terms (and which combinations) to
spend more time on.
EFTA00624636
490 49 Summary of Argument for the CogPrime Approach
49.10.3 Synergies Related to Map Formation
The essential synergy regarding map formation is obvious: Maps are formed based on the
HebbianLinks created via PLN and simpler attentional dynamics, which are based on which
Atoms are usefully used together, which is based on the dynamics of the cognitive processes
doing the "using." On the other hand, once maps are formed and encapsulated, they feed into
these other cognitive processes. This synergy in particular is critical to the emergence of self
and attention.
What has to happen, for map formation to work well, is that the cognitive processes must
utilize encapsulated maps in a way that gives rise overall to relatively clear clusters in the
network of HebbianLinks. This will happen if the encapsulated maps are not too complex for
the system's other learning operations to understand. So, there must be useful coordinated
attentional patterns whose corresponding encapsulated-map Atoms are not too complicated.
This has to do with the system's overall parameter settings, but largely with the settings of
the attention allocation component. For instance. this is closely tied in with the limited size of
"attentional focus" (the famous 7 +/- 2 number associated with humans' and other mammals
short term memory capacity). If only a small number of Atoms are typically very important
at a given point in time, then the maps formed by grouping together all simultaneously highly
important things will be relatively small predicates, which will be easily reasoned about - thus
keeping the "virtuous cycle" of map formation and comprehension going effectively.
49.11 Emergent Structures and Dynamics
We have spent much more time in this book on the engineering of cognitive processes and
structures, than on the cognitive processes and structures that must emerge in an intelligent
system for it to display human-level AGI. However, this focus should not be taken to represent
a lack of appreciation for the importance of emergence. Rather, it represents a practical focus:
engineering is what we must do to create a software system potentially capable of AGI, and
emergence is then what happens inside the engineered AGI to allow it to achieve intelligence.
Emergence must however be taken carefully into account when deciding what to engineer!
One of the guiding ideas underlying the CogPrime design is that an AGI system with ade-
quate mechanisms for handling the key types of knowledge mentioned above, and the capability
to explicitly recognize large-scale pattern in itself should upon sustained interaction with
an appropriate environment in pursuit of appropriate goals, emerge a variety of com-
plex structures in its internal knowledge network, including (but not limited to): a hierarchical
network, representing both a spatiotemporal hierarchy and an approximate "default inheritance"
hierarchy, cross-linked; a heterarchical network of associativity, roughly aligned with the hierar-
chical network; a self network which is an approximate micro image of the whole network; and
inter-reflecting networks modeling self and others, reflecting a "mirrorhouse" design pattern.
The dependence of these posited emergences on the environment and goals of the AGI system
should not be underestimated. For instance, PLN and pattern mining don't have to lead to a
hierarchical structured Atomspace, but if the AGI system is placed in an environment which is
itself hierarchically structured, then they very likely will do so. And if this environment consists
of hierarchically structured language and culture, then what one has is a system of minds with
hierarchical networks, each reinforcing the hierarchality of each others' networks. Similarly,
EFTA00624637
49.12 Ethical AC! 491
integrated cognition doesn't have to lead to mirrorhouse structures, but integrated cognition
about situations involving other minds studying and predicting and judging each other, is very
likely to do so. What is needed for appropriate emergent structures to arise in a mind, is
mainly that the knowledge representation is sufficiently flexible to allow these structures, and
the cognitive processes are sufficiently intelligent to observe these structures in the environment
and then minor them internally. Of course, it also doesn't hurt if the internal structures and
processes are at least slightly biased toward the origination of the particular high-level emergent
structures that are characteristic of the system's environment/goals; and this is indeed the case
with CogPrime - biases toward hierarchical, heterarchical, dual and mirrorhouse networks are
woven throughout the system design, in a thoroughgoing though not extremely systematic way.
49.12 Ethical AGI
Creating an AGI with guaranteeably ethical behavior seems an infeasible task; but of course,
no human is guaranteeably ethical either, and in fact it seems almost guaranteed that in any
moderately large group of humans there are going to be some with strong propensities for
extremely unethical behaviors, according to any of the standard human ethical codes. One of
our motivations in developing CogPrime has been the belief that an AC! system, if supplied
with a commonsensically ethical goal system and an intentional component based on rigorous
uncertain inference, should be able to reliably achieve a much higher level of commonsensically
ethical behavior than any human being.
Our explorations in the detailed design of CogPrime's goal system have done nothing to
degrade this belief. While we have not yet developed any CogPrime system to the point where
experimenting with its ethics is meaningful, based on our understanding of the current design
it seems to us that
• a typical CogPrime system will display a much more consistent and less conflicted and
confused motivational system than any human being, due to its explicit orientation toward
carrying out actions that (based on its knowledge) rationally seem most likely to lead to
achievement of its goals
• if a CogPrime system is given goals that are consistent with commonsensical human ethics
(say, articulated in natural language), and then educated in an ethics-friendly environment
such as a virtual or physical school, then it is reasonable to expect the CogPrime system will
ultimately develop an advanced (human adult level or beyond) form of commmonsensical
human ethics
Human ethics is itself wracked with inconsistencies, so one cannot expect a rationality-based
AGI system to precisely mirror the ethics of any particular human individual or cultural system.
But given the degree to which general intelligence represents adaptation to its environment, and
interpretation of natural language depends on life history and context, it seems very likely to
us that a CogPrime system. if supplied with a human-commonsense-ethics based goal system
and then raised by compassionate and intelligent humans in a school-type environment, would
arrive at its own variant of human-commonsense-ethics. The AGI system's ethics would then
interact with human ethical systems in complex ways, leading to ongoing evolution of both
systems and the development of new cultural and ethical patterns. Predicting the future is
EFTA00624638
492 49 Summary of Argument for the CogPrime Approach
difficult even in the absence of radical advanced technologies, but our intuition is that this path
has the potential to lead to beneficial outcomes for both human and machine intelligence.
49.13 Toward Superhuman General Intelligence
Human-level AGI is a difficult goal, relative to the current state of scientific understanding
and engineering capability, and most of this book has been focused on our ideas about how to
achieve it. However, we also suspect the CogPrime architecture has the ultimate potential to
push beyond the human level in many ways. As part of this suspicion we advance the claim
that once sufficiently advanced, a CogPrime system should be able to radically self-improve via
a variety of methods, including supercompilation and automated theorem-proving.
Supercompilation allows procedures to be automatically replaced with equivalent but mas-
sively more time-efficient procedures. This is particularly valuable in that it allows AI algorithms
to learn new procedures without much heed to their efficiency, since supercompilation can al-
ways improve the efficiency afterwards. So it is a real boon to automated program learning.
Theorem-proving is difficult for current narrow-Al systems, but for an AGI system with
a deep understanding of the context in which each theorem exists, it should be much easier
than for human mathematicians. So we envision that ultimately an AGI system will be able to
design itself new algorithms and data structures via proving theorems about which ones will
best help it achieve its goals in which situations, based on mathematical models of itself and
its environment. Once this stage is achieved, it seems that machine intelligence may begin to
vastly outdo human intelligence, leading in directions we cannot now envision.
While such projections may seem science-fictional, we note that the CogPrime architecture
explicitly supports such steps. If human-level AGI is achieved within the CogPrime framework,
it seems quite feasible that profoundly self-modifying behavior could be achieved fairly shortly
thereafter. For instance, one could take a human-level CogPrime system and teach it computer
science and mathematics, so that it fully understood the reasoning underlying its own design,
and the whole mathematics curriculum leading up the the algorithms underpinning its cognitive
processes.
49.13.1 Conclusion
What we have sought to do in these pages is, mainly,
• to articulate a theoretical perspective on general intelligence, according to which the cre-
ation of a human-level AGI doesn't require anything that extraordinary, but "merely" an
appropriate combination of closely interoperating algorithms operating on an appropriate
multi-type memory system, utilized to enable a system in an appropriate body and envi-
ronment to figure out how to achieve its given goals
• to describe a software design (CogPrime ) that, according to this somewhat mundane but
theoretically quite well grounded vision of general intelligence, appears likely (according to
a combination of rigorous and heuristic arguments) to be able to lead to human-level AGI
using feasible computational resources
EFTA00624639
49.13 Toward Superhuman General Intelligence 493
• to describe some of the preliminary lessons we've learned via implementing and experiment-
ing with aspects of the CogPrime design, in the OpenCog system
In this concluding chapter we have focused on the "combination of rigorous and heuristic argu-
ments" that lead us to consider it likely that CogPrime has the potential to lead to human-level
AGI using feasible computational resources.
We also wish to stress that not all of our arguments and ideas need to be 100% correct in order
for the project to succeed. The quest to create AGI Ls a mix of theory, engineering, and scientific
and unscientific experimentation. If the current CogPrime design turns out to have significant
shortcomings, yet still brings us a significant percentage of the way toward human-level AGI,
the results obtained along the path will very likely give us clues about how to tweak the design
to more effectively get the rest of the way there. And the OpenCog platform is extremely flexible
and extensible, rather than being tied to the particular details of the CogPrime design. While
we do have faith that the CogPrime design as described here has human-level AGI potential,
we are also pleased to have a development strategy and implementation platform that will
allow us to modify and improve the design in whatever suggestions are made by our ongoing
experimentation.
Many great achievements in history have seemed more magical before their first achievement
than afterwards. Powered flight and spaceflight are the most obvious examples, but there are
many others such as mobile telephony, prosthetic limbs, electronically deliverable books, robotic
factory workers, and so on. We now even have wireless transmission of power (one can recharge
cellphones via wifi), though not yet as ambitiously as Tesla envisioned. We very strongly suspect
that human-level AGI Ls in the same category as these various examples: an exciting and
amazing achievement, which however is achievable via systematic and careful application of
fairly mundane principles. We believe computationally feasible human-level intelligence is both
complicated (involving many interoperating parts, each sophisticated in their own right) and
complex (in the sense of involving many emergent dynamics and structures whose details are
not easily predictable based on the parts of the system) ... but that neither the complication
nor the complexity is an obstacle to engineering human-level AGI.
Furthermore, while ethical behavior is a complex and subtle matter for humans or machines,
we believe that the production of human-level AGIs that are not only intelligent but also ben-
eficial to humans and other biological sentiences, is something that is probably tractable to
achieve based on a combination of careful AGI design and proper AGI education and "parent-
ing." One of the motivations underlying our design has been to create an artificial mind that
has broadly humanlike intelligence, yet has a more rational and self-controllable motivational
system than humans, thus ultimately having the potential for a greater-than-human degree of
ethical reliability alongside its greater-than-human intelligence.
In our view, what is needed to create human-level AGI is not a new scientific breakthrough,
nor a miracle, but "merely" a sustained effort over a number of years by a moderate-sized
team of appropriately-trained professionals, completing the implementation of the design in
this book and then parenting and educating the resulting implemented system. CogPrime is by
no means the only possible path to human-level AGI, but we believe it is considerably more
fully thought-through and fleshed-out than any available alternatives. Actually, we would love
to see CogPrime and a dozen alternatives simultaneously pursued - this may seem ambitious,
but it would cost a fraction of the money currently spent on other sorts of science or engineering,
let alone the money spent on warfare or decorative luxury items. We strongly suspect that, in
hindsight, our human and digital descendants will feel amazed that their predecessors allocated
EFTA00624640
494 49 Summary of Argument for the CogPrime Approach
so few financial and attentional resources to the creation of powerful AGI, and consequently
took so long to achieve such a fundamentally straightforward thing.
EFTA00624641
Chapter 50
Build Me Something I Haven't Seen: A CogPrime
Thought Experiment
50.1 Introduction
AGI design necessarily leads one into some rather abstract spaces — but being a human-like
intelligence in the everyday world Ls a pretty concrete thing. If the CogPrime research program
is successful, it will result not just in abstract ideas and equations, but rather in real AGI
robots carrying out tasks in the world, and AGI agents in virtual worlds and online digital
spaces conducting important business, doing science, entertaining and being entertained by us,
and so forth. With this in mind, in this final chapter we will bring the discussion closer to the
concrete and everyday, and pursue a thought experiment of the form "How would a completed
CogPrime system carry, out this specific task?"
The task we will use for this thought-experiment is one we have used as a running example
now and then in the preceding chapters. We consider the case of a robotically or virtually
embodied CogPrime system, operating in a preschool type environment, interacting with a
human whom it already knows and given the task of "Build me something with blocks that I
haven't seen before."
This target task is fairly simple, but it is complex enough to involve essentially every one of
CogPrime's processes, interacting in a unified way. It involves simple, grounded creativity of the
sort that normal human children display every day - and which, we conjecture, is structurally
and dynamically basically the same as the creativity underlying the genius of adult human
creators like Einstein, Dali, Dostoevsky, Hendrix, and so forth ... and as the creativity that will
power massively capable genius machines in future.
We will consider the case of a simple interaction based on the above task where:
1. The human teacher tells the CogPrime agent "Build me something with blocks that I haven't
seen before."
2. After a few false starts, the agent builds something it thinks is appropriate and says 'Do
you like it?"
3. The human teacher says "It's beautiful. What is it?"
4. The agent says "It's a car man" land indeed. the construct has 4 wheels and a chassis vaguely
like a car, but also a torso, arms and head vaguely like a person]
Of course, a complex system like CogPrime could carry out an interaction like this internally
in many different ways, and what is roughly described here is just one among many possibilities.
495
EFTA00624642
496 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
First we will enumerate a number of CogPrime processes and explain some ways that each
one may help CogPrime carry out the target task. Then we will give a more evocative narrative,
conveying the dynamics that would occur in CogPrime while carrying out the target task, and
mentioning how each of the enumerated cognitive processes as it arises in the narrative.
50.2 Roles of Selected Cognitive Processes
Now we review a number of the more interesting CogPrime cognitive processes mentioned in
previous chapters of the hook, for each one indicating one or more of the roles it might play in
helping a CogPrime system carry out the target task. Note that this list is incomplete in many
senses, e.g. it doesn't list all the cognitive processes, nor all the roles played by the ones listed.
The purpose is to give an evocative sense of the roles played by the different parts of the design
in carrying out the task.
• Chapter 19 (OpenCog Fnunework)
- Freezing/defrosting.
• When the agent builds a structure from blocks and decides it's not good enough to
show off to the teacher, what does it do with the detailed ideas and thought process
underlying the structure it built? If it doesn't like the structure so much, it may just
leave this to the generic forgetting process. But if it likes the structure a lot, it may
want to increase the VLTI (Very Long Term Importance) of the Atoms related to
the structure in question, to be sure that these are stored on disk or other long-term
storage, even after they're deemed sufficiently irrelevant to be pushed out of RAM
by the forgetting mechanism.
• When given the target task, the agent may decide to revive from disk the mind-
states it went through when building crowd-pleasing structures from blocks before,
so as to provide it with guidance.
• Chapter 22 (Emotion, Motivation, Attention and Control)
- Cognitive cycle.
• While building with blocks, the agent's cognitive cycle will be dominated by per-
ceiving, acting on, and thinking about the blocks it is building with.
• When interacting with the teacher, then interaction-relevant linguistic, perceptual
and gestural processes will also enter into the cognitive cycle.
- Emotion. The agent's emotions will fluctuate naturally as it carries out the task.
• If it has a goal of pleasing the teacher, then it will experience happiness as its
expectation of pleasing the teacher increases.
• If it has a goal of experiencing novelty, then it will experience happiness as it creates
structures that are novel in its experience.
• If it has a goal of learning, then it will experience happiness as it learns new things
about blocks construction.
• On the other hand, it will experience unhappiness as its experienced or predicted
satisfaction of these goals decreases.
- Action selection
EFTA00624643
50.2 Roles of Selected Cognitive Processes 497
In dialoguing with the teacher, action selection will select one or more DialogueCon-
troller schema to control the conversational interaction (based on which DC schema
have proved most effective in prior similar situations.
When the agent wants to know the teacher's opinion of its construct, what
is happening internally is that the "please teacher" Goal Atom gets a link of
the conceptual form (Implication "find out teacher's opinion of my current con-
struct" "please teacher"). This link may be created by PLN inference, prob-
ably largely by analogy to previously encountered similar situations. Then,
Goallmportance is spread from the "please teacher" Goal Atom to the "find out
teacher's opinion of my current construct" Atom (via the mechanism of sending
an RFS package to the latter Atom). More inference causes a link (Implication
"ask the teacher for their opinion of my current construct" "find out teacher's
opinion of my current construct") to be formed, and the "ask the teacher for
their opinion of my current construct" Atom to get Goallmportance also. Then
PredicateSchematization causes the predicate "ask the teacher for their opinion
of my current construct" to get turned into an actionable schema, which gets
Goallmportance, and which gets pushed into the ActiveSchemaPool via Goal-
driven action selection. Once the schema version of "ask the teacher for their
opinion of my current construct" is in the ActiveSchemaPool, it then invokes
natural language generation Tasks, which lead to the formulation of an English
sentence such as "Do you like it?"
When the teacher asks "It's beautiful. What is it?", then the NL comprehension
MindAgent identifies this as a question, and the "please teacher" Goal Atom gets
a link of the conceptual form (Implication "answer the question the teacher just
asked" "please teacher"). This follows simply from the knowledge ( Implication
("teacher has just asked a question" AND "I answer the teacher's question")
("please teacher")), or else from more complex knowledge refining this Impli-
cation. From this point, things proceed much as in the case "Do you like it?"
described just above.
Consider a schema such as "pick up a red cube and place it on top of the long red
block currently at the top of the structure" (let's call this P). Once P is placed in
the ActiveSchemaPool, then it runs and generates more specific procedures, such as
the ones needed to find a red cube, to move the agent's arm toward the red cube
and grasp it, etc. But the execution of these specific low-level procedures is done via
the ExecutionManager, analogously to the execution of the specifics of generating
a natural language sentence from a collection of semantic relationships. Loosely
speaking, reaching for the red cube and turning simple relationships into a simple
sentences, are considered as "automated processes" not requiring holistic engagement
of the agent's mind. What the generic, more holistic Action Selection mechanism
does in the present context is to figure out to put P in the ActiveSchemaPool in
the first place. This occurs because of a chain such as: P predictively implies (with
a certain probabilistic weight) "completion of the car-man structure", which in turn
predictively implies "completion of a structure that is novel to the teacher," which in
turn predictively implies "please the teacher," which in turn implies "please others,"
which is assumed an Ubergoal (a top-level system goal).
EFTA00624644
498 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
- Goal Atoms. As the above items make clear, the scenario in question requires the
initial Goal Atoms to be specialized, via the creation of more and more particular
subgoals suiting the situation at hand.
- Context Atoms.
• Knowledge of the context the agent is in can help it disambiguate language it hears,
e.g. knowing the context is blocks-building helps it understand which sense of the
word "blocks" is meant.
• On the other hand, if the context is that the teacher is in a bad mood, then the agent
might know via experience that in this context, the strength of (Implication "ask
the teacher for their opinion of my current construct" "find out teacher's opinion of
my current construct") is lower than in other contexts.
- Context formation.
• A context like blocks-building or teacher in a bad mood" may be formed by cluster-
ing over multiple experience-sets, i.e. forming Atoms that refer to spatiotemporally
grouped sets of percepts/concepts/actions, and grouping together similar Atoms of
this nature into clusters.
• The Atom referring to the cluster of experience-sets involving blocks-building will
then survive aS an Atom if it gets involved in relationships that are important or
have surprising truth values. If many relationships have significantly different truth-
value inside the blocks-building context than outside it, this means it's likely that
the blocks-building ConceptNode will remain as an Atom with reasonably high LTI,
so it can be used as a context in future.
- Time-dependence of goals. Many of the agent's goals in this scenario have different
importances over different time scales. For instance "please the teacher" is important
on multiple time-scales: the agent wants to please the teacher in the near term but also
in the longer term. But a goal like "answer the question the teacher just asked" has an
intrinsic time-scale to it; if it's not fulfilled fairly rapidly then its importance goes away.
• Chapter 23 (Attention allocation)
- ShortTermlmportance versus LongTermlmportance. While conversing, the con-
cepts and immediately involved in the conversation (including the Atoms describing the
agents in the conversation) have very high STI. While building, Atoms representing to
the blocks and related ideas about the structures being built (e.g. images of cars and
people perceived or imagined in the past) have very high STI. But the reason these
Atoms are in RAM prior to having their STI boosted due to their involvement in the
agent's activities, is because they had their LTI boosted at some point in the past.
And after these Atoms leave the AttentionalFocus and their STI reduces, they will
have boosted LTI and hence likely remain in RAM for a long while, to be involved in
"background thought", and in case they're useful in the AttentionalFocus again.
- HebbianLink formation. As a single example, the car-man has both wheels and
arms, so now a Hebbian association between wheels and arms will exist in the agent's
memory, to potentially pop up again and guide future thinking. The very idea of a
car-man likely emerged partly due to previously formed HebbianLinks - because people
were often seen sitting in cars, the association between person and car existed, which
made the car concept and the human concept natural candidates for blending.
- Data mining the System Activity Table. The HebbianLinks mentioned above may
have been formed via mining the SystemActivityTable
EFTA00624645
50.2 Roles of Selected Cognitive Processes 499
- ECAN based associative memory. When the agent thinks about making a car,
this spreads importance to various Atoms related to the car concept, and one thing
this does is lead to the emergence of the car attractor into the AttentionalFocus. The
different aspects of a car are represented by heavily interlinked Atoms, so that when
some of them become important, there's a strong tendency for the others to also become
important - and for "car" to then emerge as an attractor of importance dynamics.
- Schema credit assignment.
• Suppose the agent has a subgoal of placing a certain blue block on top of a certain red
block. It may use a particular motor schema for carrying out this action - involving,
for instance, holding the blue block above the red block and then gradually lowering
it. If this schema results in success (rather than in, say, knocking down the red
block), then it should get rewarded via having its STI and LTI boosted and also
having the strength of the link between it and the subgoal increased.
• Next, suppose that a certain cognitive schema (say, the schema of running multiple
related simulations and averaging the results, to estimate the success probability
of a motor procedure) was used to arrive at the motor schema in question. Then
this cognitive schema may get passed some importance from the motor schema, and
it will get the strength of its link to the goal increased. In this way credit passes
backwards from the goal to the various schema directly or indirectly involved in
fulfilling it.
- Forgetting. If the agent builds many structures from blocks during its lifespan, it will
accumulate a large amount of perceptual memory.
• Chapter 24 (Goal and Action Selection). Much of the use of the material in this chapter
was covered above in the bullet point for Chapter 22, but a few more notes are:
- Transfer of RFS between goals. Above it was noted that the link (Implication "ask
the teacher for their opinion of my current construct" "find out teacher's opinion of
my current construct") might be formed and used as a channel for Goallmportance
spreading.
- Schema Activation. Supposing the agent is building a man-car, it may have car-
building schema and man-building schema in its ActiveSchemaPool at the same time,
and it may enact both of them in an interleaved manner. But if each tend to require
two hands for their real-time enaction, then schema activation will have to pass back
and forth between the two of them, so that at any one time, one is active whereas the
other one is sitting in the ActiveSchemaPool waiting to get activated.
- Goal Based Schema Learning. To take a fairly low-level example, suppose the agent
has the (sub)goal of making an arm for a blocks-based person (or man-car), given the
presence of a blocks-based torso. Suppose it finds a long block that seems suitable to be
an arm. It then has the problem of figuring out how to attach the arm to the body. It
may try out several procedures in its internal simulation world, until it finds one that
works: hold the arm in the right position white one end of it rests on top of some block
that is part of the torso, then place some other block on top of that end, then slightly
release the arm and see if it falls. If it doesn't fall, leave it. If it seems about to fall, then
place something heavier atop it, or shove it further in toward the center of the torso.
The procedure learning process could be MOSES here, or it could be PLN.
• Chapter 25 (Procedure Evaluation)
EFTA00624646
500 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
- Inference Based Procedure Evaluation. A procedure for man-building such as
"first put up feet, then put up legs, then put up torso, then put up arms and head"
may be synthesized from logical knowledge (via predicate schematization) but without
filling in the details of how to carry out the individual steps, such as "put up legs." If
a procedure with abstract (ungrounded) schema like PutUpTorso is chosen for execu-
tion and placed into the ActiveSchemaPool, then in the course of execution, inferential
procedure evaluation must be used to figure out how to make the abstract schema ac-
tionable. The CoalDrivenActionSelection MindAgent must make the choice whether to
put a not-fully-grounded schema into the ActiveSchemaPool, rather than grounding it
first and then making it active; this is the sort of choice that may be made effectively
via learned cognitive schema.
• Chapter 26 (Perception and Action)
- ExperienceDB. No person remembers every blocks structure they ever saw or built,
except maybe some autists. But a CogPrime can store all this information fairly easily,
in its ExperienceDB, even if it doesn't keep it all in RAM in its AtomSpace. It can also
store everything anyone ever said about blocks structures in its vicinity.
- Perceptual Pattern Mining.
- Object Recognition. Recognizing structures made of blocks as cars, people, houses,
etc. requires fairly abstract object recognition, involving identifying the key shapes and
features involved in an object-type, rather than just going by simple visual similarity.
- Hierarchical Perception Networks. If the room is well-lit, it's easy to visually iden-
tify individual blocks within a blocks structure. If the room is darker, then more top-
down processing may be needed - identifying the overall shape of the blocks structure
may guide one in making out the individual blocks.
- Hierarchical Action Networks. Top-down action processing tells the agent that, if
it wants to pick up a block, it should move its arm in such a way as to get its hand
near the block, and then move its hand. But if it's still learning how to do that sort
of motion, more likely it will do this, but then start moving its its hand and find that
it's hard to get a grip on the block - and then have to go back and move its arm a
little differently. Iterating between broader arm/hand movements and more fine-grained
hand/finger movements is an instance of information iteratively passing up and down
a hierarchical action network.
- Coupling of Perception and Action Networks. Picking up a block in the dark is
a perfect example of rich coupling of perception and action networks. Feeling the block
with the fingers helps with identifying blocks that can't be clearly seen.
• Chapter 30 (Procedure Learning)
- Specification Based Procedure Learning.
• Suppose the agent has never seen a horse, but the teacher builds a number of blocks
structures and calls them horses, and draws a number of pictures and calls them
horses. This may cause a procedure learning problem to be spawned, where the
fitness function is accuracy at distinguishing horses from non-horses.
• Learning to pick up a block is specification-based procedure learning, where the
specification is to pick up the block and grip it and move it without knocking down
the other stuff near the block.
- Representation Building.
EFTA00624647
50.2 Roles of Selected Cognitive Processes 501
• In the midst of building a procedure to recognize horses, MOSES would experi-
mentally vary program nodes recognizing visual features into other program nodes
recognizing other visual features
• In the midst of building a procedure to pick up blocks, MOSES would experimentally
vary program nodes representing physical movements into other nodes representing
physical movements
• In both of these cases, MOSES would also carry out the standard experimen-
tal variations of mathematical and control operators according to its standard
representation-building framework
• Chapter 31 (Imitative, Reinforcement and Corrective Learning)
- Reinforcement Learning.
• Motor procedures for placing blocks (in simulations or reality) will get rewarded if
they don't result in the blocks structure falling down, punished otherwise.
• Procedures leading to the teacher being pleased, in internal simulations (or in re-
peated trials of scenarios like the one under consideration), will get rewarded; pro-
cedures leading to the teacher being displeased will get punished.
- Imitation Learning. If the agent has seen others build with blocks before, it may
summon these memories and then imitate the actions it has seen others take.
- Corrective Learning. This would occur if the teacher intervened in the agent's block-
building and guided him physically - e.g. steadying his shaky arm to prevent him from
knocking the blocks structure over.
• Chapter 32 (Hillclimbing)
- Complexity Penalty. In learning procedures for manipulating blocks, the complexity
penalty will militate against procedures that contain extraneous steps.
• Chapter 33 (Probabilistic Evolutionary
• Procedure Learning)
- Supplying Evolutionary Learning with Long-Term Memory. Suppose the agent
has previously built people from clay, but never from blocks. It may then have learned a
"classification model" predicting which clay people will look appealing to humans, and
which won't. It may then transfer this knowledge, using PLN, to form a classification
model predicting which blocks-people will look appealing to humans, and which won't.
- Fitness Function Estimation via Integrative Intelligence. To estimate the fitness
of a procedure for, say, putting an arm on a blocks-built human, the agent may try out
the procedure in the internal simulation world; or it may use PLN inference to reason
by analogy to prior physical situations it's observed. These allow fitness to be estimated
without actually trying out the procedure in the environment.
• Chapter 34 (Probabilistic Logic Networks)
- Deduction. This is a tall skinny structure; tall skinny structures fall down easily; thus
this structure may fall down easily.
- Induction. This teacher is talkative; this teacher is friendly; therefore the talkative are
generally friendly.
- Abduction. This structure has a head and arms and torso; a person has a head and
arms and torso; therefore this structure is a person.
EFTA00624648
502 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
- PLN forward chaining. What properties might a car-man have, based on inference
from the properties of cars and the properties of men?
- PLN backward chaining.
• An inference target might be: Find X so that X looks something like a wheel and
can be attached to this Mocks-chassis, and I can find four fairly similar copies.
• Or: Find the truth value of the proposition that this structure looks like a car.
- Indefinite truth values. Consider the deductive inference 'This is a tall skinny struc-
ture; tall skinny structures fall down easily; thus this structure may fall down easily."
In this case, the confidence of the second premise may be greater than the confidence
of the first premise, which may result in an intermediate confidence for the conclusion,
according to the propagation of indefinite probabilities through the PLN deduction rule.
- Intensional inference. Is the blocks-structure a person? According to the definition of
intensional inheritance, it shares many informative properties with people (e.g. having
arms, torso and head), so to a significant extent, it is a person.
- Confidence decay. The agent's confidence in propositions regarding building things
with blocks should remain nearly constant. The agent's confidence in propositions re-
garding the teacher's taste should decay more rapidly. This should occur because the
agent should observe that, in general, propositions regarding physical object manipula-
tion tend to retain fairly constant truth value, whereas propositions regarding human
tastes tend to have more rapidly decaying truth value.
• Chapter 35 (Spatiotemporal Inference)
- Temporal reasoning. Suppose, after the teacher asks 'What is it?", the agent needs
to think a while to figure out a good answer. But maybe the agent knows that it's rude
to pause too long before answering something to a direct question. Temporal reasoning
helps figure out "how long is too long" to wait before answering.
- Spatial reasoning. Suppose the agent puts shoes on the wheels of the car. This is a
joke relying on the understanding that wheels hold a car up, whereas feet hold a person
up, and the structure is a car-man. But it also relies on the spatial inferences that:
the car's wheels are in the right position for the man's feet (below the torso); and, the
wheels are below the car's chassis just like a person's feet are below its torso.
• Chapter 36 (Inference Control)
- Evaluator Choice as a Bandit Problem. In doing inference regarding how to make
a suitably humanlike arm for the blocks-man, there may be a choice between multiple
inference pathways, perhaps one that relies on analogy to other situations building
arms, versus one that relies on more general reasoning about lengths and weights of
blocks. The choice between these two pathways will be made randomly with a certain
probabilistic bias assigned to each one, via prior experience.
- Inference Pattern Mining. The probabilities used in choosing which inference path
to take. are determined in part by prior experience - e.g. maybe it's the case that in
prior situations of building complex blocks structures, analogy, has proved a better guide
than naive physics, thus the prior probability of the analogy inference pathway will be
nudged up.
- PLN and Bayes Nets. What's the probability that the blocks-man's hat will fall
off if the man-car is pushed a little bit to simulate driving? This question could be
resolved in many ways (e.g. by internal simulation), but one possibility is inference. If
EFTA00624649
50.2 Roles of Selected Cognitive Processes 503
this is resolved by inference, it's the sort of conditional probability calculation that could
potentially be done faster if a lot of the probabilistic knowledge from the AtomSpace
were summarized in a Bayes Net. Updating the Bayes net structure can be slow, so this
is probably not appropriate for knowledge that is rapidly shifting; but knowledge about
properties of blocks structures may be fairly persistent after the agent has gained a fair
bit of knowledge by playing with blocks a lot.
• Chapter 37 (Pattern Mining)
- Greedy Pattern Mining.
• "Push a tall structure of blocks and it tends to fall down" is the sort of repetitive
pattern that could easily be extracted from a historical record of perceptions and
(the agent's and others') actions via simple greedy pattern mining algorithm.
• If there is a block that is shaped like a baby's rattle, with a long slender handle
and then a circular shape at the end, then greedy pattern mining may be helpful
due to having recognized the pattern that structures like this are sometimes rattles
- and also that structures like this are often stuck together, with the handle part
connected sturdily to the circular part.
- Evolutionary Pattern Mining. 'Push a tall structure of blocks with a wide base and
a gradual narrowing toward the top and it may not fall too badly" is a more complex
pattern that may not be found via greedy mining, unless the agent has dealt with a lot
of pyramids.
• Chapter 38 (Concept Formation)
- Formal Concept Analysis. Suppose there are many long, slender blocks of different
colors and different shapes (some cylindrical, some purely rectangular for example).
Learning this sort of concept based on common features is exactly what FCA is good
at (and when the features are defined fuzzily or probahilistically, it's exactly what
uncertain FCA is good at). Learning the property of "slender" itself is another example
of something uncertain FCA is good at - it would learn this if there were many concepts
that preferentially involved slender things (even though formed on the basis of concepts
other than slenderness)
- Conceptual Blending. The concept of a "car-man" or "man-car" is an obvious instance
of conceptual blending. The agents know that building a man won't surprise the teacher,
and nor will building a car ... but both "man" and "car" may pop to the forefront of its
mind (i.e. get a briefly high STI) when it thinks about what to build. But since it knows
it has to do something new or surprising, there may be a cognitive schema that boosts
the amount of funds to the ConceptBlending MindAgent, causing it to be extra-active.
In any event, the ConceptBlending agent seeks to find ways to combine important
concepts; and then PLN explores these to see which ones may be able to achieve the
given goal of surprising the teacher (which includes subgoals such as actually being
buildable).
• Chapter 39 (Dimensional Embedding)
- Dimensional Embedding. When the agent needs to search its memory for a previ-
ously seen blocks structure similar to the currently observed one - or for a previously
articulated thought similar to the one it's currently trying to articulate - then it needs
to to a search through its large memory for "an entity similar to X" (where X is a
EFTA00624650
504 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
structure or a thought). This kind of search can be quite computationally difficult - but
if the entities in question have been projected into an embedding space, then it's quite
rapid. (The cost is shifted to the continual maintenance of the embedding space, and
its periodic updating; and there is some error incurred in the projection, but in many
cases this error is not a show-stopper.)
Embedding Based Inference Control. Rapid search for answers to similarity or in-
heritance queries can be key for guiding inference in appropriate directions; for instance
reasoning about how to build a structure with certain properties, can benefit greatly
from rapid search for previously-encountered substructures currently structurally or
functionally similar to the substructures one desires to build.
• Chapter 40 (Simulation and Episodic Memory)
- Fitness Estimation via Simulation . One way to estimate whether a certain blocks
structure is likely to fall down or not, is to build it in one's "mind's eye" and see if
the physics engine in one's mind's-eye causes it to fall down. This is something that in
many cases will work better for CogPrime than for humans, because CogPrime has a
more mathematically accurate physics engine than the human mind does; however, in
cases that rely heavily on naive physics rather than, say, direct applications of Newton's
Laws, then CogPrime's simulation engine may tmderperform the typical human mind.
- Concept Formation via Simulation . Objects may be joined into categories using
uncertain FCA, based on features that they are identified to have via "simulation exper-
iments" rather than physical world observations. For instance, it may be observed that
pyramid-shaped structures fall less easily than pencil-shaped tower structures - and
the concepts corresponding to these two categories may be formed - from experiments
run in the internal simulation world, perhaps inspired by isolated observations in the
physical world.
- Episodic Memory. Previous situations in which the agent has seen similar structures
built, or been given similar problems to solve, may be brought to mind as "episodic
movies" playing in the agent's memory. By watching what happens in these replayed
episodic movies, the agent may learn new declarative or procedural knowledge about
what to do. For example, maybe there was some situation in the agent's past where it
saw someone asked to do something surprising, and that someone created something
funny. This might (via a simple PLN step) bias the agent to create something now,
which it has reason to suspect will cause others to laugh.
• Chapter 41 (Integrative Procedure Learning)
- Concept-Driven Procedure Learning. Learning the concept of "horse", as discussed
above in the context of Chapter 30, is an example of this.
- Predicate Schematization. The synthesis of a schema for man-building, as discussed
above in the context of Chapter 25, is an example of this.
• Chapter 42 (Map Formation)
- Map Formation. The notion of a car involves many aspects: the physical appearance
of cars, the way people get in and out of cars, the ways cars drive, the noises they make,
etc. All these aspects are represented by Atoms that are part of the car map, and are
richly interconnected via HebbianLinks as well as other links.
EFTA00624651
50.2 Roles of Selected Cognitive Processes 505
- Map Encapsulation . The car map forms implicitly via the interaction of multiple
cognitive dynamics, especially ECAN. But then the MapEncapstdation MindAgent may
do its pattern mining and recognize this map explicitly, and form a PredicateNode
encapsulating it. This PredicateNode may then be used in PLN inference, conceptual
blending, and so forth (e.g. helping with the formation of a concept like car-man via
blending).
• Chapter 44 (Natural Language Comprehension)
- Experience Based Diszunbiguation. The particular dialogue involved in the present
example doesn't require any nontrivial word sense disambiguation. But it does require
parse selection, and semantic interpretation selection:
In "Build me something with blocks," the agent has no trouble understanding that
"blocks" means "toy building blocks" rather than, say, "city blocks", based on many
possible mechanisms, but most simply importance spreading.
"Build me something with blocks" has at least three interpretations: the building
could be carried out using blocks with a tool; or the thing built could be presented
alongside blocks; or the thing built could be composed of blocks. The latter is the
most commonsensical interpretation for most humans, but that is because we have
heard the phrase "building with blocks" used in a similarly grounded way before
(as well as other similar phrases such as "playing with Legos", etc., whose meaning
helps militate toward the right interpretation via PLN inference and importance
spreading). So here we have a simple example of experience-based disambiguation,
where experiences at various distances of association from the current one are used
to help select the correct parse.
A subtler form of semantic disambiguation is involved in interpreting the clause
"that I haven't seen before." A literal-minded interpretation would say that this
requirement is fulfilled by any blocks construction that's not precisely identical to
one the teacher has seen before. But of course, any sensible human knows this is an
idiomatic clause that means "significantly different from anything I've seen before."
This could be determined by the CogPrime agent if it has heard the idiomatic
clause before, or if it's heard a similar idiomatic phrase such as "something I've
never done before." Or, even if the agent has never heard such an idiom before, it
could potentially figure out the intended meaning simply because the literal-minded
interpretation would be a pointless thing for the teacher to say. So if it knows the
teacher usually doesn't add useless modificatory clauses onto their statements, then
potentially the agent could guess the correct meaning of the phrase.
• Chapter 46 (Language Generation)
- Experience-Based Knowledge Selection for Language Generation. When the
teacher asks ' hat is it?", the agent must decide what sort of answer to give. Within
the confines of the QuestionAnswering DialogueController, the agent could answer "A
structure of blocks", or "A part of the physical world", or "A thing", or "Mine." (Or, if it
were running another DC, it could answer more broadly, e.g. "None of your business,"
etc.). However, the QA DC tells it that, in the present context, the most likely desired
answer is one that the teacher doesn't already know; and the most important property
of the structure that the teacher doesn't obviously already know is the fact that it
EFTA00624652
506 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
depicts a "car man." Also, memory, of prior conversations may bring up statements like
'It's a horse" in reference to a horse built of blocks, or a drawing of a horse, etc.
- Experience-Based Guidance of Word and Syntax Choice . The choice of phrase
"car man" requires some choices to be made. The agent could just as well say "It's a
man with a car for feet" or "It's a car with a human upper body and head" or "It's
a car centaur," etc. A bias toward simple expressions would lead to "car man." If the
teacher were known to prefer complex expressions, then the agent might be biased
toward expressing the idea in a different way.
• Chapter 48 (Natural Language Dialogue)
- Adaptation of Dialogue Controllers. The QuestionAsking and QuestionAnswer-
ing DialogueControllers both get reinforcement from this interaction, for the specific
internal rules that led to the given statements being made.
50.3 A Semi-Narrative Treatment
Now we describe how a CogPriine system might carry out the specified task in a semi-narrative
form, weaving in the material from the previous section as we go along, and making some more
basic points as well. The semi-narrative covers most but not all of the bullet points from the
previous section, but with some of the technical details removed; and it introduces a handful
of new examples not given in the bullet points.
The reason this is called a semi-narrative rather than a narrative is that there is no particular
linear order to the processes occurring in each phase of the situation described here. CogPrime's
internal cognitive processes do not occur in a linear narrative; rather, what we have is a complex
network of interlocking events. But still, describing some of these events concretely in a manner
correlated with the different stages of a simple interaction, may have some expository value.
The human teacher tells the CogPrime agent "Build me something with blocks
that I haven't seen before."
Upon hearing this, the agent's cognitive cycles are dominated by language processing and
retrieval from episodic and sensory memory.
The agent may decide to revive from disk the mind-states it went through when building
human-pleasing structures from blocks before, so as to provide it with guidance
It will likely experience the emotion of happiness, because it anticipates the pleasure of
getting rewarded for the task in future.
The ubergoal of pleasing the teacher gets active (gets funded significantly with STI currency),
as it becomes apparent there are fairly clear ways of fulfilling that goal (via the subgoal S of
building blocks structures that will get positive response from the teacher). Other ubergoals
like gaining knowledge are not funded as much with STI currency just now, as they are not
immediately relevant.
Action selection, based on ImplicationLinks derived via PLN (between various possible activ-
ities and the subgoal S) causes it to start experimentally building some blocks structures. Past
experience with building (turned into ImplicationLinks via mining the SystemActivityTable)
tells it that it may want to build a little bit in its internal simulation world before building in
the external world, causing STI currently to flow to the simulation MindAgent.
EFTA00624653
50.3 A Semi-Narrative Treatment 507
The Atom corresponding to the context blocks-building gets high STI and is pushed into the
AttentionalFocus, making it likely that many future inferences will occur in this context. Other
Atoms related to this one also get high STI (the ones in the blocks-building map, and others
that are especially related to blocks-building in this particular context).
After a few false starts, the agent builds something it thinks is appropriate and
says "Do you like it?"
Now that the agent has decided what to do to fulfill its well-funded goal, its cognitive cycles
are dominated by action, perception and related memory access and concept creation.
An obvious subgoal is spawned: build a new structure now, and make this particular structure
under construction appealing and novel to the teacher. This subgoal has a shorter time scale
than the high level goal. The subgoal gets some currency from its supergoal using the mechanism
of RFS spreading.
Action selection must tell it when to continue building the same structure and when to try
a new one, as well as more micro level choices.
Atoms related to the currently pursued blocks structure get high STI.
After a failed structure (a "false start") is disassembled, the corresponding Atoms lose STI
dramatically (leaving AF) but may still have significant LTI, so they can be recalled later as
appropriate. They may also have VLTI so they will be saved to disk later on if other things
push them out of RAM due to getting higher LTI.
Meanwhile everything that's experienced from the external world goes into the Experi-
enceDB.
Atoms representing different parts of aspects of the same blocks structure will get Hebbian
links between them, which will guide future reasoning and importance spreading.
Importance spreading helps the system go from an idea for something to build (say, a rock
or a car) to the specific plans and ideas about how to build it, via increasing the STI of the
Atoms that will be involved in these plan and ideas.
If something apparently good is done in building a blocks structure, then other processes and
actions that helped lead to or support that good thing, get passed some STI from the Atoms
representing the good thing, and also may get linked to the Goal Atom representing "good" in
this context. This leads to reinforcement learning.
The agent may play with building structures and then seeing what they most look like, thus
exercising abstract object recognition (that uses procedures learned by MOSES or hillclimbing,
or uncertain relations learned by inference, to guess what object category a given observed
collection of percepts mast likely falls into).
Since the agent has been asked to come up with something surprising, it knows it should
probably try to formulate some new concepts - because it has learned in the past, via Sys-
temActivityTable mining, that often newly formed concepts are surprising to others. So, more
STI currency is given to concept formation MindAgents, such as the ConceptualBlending Mind
Agent (which, along with a lot of stuff that gets thrown out or stored for later use, comes up
with "car-man").
When the notion of "car" is brought to mind, the distributed map of nodes corresponding to
"car" get high STI. When car-man is formed, it is reasoned about (producing new Atoms), but
it also serves as a nexus of importance-spreading, causing the creation of a distributed car-man
map.
If the goal of making an arm for a man-car occurs, then goal-driven schema learning may
be done to learn a procedure for arm-making (where the actual learning is done by MOSES or
EFTA00624654
508 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
If the agent Ls building a man-car, it may have man-building and car-building schema in its
ActiveSchemaPool at the same time, and SchemaActivation may spread back and forth between
the different modules of these two schema.
If the agent wants to build a horse, but has never seen a horse made of blocks (only various
pictures and movies of horses), it may uses MOSES or hillclimbing internally to solve the
problem of creating a horse-recognizer or a horse-generator which embodies appropriate abstract
properties of horses. Here as in all cases of procedure learning, a complexity penalty rewards
simpler programs, from among all programs that approximately fulfill the goals of the learning
process.
If a procedure being executed has some abstract parts, then these may be executed by
inferential procedure evaluation (which makes the abstract parts concrete on the fly in the
course of execution).
To guess the fitness of a procedure for doing something (say; building an arm or recognizing
a horse), inference or simulation may be used, as well as direct evaluation in the world.
Deductive, inductive and abductive PLN inference may be used in figuring out what a blocks
structure will look or act like like before building it (it's tall and thin so it may fall down; it
won't be bilaterally symmetric so it won't look much like a person; etc.)
Backward-chaining inference control will help figure out how to assemble something matching
a certain specification e.g. how to build a chassis based on knowledge of what a chassis looks
like. Forward chaining inference (critically including intensional relationships) will be used to
estimate the properties that the teacher will perceive a given specific structure to have. Spatial
and temporal algebra will be used extensively in this reasoning, within the PLN framework.
Coordinating different parts of the body - say an arm and a hand - will involve importance
spreading (both up and down) within the hierarchical action network, and from this network
to the hierarchical perception network and the heterarchical cognitive network.
In looking up Atoms in the AtomSpace, sonic have truth values whose confidences have
decayed significantly (e.g. those regarding the teacher's tastes), whereas others have confidences
that have hardly decayed at all (e.g. those regarding general physical properties of blocks).
Finding previous blocks structures similar to the current one (useful for guiding building by
analogy to past experience) may be done rapidly by searching the system's internal dimensional-
embedding space.
As the building process occurs, patterns mined via past experience (tall things often fall
down) are used within various cognitive processes (reasoning, procedure learning, concept cre-
ation, etc.); and new pattern mining also occurs based on the new observations made as different
structures are build and experimented with and destroyed.
Simulation of teacher reactions, based on inference from prior examples, helps with the
evaluation of possible structures, and also of procedures for creating structures.
As the agent does all this, it experiences the emotion of curiosity (likely among other emo-
tions), because as it builds each new structure it has questions about what it will look like and
how the teacher would react to it.
The human teacher says "It's beautiful. What is it?" The agent says "It's a car
man"
Now that the building is done and the teacher says something, the agent's cognitive cycles
are dominated by language understanding and generation. The Atom representing the context
of talking to the teacher gets high STI, and is used as the context for many ensuing inferences.
EFTA00624655
50.4 Conclusion 509
Comprehension of "it" uses anaphor resolution based on a combination of ECAN and PLN
inference based on a combination of previously interpreted language and observation of the
external world situation.
The agent experiences the emotion of happiness because the teacher has called its creation
beautiful, which is recognizes as a positive evaluation - so the agent knows one of its ubergoals
("please the teacher") has been significantly fulfilled.
The goal of pleasing the teacher causes the system to want to answer the question. So
the QuestionAnswering DialogueController schema gets paid a lot and gets put into the Ac-
tiveSchemaPool. In reaction to the question asked, this DC chooses a semantic graph to speak,
then invokes NL generation to say it.
NL generation chooses the most compact expression that seems to adequately convey the
intended meaning, so it decides on "car man" as the best simple verbalization to match the
newly created conceptual blend that it thinks effectively describes the newly created blocks
structure.
The positive feedback from the user leads to reinforcement of the Atoms and processes that
led to the construction of the blocks structure that has been judged beautiful (via importance
spreading and SystemActivityTable mining).
50.4 Conclusion
The simple situation considered in this chapter is complex enough to involve nearly all the
different cognitive processes in the CogPrime system - and many interactions between these
processes. This fact illustrates one of the main difficulties of designing, building and testing an
artificial mind like CogPrime - until nearly all of the system is build and made to operate in
an integrated way, it's hard to do any meaningful test of the system. Testing PLN or MOSES
or conceptual blending in isolation may be interesting computer science, but it doesn't tell you
much about CogPrime as a design for a thinking machine.
According to the CogPrime approach, getting a simple child-like interaction like "build me
something with blocks that I haven't seen before" to work properly requires a holistic, integrated
cognitive system. Once one has built a system capable of this sort of simple interaction then,
according to the theory underlying CogPrime, one is not that far from a system with adult
human-level intelligence. And once one has an adult human-level AGI built according to a
highly flexible design like CogPrime. given the potential of such systems to self-analyze and
self-modify, one is not far off from a dramatically powerful Genius Machine. Of course there
will be a lot of work to do to get from a child-level system to an adult-level system - it won't
necessarily unfold as "automatically" as seems to happen with a human child, because CogPrime
lacks the suite of developmental processes and mechanisms that the young human brain has.
But still, a child CogPrime mind capable of doing the things outlined in this chapter will have
all the basic components and interactions in place, all the ones that are needed for a much more
advanced artificial mind.
Of course, one could concoct a narrow-Al system carrying out the specific activities described
in this chapter, much more simply than one could build a CogPrime system capable of doing
these activities. But that's not the point — the point of this chapter is not to explain how to
achieve some particular narrow set of activities "by any means necessary", but rather to explain
EFTA00624656
510 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment
how these activities might be achieved within the CogPrime framework, which has been designed
with much more generality in mind.
It would be worthwhile to elaborate a number of other situations similar to the one described
in this chapter, and to work through the various cognitive processes and structures in CogPrime
carefully in the context of each of these situations. In fact this sort of exercise has frequently
been carried out informally in the context of developing CogPrime. But this book is already
long enough, so we will end here, and leave the rest for future works - emphasizing that it is
via intimate interplay between concrete considerations like the ones presented in this chapter,
and general algorithmic and conceptual considerations as presented in most of the chapters of
this book, that we have the greatest hope of creating advanced AGI. The value of this sort of
interplay actually follows from the theory of real-world general intelligence presented in Part
1 of the book. Thoroughly general intelligence is only possible given unrealistic computational
resources, so real-world general intelligence is about achieving high generality given limited
resources relative to the specific classes of environments relevant to a given agent. Specific
situations like building surprising things with blocks are particularly important insofar as they
embody broader information about the classes of environments relevant to broadly human-like
general intelligence.
No doubt, once a CogPrime system is completed, the specifics of its handling of the situation
described here will differ somewhat from the treatment presented in this chapter. Furthermore,
the final CogPrime system may differ algorithmically and structurally in some respects from
the specifics given in this book - it would be surprising if the process of building, testing and
interacting with CogPrime didn't teach us some new things about various of the topics covered.
But our conjecture is that, if sufficient effort is deployed appropriately, then a system much like
the CogPrime system described in this book will be able to handle the situation described in
this chapter in a roughly similar manner to the one described in this chapter - and that this
will serve as a natural precursor to much more dramatic AGI achievements.
EFTA00624657
Appendix A
Glossary
A.1 List of Specialized Acronyms
This includes acronyms that are commonly used in discussing CogPrime, OpenCog and related
ideas, plus sonic that occur here and there in the text for relatively ephemeral reasons.
• AA: Attention Allocation
• ADF: Automatically Defined Function (in the context of Genetic Programming)
• AF: Attentional Focus
• AGI: Artificial General Intelligence
• AV: Attention Value
• BD: Behavior Description
• C-space: Configuration Space
• CBV: Coherent Blended Volition
• CEV: Coherent Extrapolated Volition
• CGGP: Contextually Guided Greedy Parsing
• CSDLN: Compositional Spatiotemporal Deep Learning Network
• CT: Combo 'Bee
• ECAN: Economic Attention Network
• ECP: Embodied Communication Prior
• EPW : Experiential Possible Worlds (semantics)
• FCA: Formal Concept Analysis
• FI : Fisher Information
• FLM: Frequent Itemset Mining
• FOI: First Order Inference
• FOPL: First Order Predicate Logic
• FOPLN: First Order PLN
• FS-MOSES: Feature Selection MOSES (i.e. MOSES with feature selection integrated a la
LIFES)
• GA: Genetic Algorithms
511
EFTA00624658
512 A Glossary
• GB: Global Brain
• GEOP: Goal Evaluator Operating Procedure (in a GOLEM context)
• GIS: Geospatial Information System
• GOLEM: Goal-Oriented LEarning Meta-architecture
• GP: Genetic Programming
• HOE Higher-Order Inference
• HOPLN: Higher-Order PLN
• HR: Historical Repository (in a GOLEM context)
• HTM: Hierarchical Temporal Memory
• IA: (Allen) Interval Algebra (an algebra of temporal intervals)
• IRC: Imitation / Reinforcement Correction (Learning)
• LIFES: Learning-Integrated Feature Selection
• LTI: Long Term Importance
• MA: MindAgent
• MOSES: Meta-Optimizing Semantic Evolutionary Search
• MSH: Mirror System Hypothesis
• NARS: Non-Axiomatic Reasoning System
• NLGen: A specific software component within OpenCog, which provides one way of dealing
with Natural Language Generation
• OCP: OpenCogPrime
• OP: Operating Program (in a GOLEM context)
• PEPL: Probabilistic Evolutionary Procedure Learning (e.g. MOSES)
• PLN: Probabilistic Logic Networks
• RCC: Region Connection Calculus
• RelEx: A specific software component within OpenCog, which provides one way of dealing
with natural language Relationship Extraction
• SAT: Boolean SATisfaction, as a mathematical / computational problem
• SMEPH: Self-Modifying Evolving Probabilistic Hypergraph
• SRAM: Simple Realistic Agents Model
• STI: Short Term Importance
• STY: Simple Truth VAlue
• TV: Truth Value
• VLTI: Very Long Term Importances
• WSPS: Whole-Sentence Purely-Syntactic Parsing
A.2 Glossary of Specialized Terms
• Abduction: A general form of inference that goes from data describing something to a
hypothesis that accounts for the data. Often in an OpenCog context, this refers to the PLN
abduction rule, a specific First-Order PLN rule (If A implies C, and B implies C, then
maybe A is B), which embodies a simple form of abductive inference. But OpenCog may
also carry out abduction, as a general process, in other ways.
• Action Selection: The process via which the OpenCog system chooses which Schema to
enact, based on its current goals and context.
• Active Schema Pool: The set of Schema currently in the midst of Schema Execution.
EFTA00624659
A.2 Glossary of Specialized Terms 513
• Adaptive Inference Control: Algorithms or heuristics for guiding PLN inference, that
cause inference to be guided differently based on the context in which the inference is taking
place, or based on aspects of the inference that are noted as it proceeds.
• AGI Preschool: A virtual world or robotic scenario roughly similar to the environment
within a typical human preschool, intended for AGIs to learn in via interacting with the
environment and with other intelligent agents.
• Atom: The basic entity used in OpenCog as an element for building representations. Some
Atoms directly represent patterns in the world or mind, others are components of represen-
tations. There are two kinds of Atoms: Nodes and Links.
• Atom, Frozen: See Atom, Saved
• Atom, Realized: An Atom that exists in RAM at a certain point in time.
• Atom, Saved: An Atom that has been saved to disk or other similar media, and is not
actively being processed.
• Atom, Serialized: An Atom that is serialized for transmission from one software process
to another, or for saving to disk, etc.
• Atom2Link: A part of OpenCogPrime
s language generation system, that transforms appropriate Atoms into words connected via
link parser link types.
• Atomspace: A collection of Atoms, comprising the central part of the memory, of an
OpenCog instance.
• Attention: The aspect of an intelligent system's dynamics focused on guiding which aspects
of an OpenCog system's memory & functionality gets more computational resources at a
certain point in time
• Attention Allocation: The cognitive process concerned with managing the parameters
and relationships guiding what the system pays attention to, at what points in time. This
is a term inclusive of Importance Updating and Hebbian Learning.
• Attentional Currency: Short Term Importance and Long Term Importance values are
implemented in terms of two different types of artificial money, STICurrency and LTICur-
rency. Theoretically these may be converted to one another.
• Attentional Focus: The Atoms in an OpenCog Atomspace whose ShortTennImportance
values lie above a critical threshold (the AttentionalFocus Boundary). The Attention Allo-
cation subsystem treats these Atoms differently. Qualitatively, these Atoms constitute the
system's main focus of attention during a certain interval of time, i.e. it's a moving bubble
of attention.
• Attentional Memory: A system's memory of what it's useful to pay attention to, in what
contexts. In CogPrime this is managed by the attention allocation subsystem.
• Backward Chainer: A piece of software, wrapped in a MindAgent, that carries out back-
ward chaining inference using PLN.
• CIM-Dynamic: Concretely-Implemented Mind Dynamic, a term for a cognitive process
that is implemented explicitly in OpenCog (as opposed to allowed to emerge implicitly from
other dynamics). Sometimes a CIM-Dynamic will be implemented via a single MindAgent,
sometimes via a set of multiple interrelated MindAgents, occasionally by other means.
• Cognition: In an OpenCog context, this is an imprecise term. Sometimes this term means
any process closely related to intelligence; but more often it's used specifically to refer to
more abstract reasoning/learning/etc, as distinct from lower-level perception and action.
• Cognitive Architecture: This refers to the logical division of an AI system like OpenCog
into interacting parts and processes representing different conceptual aspects of intelligence.
EFTA00624660
514 A Glossary
It's different from the software architecture, though of course certain cognitive architectures
and certain software architectures fit more naturally together.
• Cognitive Cycle: The basic "loop" of operations that an OpenCog system, used to control
an agent interacting with a world, goes through rapidly each "subjective moment." Typically
a cognitive cycle should be completed in a second or less. It minimally involves perceiving
data from the world, storing data in memory, and deciding what if any new actions need
to be taken based on the data perceived. It may also involve other processes like deliber-
ative thinking or metacognition. Not all OpenCog processing needs to take place within a
cognitive cycle.
• Cognitive Schematic: An implication of the form "Context AND Procedure IMPLIES
goal". Learning and utilization of these is key to CogPrime's cognitive process.
• Cognitive Synergy: The phenomenon by which different cognitive processes, controlling a
single agent, work together in such a way as to help each other be more intelligent. Typically,
if one has cognitive processes that are individually susceptible to combinatorial explosions,
cognitive synergy involves coupling them together in such a way that they can help one
another overcome each other's internal combinatorial explosions. The CogPrime design is
reliant on the hypothesis that its key learning algorithms will display dramatic cognitive
synergy when utilized for agent control in appropriate environments.
• CogPrime : The name for the AGI design presented in this book, which is designed specifi-
cally for implementation within the OpenCog software framework (and this implementation
is OpenCogPrime).
• CogServer: A piece of software, within OpenCog, that wraps up an Atomspace and a
number of MindAgents, along with other mechanisms like a Scheduler for controlling the
activity of the MindAgents, and code for important and exporting data from the Atomspace.
• Cognitive Equation: The principle, identified in Ben Goertzel's 1994 book "Chaotic
Logic", that minds are collections of pattern-recognition elements, that work by iteratively
recognizing patterns in each other and then embodying these patterns as new system ele-
ments. This is seen as distinguishing mind from "self-organization" in general, as the latter
is not so focused on continual pattern recognition. Colloquially this means that "a mind is
a system continually creating itself via recognizing patterns in itself."
• Combo: The programming language used internally by MOSES to represent the programs
it evolves. Schemallodes may refer to Combo programs, whether the latter are learned via
MOSES or via some other means. The textual realization of Combo resembles LISP with
less syntactic sugar. Internally a Combo program is represented as a program tree.
• Composer: In the PLN design, a rule is denoted a composer if it needs premises for
generating its consequent. See generator.
• CogBuntu: an Ubuntu Linux remix that contains all required packages and tools to test
and develop OpenCog.
• Concept Creation: A general term for cognitive processes that create new ConceptNodes,
PredicateNodes or concept maps representing new concepts.
• Conceptual Blending: A process of creating new concepts via judiciously combining
pieces of old concepts. This may occur in OpenCog in many ways, among them the explicit
use of a ConceptBlending MindAgent, that blends two or more ConceptNodes into a new
one.
• Confidence: A component of an OpenCog/PLN TruthValue, which is a scaling into the
interval 10,11 of the weight of evidence associated with a truth value. In the simplest case
(of a probabilistic Simple Truth Value), one uses confidence c = n / (n+k), where n is
EFTA00624661
A.2 Clossary of Specialized Terms 515
the weight of evidence and k is a parameter. In the case of an Indefinite Truth Value, the
confidence is associated with the width of the probability interval.
• Confidence Decay: The process by which the confidence of an Atom decreases over time,
as the observations on which the Atom's truth value is bawd become increasingly obsolete.
This may be carried out by a special MindAgent. The rate of confidence decay is subtle and
contextually determined, and must be estimated via inference rather than simply assumed
a priori.
• Consciousness: CogPrime is not predicated on any particular conceptual theory of con-
sciousness. Informally, the AttentionalFocus is sometimes referred to as the "conscious"
mind of a CogPrime system, with the rest of the Atomspace as "unconscious" but this is
just an informal usage, not intended to tie the CogPrime design to any particular theory of
consciousness. The primary originator of the CogPrime
design (Ben Goertzel) tends toward panpsychism, as it happens.
• Context: In addition to its general common-sensical meaning, in CogPrime the term Con-
text also refers to an Atom that is used as the first argument of a ContextLink. The second
argument of the ContextLink then contains Links or Nodes, with TruthValues calculated
restricted to the context defined by the first argument. For instance, (ContextLink USA
(InheritanceLink person obese )).
• Core: The MindOS portion of OpenCog, comprising the Atomspace, the CogServer, and
other associated "infrastructural" code.
• Corrective Learning: When an agent learns how to do something, by having another
agent explicitly guide it in doing the thing. For instance, teaching a dog to sit by pushing
its butt to the ground.
• CSDLN: (Compositional Spatiotemporal Deep Learning Network): A hierarchical pattern
recognition network, in which each layer corresponds to a certain spatiotemporal granularity,
the nodes on a given layer correspond to spatiotemporal regions of a given size, and the
children of a node correspond to sub-regions of the region the parent corresponds to. Jeff
Hawkins's HTM is one example CSDLN, and Itamar Arel's DeSTIN (currently used in
OpenCog) is another.
• Declarative Knowledge: Semantic knowledge as would be expressed in propositional or
predicate logic facts or beliefs.
• Deduction: In general, this refers to the derivation of conclusions from premises using
logical rules. In PLN in particular, this often refers to the exercise of a specific inference
rule, the PLN Deduction rule (A B, B C, therefore A—> C)
• Deep Learning: Learning in a network of elements with multiple layers, involving feedfor-
ward and feedback dynamics, and adaptation of the links between the elements. An example
deep learning algorithm is DeSTIN, which is being integrated with OpenCog for perception
processing.
• Defrosting: Restoring, into the RAM portion of an Atomspace, an Atom (or set thereof)
previously saved to disk.
• Demand: In CogPrime's OpenPsi subsystem, this term is used in a manner inherited from
the Psi model of motivated action. A Demand in this context is a quantity whose value the
system is motivated to adjust. Typically the system wants to keep the Demand between
certain minimum and maximum values. An Urge develops when a Demand deviates from
its target range.
• Deme: In MOSES, an island" of candidate programs, closely clustered together in program
space, being evolved in an attempt to optimize a certain fitness function. The idea is that
EFTA00624662
516 A Glossary
within a dome, programs are generally similar enough that reasonable syntax-semantics
correlation obtains.
• Derived Hypergraph: The SMEPH hypergraph obtained via modeling a system in terms
of a hypergraph representing its internal states and their relationships. For instance, a
SMEPH vertex represents a collection of internal states that habitually occur in relation to
similar external situations. A SMEPH edge represents a relationship between two SMEPH
vertices (e.g. a similarity or inheritance relationship). The terminology "edge /vertex" is
used in this context, to distinguish from the 'link / node" terminology used in the context
of the Atomspace.
• DeSTIN — Deep SpatioTemporal Inference Network: A specific CSDLN created by
Itamar Arel, tested on visual perception, and appropriate for integration within CogPrime.
• Dialogue: Linguistic interaction between two or more parties. In a CogPrime context, this
may be in English or another natural language, or it may be in Lojban or Psynese.
• Dialogue Control: The process of determining what to say at each juncture in a dialogue.
This is distinguished from the linguistic aspects of dialogue, language comprehension and
language generation. Dialogue control applies to Psynese or Lojban, as well as to human
natural language.
• Dimensional Embedding: The process of embedding entities from some non-dimensional
space (e.g. the Atomspace) into an n-dimensional Euclidean space. This can be useful in an
Al context because some sorts of queries (e.g. "find everything similar to X", "find a path
between X and V) are much faster to carry out among points in a Euclidean space, than
among entities in a space with less geometric structure.
• Distributed Atomspace: An implementation of an Atomspace that spans multiple com-
putational processes; generally this is done to enable spreading an Atomspace across mul-
tiple machines.
• Dual Network: A network of mental or informational entities with both a hierarchical
structure and a heterarchical structure, and an alignment among the two structures so that
each one helps with the maintenance of the other. This is hypothesized to be a critical
emergent structure, that must emerge in a mind (e.g. in an Atomspace) in order for it to
achieve a reasonable level of human-like general intelligence (and possibly to achieve a high
level of pragmatic general intelligence in any physical environment).
• Efficient Pragmatic General Intelligence: A formal, mathematical definition of general
intelligence (extending the pragmatic general intelligence), that ultimately boils down to:
the ability to achieve complex goals in complex environments using limited computational
resources (where there is a specifically given weighting function determining which goals
and environments have highest priority). More specifically, the definition weighted-sums the
system's normalized goal-achieving ability over (goal. environment pairs), and where the
weights are given by some assumed measure over (goal. environment pairs), and where the
normalization is done via dividing by the (space and time) computational resources used
for achieving the goal.
• Elegant Normal Form (ENF): Used in MOSES, this is a way of putting programs in
a normal form while retaining their hierarchical structure. This is critical if one wishes
to probabilistically model the structure of a collection of programs, which is a meaningful
operation if the collection of programs is operating within a region of program space where
syntax-semantics correlation holds to a reasonable degree. The Reduct library, is used to
place programs into ENF.
EFTA00624663
A.2 Glossary of Specialized Terms 517
• Embodied Communication Prior: The class of prior distributions over (goal, environ-
ment pairs), that are imposed by placing an intelligent system in an environment where
most of its tasks involve controlling a spatially localized body in a complex world, and in-
teracting with other intelligent spatially localized bodies. It is hypothesized that many key
aspects of human-like intelligence (e.g. the use of different subsystems for different memory
types, and cognitive synergy between the dynamics associated with these subsystems) are
consequences of this prior assumption. This is related to the Mind-World Correspondence
Principle.
• Embodiment: Colloquially, in an OpenCog context, this usually means the use of an AI
software system to control a spatially localized body in a complex (usually 3D) world. There
are also passible "borderline cases" of embodiment, such as a search agent on the Internet.
In a sense any Al is embodied, because it occupies some physical system (e.g. computer
hardware) and has some way of interfacing with the outside world.
• Emergence: A property or pattern in a system is emergent if it arises via the combination
of other system components or aspects, in such a way that its details would be very difficult
(not necessarily impossible in principle) to predict from these other system components or
aspects.
• Emotion: Emotions are system-wide responses to the system's current and predicted state.
Dorner's Psi theory of emotion contains explanations of many human emotions in terms
of underlying dynamics and motivations, and most of these explanations make sense in a
CogPrime context, due to CogPrime's use of OpenPsi (modeled on Psi) for motivation and
action selection.
• Episodic Knowledge: Knowledge about episodes in an agent's life-history, or the life-
history of other agents. CogPrime includes a special dimensional embedding space only for
episodic knowledge, easing organization and recall.
• Evolutionary Learning: Learning that proceeds via the rough process of iterated differen-
tial reproduction based on fitness, incorporating variations of reproduced entities. MOSES
is an explicitly evolutionary-learning-based portion of CogPrime; but CogPrime's dynamics
as a whole may also be conceived as evolutionary.
• Exemplar: (in the context of imitation learning) - When the owner wants to teach an
OpenCog controlled agent a behavior by imitation, he/she gives the pet an exemplar. To
teach a virtual pet "fetch" for instance, the owner is going to throw a stick, run to it, grab
it with his/her mouth and come back to its initial position.
• Exemplar: (in the context of MOSES) - Candidate chosen as the core of a new deme, or
as the central program within a deme, to be varied by representation building for ongoing
exploration of program space.
• Explicit Knowledge Representation: Knowledge representation in which individual,
easily humanly identifiable pieces of knowledge correspond to individual elements in a knowl-
edge store (elements that are explicitly there in the software and accessible via very rapid,
deterministic operations)
• Extension: In PLN, the extension of a node refers to the instances of the category that
the node represents. In contrast is the intension.
• Fishgram (Frequent and Interesting Sub-hypergraph Mining): A pattern mining
algorithm for identifying frequent and/or interesting sub-hypergraphs in the Atom.space.
• First-Order Inference (FOI): The subset of PLN that handles Logical Links not in-
volving VariableAtoms or higher-order functions. The other aspect of PLN, Higher-Order
Inference, uses Truth Value formulas derived from First-Order Inference.
EFTA00624664
518 A Glossary
• Forgetting: The process of removing Atoms from the in-RAM portion of AtomSpace, when
RAM gets short and they are judged not as valuable to retain in RAM as other Atoms. This
is commonly done using the LTI values of the Atoms (removing lowest LTI-Atoms, or more
complex strategies involving the LTI of groups of interconnected Atoms). May be done by
a dedicated Forgetting MindAgent. VLTI may be used to determine the fate of forgotten
Atoms.
• Forward Chainer: A control mechanism (MindAgent) for PLN inference, that works by
taking existing Atoms and deriving conclusions from them using PLN rules, and then iter-
ating this process. The goal is to derive new Atoms that are interesting according to some
given criterion.
• Frame2Atom: A simple system of hand-coded rules for translating the output of RelEx2Frame
(logical representation of semantic relationships using FrameNet relationships) into Atoms.
• Freezing: Saving Atoms from the in-RAM AtomSpace to disk.
• General Intelligence: Often used in an informal, commonsensical sense, to mean the
ability to learn and generalize beyond specific problems or contexts. Has been formalized
in various ways as well, including formalizations of the notion of "achieving complex goals
in complex environments" and "achieving complex goals in complex environments using
limited resources." Usually interpreted as a fuzzy concept, according to which absolutely
general intelligence is physically unachievable, and humans have a significant level of general
intelligence, but far from the maximally physically achievable degree.
• Generalized Hypergraph: A hypergraph with some additional features, such as links
that point to links, and nodes that are seen as "containing" whole sub-hypergraphs. This is
the most natural and direct way to mathematically/visually model the Atomspace.
• Generator: In the Pia design, a rule is denoted a generator if it can produce its consequent
without needing premises (e.g. LookupRule, which just looks it up in the AtomSpace). See
composer.
• Global, Distributed Memory: Memory that stores items as implicit knowledge, with
each memory item spread across multiple components, stored as a pattern of organization
or activity among them.
• Glocal Memory: The storage of items in memory in a way that involves both localized
and global, distributed aspects.
• Goal: An Atom representing a function that a system (like OpenCog) is supposed to spend
a certain non-trivial percentage of its attention optimizing. The goal, informally speaking,
is to maximize the Atom's truth value.
• Goal, Implicit: A goal that an intelligent system, in practice, strives to achieve; but that
is not explicitly represented as a goal in the system's knowledge base.
• Goal, Explicit: A goal that an intelligent system explicitly represents in its knowledge
has' and expends some resources trying to achieve. Goal Nodes (which may be Nodes or,
e.g. hmplicationLLinks) are used for this purpose in OpenCog.
• Goal-Driven Learning: Learning that is driven by the cognitive schematic i.e. by the quest
of figuring out which procedures can be expected to achieve a certain goal in a certain sort
of context.
• Grounded Schemallode: See Schemallode, Grounded.
• Hebbian Learning: An aspect of Attention Allocation, centered on creating and updating
HebbianLinks, which represent the simultaneous importance of the Atoms joined by the
HebbianLink.
EFTA00624665
A.2 Glossary of Specialized Terms 519
• Hebbian Links: Links recording information about the associative relationship (co-
occurrence) between Atoms. These include symmetric and asymmetric HebbianLinks.
• Heterarchical Network: A network of linked elements in which the semantic relationships
associated with the links are generally symmetrical (e.g. they may be similarity links, or
symmetrical associative links). This is one important sort of subnetwork of an intelligent
system; see Dual Network.
• Hierarchical Network: A network of linked elements in which the semantic relationships
associated with the links are generally asymmetrical, and the parent nodes of a node have
a more general scope and some measure of control over their children (though there may be
important feedback dynamics too). This is one important sort of subnetwork of an intelligent
system; see Dual Network.
• Higher-Order Inference (HOI): PLN inference involving variables or higher-order func-
tion.s. In contrast to First-Order Inference (F00.
• Hillclimbing: A general term for greedy, local optimization techniques, including some
relatively sophisticated ones that involve "mildly nonlocal" jumps.
• Human-Level Intelligence: General intelligence that's "as smart as" human general in-
telligence, even if in some respects quite unlike human intelligence. An informal concept,
which generally doesn't come up much in CogPrime work, but is used frequently by some
other AI theorists.
• Human-Like Intelligence: General intelligence with properties and capabilities broadly
resembling those of humans, but not necessarily precisely imitating human beings.
• Hypergraph: A conventional hypergraph is a collection of nodes and links, where each
link may span any number of nodes. OpenCog makes use of generalized hypergraphs (the
Atomspace is one of these).
• Imitation Learning: Learning via copying what some other agent is observed to do.
• Implication: Often refers to an ImplicationLink between two PredicateNodes, indicating
an (extensional, intensional or mixed) logical implication.
• Implicit Knowledge Representation: Representation of knowledge via having easily
humanly identifiable pieces of knowledge correspond to the pattern of organization and/or
dynamics of elements, rather than via having individual elements correspond to easily hu-
manly identifiable pieces of knowledge.
• Importance: A generic term for the Attention Values associated with Atoms. Most com-
monly these are STI (short term importance) and LTI (long term importance) values. Other
importance values corresponding to various different time scales are also possible. In general
an importance value reflects an estimate of the likelihood an Atom will be aseful to the
system over some particular future time-horizon. STI is generally relevant to processor time
allocation, whereas LTI is generally relevant to memory allocation.
• Importance Decay: The process of Atom importance values (e.g. STI and LTI) decreasing
over time, if the Atoms are not utilized. Importance decay rates may in general be context-
dependent.
• Importance Spreading: A synonym for Importance Updating, intended to highlight the
similarity with "activation spreading" in neural and semantic networks.
• Importance Updating: The CIM-Dynamic that periodically (frequently) updates the STI
and LTI values of Atoms based on their recent activity and their relationships.
• Imprecise Truth Value: Peter Walley's imprecise truth values are intervals inter-
preted as lower and upper bounds of the means of probability distributions in an envelope
EFTA00624666
520 A Glossary
of distributions. In general, the term may be used to refer to any truth value involving
intervals or related constructs, such as indefinite probabilities.
• Indefinite Probability: An extension of a standard imprecise probability, comprising a
credible interval for the means of probability distributions governed by a given second-order
distribution.
• Indefinite Truth Value: An OpenCog TruthValue object wrapping up an indefinite prob-
ability
• Induction: In PLN, a specific inference rule (A —> B, A —> C, therefore B r C). In general,
the process of heuristically inferring that what has been seen in multiple examples, will be
seen again in new examples. Induction in the broad sense, may be carried out in OpenCog
by methods other than PLN induction. When emphasis needs to be laid on the particular
PLN inference rule, the phrase "PLN Induction" is used.
• Inference: Generally speaking, the process of deriving conclusions from assumptions. In
an OpenCog context. this often refers to the PLN inference system. Inference in the broad
sense is distinguished from general learning via some specific characteristics, such as the
intrinsically incremental nature of inference: it proceeds step by step.
• Inference Control: A cognitive process that determines what logical inference rule (e.g.
what PLN rule) is applied to what data, at each point in the dynamic operation of an
inference process.
• Integrative AGI: An AGI architecture, like CogPrime, that relies on a number of different
powerful, reasonably general algorithms all cooperating together. This is different from an
AGI architecture that is centered on a single algorithm, and also different than an AGI
architecture that expects intelligent behavior to emerge from the collective interoperation
of a number of simple elements (without any sophisticated algorithms coordinating their
overall behavior).
• Integrative Cognitive Architecture: A cognitive architecture intended to support inte-
grative AGI.
• Intelligence: An informal, natural language concept. "General intelligence" is one slightly
more precise specification of a related concept; "Universal intelligence" is a fully precise
specification of a related concept. Other specifications of related concepts made in the
particular context of CogPrime research are the pragmatic general intelligence and the
efficient pragmatic general intelligence.
• Intension: In PLN, the intention of a node consists of Atoms representing properties of
the entity the node represents.
• Intentional memory: A system's knowledge of its goals and their subgoaLs. and associa-
tions between these goals and procedures and contexts (e.g. cognitive schematics).
• Internal Simulation World: A simulation engine used to simulate an external environ-
ment (which may be physical or virtual), used by an AGI system as its "mind's eye" in order
to experiment with various action' q sequences and envision their consequences, or observe
the consequences of various hypothetical situations. Particularly important for dealing with
episodic knowledge.
• Interval Algebra: Allen Interval Algebra, a mathematical theory of the relationships be-
tween time intervals. CogPrime utilizes a fuzzified version of classic Interval Algebra.
• IRC Learning (Imitation, Reinforcement, Correction): Learning via interaction with
a teacher, involving a combination of imitating the teacher, getting explicit reinforcement
signals from the teacher, and having one's incorrect or suboptimal behaviors guided toward
betterness by the teacher in real-time. This is a large part of how young humans learn.
EFTA00624667
A.2 Glossary of Specialized Terms 521
• Knowledge Base: A shorthand for the totality of knowledge possessed by an intelligent
system during a certain interval of time (whether or not this knowledge is explicitly rep-
resented). Put differently: this is an intelligence's total memory contents (inclusive of all
types of memory) during an interval of time.
• Language Comprehension: The process of mapping natural language speech or text into
a more "cognitive", largely language-independent representation. In OpenCog this has been
done by various pipelines consisting of dedicated natural language processing tools, e.g. a
pipeline: text —r Link Parser —> RelEx r RelEx2Frame Frame2Atom Atomspace; and
alternatively a pipeline Link Parser —r Link2Atom i Atomspace. It would also be possi-
ble to do language comprehension purely via PLN and other generic OpenCog processes,
without using specialized language processing tools.
• Language Generation: The process of mapping (largely language-independent) cognitive
content into speech or text. In OpenCog this has been done by various pipelines consisting of
dedicated natural language processing tools, e.g. a pipeline: Atomspace NLGen text;
or more recently Atomspace Atom2Link —> surface realization —r text. It would also be
possible to do language generation purely via PLN and other generic OpenCog processes,
without using specialized language processing tools.
• Language Processing: Processing of human language is decomposed, in CogPrime, into
Language Comprehension, Language Generation, and Dialogue Control.
• Learning: In general, the process of a system adapting based on experience, in a way that
increases its intelligence (its ability to achieve its goals). The theory underlying CogPrime
doesn't distinguish learning from reasoning, associating, or other aspects of intelligence.
• Learning Server: In some OpenCog configurations, this refers to a software server that
performs "offline" learning tasks (e.g. using MOSES or hilIclimbing), and is in communica-
tion with an Operational Agent Controller software server that performs real-time agent
control and dispatches learning tasks to and receives results from the Learning Server.
• Linguistic Links: A catch-all tenn for Atoms explicitly representing linguistic content,
e.g. WordNode, SentenceNode, CharacterNode.
• Link: A type of Atom, representing a relationship among one or more Atoms. Links and
Nodes are the two basic kinds of Atoms.
• Link Parser: A natural language syntax parser, created by Sleator and Temperley at
Carnegie-Mellon University, and currently used as part of OpenCogPrime's natural language
comprehension and natural language generation system.
• Link2Atom: A system for translating link parser links into Atoms. It attempts to resolve
precisely as much ambiguity as needed in order to translate a given assemblage of link parser
links into a unique Atom structure.
• Lobe: A term sometimes used to refer to a portion of a distributed Atomspace that lives
in a single computational process. Often different lobes will live on different machines.
• Localized Memory: Memory that stores each item using a small number of closely-
connected elements.
• Logic: In an OpenCog context, this usually refers to a set of formal rules for translating
certain combinations of Atoms into "conclusion" Atoms. The paradigm case at present is the
PLN probabilistic logic system, but OpenCog can also be used together with other logics.
• Logical Links: Any Atoms whose truth values are primarily determined or adjusted via
logical rules, e.g. PLN's hnheritanceLink, SimilarityLink, hnplicationLink, etc. The term
isn't usually applied to other links like HebbianLinks whose semantics isn't primarily logic-
EFTA00624668
522 A Glossary
based, even though these other links can be processed via (e.g. PLN) logical inference via
interpreting them logically.
• Lojban: A constructed human language, with a completely formalized syntax and a highly
formalized semantics, and a small but active community of speakers. In principle this seems
an extremely good method for communication between humans and early-stage AGI sys-
tems.
• Lojban-l-+: A variant of Lojban that incorporates English words, enabling more flexible
expression without the need for frequent invention of new Lojban words.
• Long Term Importance (LTI): A value associated with each Atom, indicating roughly
the expected utility to the system of keeping that Atom in RAM rather than saving it to
disk or deleting it. It's possible to have multiple LTI values pertaining to different time
scales, but so far practical implementation and most theory has centered on the option of
a single LTI value.
• LTI: Long Term Importance
• Map: A collection of Atoms that are interconnected in such a way that they tend to be
commonly active (i.e. to have high STI, e.g. enough to be in the AttentionalFocus, at the
same time).
• Map Encapsulation: The process of automatically identifying maps in the Atomspace,
and creating Atoms that "encapsulate" them; the Atom encapsulation a map would link to
all the Atoms in the map. This is a way of making global memory into local memory, thus
making the system's memory glocal and explicitly manifesting the "cognitive equation."
This may be carried out via a dedicated MapEncapsulation MindAgent.
• Map Formation: The process via which maps form in the Atomspace. This need not be
explicit; maps may form implicitly via the action of Hebbian Learning. It will commonly
occur that Atoms frequently co-occurring in the AttentionalFocus, will come to be joined
together in a map.
• Memory Types: In CogPrime
this generally refers to the different types of memory that are embodied in different data
structures or processes in the CogPrime
architecture, e.g. declarative (semantic), procedural, attentional, intentional, episodic, sen-
sorimotor.
• Mind-World Correspondence Principle: The principle that, for a mind to display
efficient pragmatic general intelligence relative to a world, it should display many of the
same key structural properties as that world. This can be formalized by modeling the world
and mind as probabilistic state transition graphs, and saying that the categories implicit
in the state transition graphs of the mind and world should be inter-mappable via a high-
probability morphism.
• Mind OS: A synonym for the OpenCog Core.
• MindAgent: An OpenCog software object, residing in the CogServer, that carries out
some processes in interaction with the Atomspace. A given conceptual cognitive process
(e.g. PLN inference, Attention allocation, etc.) may be carried out by a number of different
MindAgents designed to work together.
• Mindspace: A model of the set of states of an intelligent system as a geometrical space,
imposed by assuming some metric on the set of mind-states. This may be used as a tool for
formulating general principles about the dynamics of generally intelligent systems.
• Modulators: Parameters in the Psi model of motivated, emotional cognition, that modu-
late the way a system perceives, reasons about and interacts with the world.
EFTA00624669
A.2 Glossary of Specialized Terms 523
• MOSES (Meta-Optimizing Semantic Evolutionary Search): An algorithm for proce-
dure learning, which in the current implementation learns programs in the Combo language.
MOSES is an evolutionary learning system, which differs from typical genetic programming
systems in multiple aspects including: a subtler framework for managing multiple "demos"
or 'islands" of candidate programs: a library, called Reduct for placing programs in Elegant
Normal Form; and the use of probabilistic modeling in place of, or in addition to, imitation
and crossover as means of determining which new candidate programs to try.
• Motoric: Pertaining to the control of physical actuators, e.g. those connected to a robot.
May sometimes be used to refer to the control of movements of a virtual character as well.
• Moving Bubble of Attention: The Attentional Focus of a CogPrime system.
• Natural Language Comprehension: See Language Comprehension
• Natural Language Generation: See Language Generation
• Natural Language Processing (NLP): See Language Processing
• NLGen: Software for carrying out the surface realization phase of natural language gen-
eration, via translating collections of RelEx output relationships into English sentences.
Was made functional for simple sentences and some complex sentences; not currently under
active development, as work has shifted to the related Atom2Link approach to language
generation.
• Node: A type of Atom. Links and Nodes are the two basic kinds of Atoms. Nodes, math-
ematically, can be thought of as "0-ary" links. Some types of Nodes refer to external or
mathematical entities (e.g. INordNode, NumberNode); others are purely abstract, e.g. a
ConceptNode is characterized purely by the Links relating it to other atoms. Grounded-
PredicateNodes and GroundedSchemallodes connect to explicitly represented procedures
(sometimes in the Combo language); ungrounded PredicateNodes and Schemallodes are
abstract and, like ConceptNodes, purely characterized by their relationships.
• Node Probability: Many PLN inference rules rely on probabilities associated with Nodes.
Node probabilities are often easiest to interpret in a specific context, e.g. the probability
P(cat) makes obvious sense in the context of a typical American house, or in the context
of the center of the sun. Without any contextual specification, P(A) is taken to mean
the probability that a randomly chosen occasion of the system's experience includes some
instance of A.
• Novamente Cognition Engine (NCE): A proprietary proto-AGI software system, the
predecessor to OpenCog. Many parts of the NCE were open-sourced to form portions of
OpenCog, but some NCE code was not included in OpenCog; and now OpenCog includes
multiple aspects and plenty of code that was not in NCE.
• OpenCog: A software framework intended for development of AGI systems, and also for
narrow-AI application using tools that have AGI applications. Co-designed with the Cog-
Prime cognitive architecture, but not exclusively bound to it.
• OpenCog Prime (OCP): The implementation of the CogPrime cognitive architecture
within the OpenCog software framework.
• OpenPsi: CogPrime's architecture for motivation-driven action selection, which is based
on adapting Dormer's Psi model for use in the OpenCog framework.
• Operational Agent Controller (OAC): In some OpenCog configurations, this is a soft-
ware server containing a CogServer devoted to real-time control of an agent (e.g. a virtual
world agent, or a robot). Background, offline learning tasks may then be dispatched to other
software processes, e.g. to a Learning Server.
EFTA00624670
524 A Glossary
• Pattern: In a CogPrime context, the term "pattern" is generally used to refer to a process
that produces some entity, and is judged simpler than that entity.
• Pattern Mining: Pattern mining is the process of extracting an (often large) number of
patterns from some body of information, subject to some criterion regarding which patterns
are of interest. Often (but not exclusively) it refers to algorithms that are rapid or "greedy",
finding a large number of simple patterns relatively inexpensively.
• Pattern Recognition: The process of identifying and representing a pattern in some
substrate (e.g. some collection of Atoms, or some raw perceptual data, etc.).
• Patternism: The philosophical principle holding that, from the perspective of engineering
intelligent systems, it is sufficient and useful to think about mental processes in terms of
(static and dynamical) patterns.
• Perception: The process of understanding data from sensors. When natural language is
ingested in textual format, this is generally not considered perceptual. Perception may be
taken to encompass both pre-processing that prepares sensory data for ingestion into the
Atomspace, processing via specialized perception processing systems like DeSTIN that are
connected to the Atomspace, and more cognitive-level process within the Atomspace that
is oriented toward understanding what has been sensed.
• Piagetan Stages: A series of stages of cognitive development hypothesized by develop-
mental psychologist Jean Piaget, which are easy to interpret in the context of developing
CogPrime systems. The basic stages are: Infantile, Pre-operational, Concrete Operational
and Formal. Post-formal stages have been discussed by theorists since Piaget and seem
relevant to AGI, especially advanced AGI systems capable of strong self-modification.
• PLN: short for Probabilistic Logic Networks
• PLN, First-Order: See First-Order hnference
• PLN, Higher-Order: See Higher-Order Inference
• PLN Rules: A PLN Rule takes as input one or more Atoms (the "premises", usually Links),
and output an Atom that is a 'logical conclusion" of those Atoms. The truth value of the
consequence is determined by a PLN Formula associated with the Rule.
• PLN Formulas: A PLN Formula, corresponding to a PLN Rule, takes the TruthValues
corresponding to the premises and produces the TruthValue corresponding to the conclusion.
A single Rule may correspond to multiple Formulas, where each Formula deals with a
different sort of TruthValue.
• Pragmatic General Intelligence: A formalization of the concept of general intelligence,
based on the concept that general intelligence is the capability to achieve goals in environ-
ments, calculated as a weighted average over some fuzzy set of goals and environments.
• Predicate Evaluation: The process of determining the Truth Value of a predicate, embod-
ied in a PredicateNode. This may be recursive, as the predicate referenced internally by a
Grounded PredicateNode (and represented via a Combo program tree) may itself internally
reference other PredicateNodes.
• Probabilistic Logic Networks (PLN): A mathematical and conceptual framework for
reasoning under uncertainty, integrating aspects of predicate and term logic with extensions
of imprecise probability theory. OpenCogPrime's central tool for symbolic reasoning.
• Procedural Knowledge: Knowledge regarding which series of actions (or action-combinations)
are useful for an agent to undertake in which circumstances. In CogPrime these may be
learned in a number of ways, e.g. via PLN or via Hebbian learning of Schema Maps, or via
explicit learning of Combo programs via MOSES or hilklimbing. Procedures are represented
as Schemallodes or Schema Maps.
EFTA00624671
A.2 Glossary of Specialized Terms 525
• Procedure Evaluation/Execution: A general term encompassing both Schema Execu-
tion and Predicate Evaluation, both of which are similar computational processes involving
manipulation of Combo trees associated with ProcedureNodw.
• Procedure Learning: Learning of procedural knowledge, based on any method, e.g. evo-
lutionary learning (e.g. MOSES), inference (e.g. PLN), reinforcement learning (e.g. Hebbian
learning).
• Procedure Node: A Schemallode or PredicateNode
• Psi: A model of motivated action and emotion, originated by Dietrich Dorner and further
developed by Joscha Bach, who incorporated it in his proto-AGI system MicroPsi. OpenCog-
Prime's motivated-action component, OpenPsi, is roughly based on the Psi model.
• Psynese: A system enabling different OpenCog instances to communicate without using
natural language, via directly exchanging Atom subgraphs, using a special system to map
references in the speaker's mind into matching references in the listener's mind.
• Psynet Model: An early version of the theory of mind underlying CogPrime, referred to
in some early writings on the Webmind AI Engine and Novamente Cognition Engine. The
concepts underlying the psynet model are still part of the theory underlying CogPrime, but
the name has been deprecated as it never really caught on.
• Reasoning: See inference
• Reduct: A code library, used within MOSES, applying a collection of hand-coded rewrite
rules that transform Combo programs into Elegant Normal Form.
• Region Connection Calculus: A mathematical formalism describing a system of basic
operations among spatial regions. Used in CogPrime as part of spatial inference to provide
relations and rules to be referenced via PLN and potentially other subsystems.
• Reinforcement Learning: Learning procedures via experience, in a manner explicitly
guided to cause the learning of procedures that will maximize the system's expected future
reward. CogPrime does this implicitly whenever it tries to learn procedures that will maxi-
mize some Goal whose Truth Value is estimated via an expected reward calculation (where
"reward" may mean simply the Truth Value of some Atom defined as "reward"). Goal-driven
learning is more general than reinforcement learning as thus defined; and the learning that
CogPrime does, which is only partially goal-driven, is yet more general.
• RelEx: A software system used in OpenCog as part of natural language comprehension, to
map the output of the link parser into more abstract semantic relationships. These more
abstract relationships may then he entered directly into the Atomspace, or they may be
further abstracted before being entered into the Atomspace, e.g. by RelEx2Frame rules.
• RelEx2Frame: A system of rules for translating RelEx output into Atoms, based on the
FrameNet ontology. The output of the RelEx2Frame rules make use of the FrameNet library
of semantic relationships. The current (2012) RelEx2Frame rule-based is problematic and
the RelEx2Frame system is deprecated as a result, in favor of Link2Atom. However, the
ideas embodied in these rules may be useful; if cleaned up the rules might profitably be
ported into the Atomspace as ImplicationLinks.
• Representation Building: A stage within MOSES, wherein a candidate Combo program
tree (within a deme) is modified by replacing one or more tree nodes with alternative tree
nodes, thus obtaining a new, different candidate program within that deme. This process
currently relies on hand-coded knowledge regarding which types of tree nodes a given tree
node should be experimentally replaced with (e.g. an AND node might sensibly be replaced
with an OR node, but not so sensibly replaced with a node representing a "kick" action).
EFTA00624672
525 A Glossary
• Request for Services (RFS): In CogPrime's Goal-driven action system, a RFS is a
package sent from a Goal Atom to another Atom, offering it a certain amount of STI
currency if it is able to deliver the goal what it wants (an increase in its Truth Value).
RFS's may be passed on, e.g. from goals to subgoals to sub-subgoals, but eventually an
RFS reaches a Grounded Schemallode, and when the corresponding Schema is executed,
the payment implicit in the RFS is made.
• Robot Preschool: An AGI Preschool in our physical world, intended for robotically em-
bodied AGIs.
• Robotic Embodiment: Using an AGI to control a robot. The AGI may be running on
hardware physically contained in the robot, or may run elsewhere and control the robot via
networking methods such as wifi.
• Scheduler: Part of the CogServer that controls which processes (e.g. which MindAgents)
get processor time, at which point in time.
• Schema: A "script" describing a process to be carried out. This may be explicit, as in the
case of a GroundedSchemallode, or implicit, as the case in Schema maps or ungrounded
Schemallodes.
• Schema Encapsulation: The process of automatically recognizing a Schema Map in an
Atomspace, and creating a Combo (or other) program embodying the process carried out
by this Schema Map, and then storing this program in the Procedure Repository and
associating it with a particular Schemallode. This translates distributed, global procedural
memory into localized procedural memory. It's a special case of Map Encapsulation.
• Schema Execution: The process of "running" a Grounded Schema, similar to running a
computer program. Or, phrased alternately: The process of executing the Schema referenced
by a Grounded Schemallode. This may be recursive, as the predicate referenced internally by
a Grounded Schemallode (and represented via a Combo program tree) may itself internally
reference other Grounded Schemallodes.
• Schema, Grounded: A Schema that is associated with a specific executable program
(either a Combo program or, say, C++ code)
• Schema Map: A collection of Atoms, including Schemallodes, that tend to be enacted
in a certain order (or set of orders), thus habitually enacting the same process. This is a
distributed, globalized way of storing and enacting procedures.
• Schema, Ungrounded: A Schema that represents an abstract procedure, not associated
with any particular executable program.
• Schematic Implication: A general, conceptual name for implications of the form ((Con-
text AND Procedure) IMPLIES Goal)
• SegSim: A name for the main algorithm underlying the NLGen language generation soft-
ware. The algorithm is based on segmenting a collection of Atoms into small parts, and
matching each part against memory, to find, for each part, cases where similar Atom-
collections already have known linguistic expression.
• Self-Modification: A term generally used for Al systems that can purposefully modify
their core algorithms and representations. Formally and crisply distinguishing this sort of
"strong self-modification" from "mere" learning is a tricky matter.
• Sensorimotor: Pertaining to sensory, data, motoric actions, and their combination and
intersection.
• Sensory: Pertaining to data received by the AGI system from the outside world. In a
CogPrime system that perceives language directly as text, the textual input will generally
EFTA00624673
A.2 Glossary of Specialized Terms 527
not be considered as "sensory" (on the other hand, speech audio data would be considered
as "sensory").
• Short Term Importance: A value associated with each Atom, indicating roughly the
expected utility to the system of keeping that Atom in RAM rather than saving it to disk
or deleting it. It's possible to have multple LTI values pertaining to different time scales,
but so far practical implementation and most theory has centered on the option of a single
LTI value.
• Similarity: a link type indicating the probabilistic similarity between two different Atoms.
Generically this is a combination of Intensional Similarity (similarity of properties) and
Extensional Similarity (similarity of members).
• Simple Truth Value: a TruthValue involving a pair (s,d) indicating strength (e.g. proba-
bility or fuzzy set membership) and confidence d. d may be replaced by other options such
as a count n or a weight of evidence w.
• Simulation World: See Internal Simulation World
• SMEPH (Self-Modifying Evolving Probabilistic Hypergraphs): a style of modeling
systems, in which each system is associated with a derived hypergraph
• SMEPH Edge: A link in a SMEPH derived hypergraph, indicating an empirically observed
relationship (e.g. inheritance or similarity) between two
• SMEPH Vertex: A node in a SMEPH derived hypergraph representing a system, indicat-
ing a collection of system states empirically observed to arise in conjunction with the same
external stimuli
• Spatial Inference: PLN reasoning including Atoms that explicitly reference spatial rela-
tionships
• Spatiotemporal Inference: PLN reasoning including Atoms that explicitly reference spa-
tial and temporal relationships
• STI: Short hand for Short Term Importance
• Strength: The main component of a TruthValue object, lying in the interval I0,1J, refer-
ring either to a probability (in cases like InheritanceLink, SimilarityLink, EquivalenceLink,
ImplicationLink, etc.) or a fuzzy value (as in MemberLink, EvaluationLink).
• Strong Self-Modification: This is generally used as synonymous with Self-Modification,
in a CogPrime context.
• Subsymbolic: Involving processing of data using elements that have no correspondence to
natural language terms, nor abstract concepts; and that are not naturally interpreted as
symbolically "standing for" other things. Often used to refer to processes such as perception
processing or motor control, which are concerned with entities like pixels or commands like
"rotate servomotor 15 by 10 degrees theta and 55 degrees phi." The distinction between
"symbolic" and "subsymbolic" is conventional in the history of Al, but seems difficult to
formalize rigorously. Logic-based Al systems are typically considered "symbolic", yet
• Supercompilation: A technique for program optimization, which globally rewrites a pro-
gram into a usually very different looking program that does the same thing. A prototype
supercompiler was applied to Combo programs with successful results.
• Surface Realization: The process of taking a collection of Atoms and transforming them
into a series of words in a (usually natural) language. A stage in the overall process of
language generation.
• Symbol Grounding: The mapping of a symbolic term into perceptual or motoric entities
that help define the meaning of the symbolic term. For instance, the concept "Cat" may be
EFTA00624674
528 A Glossary
grounded by images of cats, experiences of interactions with cats, imaginations of being a
cat, etc.
• Symbolic: Pertaining to the formation or manipulation of symbols, i.e. mental entities that
are explicitly constructed to represent other entities. Often contrasted with subsymbolic.
• Syntax-Semantics Correlation: In the context of MOSES and program learning more
broadly, this refers to the property via which distance in syntactic space (distance between
the syntactic structure of programs, e.g. if they're represented as program trees) and se-
mantic space (distance between the behaviors of programs, e.g. if they're represented as
sets of input/output pairs) are reasonably well correlated. This can often happen among
sets of programs that are not too widely dispersed in program space. The Reduct library
is used to place Combo programs in Elegant Normal Form, which increases the level of
syntax-semantics corellation between them. The programs in a single MOSES deme are
often closely enough clustered together that they have reasonably high syntax-semantics
correlation.
• System Activity Table: An OpenCog component that records information regarding
what a system did in the past.
• Temporal Inference: Reasoning that heavily involves Atoms representing temporal in-
formation, e.g. information about the duration of events, or their temporal relationship
(before, after, during, beginning, ending). As implemented in CogPrime, makes use of an
uncertain version of Allen Interval Algebra.
• Truth Value: A package of information associated with an Atom, indicating its degree
of truth. Simple'I1uthValue and IndefiniteTruthValue are two common, particular kinds.
Multiple truth values associated with the same Atom from different perspectives may be
grouped into CompositeTruthValue objects.
• Universal Intelligence: A technical term introduced by Shane Legg and Marcus Hutter,
describing (roughly speaking) the average capability of a system to carry out computable
goals in computable environments, where goal/environment pairs are weighted via the length
of the shortest program for computing them.
• Urge: In OpenPsi, an Urge develops when a Demand deviates from its target range.
• Very Long Term Importance (VLTI): A bit associated with Atoms, which determines
whether, when an Atom is forgotten (removed from RAM), it is saved to disk (frozen) or
simply deleted.
• Virtual AGI Preschool: A virtual world intended for AGI teaching/training/learning,
bearing broad resemblance to the preschool environments used for young humans.
• Virtual Embodiment: Using as AGI to control an agent living in a virtual world or game
world, typically (but not necessarily) a 3D world with broad similarity to the everyday
human world.
• Webmind AI Engine: A predecessor to the Novamente Cognition Engine and OpenCog,
developed 1997-2001 - with many similar concepts (and also some different ones) but quite
different algorithms and software architecture
EFTA00624675
References 529
References
ABS+11. Itamar Arel, $ Berant, T Slonint, A Nfoyal, B Li, and K Chai Sim. Acoustic spatiotemporal
modeling using deep machine learning for robust phoneme recognition. In Afeka-AVIOS Speech
Processing Conference, 2011.
A1183. James F. Allen. Maintaining knowledge about temporal. Intervals CACM, 26:198-3, 1983.
AMOI. J. S. Albus and A. M. Nfeystel. Engineering of Mind: An Introduction to the Science of Intelligent
Systems. Wiley and Sons, 2001.
Ama85. S. Amari. Differential-geometrical methods in statistics. Lecture notes in statistics, 1985.
Ama98. S. Amari. Natural gradient works efficiently in learning. Neural Computing, 10:251-276, 1998.
ANOO. Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry. ANIS, 2000.
ARC09a. I. Aral, D. Rose, and R. Coop. Destin: A scalable deep learning architecture with application
to high-dimensional robust pattern recognition. Proc. AAA! Workshop on Biologically Inspired
Cognitive Architectures, 2009.
ARC09b. Itamtu. Arel, Derek Rose, and Robert Coop. A biologically-inspired deep learning architecture with
application to high-dimensional pattern recognition. In Biologically Inspired Cognitive Architec-
tures, 2009. AAAI Press, 2009.
ARKO9. I. Arel, D. Rose, and T. Karnowski. A deep learning architecture comprising homogeneous cortical
circuits for scalable spatiotemporal pattern inference. NIPS 2009 Workshop on Deep Learning for
Speech Recognition and Related Applications, 2009.
Arn69. Rudolf Arnheim. Visual Thinking. University of California Press. Berkeley, 1969.
AS94. Rakesh Agrawal and Ra0takrishnan Srilaint. Fast algorithms for mining association rules. In Proc.
20th Int. Conf. Very Large Data Bases, 1994.
Ash65. Robert B. Ash. Information Theory. Dover Publications, 1965.
Bau04. E. B. Baum. What is Thought? MIT Press, 2004.
Bau06. E. Baum. A working hypothesis for general intelligence. In Advances in Artificial General
ligence: Concepts, Architectures and Algorithms, 2006.
BE07. Neven Boric and Pablo A. Estevez. Genetic programming-based clustering using an information
theoretic fitness measure. In Dipti Srinivasan and Lipo Wang, editors, 2007 IEEE Congress on
Evolutionary Computation, pages 31-38, Singapore, 25-28 September 2007. IEEE Computational
Intelligence Society, IEEE Press.
Bel03. Anthony J. Bell. The co-information lattice. Somewhere or other, 2003.
Ben94. Brandon Bennett. Spatial reasoning with propositional logics. In Principles of Knowledge Repre-
sentation and Reasoning: Proceedings of the 4th International Conference (KR94), pages 51-62.
Morgan Kaufmann, 1994.
BF97. A Blum and M Furst. Fast planning through planning graph analysis. Artificial intelligence, 1997.
BH10. Bundzel and Hashimoto. Object identification in dynamic images based on the memory-prediction
theory of brain function. Journal of Intelligent Learning Systems and Applications, 2-4, 2010.
Bic08. Derek Bickerton. Bastard Tongues. Hill and Wang, 2008.
BKLO6. Aline Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proc.
International Conference on Machine Learning, 2006.
BL99. Avrim Blum and John Langford. Probabilistic planning in the graphplan framework. In 5th
European Conference on Planning (ECP '99), 1999.
Bor05. Christian Borgelt. Keeping things simple: Finding frequent item sets by recursive elimination. In
Workshop on Open Source Data Mining Software (OSDM'05). Chicago IL, pages 66-70. 2005.
Car06. Pereira Francisco Cara. Creativity and Artificial Intelligence: A Conceptual Blending Approach,
Applications of Cognitive Linguistics. Amsterdam: Mouton de Gruyter, 2006.
Cas04. N. L. Cassimatis. Grammatical processing using the mechanisms of physical inferences. In Pro-
ceedings of the Twentieth-Sixth Annual Conference of the Cognitive Science Society. 2004.
CB00. W. H. Calvin and D. Bickerton. Lingua er Machin. MIT Press, 2000.
CFH97. Eliseo Clementini, Paolino Di Felice, and Daniel HernAindez. Qualitative representation of posi-
tional information. Artificial Intelligence, 95:317-356, 1997.
COPH09. Lucio Coelho, Ben Goertzel, Cassio Pennachin, and Chris Howard. Classifier ensemble based
analysis of a genome-wide snp dataset concerning late-onset alzheimer disease. In Proceedings of
8th IEEE International Conference on Cognitive Informatics., 2009.
Cha0s. Gregory Chaitin. Algorithmic Information Theory. Cambridge University Press, 2008.
EFTA00624676
530 A Glossary
Cha09. Mark Changizi. The Vision Revolution. l3enBella Books, 2009.
Che97. K. Chellapilla. Evolving computer programs without subtree crossover. IEEE Transactions on
Evolutionary Computation, 1997.
Coh95. A.C. Cohn. A hierarchical representation of qualitative shape based on connection and convexity.
In Proc COSIT95, LNCS, pages 311-326. Springer Verlag, 1995.
Cox61. Richard Cox. The Algebra of Probable Inference. Johns Hopkins University Press, 1961.
CS10. Shay B. Cohen and Noah A. Smith. Covariance in unsupervised learning of probabilistic grammars.
Journal of Machine Learning Research, 11:3117-3151, 2010.
CSZ06. Olivier Chapelk, Bernhard Schakopf, and Alexander Zien. Semi-Supervised Learning. MIT Press,
2006.
CXYNI05. Yun Chi, Yi Xia, Yirong Yang, and Richard Ft Muntz. Mining closed and maximal frequent subtrees
from databases of labeled rooted trees IEEE Trans. Knowledge and Data Engineering. 2005.
Dab99. A.C. Dabak. A Geometry for Detection Theory. PhD Thesis, Rice U., 1999.
Dea98. Terrence Deacon. The Symbolic Species. Norton, 1998.
dF37. Bnmo de Finetti. La prevision: ses lois logiques, sea sources subjectives,. Annetta de l'In.stitut
Henri Poincard, 1937.
DP09. Yassine Djouadi and Henri Prade. Interval-valued fuzzy formal concept analysis. In ISMS '09:
Proc. of the 18th International Symposium on Foundations of Intelligent Systems, pages 592-601,
Berlin, Heidelberg, 2009. Springer-Verlag.
dS77. Ferdinand de Saussure. Course in General Linguistics. Flmtana/Collins, 1977. Orig. published
1916 as "emirs de linguistique generale".
EBJ+97. J. Elman, E. Bates, M. Johnson, A. Karmiloff-Smith, D. Parisi. and K. Plunkett. Rethinking
Innateness: A Connectionist Perspective on Development. MIT Press, 1997.
Ede93. Gerald Edelman. Neural darwinism: Selection and reentrant signaling in higher brain function.
Neuron, 10, 1993.
PF92. Christian Freksa and Robert Milton. Temporal reasoning based on semi-intervals. Artificial Intel-
ligence, 54(1-2):199 - 227, 1992.
PLI2. Jeremy Fishel and Gerald Loeb. Bayesian exploration for intelligent identification of textures.
Frontiers in Neurorobotics 6-4, 2012.
Pri98. Roy. FYieden. Physics from Fisher Information. Cambridge U. Press, 1998.
PT02. G. Pauconnier and M. Turner. The Way We Think: Conceptual Blending and the Mind's Hidden
Complexities. Basic, 2002.
Gar00. Peter Gardenfors. Conceptual spaces: the geometry of thought. MIT Press, 2000.
GBK04. S. Gustafson. E. K. Burke, and G. Kendall. Sampling of unique structures and behaviours in
genetic programming. In European Conf. on Genetic Programming. 2004.
CCPM06. Ben Goertzel, Lucio Coelho, Cassio Pennachin, and Mauricio Mudada. Identifying Complex Bio-
logkal Interactions based on Categorical Gene Expression Data. In Proceedings of Conference on
Evolutionary Computing. Vancouver CA. 2006.
CE01. Roop Goyal and Max Egenhofer. S" 'larity in cardinal directions. In in Proc. of the Seventh
International Symposium on Spatial and Temporal Databases, pages 36-55. Springer-Verlag, 2001.
Cea05. Ben Coertzel and et al. Combinations of single nucleotide polymorphisms in neuroendocrine
effector and receptor genes predict chronic fatigue syndrome. Pharmacogenatnics, 2005.
GEA08. Ben Gone! and Cassie. Pennachin Et Al. An integrative methodology, for teaching embodied
non-linguistic agents, applied to virtual animals in second life. In Proc.of the First Conf. on AGL
IOS Press, 2008.
Ceal3. Ben Goertzel and et al. The cogprime architecture for embodied artificial general intelligence. In
Proceedings of IEEE Symposium on Human-Level Al, Singapore, 2013.
CCC+ 11. Ben Goertzel, Nil Geisweiller, Lucio Coelho, Predrag Janicic, and Cassio Pennachin. Real World
Reasoning. Atlantis, 2011.
CH11. N Garg and J Henderson. Temporal restricted boltzmann machines for dependency parsing. In
Proc. ACL, 2011.
GIII. B. Coertzel and M. Ikle. Steps toward a geometry of mind. In J Schmidhuber and K Thorisson,
editors, Subm.to ACI-11. Springer, 2011.
CICH08. B. Goertzel, M. Ikle, I. Goertzel, and A. Heljakka. Probabilistic Logic Networks. Springer, 2008.
CKD89. D. E. Goldberg, B. Korb, and K. Deb. Messy genetic algorithms: Motivation, analysis, and first
results. Complex Systems, 1989.
EFTA00624677
References 531
CLI0. Ben Coertzel and Ruiting Lian. A probabilistic characterization of fuzzy semantics. Proc. of
ICAI-10, Beijing, 2010.
CLdG+ 10. Ben Coertzel, Ruiting Lien, Hugo de Gar's, Shuo Chen, and hamar Arel. World survey of artificial
brains, part ii: Biologically inspired cognitive architectures. Neurocomputing, April 2010.
GM11108. B. Goertzel, I. Coertzel M. lkle, and A. Heljakka. Probabilistic Logic Networks. Springer, 2008.
GN02. Alfonso Gerevini and Bernhard Nebel. Qualitative spatio-temporal reasoning with rcc-8 and alien's
interval calculus: Computational complexity. In Frank van Harmelen, editor, ECAI, pages 312-316.
IOS Press, 2002.
Goe94. Ben Coertzel. Chaotic Logic. Plenum, 1994.
Coe06. Ben Coertzel. The Hidden Pattern. Brown Walker, 2006.
Coe08a. B. Coertzel. The pleasure algorithm. groups.google.com/group/opencog/files, 2008.
Coe08b. Ben Coertzel. A pragmatic path toward endowing virtually-embodied ais with human-level lin-
guistic capability. IEEE World Congress on Computational Intelligence (Wee!), 2008.
Goel0a. Ben Goertzel. Infinite-order probabilities and their application to modeling self-referential seman-
tics. In Proceedings of Conference on Advanced Intelligence 2010, Beijing, 2010.
Goel0b. Ben et al Coertzel. A general intelligence oriented architecture for embodied natural language
processing. In Proc. of the Third Conf. on Artificial General Intelligence (ACI-10). Atlantis
Press, 2010.
Cool 1 a. B Coertzel. Integrating a compositional spatiotemporal deep learning network with symbolic
representation/reasoning within an integrative cognitive architecture via an intermediary semantic
network. In Proceedings of AAA! Symposium on Cognitive Systems„ 2011.
Goel1 b. Ben Goertzel. Imprecise probability as a linking mechanism between deep learning, symbolic
cognition and local feature detection in vision processing. In Proc. of AC!-11, 2011.
CPPG116. Ben Coertzel, Hugo Pinto, Cassio Pennachin, and Izabela Freire Coertzel. Using dependency pars-
ing and probabilistic inference to extract relationships between genes, proteins and malignancies
implicit among multiple biomedical research abstracts. In Proc. of Bio-NLP 2006, 2006.
CR00. Alfonso Gerevini and Jochett Renz. Combining topological and size information for spatial reason-
ing. Artificial Intelligence, 137:2002, 2000.
GSW05. Bernhard Canter, Gerd Stumme, and Rudolf Wille. Formal Concept Analysis: Foundations and
Applications. Springer-Verlag, 2005.
HB06. Jeff Hawkins and Sandra Blakeslee. On Intelligence. Brown Walker, 2006.
HDY+12. Geoffrey Hinton, Li Deng, bong Yu, George bald, Abdel rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, and Tara Sainathand Brian Kingsbury. Deep
neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine,
2012.
141407. Barbara Hammer and Pascal Ritzier, editors. Perspectives of Neural-Symbolic Integration. Studies
in Computational Intelligence, Vol. 77. Springer, 2007.
Hi189. Daniel Hillis. The Connection Machine. MIT Press, 1989.
HK02. David Harel and Yehuda Koren. Graph Drawing by High-Dimensional Embedding. 2002.
Hob78. J. Hobbs. Resolving pronoun references. Lingua, 44:311-338, 1978.
Hof79. Douglas Holstadter. Codel, Escher, Bach: An Eternal Golden Braid. Basic, 1979.
Hol75. J. R. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975.
Hud84. Richard Hudson. Word Grammar. Oxford: Blackwell, 1984.
Htic190. Richard Hudson. English Word Grammar. Blackwell Press, 1990.
Hud07a. Richard Hudson. Language Networks. The new Word Grammar. Oxford University Press, 2007.
Hud07b. Richard Hudson. Language Networks: The New Word Grammar. Oxford Linguistics, 2007.
Hut99. G. Hutton. A tutorial on the universality and expressiveness of fold. Journal of anctional
Programming, 1999.
Hut05a. Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, 2005.
Hut05b. Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, 2005.
HWP03. Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraph in the presence of
isomorphism. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM),
pages 549-552. 2003.
Jac03. Ray Jackendoff. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford Uni-
versity Press, 2003.
EFTA00624678
532 A Glossary
JLO8. D. J. Jilk and C. Lebiere. and o'reilly. R. C. and Anderson, J. R. (2008). SAL: An explicitly
pluralistic cognitive architecture. Journal of Experimental and Theoretical Artificial Intelligence,
20:197-218, 2008.
Joh05. Mark Johnson. LATEX: A Developmental Cognitive Neuroscience. Wiley-Blackwell, 2005.
Jolla. I. T. Joliffe. Principal Component Analysis. Springer, 2010.
KA95. J. R. Koza and D. Andre. Parallel genetic programming on a network of transputers. Technical
report, Stanford University, 1995.
KARIO. Tom Karnowski, hamar Arel, and D. Rose. Deep spatiotemporal feature learning with application
to image classification. In The 9th International Conference on Machine Learning and Applications
(ICAILA '10), 2010.
KK01. Michihiro Kuramochi and George Karypis. Ftequent subgraph discovery. In Proceedings of the
2001 IEEE International Conference on Data Mining, pages 313-320. 2001.
KM04. Dan Klein and Christopher D. Manning. Corpus-based induction of syntactic structure: Models of
dependency and constituency. In ACL '04 Proceedings of the 42nd Annual Meeting on Association
for Computational Linguistics, pages 479-486. Association for Computational Linguistics, 2004.
Koh01. Teuvo Kohonen. Self-Organizing Maps. Springer, 2001.
Koz92. J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural
Selection. MIT Press, 1992.
Koz94. J. R. Koza. Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press,
1994.
KSPC13. Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Stephen Putman, and Bob Coecke. Reasoning about
meaning in natural language with compact closed categories and frobenius algebras. 2013.
Kur09. Yohei Kuria. 9-intersection calculi for spatial reasoning on the topological relations between
multi-domain objects. USA, June 2009.
Kur12. Ray Kurzweil. How to Create a Mind. Viking, 2012.
LA93. C Lebiere and J R Anderson. A connectionist implementation of the act-r production system. In
Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, 1993.
Lai12. John E Laird. The Soar Cognitive Architecture. MIT Press, 2012.
LBHIO. Jens Lehmann, Sebastian Bader, and Pascal Hitzler. Extracting reduced logic programs from
artificial neural networks. Applied Intelligence, 2010.
LDA05. Ramon Lopez, Cozar Delgado, and Masahiro Araki. Spoken, Multilingual and Afziltimodal Dialogue
Systems: Development and Assessment. Wiley, 2005.
Lamb°. B. Lemoine. Nlgen2: a linguistically plausible, general purpose natural language generation system.
http://www.louisiana.edu/??ba12277/WLGen2, 2010.
Lev94. L. Levin. Randomness and ttondeterminism. In The International Congress of Mathematicians,
1994.
LGEIO. Ruiting Lien, Ben Goertzel, and Al Et. Language generation via glocal similarity matching.
Neurocomputing, 2010.
LOKI' 12. Ruiting Lien, Ben Goertzel, Shujing Ke, Jade OONeill, Keyvan Sadeghi, Simon Shin, Dingjie
Wang, Oliver Watkins. and Gino Yu. Syntax-semantic mapping for general intelligence: Language
comprehension as hypergraph homomorphism, language generation as constraint satisfaction. In
Artificial General Intelligence: Lecture Notes in Computer Science Volume 7716. Springer, 2012.
LKP+05. Sung Hee Lee, Junggon Kim, Frank Chongwoo Park, Munsang Kim, and James E. Bobrow.
Newton-type algorithms for dynamics-based robot movement optimization. IEEE Transactions
on Robotics, 21(4):657-667, 2005.
LLR09. Weiming Liu. Sanjiang Li, and Jochen Rens. Combining rcc-8 with qualitative direction calculi:
Algorithms and complexity. In IJCAI, 2009.
LXIDK07. Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch. Handbook of
Latent Semantic Analysis. Psychology Press, 2007.
LNOO. George Lakoff and Rafael Nunez. Where Mathematics Comes From. Basic Books, 2000.
Loo06. Moshe Looks. Competent Program Evolution. PhD Thesis, Computer Science Department, Wash-
ington University, 2006.
Loo07a. M. Looks. On the behavioral diversity of random programs. In Genetic and evolutionary compu-
tation conference, 2007.
Loo07b. M. Looks. Scalable estimation-of-distribution program evolution. In Genetic and evolutionary
computation conference, 2007.
EFTA00624679
References 533
Loo07c. Moshe Looks. Meta-optimizing semantic evolutionary search. In Hod Lipson, editor, Genetic and
Evolutionary Computation Conference, CECCO 2007, Proceedings, London, England, UK, July
7-11, 2007, page 626. ACM, 2007.
Low99. David Lowe. Object recognition front local scale-invariant features. In P MC. of the International
Conf. on Computer Vision, pages 1150-1157, 1999.
LP0I. Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In Proceedings of
the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD'01), pages 323428. ACM Press, 2001.
LP02. W. B. Langdon and IL Poli. Foundations of Genetic Programming. Springer-Verlag, 2002.
Mai00. Monika Maidl. The common fragment of ctl and lel. In In IEEE Symposium on Foundations of
Computer Science, pages 643-652, 2000.
May04. M. T. Maybury. New Directions in Question Answering. MIT Press, 2004.
Mea07. Reiman E M and et al. Gab2 alleles modify alzheimer's risk in apoe e4 carriers. Neuron 54 (5),
2007.
Mih05. Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algo-
rithms for sequence data labeling. In HLT Proceedings of the conference on Human Language
Technology end Empirical Alethods in Natural Language Processing, pages 411-418, Morristown,
NJ, USA, 2005. Association for Computational Linguistics.
Mih07. Rada Nlihalcea. Wand sense disambiguation. Encyclopedia of Machine Learning. Springer-Verlag,
2007.
Min88. Marvin Minsky. The Society of Mind. MIT Press, 1988.
Mit96. Steven !Althorn The Prehistory of Mind. Thames and Hudson, 1996.
MS99. Christopher Manning and Heinrich Scheutze. Foundations of Statistical Natural Language Pro-
cessing. MIT Press, 1999.
NITF04. Rada Nlihalcea, Paul Tarau, and Elizabeth Figa. Pagerank on semantic networks, with application
to word sense disambiguation. In COLIN° '04: Proceedings of the 20th international confer-
ence on Computational Linguistics, Morristown, NJ, USA, 2004. Association for Computational
Linguistics.
OCC90. Andrew Ortony, Gerald Clore, and Allan Collins. The Cognitive Structure of Emotions. Cambridge
University Press, 1990.
O1s95. J. R. Olsson. Inductive functional programming using incremental program transformation. Arti-
ficial Intelligence, 1995.
PAF00. H. Park, S. Antari, and K. Fukumizu. Adaptive natural gradient learning algorithms for various
stochastic models. Neural Computing, 13:755-764, 2000.
Pal04. Girish Keshav Palshficar. Fuzzy region connection calculus in finite discrete space domains. Appl.
Soft Comput., 4(1):13-23, 2004.
PCP00. Papageorgiou, C., and T. Poggio. A trainable system for object detection, volume 38-1. Intl. .)
ComputerVision, 2000.
PD09. Hoifung Poon and Pedro Domingos. Unsupervised semantic parsing. In Proceedings of the 2009
Conference on Empirical Methods in Natural Language Processing, pages 1-10, Singapore, August
2009. Association for Computational Linguistics.
Pei34. C. Peirce. Collected papers: Volume V. Pragmatism end pragmaticism. Harvard University Press.
Cambridge MA., 1934.
Pel05. Martin Pelikan. Hierarchical Bayesian Optimization Algorithm: Toward a New Generation of
Evolutionary Algorithms. Springer, 2005.
PJ88a. Pearl and J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufman. 1988.
PJ88b. Steven Pinker and .1acquesMehler. Connections and Symbols. MIT Press, 1988.
Pro13. The Univalent Foundations Program. Homology Type Theory: Univalent Foundations of Mathe-
matics. Institute for Advanced Study, 2013.
RCC93. D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions and connection. 1993.
Rcs99. J. Reece. Genetic programming acquires solutions by combining top-down and bottom-up refine-
ment. In Foundations of Generic Programming. 1999.
Row90. John Rowan. Subpersonalities: The People Inside Us. Routledge Press, 1990.
RVG05. Mike Ross, Linos Vepstas, and Ben Goertzel. Rolex semantic relationship extractor.
http://opencog.org/wilci/RelEx, 2005.
EFTA00624680
534 A Glossary
Sch06. .1. Schmidhuber. Code! machines: Fully Self-referential Optimal Universal Self-improvers. In B. Go-
ertzel and C. Pennachin, editors, Artificial General Intelligence, pages 119-226. 2006.
SDCCK08a. Steven Schockaert, Martine De Cock. Chris Cornelis, and Etienne E. Kerne. Fuzzy region con-
nection calculus: An interpretation based on closeness. Int. J. Approx. Reasoning, 48(1):332-347,
2008.
SDCCK08b. Steven Schockaert, Martine De Cock, Chris Cornelis, and Etienne E. Kerne. Fuzzy region connec-
tion calculus: Representing vague topological information. Int. J. Approx. Reasoning, 48(1):314-
331, 2008.
SM07. Ravi Sinha and Rada Mihalcea. Unsupervised graph-basedword sense disambiguation using mea-
sures of word semantic similarity. In ICSC '07: Proceedings of the International Conference on
Semantic Computing, pages 363-369, Washington, DC, USA, 2007. IEEE Computer Society.
SM09. Ravi Sinha and Rada Mihalcea. Unsupervised graph-based word sense disambiguation. In Nicolas
Nicolov and John Benjamins Ruslan Xlitkov, editors, Current Issues in Linguistic Theory: Recent
Advances in Natural Language Processing. 2009.
SMI97. F-R. Sinot, Fernandez M., and Mackie I. Efficient reductions with director strings. Evolutionary
Computation, 1997.
SMK12. Jeremy Stober, Risto Miikkulainen, and Benjamin Kuipers. Learning geometry from sensorimo-
tor experience. In Proceedings of the First Joint Conference on Development and Learning and
Epigenetic Robotics, 2012.
Sol64a. Ray Solomonoff. A Formal Theory of Inductive Inference, Part I. Information and Control, 1964.
Sol64b. Ray Solomonoff. A Formal Theory of Inductive Inference, Part IL. Information and Control, 1964.
Spe96. L. Spector. Simultaneous evolution of programs and their control structures. In Advances in
Genetic Programming 2. MIT Press, 1996.
SR0t. Murray Shanahan and David A Randell. A logic-based formulation of active visual perception. In
Knowledge Representation, 2004.
5S03. R. P. Salustowicz and J. Schmidhuber. Probabilistic incremental program evolution. Lecture Notes
in Computer Science vol. 2706, 2003.
ST91. Daniel Sleator and Davy Tetnperley. Parsing english with a link grammar. Technical report,
Carnegie Mellon University Computer Science technical report CMU-CS-91-196, 1991.
ST93. Daniel Sleator and Davy Temperley. Parsing english with a link grammar. Third International
Workshop on Parsing Technologies., 1993.
SV99. A. J. Storkey and R. Valabregue. The basins of attraction of a new hopfield learning rule. Neural
Networks, 12:869-876, 1999.
SW05. Reza Shadmehr and Steven P. Wise. The Computational Neurobiology of Reaching and Pointing
: A Foundation for Motor Learning. MIT Press, 2005.
SWM90. Timothy Starkweather, Darrell Whitley, and Keith Mathias. Optimization using disributed genetic
algorithms. In Parallel Problem Solving from Nature, Edited by H Schwefel and R Manner, 1990.
5204. R. Sun and X. Zhang. Top-down versus bottom-up learning in cognitive skill acquisition. Cognitive
Systems Research, 5, 2004.
Tes59. Lucien TesniEre. Elements de syntaxe structurale. Klincksieck, Paris, 1959.
Tom03. Michael Tomasello. Constructing a Language: A Usage-Based Theory of Language Acquisition.
2003.
TSH11. Mohamad Tarifi, Moore Sitharam, and Jeffery Ho. Learning hierarchical sparse representations
using iterative dictionary learning and dimension reduction. In Proc. of RICA 2011, 2011.
TVCC05. M. Tomassini, L. Vanneschi, P. Collard, and M. Clergue. A study of fitness distance correlation
as a difficulty measure in genetic programming. Evolutionary Computation, 2005.
VK94. T. Veale and XI. T. Keane. Metaphor and Memory and Meaning in Sapper: A Hybrid Model
of Metaphor Interpretation. Proceedings of the workshop on Hybrid Connectionist Systems of
ECAI94, at the llth European Conference on Artificial Intelligence, 1994.
VO07. Tony Veak and Diannuid O'Donoghue. Computation and Blending. Cognitive Linguistics, 2007.
Wah06. Wolfgang Wahlster. SmartKom: Foundations of Multimodel Dialogue Systems. Springer, 2006.
Wan06. Pei Wang. Rigid Flexibility: The Logic of Intelligence. Springer, 2006.
WF05. Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, 2005.
Win95. Stephan Winter. Topological relations between discrete regions. In Advances in Spatial Databasesd
4th International Symposium, SSD -5, pages 310-327. Springer, 1995.
EFTA00624681
References 535
Win00. Stephan Winter. Uncertain topological relations between imprecise regions. Journal of Geograph-
ical Information Science, 14(5):411-430, 2000.
WKB05. Nice Van De Weghe, Bart Kuijpers, and Peter Bogaert. A qualitative trajectory calculus and the
composition of its relations. In Proc. of GeoS, pages 60-76. Springer-Verlag, 2005.
Yan10. King-Yin Yen. A fuzzy-probabilistic calculus for vagueness. In Unpublished manuscript, 2010.
Subm.
YKL+04. Sanghoon We, Jinwook Kim, Sung Hee Lee, Frank Chongwoo Park, Wooram Park, Junggon Kim,
Changbeom Park, mid Intaeck Yeo. A modular object-oriented framework for hierarchical multi-
resolution robot simulation. Robotics, 22(2):141-154, 2004.
Yur98. Denis Yuret. Discovery of Linguistic Relations Using Lexical Attraction. PhD thesis, MIT, 1998.
ZHIO. Ruoyu Zou and Lawrence B. Holder. Frequent subgraph mining on a single large graph using
sampling techniques. In International Conference on Knowledge Discovery and Data Mining
archive. Proceedings of the Eighth Workshop on Mining and Learning with Graphs. Washington
DC, pages 171-178, 2010.
ZLLY08. Xiaotong Zhang, Weiming Liu, Sanjiang Li, and Xlingsheng Ying. Reasoning with cardinal di-
rections: an efficient algorithm. In AAAI'08: Proc. of the 28rd national conference on Artificial
intelligence, pages 387-392. AAAI Press, 2008.
ZMO6. Song-Chun Zhu and David Mumford. A stochastic grammar of images. Foundations and Trends
in Computer Graphics and Vision, 2006.
EFTA00624682