The Agent

Agent version 0.4


There are two main elements in the system: The environment and an agent which is in the environment.  This works similar to a typical reinforcement learning system, where the agent receives observations from the environment and then performs actions which affect the environment in some way.  The difference being that there isn’t a reward channel.

For the purposes of this post, the term ‘agent‘ will refer to the combination of the robot and the artificial intelligence (AI) core which controls the robot.


This is the component which exists physically (well, virtually) in the environment.  The robot has the sensors which collect and send observations to the AI about the environment, and actuators which do things in the environment.  The robot also has equipment which must be monitored and maintained, so in a sense it’s another piece of the environment to the AI.


The robot has several components and functionality which provide a richer and more complex sensory environment to the AI.

  • E1 battery:   Provides power to movement actuators.  Robot cannot move if depleted.  Can be recharged.
  • E2 Battery:  Provides power to the AI.  If this battery runs out the AI will be put into a very undesirable quasi-standby mode.  Recharging hasn’t been worked out yet for this battery.
  • Internal Heat:   Equipment operation causes internal heat generation.  If the heat gets too high, the robot can’t use certain equipment.  Heat dissipates over time.
  • Data Port:   This is a 10-channel wireless communications port which can be connected to objects in the environment which have a data port interface such as games and chargers.  It can be used for activities such as playing data-based non-visual games such as Cart-pole or Tick-Tack-Toe, or getting information about objects such as the status of a charger while charging.  The data port can only be connected if within 2 meters of a host data port, and has to be intentionally connected and disconnected by the agent.


The sensors send back data that the robot collects from the environment.   This data is the only form of observation the AI receives, so the agent only knows about things it can sense locally, and only in the modes listed below.  It does not have full information about the environment.  The sensors only provide information which could be reasonably collected by any modern real sensor.

  • 3-channel color camera (240 x 320 pixels)
  • 1-channel depth camera (data is similar to LIDAR)
  • Camera gimbal bearing and azimuth angle sensors (analog)
  • 3-axis GPS (analog)
  • Compass (analog)
  • Collision Sensors (not implemented yet)
  • Battery meters for the E1 and E2 batteries (analog)
  • Internal temperature sensor (analog)
  • Data port connected indicator (two-state)
  • 10-channel data port (analog)


Actuators cause something to occur in the environment, either to the robot itself or to another object, or to both.

  • Move forwards, backwards
  • Strafe left, right
  • Rotate clockwise, counter-clockwise
  • Look up/down/left/right
  • Fire Laser
  • Connect/disconnect the data port
  • Button 1, button 2, button 3

Artificial Intelligence Core

The AI core is the part that provides the intelligence to the agent.  It receives environmental sensor data from the robot, processes and learns from the information, decides how to best meet its goal, then returns actuator commands to the robot in order to interact with the environment.

At initialization, the AI knows nothing about itself or the environment.  It learns knowledge from what it observes as it tries different available actions to meet goals.  It then uses that knowledge to improve its ability to meet goals.  This is a constant cycle, as the agent is constantly learning new things as it interacts in the environment.  The agent’s ‘brain’ grows and develops over time through this process.

The very first goals are simple, and are given so that the agent will learn basic things about itself and the environment.  Once it understands how to ‘drive’ itself – what actions have what effect – it uses those skills to interact with objects in the world.  It uses new knowledge from these interactions to interact with other objects, and so on.


The AI architecture is unique, as far as I know.  It turned out to be vaguely familiar to a classic production system, at least at a high level.

  • Hybrid cognitive architecture
    • Loosely ‘Symbol’-based.  Sort of.
    • Uses an associative analog memory and a discrete world model to represent the world and its knowledge about the world
    • An n-dimensional decision engine plans and decides what actions to take
    • Bottoms-up approach
  • Goal-driven
  • Non-episodic:   From startup to shutdown is one episode.  This is unlike many learning algorithms which require multiple epochs in order to learn.  The environment steps continuously through time, so the agent has to learn quickly when things happen because there’s no guarantee of a repeat.
  • Network client:  Connects to the Environment Server

Since the world is very large, the agent can’t be reasonably expected to learn many useful things simply by letting it loose to do random actions until something sticks.  Therefore, training is accomplished with a variety of methods:

  • Curriculum Training:  The agent is given a list of goals to accomplish.  The list is designed to lead the agent to learning opportunities.  The goals don’t do the actual training.
  • Other types of training are in on the whiteboard still.
  • Visual object detection:  The agent receives most of its information about the world visually.
  • Can work with partial information:  The agent can’t know everything about the environment.  Not only is it too big, the agent has to learn about it as it experiences it because it isn’t given any prior information at startup, and the environment gives it no information directly.  All knowledge about the environment is learned from sensor data.
  • Can deal with a stochastic environment:  The action taken works as expected, unless it doesn’t.  The agent learns something when things go as planned as well as when they don’t.
  • Can deal with a continuous world:  The environment is continuous, so the agent can’t represent it like a chess board or a big matrix internally – there’s too much stuff there.
  • Efficient with data:  It doesn’t need thousands or tens of thousands of samples to learn something.  It often learns from a handful of samples.
  • Transparent:  It’s fairly straight-forward to get information out of the agent in order to understand why it made any given decision.
  • Flexible:  It’s not built to deal with any particular thing.  For example, the first time it encounters a wall, it has to figure out how to deal with this new thing in the world that it didn’t know about.  So it learns to go around it to get to a destination on the other side.  When it needs to charge its battery, it has to learn that it has to get close to it, connect the data port, then hit the ‘charge’ button.

It’s early in development, so there are an uncountable number of limitations with respect to the final goal, but here are some of the current limits which are in scope for revision 0.5:

  • Robot actuator commands (actions) are set up to be continuous (from 0.0 to 1.0), but currently the agent is only allowed to use them as if they were discrete (0 or 1).  This can cause the agent to overshoot a goal state.
  • The environment is in 3 dimensions, but the agent only processes x and z and disregards y (height).  This is to reduce compute requirements, but the consequence is that it can’t detect, for example, an enclosed space with a ceiling – it will interpret the space as solid surface.
  • The agent doesn’t look beyond the last update cycle for causation.  This works okay for simple tasks and interactions, but causes problems as things get more complex.
Plans for Version 0.5:
  • Recognition of detected objects
  • Ability to learn object hierarchical relationships
  • Ability to learn and predict simple behaviors of other objects
  • Update actions to have continuous values, and allow multiple actions at once
  • Ability to derive its own higher-level actions using the base set
  • Update the whole architecture to handle n-dimensions
  • Ability to recognize groups of actions as a task


  • C# for the robot (CPU only)
  • Python for the AI (CPU only)