The Environment


The environment is a simplified ‘reality simulator’ running in the Unity 3D game engine.

  • Simulates physics and object collisions
  • Provides cameras for robot vision
  • Environments can easily be swapped to focus on specific aspects of the agent
  • Behaviors and functionality can be programmed into objects
  • Unlimited world size – the world can be as big as the agent can explore
  • Level of detail can be tailored to the agent’s needs
  • The Unity machine learning agents add-on provides a Python API
  • Partially (vs. Fully) Observable:   While this is also a function of the robot sensor limits, the world is a very large place, and the agent does not have access to all the information in the environment as a matter of course (as apposed to a chess-playing agent, which knows the entire chess board state at all times).  In fact, an environmental observation is entirely up to the agent, as the environment doesn’t care about any agents within it.
  • Stochastic (vs. Deterministic):   Many of the early test environments have been built as static (as in not-moving) single-agent environments, but they are not entirely deterministic.  The agent can’t predict the exact outcome of all actions all the time.  As a simple example, the first time the agent runs into a wall, the outcome of moving forward is completely unexpected.  The intent is to make the environment more dynamic (things moving around) and more stochastic as it evolves.
  • Sequential (vs. Episodic):   The general environment is sequential.  Once it starts, it keeps going.  There can be episodic elements, however, such as games that the agent can play through the data port in the environment are mostly episodic in nature (such as Cart-pole, Tick-Tack-Toe, etc).
  • Static (vs. Dynamic):   If the environment can change while an agent is deliberating, then the environment is said to be dynamic for the agent; otherwise, it is static.  The environment is technically static since the agent is allowed to process one full observation/action cycle per time step of the Unity physics engine.   The agent is tightly coupled for the time being in this cycle, though this will likely be decoupled in the future.  This tight coupling causes the environment frame rate to vary based on how much the agent has to think through each frame (annoying), but allows development to move forward.
  • Continuous (vs. Discrete):   States in the environment sweep through a range of continuous values and do so smoothly over time.  For example, the agent or any object can be at location (0,0,0), or (1, 1, 1), or any location between those two points, up to the floating point limit of the machine.  States observed through the data port are allowed to be discrete with a limited number of values, such as two-state indicators.
  • Single Agent (vs. Multi-agent):   The environment is capable of having any number of agents, but most scenarios so far have only included a single agent.  Having multiple agents in the environment will certainly happen in the future, including one or more humans via virtual reality.
  • 3-Dimensional (vs. 2-Dimensional):   The environment is three dimensional.  An example of a two dimensional environment would be a board game such as chess or checkers.


  • Unity 3D game engine with the Machine Learning plugin (CPU and GPU)
  • C# for the robot, sensors, and game object controllers (CPU)


The Agent

Agent version 0.4


There are two main elements in the system: The environment and an agent which is in the environment.  This works similar to a typical reinforcement learning system, where the agent receives observations from the environment and then performs actions which affect the environment in some way.  The difference being that there isn’t a reward channel.

For the purposes of this post, the term ‘agent‘ will refer to the combination of the robot and the artificial intelligence (AI) core which controls the robot.


This is the component which exists physically (well, virtually) in the environment.  The robot has the sensors which collect and send observations to the AI about the environment, and actuators which do things in the environment.  The robot also has equipment which must be monitored and maintained, so in a sense it’s another piece of the environment to the AI.


The robot has several components and functionality which provide a richer and more complex sensory environment to the AI.

  • E1 battery:   Provides power to movement actuators.  Robot cannot move if depleted.  Can be recharged.
  • E2 Battery:  Provides power to the AI.  If this battery runs out the AI will be put into a very undesirable quasi-standby mode.  Recharging hasn’t been worked out yet for this battery.
  • Internal Heat:   Equipment operation causes internal heat generation.  If the heat gets too high, the robot can’t use certain equipment.  Heat dissipates over time.
  • Data Port:   This is a 10-channel wireless communications port which can be connected to objects in the environment which have a data port interface such as games and chargers.  It can be used for activities such as playing data-based non-visual games such as Cart-pole or Tick-Tack-Toe, or getting information about objects such as the status of a charger while charging.  The data port can only be connected if within 2 meters of a host data port, and has to be intentionally connected and disconnected by the agent.


The sensors send back data that the robot collects from the environment.   This data is the only form of observation the AI receives, so the agent only knows about things it can sense locally, and only in the modes listed below.  It does not have full information about the environment.  The sensors only provide information which could be reasonably collected by any modern real sensor.

  • 3-channel color camera (240 x 320 pixels)
  • 1-channel depth camera (data is similar to LIDAR)
  • Camera gimbal bearing and azimuth angle sensors (analog)
  • 3-axis GPS (analog)
  • Compass (analog)
  • Collision Sensors (not implemented yet)
  • Battery meters for the E1 and E2 batteries (analog)
  • Internal temperature sensor (analog)
  • Data port connected indicator (two-state)
  • 10-channel data port (analog)


Actuators cause something to occur in the environment, either to the robot itself or to another object, or to both.

  • Move forwards, backwards
  • Strafe left, right
  • Rotate clockwise, counter-clockwise
  • Look up/down/left/right
  • Fire Laser
  • Connect/disconnect the data port
  • Button 1, button 2, button 3

Artificial Intelligence Core

The AI core is the part that provides the intelligence to the agent.  It receives environmental sensor data from the robot, processes and learns from the information, decides how to best meet its goal, then returns actuator commands to the robot in order to interact with the environment.

It starts life knowing nothing about itself or the environment, and learns knowledge from what it observes as it tries different available actions to meet goals.  It then uses that knowledge to improve its ability to meet goals.  This is a constant cycle, as the agent is constantly learning new things as it interacts in the environment.  The agent’s ‘brain’ grows and develops over time through this process.

The very first goals are simple, and are given so that the agent will learn basic things about itself and the environment.  Once it understands how to ‘drive’ itself – what actions have what effect – it uses those skills to interact with objects in the world.  It uses new knowledge from these interactions to interact with other objects, and so on.

  • Hybrid cognitive architecture
    • Symbol-based, but uses many neural networks
    • Bottoms-up approach
  • Goal-driven
  • Non-episodic:   From startup to shutdown is one episode.  This is unlike many learning algorithms which require multiple epochs in order to learn.  The environment steps continuously through time, so the agent has to learn quickly when things happen because there’s no guarantee of a repeat.
  • Time dilation:  Not exactly, but the agent experiences one fixed time step in the environment per internal cycle.  So if the agent has something complex to figure out, it’s as if it is thinking faster than when it has something simple to figure out.  This will change in a future revision so that the environment runs at real time and the agent has to keep up.  (VR is not fun at 3 frames per second)

Since the world is very large, the agent can’t be reasonably expected to learn many useful things simply by letting it loose to do random actions until something sticks.  Therefore, training is accomplished with a variety of methods:

  • Curriculum Training:  The agent is given a list of goals to accomplish.  The list is designed to lead the agent to learning opportunities.  The goals don’t do the actual training.
  • Visual object detection:  The agent receives most of its information about the world visually.
  • Can work with partial information:  The agent can’t know everything about the environment.  Not only is it too big, the agent has to learn about it as it experiences it because it isn’t given any prior information at startup, and the environment gives it no information directly.  All knowledge about the environment is learned from sensor data.
  • Can deal with a stochastic environment:  The action taken works as expected, unless it doesn’t.  The agent learns something when things go as planned as well as when they don’t.
  • Can deal with a continuous world:  The environment is continuous, so the agent can’t represent it like a chess board or a big matrix internally – there’s too much stuff there.
  • Efficient with data:  It doesn’t need thousands or tens of thousands of samples to learn something.  It often learns from a handful of samples.
  • Transparent:  It’s fairly straight-forward to get information out of the agent in order to understand why it made any given decision.
  • Flexible:  It’s not built to deal with any particular thing.  For example, the first time it encounters a wall, it has to figure out how to deal with this new thing in the world that it didn’t know about.  So it learns to go around it to get to a destination on the other side.  When it needs to charge its battery, it has to learn that it has to get close to it, connect the data port, then hit the ‘charge’ button.

It’s early in development, so there are an uncountable number of limitations with respect to the final goal, but here are some of the current limits which are in scope for revision 0.4:

  • Robot actuator commands (actions) are set up to be continuous (from 0.0 to 1.0), but currently the agent is only allowed to use them as if they were discrete (0 or 1).  This can cause the agent to overshoot a goal state.
  • The environment is in 3 dimensions, but the agent only processes x and z and disregards y (height).  This is to reduce compute requirements, but the consequence is that it can’t detect, for example, an enclosed space with a ceiling – it will interpret the space as solid.
  • The agent doesn’t look beyond the last update cycle for causation.  This works okay for simple tasks and interactions, but causes problems as things get more complex, such as playing the cart-pole game.
Plans for Version 0.5:
  • Recognition of detected objects
  • Ability to learn object hierarchical relationships
  • Ability to learn and predict simple object behaviors
  • Update actions to have continuous values, and allow multiple actions at once
  • Ability to derive its own higher-level actions using the base set
  • Update the whole architecture to handle n-dimensions
  • Ability to recognize groups of actions as a task


  • C# for the robot (CPU only)
  • Python for the AI (CPU only)

rlpAI Artificial Intelligence Project

Last Updated: 5/14/2018
Status: In Progress


Develop an Artificial General Intelligence (AGI) agent which can learn to function in a continuous real-time environment through various means of training using a general architecture which is not specific to any task, ability, or robot chassis implementation.


The agent should be able to make sense of itself and its environment quickly – similar to a human or animal.

The agent should be able to use relevant knowledge learned from prior experiences in order to solve new problems.

The agent should be able to learn how to accomplish tasks which are fundamentally different from each other, and retain the skills learned.

The agent should be able to learn how to communicate with a human or another agent in the environment.

General Description

The agent is general, and is designed to be able to learn to function in any arbitrary and unknown environment.   In other words, the agent code is independent of the environment.  It doesn’t have any prior knowledge of the world or itself at startup.

The agent learns from experiencing the world as it finds ways to meet objectives.   It determines the best actions to take based on what has been learned.  It must learn everything it needs to know to operate.

Since the world is open and continuous, the agent is free to take any one of an infinite number of paths to anywhere.  Therefore, curriculum training using objectives is used to guide the agent to learning opportunities where it can discover and learn new things.

Continue reading “rlpAI Artificial Intelligence Project”