Real-Time Environment Server

The environment just had a major functional overhaul – it’s now an Environment Server which runs asynchronous to agents in real-time and at 60 frames per second. So, much better! This is the first major update towards v0.5.

The environment had used the Unity ML interface through all the v0.3 -> v0.4 updates to connect the Python AI code to the robot in the environment, which worked great. The system outgrew the intent of that interface though, so it needed to be completely replaced with custom server code.

Agent / Environment Architecture

The environment runs independently of any agents at 60 fps, instead the ~3 fps that was starting to become common. When an agent is started, it connects to the environment server as a client and gracefully exits when it’s done, terminated, or trips over a bug in the code. The agent communicates actions and observations asynchronously now, instead of synchronously, which is how a real-world system would operate anyhow. The server can handle 3 or 4 agents simultaneously before the hardware is maxed out and frame rates start to drop significantly.

A flexible communication protocol allows vastly improved control over information flow between the environment and the agent, which means more useful information can be displayed graphically.

The consequence to the agent is that it must think quickly enough to function, since the world keeps going while its processing. The current version of the aiCore typically operates at about 10 Hz, on average. A little faster would be nice, but 10 is reasonable and functional for basic slow-moving tasks. Updates for multiprocessing, the use of cython in the aiCore, and other speed optimizations are likely to occur in the near future to help improve performance.

Other benefits include:

  • Nice frame rate for using virtual reality
  • Videos of the agents will have less stuttering
  • The code/test/code/test cycle is much faster
  • Allows agent to agent interaction

rlpAI v0.4

Version 0.4 is a significant advancement from the version 0.3 agent. It combines the cumulative updates from the v0.3 branch including visual object detection, n-dimensional planning, and learning multi-step actions, in addition to some major code cleanup and execution speed improvements.

The system architecture (see below) is completely different than in v0.3, which was a simple 2D python class which contained and managed all the world objects. The agent is now a robot combined with an AI core – analogous to a real-world robot which contains a computer which runs the AI. The 3D world, created using the Unity editor, provides a rich learning environment for the agent.


  • The AI operates a robot in a 3D Unity environment
  • Ability to visually detect objects and model the environment
  • Ability to learn by interacting with the environment
  • Ability to plan a series of actions in multiple dimensions to meet an objective

System Architecture

The Unity3D environment is synchronously connected to the python aiCore ‘brain’ through the Unity ML interface. At initialization, the environment sends an initial observation from the robot to the aiCore. The aiCore processes the observation and sends a set of action commands back to the robot to execute. The environment then processes five frames (about 80ms of simulated time) where the robot executes the action commands, then sends an updated robot observation back to the aiCore. This cycle continues indefinitely.

The robot can interact with other objects in the environment which have a data port installed, such as the battery charger. This is not unlike R2D2 connecting to the Death Star to stop the trash compactor. Use of the data port was demonstrated in the Multi-step Planning post video.

Future System Improvements:

  • Synchronous operation with the environment (per time step) causes the environment to operate only as fast as the agent can process a state update – usually about 3 frames per second. It also means the agent takes zero time to think, with respect to the passage of time in the environment. This doesn’t represent how the real world works. Ideally, the environment and agent run asynchronously, and the agent has to keep up with the world.

Visual Object Detection

The agent now can visually detect objects and add them to its internal model of the world.  All the environmental hints pushed to the agent at startup have been removed, so it has to visually discover everything on its own.

In the video below, the agent demonstrates using its new eyes to visually navigate around obstacles and to find a way point which is hidden out of sight.  It starts knowing nothing as usual, and quickly learns it can’t go through walls.  It then searches for the hidden waypoint and learns that it’s at the end of a U-shaped hallway outside of the room.  It then goes back to the starting point and repeats, this time much more quickly since it knows about the walls now.

The visual subsystem uses image streams from the color camera and the depth camera on the agent robot to detect objects, though it doesn’t need both for basic detection.  The depth data is roughly similar to a LiDAR system.

When an object is detected, it calculates a bounding box (above) and passes that to the environmental model, which then tracks the object as a surface point-cloud.   The images below show the agent environmental model compared to the actual environment.

Object Detection vs. Recognition

Before being able to see, the agent knew that a charger, for example, was distinct from and different from a wall.  It could navigate to the charger and learn to charge.  Now that the environmental hints are gone, it doesn’t know that a charger is a charger, so it has a harder time learning to charge because it can’t visually differentiate between a wall and a charger – it detects the charger but can’t recognize it (yet).

Detecting objects is a precursor to recognizing objects.  Because the design requirement is that the agent has to learn everything on its own, it can’t use a pre-trained convolutional neural network (CNN) to recognize objects (because it would be impossible to pre-train a CNN on every possible object the agent could ever see).  That means it will have to train itself, which means it needs to be able to detect an object before it can learn to recognize it, which is planned for version 0.5.  So for now, it knows an object is there and it knows how to navigate around it, but it doesn’t know what the object is exactly.

And so it goes: as new functionality is added to replace old duct tape, the AI is required to be smarter in order to deal with its new self.  Spiral development.



Why Lasers?

If you’ve watched any of the videos of the agent going around doing things, you’ve seen it shoot lasers randomly on occasion.  You might be wondering if we really want robots all around us with frickin’ lasers on their heads.  Fair enough.  Here’s a couple of reasons:

From a technical perspective, It adds complexity to the experience the agent has in the environment.  Firing lasers adds heat rapidly to the bot when fired, then it cools down slowly.   They also drain the battery rapidly.  The agent now has to figure out if any of those changing states were causual to anything it’s intending to accomplish now.  It forces complexity into the scenario, which is good for learning.

From a safety perspective, it provides a concrete way for the agent to do something wrong.  What’s better: build an AI, put it in a robot in the real world, and see if it shoots somebody?  Or, put the AI in a simulated bot in a simulated world and see if it shoots somebody.  When (not IF) it does, the simulated world is the perfect test bed for making the safety systems more robust before the AI has a chance to do something bad in the real world.

For example, what if the agent learns that to get past a certain type of obstacle, it can either go around it, or just shoot it.  Then, what if it decides to try that on a human and see what happens?  There are many issues surrounding AI safety that need to be dealt with, many much more subtle that this scenario.  Giving it lasers is one way to approach the problem.  Or in general, setting it up to be able to fail in order to see how it creatively fails so that more robust safety systems can be developed.

There’s all sorts of ways to imagine an AI running around the pool with knives.  This type of approach doesn’t cover all bases, not even close.  But it adds to the pile of test methods that will be needed as it gets smarter.


n-Dimensional Planning Engine

The agent has been restricted to solving one-dimensional problems so far, but now the planning engine has been updated to allow n-dimensional planning.  One simple example of this is being able to intentionally navigate around obstacles, whereas before it could not.

There are significant changes on the whiteboard for the planning engine in the near future which will add more versatility and functionality to the agent’s reasoning ability.

Also, recently updated the environment to be a boot camp-style training course to facilitate the initial curriculum training it will need to do more complex tasks later on.


Multi-Step Planning

Curriculum training with progressive learning goals is used to help the agent learn specific actions, and to allow it to discover the importance of the order of operations in a multi-step operation.   Note that curriculum training doesn’t teach the agent how to do a task, it just puts the agent in a situation to learn something it needs to know.  Its up to the agent to learn how it all goes together.

After being trained on how to use the charger, this agent demonstrates (in the video below) the ability charge its batter when needed.  To do this, the agent must:

  • Find a charger,
  • Connect to the charger via the dataport
  • Push the ‘Charge’ button using the dataport until the battery is charged
  • Disconnect from the charger dataport when complete

Moved back to Unity 3D

Moved the project back into the Unity 3D environment from the 2D python environment it was using.  The agent AI is written in python, the agent bot is in C#.

This video shows the agent learning how to move in the environment.  It can only move forward, and turn clockwise and counter clockwise.  It can’t move backwards or strafe to the sides, so it has to learn how to point in the correct direction before moving forward to get there.  The agent learns quickly which actions are most likely to help it achieve its goals.

rlpAI v0.3

The AI can navigate around a cluttered room using fwd/cw/ccw movements to a series of waypoints in level0-6.csv, but it cannot intentionally navigate around large obstacles. OpenAI games CartPole_v0, MountainCar_v0, and LunarLander_v2 are implemented as world objects and the agent can navigate to and play these games, with varied success. The agent has been observed to make perfect landings in LunarLander_v2, and has scored above the threshold on the other games within a few tries (less than 4 generally), however consistency is lacking after initial successes.

rlpAI v0.2

The AI was given 4 waypoint goals (state = ‘GPS’) in Level 0-0, and it was able to learn how to navigate rapidly (within about 10 steps), and then navigated to each of the 4 goal locations using a simple difference planning engine, which was kind of a hack but was added to the AI for the purpose of demonstrating that the CE/CM works.

rlpAI v0.1

The AI was designed to play OpenAI Cartpole-v0.  It was not consistent however – usually performing moderately well, on occasion performing very well.  Sometimes it performed terribly.  Typically though it was able to reach an average of 195 steps in a 500-run test.

Version v0.1 was a move from C# in the Unity3D environment to python, since much of the AI community uses python, it seemed a good shift.  The agent was based on a very simplified rlpAI architecture with a scikit-learn MLP Classifier as the centerpiece.

The agent trained the classifier (state input, action output) with some of the results of prior episodes. In the best test, the agent reached 200 steps in less than the first 10 episodes, then hit 200 steps on every episode after that for 500 episodes (bottom right chart).

Vertical axis: Number of steps completed in the Cartpole-V0 game.
Horizontal axis: Episode number.