torus

[RLA] The RL Agent for Survivors-like game [Part 2]

First part

Here's an update of what the project looked like when I burned out and stopped working on it for a while (a few months ago).

Here's an interim result (I hope gifs are working): vampire-39.gif

vampire-42.gif

The environment consists of an Agent, Goal and a small set of Obstacles. Reaching the Goal is the Goal. Hitting the Obstacle hurts the Agent, until it moves away from it or otherwise kills it if it stays for too long in the vicinity of the Obstacle. The environment is a basic gymnasium one. Here's how the reward function looked after many iterations of trying to make it work:

progress = self._target_distance - prev_distance
# if curr_sensing > prev_sensing then danger is 
# else danger is lower, therefore we subtract less
enemy_danger = np.sum(self._enemy_sensing) - np.sum(self.prev_sensing)
reward = self.config.progress_w * progress - self.config.danger_w * enemy_danger

sensing_enemies works like this: we split the circle into 12 sectors. Each sector may contain an enemy (Obstacle). For each enemy distance I apply a tanh cutoff in such a way that agent would not bother itself with faraway obstacles. For a brief moment I've noticed that sometimes it does help navigate around the obstacles but in overwhelming cases it would not - and the agent would just wander around somewhere in-between.

Things that turned out harder than they seemed:

  • Creating an environment is only a part of the process (arguably the easiest part).
  • Optimizing the said environment and making the training to NOT bottleneck at the CPU is not obvious.
  • Throwing a big GPU at the problem doesn't work (for the reason above).
  • Figuring out a reward function is probably the hardest thing. No, simply weighting the distances towards the task does not work.
  • It's important not to penalize the agent for simply ending in a tight spot. Figuring out how to reward it for making the correct decision of getting out of there is much harder.

Things that surprised me somewhat:

  • Using jax or mlx is actually somewhat niche in this field; there are jax-based environments, but they are not as popular as you may think.
  • Many people in the field are working on solving a specific set of environments better or implementing papers rather than exploring the space of applicable problems.
  • Despite that, there are some sick examples of complex IRL applied RL problems - for example, a robot or a drone doing some meaningful thing in real life or navigating a very complex scene or environment.

Things that did help a lot in the process:

  • Writing a bug free environment, assert sparingly. Asserting the shapes of numpy arrays gives you a good grip on what is happening in the environment during the runs.
  • Adding the manual control.
  • Giving the Agent sort of a proximity vision - like a roomba bot - for its observations.
  • Storing magic constants separately for reasons I don't remember.
  • NOT getting distracted too much by jax, mlx - and accelerating or optimizing the thing before there are consistent signals that the agent is learning something.
  • Adding the tensorboard logging and looking at the graphs. It also helped to plot a graph for a well known environment and a working baseline and then compare my results to that thing.
  • But graphs don't tell the whole story. What is even better is to record the result of the winning iteration and examine it with your own eyes.

I will start from scratch. It doesn't make sense to think too much about the optimization, even though it made me curious how that task is so CPU-bound. Speed is NOT the blocker here. Getting the hyperparameters right is not the problem either. Creating a meaningful interplay between Observations and Reward function is the key. By the way, I've just noticed that you start overthinking technical concerns whenever you don't know what to do.

Thoughts? Leave a comment