Reinforcement learning with the OpenAI Gym wrapper

The corresponding complete source code can be found here.

The goal of this example is to demonstrate how to use the open ai gym interface proposed by EnvPlayer, and to train a simple deep reinforcement learning agent comparable in performance to the MaxDamagePlayer we created in Creating a simple max damage player.


This example necessitates keras-rl (compatible with Tensorflow 1.X) or keras-rl2 (Tensorflow 2.X), which implement numerous reinforcement learning algorithms and offer a simple API fully compatible with the Open AI Gym API. You can install them by running pip install keras-rl or pip install keras-rl2. If you are unsure, pip install keras-rl2 is recommended.

Implementing rewards and observations

The open ai gym API provides rewards and observations for each step of each episode. In our case, each step corresponds to one decision in a battle and battles correspond to episodes.

Defining observations

Observations are embeddings of the current state of the battle. They can be an arbitrarily precise description of battle states, or a very simple representation. In this example, we will create embedding vectors containing:

  • the base power of each available move

  • the damage multiplier of each available move against the current active opponent pokemon

  • the number of non fainted pokemons in our team

  • the number of non fainted pokemons in the opponent’s team

To define our observations, we will create a custom embed_battle method. It takes one argument, a Battle object, and returns our embedding.

In addition to this, we also need to describe the embedding to the gym interface. To achieve this, we need to implement the describe_embedding method where we specify the low bound and the high bound for each component of the embedding vector and return them as a gym.Space object.

Defining rewards

Rewards are signals that the agent will use in its optimization process (a common objective is optimizing a discounted total reward). EnvPlayer objects provide a helper method, reward_computing_helper, that can help defining simple symmetric rewards that take into account fainted pokemons, remaining hp, status conditions and victory.

We will use this method to define the following reward:

  • Winning corresponds to a positive reward of 30

  • Making an opponent’s pokemon faint corresponds to a positive reward of 1

  • Making an opponent lose % hp corresponds to a positive reward of %

  • Other status conditions are ignored

Conversely, negative actions lead to symmetrically negative rewards: losing is a reward of -30 points, etc.

To define our rewards, we will create a custom compute_reward method. It takes one argument, a Battle object, and returns the reward.

Defining our custom class

Our player will play the gen8randombattle format. We can therefore inherit from Gen8EnvSinglePlayer.

    import numpy as np
from gym.spaces import Space, Box
from poke_env.player import Gen8EnvSinglePlayer

class SimpleRLPlayer(Gen8EnvSinglePlayer):
    def calc_reward(self, last_battle, current_battle) -> float:
        return self.reward_computing_helper(
            current_battle, fainted_value=2.0, hp_value=1.0, victory_value=30.0

    def embed_battle(self, battle: AbstractBattle) -> ObservationType:
        # -1 indicates that the move does not have a base power
        # or is not available
        moves_base_power = -np.ones(4)
        moves_dmg_multiplier = np.ones(4)
        for i, move in enumerate(battle.available_moves):
            moves_base_power[i] = (
                move.base_power / 100
            )  # Simple rescaling to facilitate learning
            if move.type:
                moves_dmg_multiplier[i] = move.type.damage_multiplier(

        # We count how many pokemons have fainted in each team
        fainted_mon_team = len([mon for mon in if mon.fainted]) / 6
        fainted_mon_opponent = (
            len([mon for mon in battle.opponent_team.values() if mon.fainted]) / 6

        # Final vector with 10 components
        final_vector = np.concatenate(
                [fainted_mon_team, fainted_mon_opponent],
        return np.float32(final_vector)

    def describe_embedding(self) -> Space:
        low = [-1, -1, -1, -1, 0, 0, 0, 0, 0, 0]
        high = [3, 3, 3, 3, 4, 4, 4, 4, 1, 1]
        return Box(
            np.array(low, dtype=np.float32),
            np.array(high, dtype=np.float32),


Instantiating and testing a player

Now that our custom class is defined, we can instantiate our RL player and test if it’s compliant with the OpenAI gym API.

from gym.utils.env_checker import check_env
from poke_env.player import RandomPlayer

opponent = RandomPlayer(battle_format="gen8randombattle")
test_env = SimpleRLPlayer(
    battle_format="gen8randombattle", opponent=opponent, start_challenging=True

The close method of test_env closes all underlying processes and clears from memory all objects related to the environment. After an environment is closed, no further actions should be taken on that environment.

Instantiating train environment and evaluation environment

Normally, to ensure isolation between training and testing, two different environments are created. The base class EnvPlayer allows you to choose the opponent either when you instantiate it or replace it during training with the set_opponent method. If you don’t want the player to start challenging the opponent you can set start_challenging=False when creating it. In this case, we want them to start challenging right away:

from poke_env.player import RandomPlayer, wrap_for_old_gym_api

opponent = RandomPlayer(battle_format="gen8randombattle")
train_env = SimpleRLPlayer(
    battle_format="gen8randombattle", opponent=opponent, start_challenging=True
train_env = wrap_for_old_gym_api(train_env)
opponent = RandomPlayer(battle_format="gen8randombattle")
eval_env = SimpleRLPlayer(
    battle_format="gen8randombattle", opponent=opponent, start_challenging=True
eval_env = wrap_for_old_gym_api(eval_env)

The function wrap_for_old_gym_api wraps the environment to make it compatible with the old gym API, as the keras-rl2 library does not support the new one.

Creating a DQN with keras-rl

We have defined observations and rewards. We can now build a model that will control our player. In this section, we will implement the DQN algorithm using keras-rl.

Defining a base model

We build a simple keras sequential model. Our observation vectors have 10 components; our model will therefore accept inputs of dimension 10.

The output of the model must map to the environment’s action space. The action space can be accessed through the action_space property. Each action correspond to one order: a switch or an attack, with additional options for dynamaxing, mega-evolving and using z-moves.

from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential

# Compute dimensions
n_action = train_env.action_space.n
input_shape = (1,) + train_env.observation_space.shape # (1,) is the batch size that the model expects in input.

# Create model
model = Sequential()
model.add(Dense(128, activation="elu", input_shape=input_shape))
model.add(Dense(64, activation="elu"))
model.add(Dense(n_action, activation="linear"))

Defining the DQN

Now that we have a model, we can build the DQN agent. This agent combines our model with a policy and a memory. The memory is an object that will store past actions and define samples used during learning. The policy describes how actions are chosen during learning.

We will use a simple memory containing 10000 steps, and an epsilon greedy policy.

For more information regarding keras-rl, please refer to their documentation.

from rl.agents.dqn import DQNAgent
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from tensorflow.keras.optimizers import Adam

# Defining the DQN
memory = SequentialMemory(limit=10000, window_length=1)

policy = LinearAnnealedPolicy(

dqn = DQNAgent(
dqn.compile(Adam(learning_rate=0.00025), metrics=["mae"])

Training the model

Training the model is as simple as

..., nb_steps=10000)

Evaluating the model

We have trained our agent. Now we can use different strategies to evaluate the result.

Simple win rate evaluation

A first way to evaluate the result is having it play against different agents and printing the won battles. This can be done with the following code:

print("Results against random player:")
dqn.test(eval_env, nb_episodes=100, verbose=False, visualize=False)
    f"DQN Evaluation: {eval_env.n_won_battles} victories out of {eval_env.n_finished_battles} episodes"
second_opponent = MaxBasePowerPlayer(battle_format="gen8randombattle")
eval_env.reset_env(restart=True, opponent=second_opponent)
print("Results against max base power player:")
dqn.test(eval_env, nb_episodes=100, verbose=False, visualize=False)
    f"DQN Evaluation: {eval_env.n_won_battles} victories out of {eval_env.n_finished_battles} episodes"

The reset_env method of the EnvPlayer class allows you to reset the environment to a clean state, including internal counters for victories, battles, etc.

It takes two optional parameters:

  • restart: a boolean that will tell the environment if the challenge loop is to be restarted after the reset;

  • opponent: the new opponent to use after the reset in the challenge loop. If empty it will keep old opponent.

Use provided evaluate_player method

In order to evaluate the player with the provided method, we need to use a background version. background_evaluate_player has the same interface as the foreground counterpart, but it will return a Future object.

from poke_env.player import background_evaluate_player

n_challenges = 250
placement_battles = 40
eval_task = background_evaluate_player(
    eval_env.agent, n_challenges, placement_battles
dqn.test(eval_env, nb_episodes=n_challenges, verbose=False, visualize=False)
print("Evaluation with included method:", eval_task.result())

The result method of the Future object will block until the task is done and will return the result.


background_evaluate_player requires the challenge loop to be stopped. To ensure this use method reset_env(restart=False) of EnvPlayer.


If you call result before the task is finished, the main thread will be blocked. Only do that if the agent is operating on a different thread than the one asking for the result.

Use provided cross_evaluate method

To use the cross_evaluate method, the strategy is the same to the one used for the evaluate_player method:

from poke_env.player import background_cross_evaluate

n_challenges = 50
players = [
cross_eval_task = background_cross_evaluate(players, n_challenges)
    nb_episodes=n_challenges * (len(players) - 1),
cross_evaluation = cross_eval_task.result()
table = [["-"] + [p.username for p in players]]
for p_1, results in cross_evaluation.items():
    table.append([p_1] + [cross_evaluation[p_1][p_2] for p_2 in results])
print("Cross evaluation of DQN with baselines:")


background_cross_evaluate requires the challenge loop to be stopped. To ensure this use method reset_env(restart=False) of EnvPlayer.


If you call result before the task is finished, the main thread will be blocked. Only do that if the agent is operating on a different thread than the one asking for the result.

Final result

Running the whole file should take a couple of minutes and print something similar to this:

Training for 10000 steps ...
Interval 1 (0 steps performed)
10000/10000 [==============================] - 194s 19ms/step - reward: 0.6015
done, took 195.208 seconds
Results against random player:
DQN Evaluation: 94 victories out of 100 episodes
Results against max base power player:
DQN Evaluation: 65 victories out of 100 episodes
Evaluation with included method: (16.028896545454547, (11.79801006617441, 22.609978288238203))
Cross evaluation of DQN with baselines:
------------------  ----------------  --------------  ------------------  ------------------
-                   SimpleRLPlayer 3  RandomPlayer 5  MaxBasePowerPlay 3  SimpleHeuristics 2
SimpleRLPlayer 3                      0.96            0.76                0.16
RandomPlayer 5      0.04                              0.12                0.0
MaxBasePowerPlay 3  0.24              0.88                                0.1
SimpleHeuristics 2  0.84              1.0             0.9
------------------  ----------------  --------------  ------------------  ------------------


Remember to use reset_env between different evaluations on the same environment or use different environments to avoid interferences between evaluations.