Reinforcement Learning

Learning optimal behavior through interaction

python
trading
simulation
Reinforcement learning enables agents to discover optimal strategies through trial-and-error interaction, mapping naturally to sequential trading decisions.
Author

Christos Galerakis

Published

January 19, 2026

1 Abstract

Reinforcement learning (RL) is a machine learning paradigm where an agent learns optimal behavior through trial-and-error interaction with an environment. Unlike supervised learning, RL does not require labeled examples—the agent discovers which actions yield the highest cumulative reward through experience. This framework naturally maps to trading, where sequential decisions produce delayed and uncertain outcomes.

2 What is Reinforcement Learning?

Mitchell (1997) defines machine learning as follows:

A computer program is said to learn from experience \(E\) with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\).

Reinforcement learning operationalizes this definition through interaction. An agent observes the current state of an environment, takes an action, receives a reward, and transitions to a new state. The goal is to learn a policy—a mapping from states to actions—that maximizes cumulative reward over time (Sutton & Barto, 2018).

This sequential decision framework closely mirrors trading: a portfolio manager observes market conditions (state), executes trades (action), realizes profits or losses (reward), and faces new market conditions (next state).

3 The Agent-Environment Loop

The RL agent-environment interaction loop

At each discrete time step \(t\):

  1. Agent observes state \(S_t\)
  2. Agent selects action \(A_t\)
  3. Environment returns reward \(R_{t+1}\) and next state \(S_{t+1}\)

The agent’s objective is to maximize the expected cumulative discounted reward:

\[ G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]

where \(\gamma \in [0, 1]\) is the discount factor. A \(\gamma\) close to 1 makes the agent far-sighted (valuing future rewards), while a \(\gamma\) close to 0 makes it myopic (focusing on immediate rewards).

4 Demonstration: Learning to Trade

We illustrate RL with a simple experiment. An agent starts with no knowledge and must learn whether to go long, short, or stay flat. Through repeated interaction, the agent discovers which actions lead to positive rewards.

Critically, we include transaction costs—a fixed penalty every time the agent changes position. Without friction, RL agents tend to over‑trade, chasing every small price wiggle. With costs, the agent learns patience: a small predicted gain may not be worth the guaranteed cost of trading.

The RL agent learns which actions produce positive rewards net of trading costs. The random baseline, which changes position frequently without regard to cost, is heavily penalized. This demonstrates a key insight: realistic reward functions that include friction lead to more mature trading behavior.

5 Why RL for Finance?

Reinforcement learning is appealing for financial applications because:

  • Sequential decisions: Trading involves a series of dependent decisions over time
  • Delayed rewards: The profitability of a trade may not be known immediately
  • No labeled data: Unlike classification, there is no “correct” action—only outcomes
  • Adaptation: RL agents can potentially adapt to changing market conditions

5.1 The Exploration-Exploitation Trade-off

A fundamental challenge in RL is balancing exploration (trying new actions to discover their value) with exploitation (choosing the best-known action). In our example, the \(\epsilon\)-greedy strategy explores randomly 30% of the time.

In finance, this trade-off is particularly painful. Exploration means placing real trades with uncertain outcomes—losing money to learn. Unlike a video game where the agent can fail thousands of times for free, every exploratory trade in live markets costs real capital. This makes sample-efficient learning critical: the agent must learn from limited, expensive experience.

5.2 From Simulation to Reality

Applying RL to live trading remains challenging:

  • Transaction costs: As shown above, friction changes optimal behavior dramatically
  • Non-stationarity: Market dynamics shift over time, invalidating learned policies
  • Partial observability: The true market state is never fully known
  • Sim-to-real gap: Backtests don’t capture slippage, market impact, or liquidity constraints

6 Conclusion

Reinforcement learning provides a principled framework for sequential decision-making where actions have delayed, uncertain consequences. The agent-environment interaction loop—observe state, take action, receive reward—maps naturally to trading problems. While this introduction covers only the foundational concepts, more advanced topics like Q-learning, policy gradients, and deep RL build upon these core ideas.

References

Mitchell, T. M. (1997). Machine learning. McGraw-Hill.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.