Reinforcement Learning
Learning optimal behavior through interaction
1 Abstract
Reinforcement learning (RL) is a machine learning paradigm where an agent learns optimal behavior through trial-and-error interaction with an environment. Unlike supervised learning, RL does not require labeled examples—the agent discovers which actions yield the highest cumulative reward through experience. This framework naturally maps to trading, where sequential decisions produce delayed and uncertain outcomes.
2 What is Reinforcement Learning?
Mitchell (1997) defines machine learning as follows:
A computer program is said to learn from experience \(E\) with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\).
Reinforcement learning operationalizes this definition through interaction. An agent observes the current state of an environment, takes an action, receives a reward, and transitions to a new state. The goal is to learn a policy—a mapping from states to actions—that maximizes cumulative reward over time (Sutton & Barto, 2018).
This sequential decision framework closely mirrors trading: a portfolio manager observes market conditions (state), executes trades (action), realizes profits or losses (reward), and faces new market conditions (next state).
3 The Agent-Environment Loop
At each discrete time step \(t\):
- Agent observes state \(S_t\)
- Agent selects action \(A_t\)
- Environment returns reward \(R_{t+1}\) and next state \(S_{t+1}\)
The agent’s objective is to maximize the expected cumulative discounted reward:
\[ G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]
where \(\gamma \in [0, 1]\) is the discount factor. A \(\gamma\) close to 1 makes the agent far-sighted (valuing future rewards), while a \(\gamma\) close to 0 makes it myopic (focusing on immediate rewards).
4 Demonstration: Learning to Trade
We illustrate RL with a simple experiment. An agent starts with no knowledge and must learn whether to go long, short, or stay flat. Through repeated interaction, the agent discovers which actions lead to positive rewards.
Critically, we include transaction costs—a fixed penalty every time the agent changes position. Without friction, RL agents tend to over‑trade, chasing every small price wiggle. With costs, the agent learns patience: a small predicted gain may not be worth the guaranteed cost of trading.
The RL agent learns which actions produce positive rewards net of trading costs. The random baseline, which changes position frequently without regard to cost, is heavily penalized. This demonstrates a key insight: realistic reward functions that include friction lead to more mature trading behavior.
5 Why RL for Finance?
Reinforcement learning is appealing for financial applications because:
- Sequential decisions: Trading involves a series of dependent decisions over time
- Delayed rewards: The profitability of a trade may not be known immediately
- No labeled data: Unlike classification, there is no “correct” action—only outcomes
- Adaptation: RL agents can potentially adapt to changing market conditions
5.1 The Exploration-Exploitation Trade-off
A fundamental challenge in RL is balancing exploration (trying new actions to discover their value) with exploitation (choosing the best-known action). In our example, the \(\epsilon\)-greedy strategy explores randomly 30% of the time.
In finance, this trade-off is particularly painful. Exploration means placing real trades with uncertain outcomes—losing money to learn. Unlike a video game where the agent can fail thousands of times for free, every exploratory trade in live markets costs real capital. This makes sample-efficient learning critical: the agent must learn from limited, expensive experience.
5.2 From Simulation to Reality
Applying RL to live trading remains challenging:
- Transaction costs: As shown above, friction changes optimal behavior dramatically
- Non-stationarity: Market dynamics shift over time, invalidating learned policies
- Partial observability: The true market state is never fully known
- Sim-to-real gap: Backtests don’t capture slippage, market impact, or liquidity constraints
6 Conclusion
Reinforcement learning provides a principled framework for sequential decision-making where actions have delayed, uncertain consequences. The agent-environment interaction loop—observe state, take action, receive reward—maps naturally to trading problems. While this introduction covers only the foundational concepts, more advanced topics like Q-learning, policy gradients, and deep RL build upon these core ideas.