StockBuyerX
A reinforcement-learning portfolio simulator with a custom Gym environment, PPO baseline, and risk-aware diagnostics across large-cap equities and an S\&P 500 ETF.
Overview
StockBuyerX is a full pipeline for training and evaluating reinforcement learning agents on daily equity data. We built a custom OpenAI Gym environment that models portfolio actions, cash, positions, and mark-to-market P\&L, then trained a PPO policy per asset. The project tracks profit, Sharpe ratio, maximum drawdown, and an interpretable holding ratio so I can compare behavior across assets and splits.
- Symbols: AAPL, MSFT, GOOGL, AMZN, SPLG
- Period: 2018-01-01 to 2022-12-31
- Stack: Python, Gym, Stable-Baselines3 (PPO, A2C, DDPG), PyTorch, yfinance, NumPy, Matplotlib
Academic simulation on historical data. For research only, not financial advice.
System Design
- Data: Download daily OHLCV via
yfinance, forward/backward fill, and split into train 70%, validation 15%, test 15%. - Environment:
StockTradingEnvexposes a continuous action space with Buy, Sell, Hold magnitudes and returns a compact observation that combines a 5-day OHLCV window with portfolio state (cash, net worth, shares held, cost basis, realized sales). - Reward and Diagnostics: Reward is time-weighted to encourage sustained survivability while logging profit, Sharpe, and max drawdown at every step. A derived holding ratio provides a human-readable signal of the agent’s allocation behavior.
- Training and Evaluation: Train PPO per asset in
DummyVecEnv, then replay the policy on validation and test splits. I also ran A2C and DDPG sanity checks on standard control tasks to benchmark code paths.
Key Implementation Details
Custom Gym Environment
- Observation: a
6×6tensor composed of five rows of scaled OHLCV for the last five bars plus one row of portfolio state features. - Action space:
Box(low=[0,0], high=[3,1]). The first value selects Buy, Sell, or Hold; the second is the fraction to apply. - Execution model: the environment samples a price within the time step’s range and updates
balance,shares_held,cost_basis, andnet_worth. -
Metrics:
- Profit: running net worth relative to initial capital.
- Sharpe ratio: computed from incremental P\&L with a fixed reference rate for comparison.
- Max drawdown: peak-to-trough on cumulative profit.
- Holding ratio: share of cumulative trades that remain invested, visualized over time.
Training Loop
- One PPO (MlpPolicy) model per asset with configurable timesteps.
-
During training I log per-step metrics and then generate plots for:
- Profit curves by asset
- Holding ratio by asset and the implied cash ratio
- Sharpe ratios vs a reference line
- I repeat the same plots for validation and test to check generalization.
Baseline Agents
To ensure my RL plumbing behaved as expected, I trained A2C and DDPG on CartPole-v1 and Pendulum-v1 respectively and compared average episodic rewards against PPO. This kept model/eval code paths honest before running longer equity experiments.
What Worked Well
- Asset-wise isolation made debugging easier. By training a separate PPO per ticker, I could attribute behavior to the policy and the series without cross-asset confounding.
- Interpretable diagnostics helped explain decisions. Profit, Sharpe, drawdown, and the holding ratio together paint a clear picture of stability vs aggression.
- Vectorized env wrapper (
DummyVecEnv) removed friction when moving between training, validation, and testing.
Findings
- Policies tended to stabilize around consistent holding patterns for high-liquidity names like AAPL and MSFT.
- The holding ratio vs cash plot made it obvious when the agent preferred staying risk-off.
- Sharpe and drawdown moved in the expected directions when rewards favored survivability over short bursts of profit.
I did not optimize for absolute return because the reward emphasized stability and learning dynamics rather than alpha generation.
Limitations and Next Steps
Limitations
- No transaction costs or slippage.
- Actions are learned per asset instead of a joint portfolio vector, so cross-asset allocation is implicit rather than optimized globally.
- Price execution is simulated within the bar, which is convenient but idealized.
Next steps
- Move to a multi-asset joint environment with a simplex-constrained allocation vector.
- Add fees, slippage, and position limits.
- Replace the time-weighted reward with a risk-adjusted return objective that combines profit, drawdown, and volatility more rigorously.
- Use walk-forward evaluation and
EvalCallbackfrom SB3 for more principled early stopping. - Engineer features such as rolling returns, ATR, and cross-sectional signals to improve state representation.
How to Reproduce
- Install:
pip install stable-baselines3 gym yfinance pandas numpy matplotlib torch - Set
ASSETS,START_DATE,END_DATE, andrun_times. - Run the notebook or script to download data, train PPO per asset, and render plots for train, validation, and test.