StockBuyerX | Tongze Mao

Overview

StockBuyerX is a full pipeline for training and evaluating reinforcement learning agents on daily equity data. We built a custom OpenAI Gym environment that models portfolio actions, cash, positions, and mark-to-market P\&L, then trained a PPO policy per asset. The project tracks profit, Sharpe ratio, maximum drawdown, and an interpretable holding ratio so I can compare behavior across assets and splits.

Symbols: AAPL, MSFT, GOOGL, AMZN, SPLG
Period: 2018-01-01 to 2022-12-31
Stack: Python, Gym, Stable-Baselines3 (PPO, A2C, DDPG), PyTorch, yfinance, NumPy, Matplotlib

Academic simulation on historical data. For research only, not financial advice.

System Design

Data: Download daily OHLCV via yfinance, forward/backward fill, and split into train 70%, validation 15%, test 15%.
Environment: StockTradingEnv exposes a continuous action space with Buy, Sell, Hold magnitudes and returns a compact observation that combines a 5-day OHLCV window with portfolio state (cash, net worth, shares held, cost basis, realized sales).
Reward and Diagnostics: Reward is time-weighted to encourage sustained survivability while logging profit, Sharpe, and max drawdown at every step. A derived holding ratio provides a human-readable signal of the agent’s allocation behavior.
Training and Evaluation: Train PPO per asset in DummyVecEnv, then replay the policy on validation and test splits. I also ran A2C and DDPG sanity checks on standard control tasks to benchmark code paths.

Key Implementation Details

Custom Gym Environment

Observation: a 6×6 tensor composed of five rows of scaled OHLCV for the last five bars plus one row of portfolio state features.
Action space: Box(low=[0,0], high=[3,1]). The first value selects Buy, Sell, or Hold; the second is the fraction to apply.
Execution model: the environment samples a price within the time step’s range and updates balance, shares_held, cost_basis, and net_worth.
Metrics:
- Profit: running net worth relative to initial capital.
- Sharpe ratio: computed from incremental P\&L with a fixed reference rate for comparison.
- Max drawdown: peak-to-trough on cumulative profit.
- Holding ratio: share of cumulative trades that remain invested, visualized over time.

Training Loop

One PPO (MlpPolicy) model per asset with configurable timesteps.
During training I log per-step metrics and then generate plots for:
- Profit curves by asset
- Holding ratio by asset and the implied cash ratio
- Sharpe ratios vs a reference line
I repeat the same plots for validation and test to check generalization.

Baseline Agents

To ensure my RL plumbing behaved as expected, I trained A2C and DDPG on CartPole-v1 and Pendulum-v1 respectively and compared average episodic rewards against PPO. This kept model/eval code paths honest before running longer equity experiments.

What Worked Well

Asset-wise isolation made debugging easier. By training a separate PPO per ticker, I could attribute behavior to the policy and the series without cross-asset confounding.
Interpretable diagnostics helped explain decisions. Profit, Sharpe, drawdown, and the holding ratio together paint a clear picture of stability vs aggression.
Vectorized env wrapper (DummyVecEnv) removed friction when moving between training, validation, and testing.

Findings

Policies tended to stabilize around consistent holding patterns for high-liquidity names like AAPL and MSFT.
The holding ratio vs cash plot made it obvious when the agent preferred staying risk-off.
Sharpe and drawdown moved in the expected directions when rewards favored survivability over short bursts of profit.

I did not optimize for absolute return because the reward emphasized stability and learning dynamics rather than alpha generation.

Limitations and Next Steps

Limitations

No transaction costs or slippage.
Actions are learned per asset instead of a joint portfolio vector, so cross-asset allocation is implicit rather than optimized globally.
Price execution is simulated within the bar, which is convenient but idealized.

Next steps

Move to a multi-asset joint environment with a simplex-constrained allocation vector.
Add fees, slippage, and position limits.
Replace the time-weighted reward with a risk-adjusted return objective that combines profit, drawdown, and volatility more rigorously.
Use walk-forward evaluation and EvalCallback from SB3 for more principled early stopping.
Engineer features such as rolling returns, ATR, and cross-sectional signals to improve state representation.

How to Reproduce

Install: pip install stable-baselines3 gym yfinance pandas numpy matplotlib torch
Set ASSETS, START_DATE, END_DATE, and run_times.
Run the notebook or script to download data, train PPO per asset, and render plots for train, validation, and test.