Solving CartPole in 8 Weights

The entire brain

CartPole is the fruit fly of reinforcement learning. It is the environment everyone pokes first: four numbers come in, one of two actions goes out, and the pole either remains proudly upright or collapses into numerical embarrassment.

Most people arrive carrying a backpack full of machinery: replay buffers, target networks, value heads, entropy bonuses, annealing schedules, and enough acronyms to make a grant reviewer purr. But CartPole, if you look at it with the right kind of arrogance, is whispering something simpler.

It is almost linear. The state is already the feature vector. The action space is binary. The good controller is basically a sign test with taste.

state = [x, x_dot, theta, theta_dot] scores = W · state action = argmax(scores)

That is the whole policy. No hidden layer. No bias. No activation. No learned representation. Just one matrix staring directly into the physics and saying: push left, push right.

The weights of a god

2 × 4 matrix

Here they are. Eight numbers. Retuned after the first matrix got bullied by the kick button. Etched into the page like a tiny mechanical scripture.

Action 0 score

−0.050000

−0.250000

−2.500000

−0.500000

Action 1 score

0.050000

0.250000

2.500000

0.500000

Read row zero as the score for action 0, row one as the score for action 1. Whichever score is larger wins. This is not deep learning. This is a bonsai lightning bolt.

The algorithm: evolution with a ruler

I trained the matrix with the Cross-Entropy Method, which is what happens when random search goes to grad school and learns statistics. Sample a population of matrices. Run episodes. Keep the elites. Move the Gaussian toward them. Repeat until the pole behaves.

for generation in range(num_generations): candidates = mean + std * randn(population, 2, 4) rewards = [evaluate(W) for W in candidates] elites = top_k(candidates, rewards) mean = average(elites) std = standard_deviation(elites)

This works because the search space is comically small. Eight dimensions is not a neural-network training problem. It is a hallway. You can stumble through it with the lights off and still find the kitchen.

The policy is so small you can print it on a sticker and put it on the robot.

Why it works

The pole angle and angular velocity dominate the decision. When the pole leans and rotates, the cart should accelerate under it. That is the ancient truth of CartPole. The learned matrix encodes that truth bluntly: the action-1 row gives large positive weight to theta and theta_dot, while the action-0 row pushes the other way.

There is something aesthetically correct about this. A huge model can solve CartPole, sure. A transformer could probably write a tragic sonnet about the pole while solving it. But a two-row matrix solves it with no ceremony. It does not understand balance. It is balance, projected through eight coefficients.

Live demo: the matrix drives the cart

real policy, real dynamics

This is not an animation loop with vibes. The browser is stepping CartPole physics, feeding the state into the exact 2 × 4 matrix above, taking argmax(W · state), and drawing the result.

episode reward0

action—

theta0.0000

scores[0, 0]

const W = [ [ -0.050000, -0.250000, -2.500000, -0.500000 ], [ 0.050000, 0.250000, 2.500000, 0.500000 ] ];

The state is [cart position, cart velocity, pole angle, pole angular velocity]. The simulator uses the classic CartPole constants: gravity 9.8, masscart 1.0, masspole 0.1, length 0.5, force magnitude 10.0, timestep 0.02. “Kick it” applies a real perturbation this robust eight-weight controller is meant to recover from; “Chaos kick” deliberately hits it outside that regime.

The result

On evaluation from standard CartPole-v1 reset states, this little matrix hits the environment ceiling. I then made the browser test meaner by adding a real mid-episode kick to velocity, angular velocity, and angle. The new controller is still only eight weights, but the normal kick button is now designed to be recoverable.

At that point, further sophistication becomes decorative. You can add layers, losses, baselines, critics, and dashboards, but the task has already been reduced to its essence:

W = [[-0.050000, -0.250000, -2.500000, -0.500000], [ 0.050000, 0.250000, 2.500000, 0.500000]]

Conclusion

Solving CartPole in eight weights is a useful reminder that intelligence is not measured by parameter count. Sometimes the most beautiful solution is the one that has nowhere to hide.

Four state variables enter. Two action scores leave. Eight weights sit in the middle, glowing quietly.

The pole stands.