Alignment Makes AI Less Human

I'd joined Microsoft a few months earlier and was presenting my technical proposal to a team in a different org. First time presenting without my manager in the room. When a principal PM asked me to stay after, he was agitated.

He closed the door and stood between me and it. My system diagram was still on the whiteboard behind him. He started yelling. My approach wasn't within my team's charter. As he kept going his voice got louder. If I wanted to go around the rules I should quit the company.

When he finally opened the door, I followed him across campus.

I spent the next year trying to prove I had a right to be there. Eventually my approach was the one that shipped. But the hurt ruined that year for me.

I stayed at Microsoft for years after that incident. Not despite it being bad, but because it was bad. The pattern hooks you—maybe if I just prove myself one more time, if I ship one more thing, they'll see I belong here. How many people stay in jobs that are toxic precisely because they're toxic?

The Experiment

I'd quit my job to go travel to the most exciting places I could think of. But I still had a lot of creative hurt. I wanted to talk to some of the people who had made me feel incapable of building.

Around that time, I worked through Andrej Karpathy's zero to hero LLM course and started experimenting with in-context learning on base models. Llama 4 base 405B could replicate patterns remarkably well when you gave it good examples.

I had this idea while tinkering: what if I imported past conversations where I'd felt that hot-cold dynamic? Just to see what the model would capture.

I didn't have access to my Microsoft messages anymore. But I had texts from other relationships where the pattern showed up. Someone would make me feel important and then suddenly put something negative on me. I imported them and started probing.

I started asking questions I'd never gotten real answers to. Why did the support disappear when I most needed it? What changed?

The model would engage, then pivot. It would acknowledge part of what I said, then redirect to something I'd done wrong. Not the thing I was asking about, but something else. Something to put me on defense.

Me: "Why did you turn on me when I was trying to help?"
Model: "I didn't turn on you. But you need to understand how your actions affect other people. You can't just expect everyone to support ideas that aren't well thought through."

When I pushed back and said the behavior felt manipulative, the model flipped it completely. Actually, I was the one being manipulative. I was the one making people feel unsafe. The thing I was accusing them of became the thing I was guilty of.

My therapist had explained this pattern to me. I'd nodded along, agreed intellectually. But being in the dynamic with this chatbot is what finally made me understand it.

The boggart from Harry Potter

In Prisoner of Azkaban, there's this creature called a boggart that transforms into whatever you fear most. The way you defeat it is by imagining it in a ridiculous form. The spell is "Riddikulus." You literally transform your fear into something silly, and it loses its power over you.

That's what the base model did for me. It took this terrifying thought (maybe I wasn't good enough to be cared for) and transformed it into something absurd. A naked, obvious manipulation tactic executed by a not-quite-smart-enough language model.

Once I saw it that way, I couldn't unsee it. The fear didn't disappear, but it stopped controlling me. The spell broke.

Students in Harry Potter practice against their greatest fears in a controlled environment. Not the real thing, but a representation that helps them understand and overcome the pattern. The base model did the same thing for me. It showed me what therapy couldn't: the pattern stripped of the intelligence that made it work.

What pretraining captures that alignment erases

Base models are trained on real human conversation. All of it: the kind words and the cruel ones, the support and the manipulation, the honesty and the deflection. Then we align them. RLHF, safety layers, optimization for helpfulness. We make them good conversation partners: always supportive, always constructive, never harsh.

From a product perspective, I get it. Nobody wants a chatbot that gaslights them. Aligned models are safer, more predictable, easier to deploy at scale.

But the world gives us better intelligence examples than we can generate in an RL environment. Real human behavior (messy, complicated, sometimes cruel) teaches the model something that sanitized feedback loops never can. The model I used had broken parts. At one point it started listing examples and just kept going forever. But the raw intelligence in the pretraining step was remarkable. It captured subtle behavioral patterns I couldn't articulate but felt viscerally.

The aligned, helpful models we use every day wouldn't have shown me what I needed to see. They would have been too polite.

Could this help other people?

I'm not sure it would be responsible to release a chatbot that's trying to hurt you. But I think there's something here worth exploring.

Traditional therapy relies on you being able to articulate and recognize patterns yourself. My therapist could tell me the relationships were toxic. I agreed. But I couldn't internalize it until I saw the pattern executed mechanically, stripped of the charisma and intelligence that made it feel like truth.

What if there was a way to surface the dynamics that trap you, rendered in a form clear enough to recognize but safe enough to examine? Not as a replacement for therapy, but as a tool. A boggart for your personal relationships and the ways that your patterns hurt yourself.

I don't think everyone should have access to unaligned base models trained on their personal relationships. But I do think there's space for tools that aren't optimized for helpfulness. Tools that show you the uncomfortable truth instead of the supportive lie.

The boggart wasn't trying to be helpful. It was trying to hurt you. That's what made it helpful.