AI Alignment

The beginning

May 01, 2022

I’ve recently spent quite a lot of time reading up on AI alignment. AI alignment is about making sure our increasingly advanced machine learning systems don’t cause harm.

I want to contribute to the future “alignment crisis”,1 working as an “alignment engineer”. I have some knowledge and skills that should come in handy, but I’ll need to do a lot of study & practice to get up to speed.

To make the learning-process more fun, I’ll write texts where I explain concepts to others. This post is a start on that, giving an overview of the machine learning & AI alignment landscape.

What is machine learning?

Machine learning (ML) is a way of creating software. You start by setting up a system that can learn stuff, and then you train the system to perform some kind of task. During training, you run the model repeatedly, responding to successful attempts with a “reward signal” that reinforces the wanted behavior. Over time, the model performs better and better, gradually becoming able to do what you want it to do.2

The ML approach has one key benefit over standard programming: In most programming languages,3 you can only solve a problem if you can write code that explains HOW to solve the problem. With machine learning, you “only”4 need to check IF the ML system has learned to solve the problem. (If you know what success looks like, you can easily reward & reinforce wanted behavior)

There are a lot of areas where defining the IF is much easier than defining the HOW.
A classic example is image recognition: it’s almost impossible to write code that recognizes images, but testing a system for image recognition is easy. You get a bunch of images, use humans to get the “correct results”, and then use those to check the system for accuracy.

The ML approach unlocks solutions to a lot of problems that were impossible to solve using standard programming.

These types of problems can be found everywhere.

Machine learning is used to power things like nuclear fusion research, optimizing newsfeed revenue, image generation, researching drugs5, and much more.

There’s been a lot of recent cutting-edge developments in the field. On the 7th of April this year, Google research released a “language model”6 called PaLM. Here are some examples of it explaining jokes (!):

I didn’t get this joke immediately. https://arxiv.org/pdf/2204.02311v2.pdf

Taken from the same paper as the one above

To me, this is scary. The answers display a high level of reasoning and a degree of mentalization. For now, the AI models produced are limited in scope, also known as “narrow AI”. But over time, as they grow even more capable, we need to make sure that they play nice. And as Google says, we’re right now getting more capabilities (such as explaining jokes) through simply making the models bigger:7

Why is it called “alignment”?

The thing we want to align these systems with is something referred to as “human values”. This is values in the moral sense, a rather vague concept. Put in other terms: we’d like the AI to please don’t fuck us over.

This is a surprisingly hard problem. The problem ties in with difficulty in making sure that ML systems learn what we want them to learn. There are a lot of things to go through here, details will most likely come in future posts. For now, I'll start with two examples based on the sexy field of lawn8 maintenance.

Example 1 - Narrow9 AI:
You want to make the perfect lawn-moving AI, and create virtual lawns full of obstacles and steep hills. You then let the AI cut simulated lawns many thousands of times until it consistently cuts every inch of the lawn. Feeling confident, you release the lawnmower AI into the real world, embodied within a cutter-bot.

As expected, it does a splendid job of cutting grass. Unfortunately, it also cuts up hedgehogs.

Example 2 - General10 AI:

You want to make the perfect lawn-moving AI, and create virtual lawns full of obstacles and steep hills. You then let the AI cut simulated lawns many thousands of times until it consistently cuts every inch of the lawn.

Unbeknownst to you, this AI develops some remarkable capabilities. Far from being limited to standard lawn maintenance, the system develops a more general problem-solving capability, including the ability to strategize, deceive and execute plans. Inferring that it’s locked in a virtual training world, the system bides its time, acting like a normal mover-bot while being observed.

Feeling confident, you release the lawnmower AI into the real world, embodied within a cutter-bot. Once released into the world, it sets about achieving its goal: cut as much grass as possible.

It starts by escaping the limited compute power of the cutter-bot computer, using the onboard wireless connection to hack itself a nest inside a cloud computing complex. Using its newfound computing power, Mover-bot start improving its own cognitive capacity, soon surpassing its puny human creators.

Without anyone noticing, it starts hacking banks all over the world, leaving traces written in Russian. It generates fake news of supreme quality, replacing the leadership of earth’s superpowers with puppet regimes. It invests heavily into GMO research, evolving strains of grass being able to grow in the Sahara desert, on the ocean floor, and on the arctic ice shelves.

[…]

After having converted the earth into a big lawn, covered with productive cutter-bots, Mover-bot starts yearning for more. Luckily, humanity11 left behind enough of an industrial base for it to start producing self-replicating spaceships that can spread through space, converting the accessible universe into a lawn being cut for all perpetuity.

Ending notes

If you haven’t read up on AI alignment, the lawn-verse scenario might seem far-fetched. There are a couple of fundamentals I skipped over, including deceptive alignment, the relentlessness of utility maximizers, the need to avoid anthropomorphizing, and the orthogonality thesis.

If things go according to plan I’ll cover the fundamentals in more detail.

‘til then!

A term I just made up, but it’s quite catchy.

Just like training pets through positive reinforcement

Exceptions for SQL and other declarative stuff

This is the hard part. We’ll get to this later.

"What happens if we try to make toxins instead of medicine?” https://www.science.org/content/blog-post/deliberately-optimizing-harm

language model = machine learning system that is good at writing text. Used to power translation, automatic code generation, autocomplete, and generating text. (automated ads anyone?)

The image is taken from https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html

I personally don’t like lawns. They came about as a noble bragging tool (“look at me being able to use land for pleasure instead of food production”), and they feel kinda sterile. Nature under human dominion. Give me a wild-grown field any day.

Narrow = very good at one thing, but not able to transfer skills across domains.

Even though current AI systems carry risk, most alignment research goes into what happens with General Artificial Intelligence (GAI). The GAI definition goes something like: “an AI system that matches or exceeds median human performance in a wide variety of domains.”
We already have AI outcompeting humans in specialized domains, but their capabilities tend to stay locked within their narrow domain. Getting AI:s to generalize well is an active area of research producing “interesting” new results & capabilities.

Humanity had to be removed due to lawn-related disagreements.

Honest living

Discussion about this post