What is AI alignment?
“AI alignment” means different things to different people.
My preferred definition, following many others, is:
An AI system is aligned if it does what its operator intends.
Why not just say that an AI system works if it does what its operator intends?
The reason is that AI systems are more agentic. To understand their behaviour, we adopt the intentional stance; we explain their behaviour partly by ascribing them values.
For an AI system to do what its operator intends, it must share values with its operator to a sufficient degree. Values do not need to perfectly shared in order for the system to do what the operator intends.
The same holds for a manager-employee relationship. In such a relationship, values are not perfectly shared, but they’re shared enough that the employee does what the manager intends.
How closely do values need to be shared, for an AI system do to what its operator intends? And how easily can we build such systems? That’s mostly an empirical question, and it depends on the use case.
The case of GPT-4 is somewhat encouraging: the system is remarkably good at understanding user intent; better, in fact, than many human colleagues.
Tyler Cowen’s core view on AI
My core view: if we humans can get more intelligence into our hands, if we can’t turn that into to a positive… what is it we’re hoping for? This is our big chance! Let’s not blow it.
Digital Minds or Butlerian Accord? (Part 1)
Two scenarios for 2100 that might be choiceworthy:
- Digital Minds: digital minds rule the Earth. There was a peaceful transition of power; humans think the 21st century went well and are optimistic about the future.
- Butlerian Accord: biological humans rule the Earth. They have agreed not to develop AI systems that replace them, and to regulate other technologies that greatly increase extinction risk.
Which would be preferable? Would either be acceptable?
Let’s imagine each these scenarios have come to pass (bracketing the question of how it actually happened).
If you like the Digital Minds scenario, you may think some or all of:
- Digital minds will live far better lives than biological humans; there will also be far more of them.
- Biological humans are overall better off in this scenario.
- Digital minds will have values somewhat (or very) different from our own, and that’s fine.
- We should think of digital minds as our descendants, not as an alien species.
- It would be good if our descendants become grabby and occupy a large local chunk of the Universe, rather than letting other civilisations take our place. “Digital Minds” soon increases the chance of this.
If you dislike the Digital Minds scenario, you may think some or all of:
- Digital Minds will not be conscious, or otherwise capable of wellbeing.
- Digital Minds will have values somewhat (or very) different from our own, and those values will be worse, objectively speaking.
- Digital Minds will have values somewhat (or very) different from our own, and I don’t like that.
- We should think of digital minds as an “other” that we dislike.
- Digital Minds will experience extraordinary suffering, or other kinds of disvalue.
If you like Butlerian Accord, you may think some or all of:
- The human condition is already pretty good at 21st century levels of technology. Same goes for non-human animals.
- A human civilisation based on a Butlerian Accord can be stable, and flourish for millennia.
- We should shift to “Digital Minds” eventually, but not before we’ve thought long and hard about how to maximise the chance that shift goes well.
If you dislike Butlerian Accord, you may think some or all of:
- We currently live in a Darwinian hellscape; superintelligent digital minds are the only way out of that.
- A Butlerian Accord would not last.
- …
What other big reasons are there to like or dislike each scenario?
Who will rule the Earth in 2100?
Scenarios for 2100, with my probability guesstimates:
The shift from (1) to (2) or (3) might happen—or at least get baked in—within 20 years. I guess 10-30% chance of that.
The short story for (2) goes: the capabilities of AI systems surpass those of humans; AI systems pursue their own values.
The main ways that (1) might continue:
(20%) A global agreement prevents the development of the most dangerous AI systems. 4
(10%) A global disaster sets back the development of AI systems (along with much else) by decades. E.g. a major nuclear war.
(<10%) A technological barrier prevents superhuman AI before 2100.
How might we get (3)?
Something causes human extinction before digital minds can sustain themselves without us
Aliens.
Simulation ends.
???
As Holden says, we should not expect business as usual.
By “rule the Earth”, I mean: have the most influence over what happens on Earth, compared to any other kinds of agents. The idea of “agency” is hard to pin down; the key thing is that insofar as humans and animals currently have agency, digital minds will have it too.↩︎
Humans and animals have minds which run on biological hardware. Digital minds run on silicon or other non-biological hardware.↩︎
This might be a desirable outcome. We could think of digital minds as our descendents, and wish them well just as we do our children and grandchildren. They might live flourishing lives and wish us well too, supporting us for a wonderful retirement. It is very unclear whether we should use a “raising children” or “alien invasion” frame when we think about the development of AI.↩︎
If we pull this off, and avoid (1b), then we’ll probably be at least 10x richer and biological humans will be far more “transhuman” than we are now (e.g. enjoy healthy lives for hundreds of years).↩︎
When you’re writing to learn, very short posts are fine
When I draft a post for this journal, I often write several good paragraphs and then run into a lot of questions. I start to think about how I can make my writing more understandable. An hour or two later, I’m out of steam, and it’s time to move on with my day.
Why? This is a study journal. I’m mostly writing for myself, and, on occasion, a small audience of high-context peers.
The journal format is partly about “getting thoughts out of my head” so that they are easier to interrogate. It’s also about providing a partial record of thoughts I’d like to reconnect with later.
I’m going to try going hard on this style for a few weeks.
I’ll write the journal I want to have.
@mfweof and @RokoMijic on naturalistic metaethics
Anything moral is only moral because it won, and it won for some reason other than being moral.
The past is not more ethical than the present. The past is (relative to current standards) less ethical, but so is the future!
The maintenance of present day human values are bound to present day technology in the same way thay the maintenance of medieval hierarchies was bound to the function of knights on the battlefield. Technology is not value neutral, so if you want to upgrade your technology without destroying your values you need to do some kind of alignment work. You need to constrain that technology so that it doesn’t do the thing it wants to do in the most efficient way possible: you need to make a gun that only knights can use.
See also: Robin Hanson, Most AI fear is Future Fear.