“AI alignment” means different things to different people.
My preferred definition, following many others, is:
An AI system is aligned if it does what its operator intends.
Why not just say that an AI system works if it does what its operator intends?
The reason is that AI systems are more agentic. To understand their behaviour, we adopt the intentional stance; we explain their behaviour partly by ascribing them values.
For an AI system to do what its operator intends, it must share values with its operator to a sufficient degree. Values do not need to perfectly shared in order for the system to do what the operator intends.
The same holds for a manager-employee relationship. In such a relationship, values are not perfectly shared, but they’re shared enough that the employee does what the manager intends.
How closely do values need to be shared, for an AI system do to what its operator intends? And how easily can we build such systems? That’s mostly an empirical question, and it depends on the use case.
The case of GPT-4 is somewhat encouraging: the system is remarkably good at understanding user intent; better, in fact, than many human colleagues.