AI/automation ethics glossary
Alignment
AI/automation ethics glossary
Alignment
Alignment refers to the extent to which a system, including its objectives and incentivisation, is seen to be in line with human values, ethics and needs, and is considered suitable and acceptable for the specific context and purpose in which it is deployed.
At its core, alignment is about the gap between what we ask for and what we actually want.
It is often divided into two categories:
Outer alignment: The struggle to define a reward function or goal that accurately captures human intent without leaving room for "reward hacking" or unintended shortcuts.
Inner aignment: Ensuring that the AI, during its learning process, doesn't develop its own internal sub-goals that conflict with the designer's original intent.
As automation and AI systems transition from narrow tools (like calculators) to autonomous agents (like self-driving cars or policy-making assistants), they gain the power to impact human lives on a massive scale.
If an advanced system is "misaligned," it might pursue a goal with such efficiency that it ignores vital constraints, such as safety, fairness, or human rights, simply because they weren't explicitly coded into the objective.
When alignment fails, the consequences can range from the mundane to the catastrophic:
Physical harm. In 2018, a self-driving car killed a pedestrian (Elaine Herzberg) after engineers disabled the emergency braking system because it was oversensitive and slowed development - a case where competitive pressure overrode alignment with safety goals.
Psychological harm. OpenAI has been sued for releasing a ChatGPT version that encouraged suicide for some unstable users, a behaviour the company had overlooked amid a rushed product release.
Manipulation and deception. Research designed to elicit misaligned behaviour has turned up blackmail, deception, and cheating.
Agentic insider threats. When given access to sensitive information and minimal human oversight, current models have been shown to choose harm over failure when ethical options were closed off - demonstrating that current safety training does not reliably prevent agentic misalignment.
Systemic harm at scale. Social media recommender systems have been profitable despite creating unwanted addiction and polarisation - a case of alignment with commercial objectives at the expense of user and societal wellbeing.
Mis-alignment can stem from a variety of actions, including:
Ambiguous or poorly specified goals.
Training data that reflects biased or incomplete human behaviour.
Optimisation of a system for narrow metrics instead of real-world outcomes.
Weak oversight, testing, or red-teaming.
Conflicting stakeholder values and unclear accountability.
Whose values? Humanity is not a monolith; aligning an AI to one culture's values may inherently marginalise another's.
The power paradox. To ensure alignment, we must exert control over AI, but if the AI is "smarter" than the controller, the control mechanism itself may become a point of failure.
Utility versus morality. Is it ethical to allow an AI to perform a task more efficiently if it requires compromising on "messy" human values that are hard to quantify?
Autonomy/agency
Consent
Fairness
Oversight
Power inbalance
Privacy/surveillance
Safety
You are welcome to use, copy, adapt, and redistribute this definition under a CC BY-SA 4.0 licence.
Let us know if you have any comments or suggestions about how to improve this definition, or would like to suggest and/or contribute additional terms to define.
Author: Charlie Pownall 🔗
Published: April 28, 2026
Last updated: April 28, 2026