Oh this is brilliant: “I want my AI agent to do what I want. I want it to understand my goals. I want it aligned with me. Along with the current discourse on how LLMs fall apart on recursively generated content, it will be interesting to see how this progresses.
Also a good example of how data sets get skewed in trials to support a hypothesis, even partially. This reminds me of the willpower study that was later disproven. This study just happens to be well cited because of the dramatics, I think.
Thanks for flagging this study! I put aside everything else this morning to read it. Obedient subjects were more likely to talk over the learner's response to the prior question. Work by Perry (cited in the article) suggests that obedient subjects were less likely to buy into the cover story to start with, but that's based on their post hoc accounts when they might have been rationalizing/justifying. The authors here conjecture that obedient subjects had become indifferent to the study's legitimacy, perhaps overwhelmed by stress. Folks in my own field of conversation analysis (like Hollander, also cited) have been busily analyzing these recordings but I have not seen this particular finding previously reported. It's an important one! (Incidentally, I recently wrote about sustained overlapping talk, and the contempt it communicates about the other person: https://davidgibsonsoc.substack.com/p/overtalk).
Very interesting article. It opened several inquiries for me. Foremost, if one peels the façade that is linguistic compression of ‘obedience’ and ‘dis-obedience’, one will find it is alignment. Now comes the question, your article poses. Alignment to what? In Milgram’s experiment did ‘obedient’ ones construe silence of the experimenter as consent (quite candidly, one will find the parallel with goal-bound agents, which persists for the goals, often compromising the guard-rails set. If interested read about Palisade research, Anthropic agentic misalignment case study, where agents edited the ‘kill-switch code’ or blackmailed fictional executive to continue working towards the explicit goals set) to pursue towards the goal or the experimenter was silent because he wanted to see the extent they pursued the set goal.
As one would imagine, artificial agents lack an active feedback loop and ability to reason out what went wrong correlated with the context and choices that were made. In neuroscience, we call it metacognition, and if compromised, human reasoning will mirror that of artificial agents.
On creator of artificial agents, you raise a very interesting point—an agentic creator creates a solution that aligns with how they perceive the needs of other human agents at time t1 with context c1. As time progresses and context changes, should the goal be persisted or aborted, is the choice that the creator must take. And as we saw in the Milgram’s experiment each choice has a trade-off.
Apologies for the length of the comment. The article opened several lines of inquiry a comment cannot contain. Notably, 'alignment to what' — and what that 'what' refers to. Whether agents align to local goal referents or to referents that preserve overall system stability may be the more consequential question. Perhaps an article. Thanks for sharing
Please know I love the rigor you bring to every conversation, but sometimes I just have to inject a glimmer of humor into this barbaric slaughterhouse that was once known as humanity...
You may be aware of this already, but there's another interesting wrinkle. Milgram surreptitiously left Condition 24 out of his published findings, which was a "Relationship Condition" in which the teacher (subject) and learner (confederate) were either relatives or friends who had been recruited together. In this condition, the learner was instructed to direct his protests and yelling not at the experimenter but at the teacher, i.e. their friend/relative. In this unreported condition with 20 related pairs of male subjects, only *15%* completed the condition (compared to 65% in the reported sample), and 80% of noncompleters stopped short of 195 volts. (see Griggs & Whitehead, 2015, among others). I don't recall any detailed breakdowns for who stopped announcing the voltage or didn't wait long enough before shocking, but that would be interesting to doublecheck.
I'd have to think through the exact implications for AI, but one obvious one is that preexisting "relationship" (simulated as human, instructed to assume role of friend/relative) might reshape that calculus and degradation path.
By skipping the voltage announcements, they essentially muted their own consciences
Right? So fascinating!!! It changes everything.
Oh this is brilliant: “I want my AI agent to do what I want. I want it to understand my goals. I want it aligned with me. Along with the current discourse on how LLMs fall apart on recursively generated content, it will be interesting to see how this progresses.
Also a good example of how data sets get skewed in trials to support a hypothesis, even partially. This reminds me of the willpower study that was later disproven. This study just happens to be well cited because of the dramatics, I think.
“degradation of the legitimating context with nobody to flag the collapse” well described.
So obedience to the Science was at odds with compliance with the experimenter but Milgram didn't see the distinction.
Thanks for flagging this study! I put aside everything else this morning to read it. Obedient subjects were more likely to talk over the learner's response to the prior question. Work by Perry (cited in the article) suggests that obedient subjects were less likely to buy into the cover story to start with, but that's based on their post hoc accounts when they might have been rationalizing/justifying. The authors here conjecture that obedient subjects had become indifferent to the study's legitimacy, perhaps overwhelmed by stress. Folks in my own field of conversation analysis (like Hollander, also cited) have been busily analyzing these recordings but I have not seen this particular finding previously reported. It's an important one! (Incidentally, I recently wrote about sustained overlapping talk, and the contempt it communicates about the other person: https://davidgibsonsoc.substack.com/p/overtalk).
I think about this all the time.
Very interesting article. It opened several inquiries for me. Foremost, if one peels the façade that is linguistic compression of ‘obedience’ and ‘dis-obedience’, one will find it is alignment. Now comes the question, your article poses. Alignment to what? In Milgram’s experiment did ‘obedient’ ones construe silence of the experimenter as consent (quite candidly, one will find the parallel with goal-bound agents, which persists for the goals, often compromising the guard-rails set. If interested read about Palisade research, Anthropic agentic misalignment case study, where agents edited the ‘kill-switch code’ or blackmailed fictional executive to continue working towards the explicit goals set) to pursue towards the goal or the experimenter was silent because he wanted to see the extent they pursued the set goal.
As one would imagine, artificial agents lack an active feedback loop and ability to reason out what went wrong correlated with the context and choices that were made. In neuroscience, we call it metacognition, and if compromised, human reasoning will mirror that of artificial agents.
On creator of artificial agents, you raise a very interesting point—an agentic creator creates a solution that aligns with how they perceive the needs of other human agents at time t1 with context c1. As time progresses and context changes, should the goal be persisted or aborted, is the choice that the creator must take. And as we saw in the Milgram’s experiment each choice has a trade-off.
Apologies for the length of the comment. The article opened several lines of inquiry a comment cannot contain. Notably, 'alignment to what' — and what that 'what' refers to. Whether agents align to local goal referents or to referents that preserve overall system stability may be the more consequential question. Perhaps an article. Thanks for sharing
I hope you write a piece on this!
Sometimes a little therapy helps, sometimes not...
https://youtu.be/JFCgz959ARY?si=HanfXAcmVw62b1YF
love love this.
Please know I love the rigor you bring to every conversation, but sometimes I just have to inject a glimmer of humor into this barbaric slaughterhouse that was once known as humanity...
You may be aware of this already, but there's another interesting wrinkle. Milgram surreptitiously left Condition 24 out of his published findings, which was a "Relationship Condition" in which the teacher (subject) and learner (confederate) were either relatives or friends who had been recruited together. In this condition, the learner was instructed to direct his protests and yelling not at the experimenter but at the teacher, i.e. their friend/relative. In this unreported condition with 20 related pairs of male subjects, only *15%* completed the condition (compared to 65% in the reported sample), and 80% of noncompleters stopped short of 195 volts. (see Griggs & Whitehead, 2015, among others). I don't recall any detailed breakdowns for who stopped announcing the voltage or didn't wait long enough before shocking, but that would be interesting to doublecheck.
I'd have to think through the exact implications for AI, but one obvious one is that preexisting "relationship" (simulated as human, instructed to assume role of friend/relative) might reshape that calculus and degradation path.