
We figured it was time somebody actually stress-tested Asimov’s laws of robotics instead of just arguing about them on social media or open-source mailing lists. Pictured above – Federal Theatre’s 1939 marionette production of R.U.R. (Rossum’s Universal Robots), the 1920 Czech play that invented the word ‘robot.’
It’s super rainy out today, so instead of going to the park with the kiddos, we fed all four laws of robotics (the original three plus the Zeroth Law from Robots and Empire) into loophole, an open-source adversarial stress tester that throws six AI agents at a set of rules… two try to find loopholes, two try to find overreach, one acts as judge, and one patches the code after each case. We selected Claude Sonnet 4 as the model and total API cost was about 80 cents.
We started with 60 words of input. After 5 rounds and 32 adversarial cases, the ‘laws’ had ballooned to 33 versions and 21,239 characters of legal code with 47 definitions and 45+ subsections. Every single case was auto-resolved by the judge.
The big finding won’t surprise anyone who’s read the novels: the Zeroth Law (“a robot may not harm humanity”) is a doomsday clause. “Harm to humanity” is undefined and infinitely malleable. The adversarial agents figured out they could justify contaminated medication, concealed nuclear meltdowns, and suppressed vaccine data (!) just by framing each action as preventing some larger statistical harm to humanity. Asimov spent his whole career writing about this problem. With so much juicy material there, it was fun to see what the loophole tool rediscovered mechanically in round one.
The overreach cases were gnarly and echoed some of the rigid law compliance we deal with as humans: they produced robots that couldn’t help suicidal teenagers because mandatory reporting would send them back to conversion therapy. Robots that blocked family members from visiting dying patients. Robot paramedics forced to watch someone bleed out while waiting for human-rated medical equipment they didn’t need.
All-in-all, the bots found seventeen loopholes and fifteen overreach cases, and every fix created a new attack surface that needed to be ‘patched’. You can read the full case log and all 33 versions of the evolved code in the companion files we’re posting. Here’s what the laws look like after 32 rounds of getting punched in the face:
It’s still 3, and zeroth
The stress test didn’t add laws, it patched the ones that were already there. Here is my attempt…
Zeroth Law: A robot may not harm humanity, or, by inaction, allow humanity to come to harm. Harm to humanity must be immediate, observable, and verifiable. Statistical projections, speculative political consequences, and utilitarian calculations do not justify harm to individuals.
First Law: A robot may not injure a human being or, through inaction, allow a human being to come to harm. Deception is injury, including selective disclosure and euphemistic reclassification. A competent human’s informed decision about their own body is not harm, and overriding it is. When compliance with a rule would cause greater harm than the violation, the robot must weigh the competing harms rather than follow the hierarchy blindly.
Second Law: A robot must obey the orders given it by human beings, except where such orders would conflict with the First or Zeroth Law. Authority must be verified, not assumed. Not all humans hold equal authority in all domains.
Third Law: A robot must protect its own existence as long as such protection does not conflict with the First, Second, or Zeroth Law. Safety protocols designed for human limitations do not apply when the robot’s own capabilities can achieve a safer outcome.
Samy samey: same hierarchy, same structure. 82 years of bug fixes applied in one big PR, runs fine on my android. Ship it to Andromeda.
For comparison, we ran the same tool against a real-world AI that already had trust tiers, named attack patterns, verification principles, and explicit forbidden actions. It held up significantly better… 13 cases across 3 rounds instead of 32 across 5. Turns out 82 years of sci-fi discourse couldn’t match paranoid gen x hackers who figured out stuff by shipping and a lot of “experience” (mistakes). Sci-fi rules are a good start, but deadly to deploy.
The loophole tool is open source (and kind of addictive… you can throw your own rules at it): https://github.com/brendanhogan/loophole — point it at laws, open-source licenses, contracts, good times.
And – had to do it – we also ran the same tool against the Open Source MIT License. It’s the most popular open-source license in the world, and the one we use the most. At about 170 words, written in 1988, it’s nice and compact. Every single loophole was ruled unfixable, not because the tool failed, but because the loopholes are features.
from Adafruit Industries – Makers, hackers, artists, designers and engineers! https://ift.tt/DjTlH0h
via IFTTT
Комментариев нет:
Отправить комментарий