Frontiers of Friendly AI

From: Eliezer S. Yudkowsky (sentience@pobox.com)
Date: Thu Sep 28 2000 - 09:45:16 MDT


For those of you wondering about the kind of questions that I *would* regard
as valid...

** Q1)

     Stage: Mid-term and late-term seed AI.

     Context: Let's say you design a seed AI with what looks like a Friendly
goal system; i.e., it regards all subtasks - including optimizing itself, and
rewriting a particular piece of code, and playing chess, and so on - as
subgoals of subgoals of subgoals which eventually terminate in the
Friendliness supergoal. In fact, the "subgoals" have no programmatic status -
they are simply commonly used way-stations along the projection that leads to
the supergoal.

     Possible problem: On a moment-to-moment basis, the vast majority of
tasks are not materially affected by the fact that the supergoal is
Friendliness. The optimal strategy for playing chess is not obviously
affected by whether the supergoal is Friendliness or hostility. Therefore,
the system will tend to accumulate learned complexity for the subgoals, but
will not accumulate complexity for the top of the goal chain - Friendliness,
and any standard links from Friendliness to the necessity for self-enhancement
or some other standard subgoal, will remain crystalline and brittle. If most
of the de facto complexity rests with the subgoals, then is it likely that
future superintelligent versions of the AI's self will grant priority to the
subgoals?

     Solution: I'll post again in a bit, but I'm more interested in seeing
whether anyone can come up with the answer independently.

** Q2)

     Stage: The earliest, most primitive versions of the seed AI.

     Possible problem: There is an error in the goal system, or the goal
system is incomplete. Either you had to simplify at first because the
primitive AI couldn't understand all the referents, or you (the programmer)
changed your mind about something. How do you get the AI to let you change
the goal system? Obviously, changing the goal system is an action that would
tend to interfere with whatever goals the AI currently possesses.

     Solution: Again, I'd like to let this hang out for a bit - maybe we'll
get a more productive discussion than the current one.

     Bonus problem: Suppose that you screw up badly enough that the AI not
only attempts to preserve the original goals, but also realizes that it (the
AI) must do so surreptitiously in order to succeed. Can you think of any
methods that might help identify such a problem?

** Grading

No ad-hoc patches, please - remember, the final system has to be coherent, and
you won't be able to think up ad-hoc patches for everything in advance. Your
answer should take the form of a fundamental feature of the cognitive
architecture or the programming methodology.

-- -- -- -- --
Eliezer S. Yudkowsky http://singinst.org/
Research Fellow, Singularity Institute for Artificial Intelligence



This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:31:15 MST