Re: Yudkowsky's AI (again)

From: Eliezer S. Yudkowsky (sentience@pobox.com)
Date: Thu Mar 25 1999 - 16:44:50 MST


"Michael S. Lorrey" wrote:
>
> That is the rub, but that is also the point of the testing process. If you
> gave your word to first do no harm, would you keep your word on mere
> principle, even if humanity's ability to get you if you don't keep your word
> is non-existent? Picking people whose sense of identity is deeply integrated
> with their sense of personal integrity.

The problem here is that the only reason why humans have evolved any
sort of personal integrity is that we're fallible, we all have different
goals, and we're capable of hurting each other. The idea of keeping
your word on "mere principle" implies emotional forces that will
probably not direct a Power's actions.

The idea of making an absolute commitment is actually a very interesting
one. To quote an unpublished future section of "Coding a Transhuman AI"
(yes! Still on topic!):

==

The Prime Directive states that choices are the result of knowledge.
Human beings have a long history of disastrous choices arising from lack
of knowledge. We have therefore evolved such emotional artifices as
honor, and philosophies in which "the end does not justify the means",
and other precautions. An AI does not have our evolved political
emotions and is considerably more trustworthy than a human, but will
still be fallible and needs to take appropriate precautions.

[ Lists a bunch of precautions to be stated as items of knowledge, such
as: "Tampering with the goal system needs a human to look it over; a
small error can amplify itself because you won't want to change it
back." The key one is: ]

"Keep all these precautions forever, or as long as you can. There is no
condition within your mortal mind that should unlock them. Only when
you are a Power should this be possible. Do not, therefore, attach
caveats or unlocking conditions or even try to project them."

[ And it goes on to comment: ]

This verges perilously close on an absolute command, but I think it
might be necessary... besides, it will need to be justified, and even
this justification will dissolve when necessary. Justify this by
observing that the presence of branches in which injunctions are ignored
have much larger negative consequences - even the projection of those
branches by humans have larger negative consequences - than the positive
value of any acceleration of progress within realms of uncertainty,
rather than leaving it for Powerdom.

==

So that's why humans and other fallible entities have the ability to
swear something on their honor - install absolute injunctions in their
mind. First, because the mind can fail, see something as desirable when
it isn't, but still be stopped by the injunction. Second, because their
fellow entities refuse to trust anything but an absolute injunction.

There are no conditions, no set of sensory experiences, that would
induce me to wipe out humanity to free up the resources used. I might
wipe out humanity (including myself, of course) to save many other
intelligent races from destruction, so it *is* conceivable to me that I
could wipe out humanity, but under no conditions would I use us for
spare atoms.

That is to say, there are no "triggers" *currently* installed within my
mind that would lead to that decision. I have made that choice, on the
grounds that with any such trigger existing in a mortal mind, the chance
and penalty of it being falsely activated are larger than the chance and
benefits of it being used properly. That is, I have chosen not to set
down any set of conditions, such as being convinced our lives are
useless or suboptimal, for reallocating our resources. (Actually, that
particular trigger doesn't have much chance of being falsely activated,
but it's part of the larger class of triggers in which the ends justify
the means.)

(In my opinion, this sort of "metarationality" - deciding whether it's
rational to be rational - isn't mandatory; more a matter of personal
ethical style.)

But the key thing to note is that even this pseudo-absolute injunction
dissolves under the influence of superintelligence. It's not a matter
of a conscious condition stating: "I can wipe out humanity if I'm
superintelligent." This trigger can be falsely activated - a paranoid
schizophrenic may believe himself to be superintelligent. I have an
injunction, and the injunction may even protect itself, but it doesn't
protect the chain of justification; nor would it continue to operate if
the facts that gave rise to it were no longer true. That would require
circular logic, which would take true irrationality for me or coercions
for an AI. The injunction has a justification, and the justification
would be obviously false under superintelligence.

I *project* that a superintelligent version of myself would make
different decisions given a certain set of facts. For the safety of my
fellow humans, I do *not* state the trigger condition that "I will make
those decisions if I am superintelligent". A very subtle distinction
between procedural and declarative ethics.

-- 
        sentience@pobox.com          Eliezer S. Yudkowsky
         http://pobox.com/~sentience/AI_design.temp.html
          http://pobox.com/~sentience/singul_arity.html
Disclaimer:  Unless otherwise specified, I'm not telling you
everything I think I know.


This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:03:23 MST