오늘 새벽에 정말 환상적인 소식이 있었습니다. 바로 알파고 제로의 등장!! 기존과 전혀 다른 형식으로 학습하는 알파고 제로에 대해서 네이처지에 실린 논문을 공유해드립니다. PDF내용을 옮기다 보니 수식이 모두 깨지네요..pdf 원문을 받고 싶은 분은 가장 하단에 있는 링크를 참조하세요~
(번역은 구글 번역기를 이용했습니다.)
Mastering the Game of Go
without Human Knowledge
(인간의 도움 없이 바둑게임 마스터 하기)
David
Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang,
Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian
Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche,
Thore Graepel, Demis Hassabis.
DeepMind, 5 New Street Square, London EC4A 3TW.
*These authors contributed equally to this work.
A
long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo
became the first program to defeat a world champion in the game of Go. The tree
search in AlphaGo evaluated positions and selected moves using deep neural
networks. These neural networks were trained by supervised learning from human
expert moves, and by reinforcement learning from selfplay. Here, we introduce
an algorithm based solely on reinforcement learning, without human data,
guidance, or domain knowledge beyond game rules. AlphaGo becomes its own
teacher: a neural network is trained to predict AlphaGo’s own move selections
and also the winner of AlphaGo’s games. This neural network improves the
strength of tree search, resulting in higher quality move selection and
stronger self-play in the next iteration. Starting tabula rasa, our new program
AlphaGo Zero achieved superhuman performance, winning 100-0 against the
previously published, champion-defeating AlphaGo.
인공지능의 오랜 목표는 도전적인 영역에서
초자연적 인 능숙도를 배우는 알고리즘입니다. 최근 AlphaGo는
바둑 게임에서 세계 챔피언을 물리 치는 첫 번째 프로그램이되었습니다. AlphaGo의 트리 검색은 심층
신경망을 사용하여 위치와 선택된 움직임을 평가했습니다. 이러한 신경 회로망은 인간의 전문가 움직임에
의한 감독 학습과 셀프 플레이를 통한 학습 강화에 의해 훈련되었습니다. 여기서는 인간 데이터, 지침 또는 게임 규칙 이외의 영역 지식없이 강화 학습만을 기반으로하는 알고리즘을 소개합니다. AlphaGo는 자체 교사가됩니다 : 신경 네트워크는 AlphaGo의 자체 이동 선택과 AlphaGo 게임의 승자를 예측하도록
훈련되었습니다. 이 신경 네트워크는 트리 검색의 강도를 향상시켜 다음 반복에서 더 높은 품질의 이동
선택과 강력한 셀프 플레이를 가능하게합니다. 우리의 새로운 프로그램 인 AlphaGo Zero는 tabula rasa를 시작으로 이전에 발표
된 챔피언 격파 AlphaGo에 비해 100-0으로 우승하여
초인적 인 성능을 달성했습니다.
Much progress towards artificial intelligence has been made using supervised
learning systems that are trained to replicate the decisions of human experts1–4.
However, expert data is often expensive, unreliable, or simply unavailable.
Even when reliable data is available it may impose a ceiling on the performance
of systems trained in this manner5. In contrast, reinforcement
learning systems are trained from their own experience, in principle allowing
them to exceed human capabilities, and to operate in domains where human
expertise is lacking. Recently, there has been rapid progress towards this
goal, using deep neural networks trained by reinforcement learning. These
systems have outperformed humans in computer games such as Atari 6, 7
and 3D virtual environments 8–10. However, the most challenging
domains in terms of human intellect – such as the game of Go, widely viewed as
a grand challenge for artificial intelligence 11 – require precise
and sophisticated lookahead in vast search spaces. Fully general methods have
not previously achieved human-level performance in these domains.
인공 지능을 향한 많은 진전이 인간 전문가1-4의 결정을 반복하도록 훈련 된 감독 학습 시스템을 사용하여 이루어졌습니다. 그러나 전문가 데이터는 종종 비싸거나 신뢰할 수 없거나 단순히 사용할 수 없습니다. 신뢰할 수있는 데이터가있을 때조차도이 방법으로 훈련 된 시스템의 성능에 대한 한도를 부과 할 수있다. 대조적으로 보강 학습 시스템은 원칙적으로 인간의 능력을 뛰어 넘는 자신의 경험과 전문 지식이 부족한 영역에서의
작업에서 훈련됩니다. 최근에는 강화 학습에 의해 훈련 된 심층 신경망를 사용하여이 목표를 향한 급속한
진전이있었습니다. 이러한 시스템은 아타리 6, 7 및 3D 가상 환경 8-10과
같은 컴퓨터 게임에서 인간보다 성능이 뛰어납니다. 그러나 인공 지능 11의 웅장한 도전 과제로 널리 인식되는 바둑 게임과 같이 인간의 지성 측면에서 가장 어려운 영역은
광대 한 검색 공간에서 정확하고 정교한 미리보기를 필요로합니다. 완전히 일반적인 방법은 이전에는 이러한
영역에서 인간 수준의 성능을 달성하지 못했습니다.
AlphaGo was the first program to achieve superhuman performance in Go. The
published version 12, which we refer to as AlphaGo Fan, defeated the
European champion Fan Hui in October 2015. AlphaGo Fan utilised two deep neural
networks: a policy network that outputs move probabilities, and a value network
that outputs a position evaluation. The policy network was trained initially by
supervised learning to accurately predict human expert moves, and was
subsequently refined by policy-gradient reinforcement learning. The value
network was trained to predict the winner of games played by the policy network
against itself. Once trained, these networks were combined with a Monte-Carlo
Tree Search (MCTS) 13–15 to provide a lookahead search, using the
policy network to narrow down the search to high-probability moves, and using
the value network (in conjunction with Monte-Carlo rollouts using a fast
rollout policy) to evaluate positions in the tree. A subsequent version, which
we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated
Lee Sedol, the winner of 18 international titles, in March 2016.
AlphaGo는 바둑에서 초인적인 성능을 달성 한 최초의 프로그램이었습니다. AlphaGo Fan은 2015 년 10 월에 유럽 챔피언
Fan Hui를 물리 쳤습니다. AlphaGo Fan은 이동 확률을 산출하는 정책 네트워크와
위치 평가를 출력하는 가치 네트워크라는 두 개의 심층 신경망을 활용했습니다. 정책 네트워크는 처음에는
인간의 전문가 움직임을 정확하게 예측하기위한 감독 학습 (supervised learning)에 의해
훈련 받았고, 이후에 정책 구배 강화 학습
(policy-gradient reinforcement learning)으로 개선되었습니다. 가치
네트워크는 정책 네트워크가 자신을 상대로 한 게임의 승자를 예측하도록 훈련되었습니다. 일단 교육을 받으면
이러한 네트워크를 Monte-Carlo Tree Search (MCTS) 13-15와
결합하여 미리보기 검색을 제공하고, 정책 네트워크를 사용하여 높은 확률로 이동 범위를 좁히고 값 네트워크를
사용합니다 ( 빠른 롤아웃 정책을 사용한 Monte-Carlo 롤아웃)을 사용하여 트리의 위치를 평가합니다. 우리가 AlphaGo Lee로 언급 한 후속 버전은 유사한 접근법 (방법 참조)을 사용했으며 2016년 3월 18 개 국제 타이틀의 승자 인 Lee Sedol을 물리 쳤습니다.
Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee 12
in several important aspects. First and foremost, it is trained solely by
self-play reinforcement learning, starting from random play, without any
supervision or use of human data. Second, it only uses the black and white stones
from the board as input features. Third, it uses a single neural network,
rather than separate policy and value networks. Finally, it uses a simpler tree
search that relies upon this single neural network to evaluate positions and
sample moves, without performing any MonteCarlo rollouts. To achieve these
results, we introduce a new reinforcement learning algorithm that incorporates
lookahead search inside the training loop, resulting in rapid improvement and
precise and stable learning. Further technical differences in the search
algorithm, training procedure and network architecture are described in
Methods.
우리의 프로그램 인 AlphaGo Zero는 AlphaGo Fan과 AlphaGo Lee 12와 몇 가지 중요한 측면에서 다릅니다. 무엇보다도 인간 데이터의 감독이나 사용없이 무작위로 플레이하는 것을 시작으로 셀프 플레이 강화 학습만으로 교육을받습니다. 둘째, 입력 기능으로 보드의 흑백 돌만 사용합니다. 셋째, 별도의 정책 및 가치 네트워크보다는 단일 신경 네트워크를
사용합니다. 마지막으로 MonteCarlo 롤아웃을 수행하지
않고 위치 및 샘플 이동을 평가하기 위해이 단일 신경 네트워크에 의존하는보다 단순한 트리 검색을 사용합니다. 이러한
결과를 달성하기 위해 우리는 트레이닝 루프 내부에 선행 검색을 통합 한 새로운 강화 학습 알고리즘을 도입하여 신속한 개선과 정확하고 안정적인 학습을
가능하게합니다. 검색 알고리즘, 교육 절차 및 네트워크 아키텍처에
대한 기술적 차이점은 방법에 설명되어 있습니다.
1. Reinforcement Learning
in AlphaGo Zero
Our new method uses a deep neural network fθ with parameters θ. This neural
network takes as an input the raw board representation s of the position and
its history, and outputs both move probabilities and a value, (p, v) = fθ(s).
The vector of move probabilities p represents the probability of selecting each
move (including pass), pa = P r(a|s). The value v is a scalar evaluation,
estimating the probability of the current player winning from position s. This
neural network combines the roles of both policy network and value network12
into a single architecture. The neural network consists of many residual blocks4
of convolutional layers 16, 17 with batch normalisation 18 and
rectifier non-linearities 19 (see Methods).
우리의 새로운 방법은 매개 변수 θ를
갖는 심층 신경망 fθ를 사용한다. 이 신경망은 위치와 그
역사의 원시 보드 표현을 입력으로 취하여 이동 확률과 (p, v) = fθ (s) 값을 출력합니다. 이동 확률 벡터 p는 각 이동 (패스 포함)을
선택할 확률을 나타내며, pa = P r (a | s)입니다. 값 v는 현재 플레이어가 위치 s에서 우승 할 확률을 산정하는 스칼라
평가입니다. 이 신경 네트워크는 정책 네트워크와 가치 네트워크12의 역할을 하나의 아키텍처로 결합합니다. 신경 네트워크는
배치 정규화18 및 정류기 비선형성
19을 갖는 컨볼루션층16, 17의
다수의 잔여 블록4으로 구성된다 (방법
참조).
The neural network in AlphaGo Zero is trained from games of self-play by a
novel reinforcement learning algorithm. In each position s, an MCTS search is
executed, guided by the neural network fθ. The MCTS search outputs
probabilities π of playing each move. These search probabilities usually select
much stronger moves than the raw move probabilities p of the neural network
fθ(s); MCTS may therefore be viewed as a powerful policy improvement operator20,
21. Self-play with search – using the improved MCTS-based policy to
select each move, then using the game winner z as a sample of the value – may
be viewed as a powerful policy evaluation operator. The main idea of our
reinforcement learning algorithm is to use these search operators repeatedly in
a policy iteration procedure22, 23: the neural network’s parameters
are updated to make the move probabilities and value (p, v) = fθ(s) more
closely match the improved search probabilities and self-play winner (π, z);
these new parameters are used in the next iteration of self-play to make the
search even stronger. Figure 1 illustrates the self-play training pipeline.
AlphaGo Zero의 신경망은 새로운 강화 학습 알고리즘에 의한 자기 놀이 게임에 대한 교육을받습니다. 각각의
위치 s에서, 신경망 fθ에
의해 유도 된 MCTS 탐색이 실행된다. MCTS 탐색은
각 이동을 할 확률 π를 출력합니다. 이러한 검색 확률은 일반적으로 신경망 fθ (s)의 원시 이동 확률 p보다
훨씬 더 강한 이동을 선택합니다. 따라서 MCTS는 강력한
정책 개선 운영자20, 21로 볼 수 있습니다. 검색을
통한 셀프 플레이 - 개선 된 MCTS 기반 정책을 사용하여
각 이동을 선택하고 게임 승자 z를 값의 샘플로 사용하면 다음과 같이 표시 될 수 있습니다. 강력한 정책 평가 연산자. 우리의 보강 학습 알고리즘의 주된 아이디어는
정책 반복 절차22, 23에서 이러한 검색 연산자를 반복적으로 사용하는 것입니다 : 신경망의 매개 변수가 이동 확률을 만들기 위해 업데이트되고 값 (p, v)
= fθ (s)가 더 가깝게 일치합니다 향상된 검색 확률 및 셀프 플레이 당첨자 (π, z); 이러한 새로운 매개 변수는 검색을 더욱 강하게하기 위해
다음 반복 셀프 플레이에서 사용됩니다. 그림 1은 셀프 플레이
트레이닝 파이프 라인을 보여줍니다.
The Monte-Carlo tree search uses the neural network fθ to guide its simulations
(see Figure 2). Each edge (s, a) in the search tree stores a prior probability
P(s, a), a visit count N(s, a), and an action-value Q(s, a). Each simulation
starts from the root state and iteratively selects moves that maximise an upper
confidence bound Q(s, a) +U(s, a), where U(s, a) ∝ P(s, a)/(1 +
N(s, a)) 12, 24, until a leaf node s’ is encountered. This leaf
position is expanded and evaluated just
Monte-Carlo tree 검색은 신경망 fθ를 사용하여 시뮬레이션을 유도합니다 (그림 2 참조). 검색
트리의 각 에지 (s, a)는 이전 확률 P (s, a), 방문
카운트 N (s, a) 및 동작 값 Q (s, a)를 저장한다. 각 시뮬레이션은 루트 상태에서 시작하여 상위 신뢰도 한계 Q (s, a) +
U (s, a), where U(s, a) ∝ P(s, a)/(1 + N(s,
a)) 12, 24를 최대화하는 동작을 반복적으로 선택합니다. s’에 도달 할 때까지 반복된다. 이 리프 위치는 확장되어 평가됩니다.
Figure 1: Self-play reinforcement
learning in AlphaGo Zero.
a The program plays a game s1, ..., sT against itself. In each position st,
a Monte-Carlo tree search (MCTS) αθ is executed (see Figure 2) using the latest
neural network fθ. Moves are selected according to the search probabilities
computed by the MCTS, at ∼ πt. The
terminal position sT is scored according to the rules of the game to compute
the game winner z.
b Neural network training in AlphaGo
Zero. The neural network takes the raw board position st as its input, passes it
through many convolutional layers with parameters θ, and outputs both a vector
pt, representing a probability distribution over moves, and a scalar value vt,
representing the probability of the current player winning in position st. The
neural network parameters θ are updated so as to maximise the similarity of the
policy vector pt to the search probabilities πt , and to minimise the error
between the predicted winner vt and the game winner z (see Equation 1). The new
parameters are used in the next iteration of self-play a.
그림
1 : AlphaGo Zero에서 자가 강화 학습.
a 프로그램은 s1, ..., sT 게임을 실행합니다. 각 위치 st에서 최신 신경망 fθ를 사용하여 몬테카를로 트리 탐색 (MCTS) αθ가 실행됩니다. (그림 2 참조). MCTS에 의해 계산 된 검색 확률에 따라 이동이 ~ πt에서 선택됩니다. 말단
위치 sT는 게임의 규칙에 따라 스코어되어 게임 승자 z를
계산한다.
b AlphaGo Zero에서 신경망 훈련. 신경망은 원시 보드 위치 st를 입력으로 사용하여 매개 변수 θ를
갖는 많은 길쌈 레이어를 통과시키고 이동에 대한 확률 분포를 나타내는 벡터 pt와 현재 플레이어의 확률을
나타내는 스칼라 값 vt를 출력합니다. 위치 st에서 우승. 신경망 매개 변수 θ는 정책 벡터 pt와 검색 확률 πt의 유사성을 최대화하고 예상 승자 vt와 게임 승자 z (식 1 참조) 간의 오차를 최소화하도록 업데이트됩니다. 새로운 매개 변수는 self-play a의 다음 반복에서
사용됩니다.
Figure 2: Monte-Carlo tree
search in AlphaGo Zero.
a
Each simulation traverses the tree by selecting the edge with maximum
action-value Q, plus an upper confidence bound U that depends on a stored prior
probability P and visit count N for that edge (which is incremented once
traversed). b The leaf node is
expanded and the associated position s is evaluated by the neural network (P(s,
·), V (s)) = fθ(s); the vector of P values are stored in the outgoing edges
from s.
c Action-values Q are updated to
track the mean of all evaluations V in the subtree below that action.
d Once the search is complete,
search probabilities π are returned, proportional to N1/τ , where N is the
visit count of each move from the root state and τ is a parameter controlling
temperature.
그림 2 : AlphaGo Zero의
Monte-Carlo 트리 검색.
a 각 시뮬레이션은 최대 동작 값 Q를 갖는 에지와 저장된 사전 확률 P 및 그 에지에 대한 방문 카운트 N (한번 통과 할 때 증가됨)에 의존하는 상위 신뢰도 U를 선택함으로써 트리를 가로 지른다.
b 잎 노드가 확장되고 연관된 위치 s는 신경망 (P (s, ·), V
(s)) = fθ (s)에 의해 평가된다. P 값의
벡터는 s로부터 나가는 에지에 저장됩니다.
c 동작 값 Q는
해당 동작 아래의 하위 트리에있는 모든 평가 V의 평균을 추적하도록 업데이트됩니다.
d 검색이 완료되면 검색 확률 π가 N1 / τ에 비례하여 반환됩니다.
once
by the network to generate both prior probabilities and evaluation, (P(s 0 ,
·), V (s 0 )) = fθ(s 0 ). Each edge (s, a) traversed in the simulation is
updated to increment its visit count N(s, a), and to update its action-value to
the mean evaluation over these simulations, Q(s, a) = 1/N(s, a) P s 0 |s,a→s 0
V (s 0 ), where s, a → s 0 indicates that a simulation eventually reached s 0
after taking move a from position
여기서
N은 루트 상태에서 각 이동의 방문 횟수이고 τ는 온도를 제어하는 매개 변수입니다. 네트워크에 의해 일단 사전 확률 및 평가 (P (s0, ·), V (s0)) = fθ (s0)를 생성한다. 시뮬레이션에서
가로 지른 각 에지 (s (s))는 방문 수 N (s, a)를
증가 시키도록 업데이트되고 이러한 시뮬레이션에 대한 동작 평가 값을 평균 평가로 업데이트하기 위해 Q (s,
a) = 1 / N (s, a) P s 0 | s, a → s 0 V (s 0) 여기서 s, a → s 0는 위치 이동을 한 후에 시뮬레이션이 결국 s 0에 도달했음을 나타냅니다.
MCTS
may be viewed as a self-play algorithm that, given neural network parameters θ
and a root position s, computes a vector of search probabilities recommending
moves to play, π = αθ(s), proportional to the exponentiated visit count for
each move, πa ∝ N(s, a) 1/τ , where τ is a temperature
parameter.
MCTS는 신경망 매개 변수 θ와 근원 위치 s가 주어지면 각 이동에 대한 지수화 된 방문 횟수에 비례하여 이동을 권장하는 탐색 확률 벡터 π = αθ (s)를 계산하는 자체 재생 알고리즘으로 볼 수 있습니다 , πa α N (s, a) 1 /
τ, 여기서 τ는 온도 매개 변수이다.
The
neural network is trained by a self-play reinforcement learning algorithm that
uses MCTS to play each move. First, the neural network is initialised to random
weights θ0. At each subsequent iteration i ≥ 1, games of self-play are
generated (Figure 1a). At each time-step t, an MCTS search πt = αθi−1 (st) is executed using
the previous iteration of neural network fθi−1 , and a move is played by sampling the search
probabilities πt . A game terminates at step T when both players pass, when the
search value drops below a resignation threshold, or when the game exceeds a
maximum length; the game is then scored to give a final reward of rT ∈
{−1, +1} (see
Methods for details). The data for each time-step t is stored as (st , πt , zt)
where zt = ±rT is the game winner from the perspective of the current player at
step t. In parallel (Figure 1b), new network parameters θi are trained from
data (s, π, z) sampled uniformly among all time-steps of the last iteration(s)
of self-play. The neural network (p, v) = fθi (s) is adjusted to minimise the
error between the predicted value v and the self-play winner z, and to maximise
the similarity of the neural network move probabilities p to the search
probabilities π. Specifically, the parameters θ are adjusted by gradient descent
on a loss function l that sums over mean-squared error and cross-entropy losses
respectively,
신경 네트워크는 MCTS를 사용하여 각 동작을 재생하는 자체 재생 강화 학습 알고리즘에 의해 학습됩니다. 먼저, 신경망은 랜덤 가중치 θ0로
초기화된다. 각각의 후속 반복에서 i ≥ 1 일 때, 자기 놀이의 게임이 생성된다 (그림 1a). 각 시간 - 단계 t에서, MCTS 탐색은 이전의 신경 회로망 fθi-1의 반복을 사용하여 실행되고, 이동은 검색 확률 πt를 샘플링함으로써 수행된다. 게임은 두 플레이어가 합격하거나 검색 값이 사임 임계 값 이하로 떨어지거나 게임이 최대 길이를 초과 할 때 T 단계에서 종료됩니다. 그 게임은
rT ∈ {-1, +1}의 최종 보상을주는 점수가 매겨집니다 (자세한 내용은 방법 참조). 각 시간 단계 t에 대한 데이터는 (st, πt,
zt)로 저장됩니다. 여기서 zt = ± rT는 단계 t에서 현재 플레이어의 관점에서 게임 승자입니다. 병렬로 (그림 1b), 새로운
네트워크 매개 변수 θi는 셀프 플레이의 마지막 반복의 모든 시간 단계에서 균일하게 샘플링 된 데이터 (s, π, z)로부터 학습됩니다.
신경망 (p, v) = fθi (s)는 예측
된 값 v와 셀프 플레이 당첨자 z 사이의 오차를 최소화하고
검색 확률 π에 대한 신경망 이동 확률 p의 유사성을 최대화하기 위해 조정된다. 구체적으로, 매개 변수 θ는 평균 제곱 에러 및 교차 엔트로피 손실에
각각 합한 손실 함수 l에 대한 그래디언트 디센트에 의해 조정되며,
where
c is a parameter controlling the level of L2 weight regularisation (to prevent
overfitting).
여기서
c는 과중 방지를 방지하기 위해 L2 가중 정규화의 수준을 제어하는 매개 변수입니다.
2.
Empirical Analysis of AlphaGo Zero Training
We
applied our reinforcement learning pipeline to train our program AlphaGo Zero.
Training started from completely random behaviour and continued without human
intervention for approximately 3 days. Over the course of training, 4.9 million
games of self-play were generated, using 1,600 simulations for each MCTS, which
corresponds to approximately 0.4s thinking time per move. Parameters were
updated from 700,000 mini-batches of 2,048 positions. The neural network
contained 20 residual blocks (see Methods for further details). Figure 3a shows
the performance of AlphaGo Zero during self-play reinforcement learning, as a
function of training time, on an Elo scale 25. Learning progressed smoothly
throughout training, and did not suffer from the oscillations or catastrophic
forgetting suggested in prior literature 26–28.
Figure 3: Empirical evaluation
of AlphaGo Zero.
a
Performance of self-play reinforcement learning. The plot shows the performance
of each MCTS player αθi from each iteration i of reinforcement learning in
AlphaGo Zero. Elo ratings were computed from evaluation games between different
players, using 0.4 seconds of thinking time per move (see Methods). For
comparison, a similar player trained by supervised learning from human data,
using the KGS data-set, is also shown. b Prediction accuracy on human
professional moves. The plot shows the accuracy of the neural network fθi , at
each iteration of self-play i, in predicting human professional moves from the
GoKifu data-set. The accuracy measures the percentage of positions in which the
neural network assigns the highest probability to the human move. The accuracy
of a neural network trained by supervised learning is also shown. c
Mean-squared error (MSE) on human professional game outcomes. The plot shows
the MSE of the neural network fθi , at each iteration of self-play i, in
predicting the outcome of human professional games from the GoKifu data-set.
The MSE is between the actual outcome z ∈ {−1, +1} and the neural network value v, scaled by a
factor of 1 4 to the range [0, 1]. The MSE of a neural network trained by
supervised learning is also shown.
자기 재생 강화 학습의 성과. 플롯은 AlphaGo Zero에서 보강 학습의 각 반복에서 각 MCTS 플레이어의 성능을 보여줍니다. Elo 등급은 여러 플레이어
사이의 평가 게임에서 이동 당 0.4 초의 사고 시간을 사용하여 계산되었습니다 (방법 참조). 비교를 위해 KGS
데이터 세트를 사용하여 인간 데이터에서 감독 학습을 통해 학습 한 유사한 플레이어도 표시됩니다. b
인간의 전문적 움직임에 대한 예측 정확도. 플롯은
GoKifu 데이터 세트에서 인간의 전문적 움직임을 예측할 때 셀프 플레이 i의 반복마다
신경망 fθi의 정확도를 보여줍니다. 정확도는 신경망이 인간 이동에 가장 높은 확률을 할당하는 위치의 백분율을 측정합니다. 감독 학습에 의해 훈련 된 신경 회로망의 정확성도 보여줍니다. c 인간
전문 게임 결과에 대한 평균 제곱 오류 (MSE). 플롯은
GoKifu 데이터 세트에서 인간 전문 게임의 결과를 예측할 때 셀프 플레이 i의 반복마다
신경망 fθi의 MSE를
보여줍니다. MSE는 실제 결과 z ∈ {-1, +1}과 범위 [0, 1]에 대해 1의 4 배로 스케일 된 신경망 값
v 사이에있다. 감독 학습에 의해 훈련 된 신경 회로망의
MSE도 표시됩니다.
Surprisingly,
AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for comparison,
AlphaGo Lee was trained over several months. After 72 hours, we evaluated
AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol,
under the 2 hour time controls and match conditions as were used in the
man-machine match in Seoul (see Methods). AlphaGo Zero used a single machine
with 4 Tensor Processing Units (TPUs) 29, while AlphaGo Lee was distributed
over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100
games to 0 (see Extended Data Figure 5 and Supplementary Information). To
assess the merits of self-play reinforcement learning, compared to learning
from human data, we trained a second neural network (using the same
architecture) to predict expert moves in the KGS data-set; this achieved
state-of-the-art prediction accuracy compared to prior work 12, 30–33 (see
Extended Data Table 1 and 2 respectively). Supervised learning achieved better
initial performance, and was better at predicting the outcome of human
professional games (Figure 3). Notably, although supervised learning achieved
higher move prediction accuracy, the self-learned player performed much better
overall, defeating the human-trained player within the first 24 hours of
training. This suggests that AlphaGo Zero may be learning a strategy that is
qualitatively different to human play. To separate the contributions of
architecture and algorithm, we compared the performance of the neural network
architecture in AlphaGo Zero with the previous neural network architecture used
in AlphaGo Lee (see Figure 4). Four neural networks were created, using either
separate policy and value networks, as in AlphaGo Lee, or combined policy and
value networks, as in AlphaGo Zero; and using either the convolutional network
architecture from AlphaGo Lee or the residual network architecture from AlphaGo
Zero. Each network was trained to minimise the same loss function (Equation 1)
using a fixed data-set of self-play games generated by AlphaGo Zero after 72
hours of self-play training. Using a residual network was more accurate,
achieved lower error, and improved performance in AlphaGo by over 600 Elo.
Combining policy and value together into a single network slightly reduced the
move prediction accuracy, but reduced the value error and boosted playing
performance in AlphaGo by around another 600 Elo. This is partly due to improved
computational efficiency, but more importantly the dual objective regularises the network to a common
representation that supports multiple use cases.
Figure 4: Comparison of
neural network architectures in AlphaGo Zero and AlphaGo Lee.
Comparison of neural network architectures using either separate (“sep”) or
combined policy and value networks (“dual”), and using either convolutional
(“conv”) or residual networks (“res”). The combinations “dual-res” and
“sep-conv” correspond to the neural network architectures used in AlphaGo Zero
and AlphaGo Lee respectively. Each network was trained on a fixed data-set
generated by a previous run of AlphaGo Zero. a Each trained network was
combined with AlphaGo Zero’s search to obtain a different player. Elo ratings
were computed from evaluation games between these different players, using 5
seconds of thinking time per move. b Prediction accuracy on human professional
moves (from the GoKifu data-set) for each network architecture. c Mean-squared
error on human professional game outcomes (from the GoKifu data-set) for each
network architecture.
3.
Knowledge Learned by AlphaGo Zero
AlphaGo
Zero discovered a remarkable level of Go knowledge during its self-play
training process. This included fundamental elements of human Go knowledge, and
also non-standard strategies beyond the scope of traditional Go knowledge.
Figure 5 shows a timeline indicating when professional joseki (corner
sequences) were discovered (Figure 5a, Extended Data Figure 1); ultimately
AlphaGo Zero preferred new joseki variants that were previously unknown (Figure
5b, Extended Data Figure 2). Figure 5c and the Supplementary Information show
several fast self-play games played at different stages of training. Tournament
length games played at regular intervals throughout training are shown in
Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly
progressed from entirely random moves towards a sophisticated understanding of
Go concepts including fuseki (opening), tesuji (tactics), life-and-death, ko
(repeated board situations), yose (endgame), capturing races, sente
(initiative), shape, influence and territory, all discovered from first
principles. Surprisingly, shicho (“ladder” capture sequences that may span the
whole board) – one of the first elements of Go knowledge learned by humans –
were only understood by AlphaGo Zero much later in training.
4.
Final Performance of AlphaGo Zero
We
subsequently applied our reinforcement learning pipeline to a second instance
of AlphaGo Zero using a larger neural network and over a longer duration.
Training again started from completely random behaviour and continued for
approximately 40 days. Over the course of training, 29 million games of
self-play were generated. Parameters were updated from 3.1 million mini-batches
of 2,048 positions each. The neural network contained 40 residual blocks. The
learning curve is shown in Figure 6a. Games played at regular intervals throughout
training are shown in Extended Data Figure 4 and Supplementary Information.
Figure 5: Go knowledge
learned by AlphaGo Zero.
a Five human joseki (common corner sequences) discovered during AlphaGo Zero
training. The associated timestamps indicate the first time each sequence
occured (taking account of rotation and reflection) during self-play training.
Extended Data Figure 1 provides the frequency of occurence over training for
each sequence. b Five joseki favoured at different stages of self-play
training. Each displayed corner sequence was played with the greatest
frequency, among all corner sequences, during an iteration of self-play
training. The timestamp of that iteration is indicated on the timeline. At 10
hours a weak corner move was preferred. At 47 hours the 3-3 invasion was most
frequently played. This joseki is also common in human professional play;
however AlphaGo Zero later discovered and preferred a new variation. Extended
Data Figure 2 provides the frequency of occurence over time for all five
sequences and the new variation. c The first 80 moves of three self-play games
that were played at different stages of training, using 1,600 simulations
(around 0.4s) per search. At 3 hours, the game focuses greedily on capturing
stones, much like a human beginner. At 19 hours, the game exhibits the
fundamentals of life-and-death, influence and territory. At 70 hours, the game
is beautifully balanced, involving multiple battles and a complicated ko fight,
eventually resolving into a half-point win for white. See Supplementary
Information for the full games.
We
evaluated the fully trained AlphaGo Zero using an internal tournament against
AlphaGo Fan, AlphaGo Lee, and several previous Go programs. We also played
games against the strongest existing program, AlphaGo Master – a program based
on the algorithm and architecture presented in this paper but utilising human
data and features (see Methods) – which defeated the strongest human
professional players 60–0 in online games 34 in January 2017. In our
evaluation, all programs were allowed 5 seconds of thinking time per move;
AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs;
AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs respectively.
We also included a player based solely on the raw neural network of AlphaGo
Zero; this player simply selected the move with maximum probability.
Figure
6b shows the performance of each program on an Elo scale. The raw neural
network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo
Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739
for AlphaGo Lee and 3,144 for AlphaGo Fan.
Finally,
we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100 game
match with 2 hour time controls. AlphaGo Zero won by 89 games to 11 (see
Extended Data Figure 6) and Supplementary Information.
5. Conclusion
Our
results comprehensively demonstrate that a pure reinforcement learning approach
is fully feasible, even in the most challenging of domains: it is possible to
train to superhuman level, without human examples or guidance, given no
knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement
learning approach requires just a few more hours to train, and achieves much
better asymptotic performance, compared to training on human expert data. Using
this ap
proach,
AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were
trained from human data using handcrafted features, by a large margin.
Figure 6: Performance of
AlphaGo Zero.
a
Learning curve for AlphaGo Zero using larger 40 block residual network over 40
days. The plot shows the performance of each player αθi from each iteration i
of our reinforcement learning algorithm. Elo ratings were computed from
evaluation games between different players, using 0.4 seconds per search (see
Methods). b Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40
days using a 40 residual block neural network. The plot shows the results of a
tournament between: AlphaGo Zero, AlphaGo Master (defeated top human
professionals 60-0 in online games), AlphaGo Lee (defeated Lee Sedol), AlphaGo
Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and
GnuGo. Each program was given 5 seconds of thinking time per move. AlphaGo Zero
and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan
and AlphaGo Lee were distributed over many machines. The raw neural network
from AlphaGo Zero is also included, which directly selects the move a with
maximum probability pa, without using MCTS. Programs were evaluated on an Elo
scale 25: a 200 point gap corresponds to a 75% probability of winning.
Humankind
has accumulated Go knowledge from millions of games played over thousands of
years, collectively distilled into patterns, proverbs and books. In the space
of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much
of this Go knowledge, as well as novel strategies that provide new insights
into the oldest of games.
출처 : https://deepmind.com/docum.../119/agz_unformatted_nature.pdf
'개인공간 > 자기계발' 카테고리의 다른 글
See you again & One call away - J fla (0) | 2017.10.27 |
---|---|
A theory of System justification (0) | 2016.10.18 |
Enthalpy & Entropy (0) | 2016.10.13 |
Contract theory (2016 Nobel Prize in Economic) (0) | 2016.10.11 |
How molecules became machines (2016 Nobel Prize in Chemistry) (0) | 2016.10.05 |