UIpac

Mastering the Game of Go without Human Knowledge 본문

Trend/ICT

Mastering the Game of Go without Human Knowledge

David.Cheon 2017.10.19 21:13

오늘 새벽에 정말 환상적인 소식이 있었습니다. 바로 알파고 제로의 등장!! 기존과 전혀 다른 형식으로 학습하는 알파고 제로에 대해서 네이처지에 실린 논문을 공유해드립니다. PDF내용을 옮기다 보니 수식이 모두 깨지네요..pdf 원문을 받고 싶은 분은 가장 하단에 있는 링크를 참조하세요~
(번역은 구글 번역기를 이용했습니다.) 


Mastering the Game of Go without Human Knowledge 
(인간의 도움 없이 바둑게임 마스터 하기)

 

David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.
DeepMind, 5 New Street Square, London EC4A 3TW.
*These authors contributed equally to this work.

 

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from selfplay. Here, we introduce an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo.
인공지능의 오랜 목표는 도전적인 영역에서 초자연적 인 능숙도를 배우는 알고리즘입니다. 최근 AlphaGo는 바둑 게임에서 세계 챔피언을 물리 치는 첫 번째 프로그램이되었습니다. AlphaGo의 트리 검색은 심층 신경망을 사용하여 위치와 선택된 움직임을 평가했습니다. 이러한 신경 회로망은 인간의 전문가 움직임에 의한 감독 학습과 셀프 플레이를 통한 학습 강화에 의해 훈련되었습니다. 여기서는 인간 데이터, 지침 또는 게임 규칙 이외의 영역 지식없이 강화 학습만을 기반으로하는 알고리즘을 소개합니다. AlphaGo는 자체 교사가됩니다 : 신경 네트워크는 AlphaGo의 자체 이동 선택과 AlphaGo 게임의 승자를 예측하도록 훈련되었습니다. 이 신경 네트워크는 트리 검색의 강도를 향상시켜 다음 반복에서 더 높은 품질의 이동 선택과 강력한 셀프 플레이를 가능하게합니다. 우리의 새로운 프로그램 인 AlphaGo Zero tabula rasa를 시작으로 이전에 발표 된 챔피언 격파 AlphaGo에 비해 100-0으로 우승하여 초인적 인 성능을 달성했습니다.

Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts1–4. However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner5. In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games such as Atari 6, 7 and 3D virtual environments 8–10. However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence 11 – require precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved human-level performance in these domains.
인공 지능을 향한 많은 진전이 인간 전문가1-4의 결정을 반복하도록 훈련 된 감독 학습 시스템을 사용하여 이루어졌습니다. 그러나 전문가 데이터는 종종 비싸거나 신뢰할 수 없거나 단순히 사용할 수 없습니다. 신뢰할 수있는 데이터가있을 때조차도이 방법으로 훈련 된 시스템의 성능에 대한 한도를 부과 할 수있다. 대조적으로 보강 학습 시스템은 원칙적으로 인간의 능력을 뛰어 넘는 자신의 경험과 전문 지식이 부족한 영역에서의 작업에서 훈련됩니다. 최근에는 강화 학습에 의해 훈련 된 심층 신경망를 사용하여이 목표를 향한 급속한 진전이있었습니다. 이러한 시스템은 아타리 6, 7 3D 가상 환경 8-10과 같은 컴퓨터 게임에서 인간보다 성능이 뛰어납니다. 그러나 인공 지능 11의 웅장한 도전 과제로 널리 인식되는 바둑 게임과 같이 인간의 지성 측면에서 가장 어려운 영역은 광대 한 검색 공간에서 정확하고 정교한 미리보기를 필요로합니다. 완전히 일반적인 방법은 이전에는 이러한 영역에서 인간 수준의 성능을 달성하지 못했습니다.

AlphaGo was the first program to achieve superhuman performance in Go. The published version 12, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte-Carlo Tree Search (MCTS) 13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.
AlphaGo
는 바둑에서 초인적인 성능을 달성 한 최초의 프로그램이었습니다. AlphaGo Fan 2015 10 월에 유럽 챔피언 Fan Hui를 물리 쳤습니다. AlphaGo Fan은 이동 확률을 산출하는 정책 네트워크와 위치 평가를 출력하는 가치 네트워크라는 두 개의 심층 신경망을 활용했습니다. 정책 네트워크는 처음에는 인간의 전문가 움직임을 정확하게 예측하기위한 감독 학습 (supervised learning)에 의해 훈련 받았고, 이후에 정책 구배 강화 학습 (policy-gradient reinforcement learning)으로 개선되었습니다. 가치 네트워크는 정책 네트워크가 자신을 상대로 한 게임의 승자를 예측하도록 훈련되었습니다. 일단 교육을 받으면 이러한 네트워크를 Monte-Carlo Tree Search (MCTS) 13-15와 결합하여 미리보기 검색을 제공하고, 정책 네트워크를 사용하여 높은 확률로 이동 범위를 좁히고 값 네트워크를 사용합니다 ( 빠른 롤아웃 정책을 사용한 Monte-Carlo 롤아웃)을 사용하여 트리의 위치를 ​​평가합니다. 우리가 AlphaGo Lee로 언급 한 후속 버전은 유사한 접근법 (방법 참조)을 사용했으며 2016 3 18 개 국제 타이틀의 승자 인 Lee Sedol을 물리 쳤습니다.

Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee 12 in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.
우리의 프로그램 인 AlphaGo Zero AlphaGo Fan AlphaGo Lee 12와 몇 가지 중요한 측면에서 다릅니다. 무엇보다도 인간 데이터의 감독이나 사용없이 무작위로 플레이하는 것을 시작으로 셀프 플레이 강화 학습만으로 교육을받습니다. 둘째, 입력 기능으로 보드의 흑백 돌만 사용합니다. 셋째, 별도의 정책 및 가치 네트워크보다는 단일 신경 네트워크를 사용합니다. 마지막으로 MonteCarlo 롤아웃을 수행하지 않고 위치 및 샘플 이동을 평가하기 위해이 단일 신경 네트워크에 의존하는보다 단순한 트리 검색을 사용합니다. 이러한 결과를 달성하기 위해 우리는 트레이닝 루프 내부에 선행 검색을 통합 한 새로운 강화 학습 알고리즘을 도입하여 신속한 개선과 정확하고 안정적인 학습을 가능하게합니다. 검색 알고리즘, 교육 절차 및 네트워크 아키텍처에 대한 기술적 차이점은 방법에 설명되어 있습니다.

 

1. Reinforcement Learning in AlphaGo Zero
Our new method uses a deep neural network fθ with parameters θ. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p, v) = fθ(s). The vector of move probabilities p represents the probability of selecting each move (including pass), pa = P r(a|s). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network12 into a single architecture. The neural network consists of many residual blocks4 of convolutional layers 16, 17 with batch normalisation 18 and rectifier non-linearities 19 (see Methods).
우리의 새로운 방법은 매개 변수 θ를 갖는 심층 신경망 fθ를 사용한다. 이 신경망은 위치와 그 역사의 원시 보드 표현을 입력으로 취하여 이동 확률과 (p, v) = fθ (s) 값을 출력합니다. 이동 확률 벡터 p는 각 이동 (패스 포함)을 선택할 확률을 나타내며, pa = P r (a | s)입니다. v는 현재 플레이어가 위치 s에서 우승 할 확률을 산정하는 스칼라 평가입니다. 이 신경 네트워크는 정책 네트워크와 가치 네트워크12의 역할을 하나의 아키텍처로 결합합니다. 신경 네트워크는 배치 정규화18 및 정류기 비선형성 19을 갖는 컨볼루션층16, 17의 다수의 잔여 블록4으로 구성된다 (방법 참조).

The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network fθ. The MCTS search outputs probabilities π of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network fθ(s); MCTS may therefore be viewed as a powerful policy improvement operator20, 21. Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value – may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure22, 23: the neural network’s parameters are updated to make the move probabilities and value (p, v) = fθ(s) more closely match the improved search probabilities and self-play winner (π, z); these new parameters are used in the next iteration of self-play to make the search even stronger. Figure 1 illustrates the self-play training pipeline.
AlphaGo Zero
의 신경망은 새로운 강화 학습 알고리즘에 의한 자기 놀이 게임에 대한 교육을받습니다. 각각의 위치 s에서, 신경망 fθ에 의해 유도 된 MCTS 탐색이 실행된다. MCTS 탐색은 각 이동을 할 확률 π를 출력합니다. 이러한 검색 확률은 일반적으로 신경망 fθ (s)의 원시 이동 확률 p보다 훨씬 더 강한 이동을 선택합니다. 따라서 MCTS는 강력한 정책 개선 운영자20, 21로 볼 수 있습니다. 검색을 통한 셀프 플레이 - 개선 된 MCTS 기반 정책을 사용하여 각 이동을 선택하고 게임 승자 z를 값의 샘플로 사용하면 다음과 같이 표시 될 수 있습니다. 강력한 정책 평가 연산자. 우리의 보강 학습 알고리즘의 주된 아이디어는 정책 반복 절차22, 23에서 이러한 검색 연산자를 반복적으로 사용하는 것입니다 : 신경망의 매개 변수가 이동 확률을 만들기 위해 업데이트되고 값 (p, v) = fθ (s)가 더 가깝게 일치합니다 향상된 검색 확률 및 셀프 플레이 당첨자 (π, z); 이러한 새로운 매개 변수는 검색을 더욱 강하게하기 위해 다음 반복 셀프 플레이에서 사용됩니다. 그림 1은 셀프 플레이 트레이닝 파이프 라인을 보여줍니다.

The Monte-Carlo tree search uses the neural network fθ to guide its simulations (see Figure 2). Each edge (s, a) in the search tree stores a prior probability P(s, a), a visit count N(s, a), and an action-value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a) +U(s, a), where U(s, a)
P(s, a)/(1 + N(s, a)) 12, 24, until a leaf node s’ is encountered. This leaf position is expanded and evaluated just
Monte-Carlo tree
검색은 신경망 fθ를 사용하여 시뮬레이션을 유도합니다 (그림 2 참조). 검색 트리의 각 에지 (s, a)는 이전 확률 P (s, a), 방문 카운트 N (s, a) 및 동작 값 Q (s, a)를 저장한다. 각 시뮬레이션은 루트 상태에서 시작하여 상위 신뢰도 한계 Q (s, a) + U (s, a), where U(s, a) P(s, a)/(1 + N(s, a)) 12, 24를 최대화하는 동작을 반복적으로 선택합니다. s’에 도달 할 때까지 반복된다. 이 리프 위치는 확장되어 평가됩니다.


Figure 1: Self-play reinforcement learning in AlphaGo Zero.
a
The program plays a game s1, ..., sT against itself. In each position st, a Monte-Carlo tree search (MCTS) αθ is executed (see Figure 2) using the latest neural network fθ. Moves are selected according to the search probabilities computed by the MCTS, at
πt. The terminal position sT is scored according to the rules of the game to compute the game winner z.
b Neural network training in AlphaGo Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters θ, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters θ are updated so as to maximise the similarity of the policy vector pt to the search probabilities πt , and to minimise the error between the predicted winner vt and the game winner z (see Equation 1). The new parameters are used in the next iteration of self-play a.

그림 1 : AlphaGo Zero에서 자가 강화 학습.
a
프로그램은 s1, ..., sT 게임을 실행합니다. 각 위치 st에서 최신 신경망 fθ를 사용하여 몬테카를로 트리 탐색 (MCTS) αθ가 실행됩니다. (그림 2 참조). MCTS에 의해 계산 된 검색 확률에 따라 이동이 ~ πt에서 선택됩니다. 말단 위치 sT는 게임의 규칙에 따라 스코어되어 게임 승자 z를 계산한다.
b AlphaGo Zero
에서 신경망 훈련. 신경망은 원시 보드 위치 st를 입력으로 사용하여 매개 변수 θ를 갖는 많은 길쌈 레이어를 통과시키고 이동에 대한 확률 분포를 나타내는 벡터 pt와 현재 플레이어의 확률을 나타내는 스칼라 값 vt를 출력합니다. 위치 st에서 우승. 신경망 매개 변수 θ는 정책 벡터 pt와 검색 확률 πt의 유사성을 최대화하고 예상 승자 vt와 게임 승자 z ( 1 참조) 간의 오차를 최소화하도록 업데이트됩니다. 새로운 매개 변수는 self-play a의 다음 반복에서 사용됩니다.


Figure 2: Monte-Carlo tree search in AlphaGo Zero.
a
Each simulation traverses the tree by selecting the edge with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed). b The leaf node is expanded and the associated position s is evaluated by the neural network (P(s, ·), V (s)) = fθ(s); the vector of P values are stored in the outgoing edges from s.
c Action-values Q are updated to track the mean of all evaluations V in the subtree below that action.
d Once the search is complete, search probabilities π are returned, proportional to N1/τ , where N is the visit count of each move from the root state and τ is a parameter controlling temperature.
그림 2 : AlphaGo Zero Monte-Carlo 트리 검색.
a
각 시뮬레이션은 최대 동작 값 Q를 갖는 에지와 저장된 사전 확률 P 및 그 에지에 대한 방문 카운트 N (한번 통과 할 때 증가됨)에 의존하는 상위 신뢰도 U를 선택함으로써 트리를 가로 지른다.
b
잎 노드가 확장되고 연관된 위치 s는 신경망 (P (s, ·), V (s)) = fθ (s)에 의해 평가된다. P 값의 벡터는 s로부터 나가는 에지에 저장됩니다.
c
동작 값 Q는 해당 동작 아래의 하위 트리에있는 모든 평가 V의 평균을 추적하도록 업데이트됩니다.
d
검색이 완료되면 검색 확률 π N1 / τ에 비례하여 반환됩니다.

once by the network to generate both prior probabilities and evaluation, (P(s 0 , ·), V (s 0 )) = fθ(s 0 ). Each edge (s, a) traversed in the simulation is updated to increment its visit count N(s, a), and to update its action-value to the mean evaluation over these simulations, Q(s, a) = 1/N(s, a) P s 0 |s,a→s 0 V (s 0 ), where s, a → s 0 indicates that a simulation eventually reached s 0 after taking move a from position
여기서 N은 루트 상태에서 각 이동의 방문 횟수이고 τ는 온도를 제어하는 ​​매개 변수입니다. 네트워크에 의해 일단 사전 확률 및 평가 (P (s0, ·), V (s0)) = fθ (s0)를 생성한다. 시뮬레이션에서 가로 지른 각 에지 (s (s))는 방문 수 N (s, a)를 증가 시키도록 업데이트되고 이러한 시뮬레이션에 대한 동작 평가 값을 평균 평가로 업데이트하기 위해 Q (s, a) = 1 / N (s, a) P s 0 | s, a s 0 V (s 0) 여기서 s, a s 0는 위치 이동을 한 후에 시뮬레이션이 결국 s 0에 도달했음을 나타냅니다.
MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π = αθ(s), proportional to the exponentiated visit count for each move, πa N(s, a) 1/τ , where τ is a temperature parameter.
MCTS
는 신경망 매개 변수 θ와 근원 위치 s가 주어지면 각 이동에 대한 지수화 된 방문 횟수에 비례하여 이동을 권장하는 탐색 확률 벡터 π = αθ (s)를 계산하는 자체 재생 알고리즘으로 볼 수 있습니다 , πa α N (s, a) 1 / τ, 여기서 τ는 온도 매개 변수이다.

The neural network is trained by a self-play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialised to random weights θ0. At each subsequent iteration i ≥ 1, games of self-play are generated (Figure 1a). At each time-step t, an MCTS search πt = αθi1 (st) is executed using the previous iteration of neural network fθi1 , and a move is played by sampling the search probabilities πt . A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward of rT {1, +1} (see Methods for details). The data for each time-step t is stored as (st , πt , zt) where zt = ±rT is the game winner from the perspective of the current player at step t. In parallel (Figure 1b), new network parameters θi are trained from data (s, π, z) sampled uniformly among all time-steps of the last iteration(s) of self-play. The neural network (p, v) = fθi (s) is adjusted to minimise the error between the predicted value v and the self-play winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities π. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively,
신경 네트워크는 MCTS를 사용하여 각 동작을 재생하는 자체 재생 강화 학습 알고리즘에 의해 학습됩니다. 먼저, 신경망은 랜덤 가중치 θ0로 초기화된다. 각각의 후속 반복에서 i 1 일 때, 자기 놀이의 게임이 생성된다 (그림 1a). 각 시간 - 단계 t에서, MCTS 탐색은 이전의 신경 회로망 fθi-1의 반복을 사용하여 실행되고, 이동은 검색 확률 πt를 샘플링함으로써 수행된다. 게임은 두 플레이어가 합격하거나 검색 값이 사임 임계 값 이하로 떨어지거나 게임이 최대 길이를 초과 할 때 T 단계에서 종료됩니다. 그 게임은 rT {-1, +1}의 최종 보상을주는 점수가 매겨집니다 (자세한 내용은 방법 참조). 각 시간 단계 t에 대한 데이터는 (st, πt, zt)로 저장됩니다. 여기서 zt = ± rT는 단계 t에서 현재 플레이어의 관점에서 게임 승자입니다. 병렬로 (그림 1b), 새로운 네트워크 매개 변수 θi는 셀프 플레이의 마지막 반복의 모든 시간 단계에서 균일하게 샘플링 된 데이터 (s, π, z)로부터 학습됩니다. 신경망 (p, v) = fθi (s)는 예측 된 값 v와 셀프 플레이 당첨자 z 사이의 오차를 최소화하고 검색 확률 π에 대한 신경망 이동 확률 p의 유사성을 최대화하기 위해 조정된다. 구체적으로, 매개 변수 θ는 평균 제곱 에러 및 교차 엔트로피 손실에 각각 합한 손실 함수 l에 대한 그래디언트 디센트에 의해 조정되며,

where c is a parameter controlling the level of L2 weight regularisation (to prevent overfitting).
여기서 c는 과중 방지를 방지하기 위해 L2 가중 정규화의 수준을 제어하는 ​​매개 변수입니다.

 

2. Empirical Analysis of AlphaGo Zero Training
We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approximately 3 days. Over the course of training, 4.9 million games of self-play were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4s thinking time per move. Parameters were updated from 700,000 mini-batches of 2,048 positions. The neural network contained 20 residual blocks (see Methods for further details). Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature 26–28.


Figure 3: Empirical evaluation of AlphaGo Zero.
a Performance of self-play reinforcement learning. The plot shows the performance of each MCTS player αθi from each iteration i of reinforcement learning in AlphaGo Zero. Elo ratings were computed from evaluation games between different players, using 0.4 seconds of thinking time per move (see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS data-set, is also shown. b Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network fθi , at each iteration of self-play i, in predicting human professional moves from the GoKifu data-set. The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move. The accuracy of a neural network trained by supervised learning is also shown. c Mean-squared error (MSE) on human professional game outcomes. The plot shows the MSE of the neural network fθi , at each iteration of self-play i, in predicting the outcome of human professional games from the GoKifu data-set. The MSE is between the actual outcome z ∈ {1, +1} and the neural network value v, scaled by a factor of 1 4 to the range [0, 1]. The MSE of a neural network trained by supervised learning is also shown.

자기 재생 강화 학습의 성과. 플롯은 AlphaGo Zero에서 보강 학습의 각 반복에서 각 MCTS 플레이어의 성능을 보여줍니다. Elo 등급은 여러 플레이어 사이의 평가 게임에서 이동 당 0.4 초의 사고 시간을 사용하여 계산되었습니다 (방법 참조). 비교를 위해 KGS 데이터 세트를 사용하여 인간 데이터에서 감독 학습을 통해 학습 한 유사한 플레이어도 표시됩니다. b 인간의 전문적 움직임에 대한 예측 정확도. 플롯은 GoKifu 데이터 세트에서 인간의 전문적 움직임을 예측할 때 셀프 플레이 i의 반복마다 신경망 fθi의 정확도를 보여줍니다. 정확도는 신경망이 인간 이동에 가장 높은 확률을 할당하는 위치의 백분율을 측정합니다. 감독 학습에 의해 훈련 된 신경 회로망의 정확성도 보여줍니다. c 인간 전문 게임 결과에 대한 평균 제곱 오류 (MSE). 플롯은 GoKifu 데이터 세트에서 인간 전문 게임의 결과를 예측할 때 셀프 플레이 i의 반복마다 신경망 fθi MSE를 보여줍니다. MSE는 실제 결과 z {-1, +1}과 범위 [0, 1]에 대해 1 4 배로 스케일 된 신경망 값 v 사이에있다. 감독 학습에 의해 훈련 된 신경 회로망의 MSE도 표시됩니다.

Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for comparison, AlphaGo Lee was trained over several months. After 72 hours, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the 2 hour time controls and match conditions as were used in the man-machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 Tensor Processing Units (TPUs) 29, while AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Figure 5 and Supplementary Information). To assess the merits of self-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS data-set; this achieved state-of-the-art prediction accuracy compared to prior work 12, 30–33 (see Extended Data Table 1 and 2 respectively). Supervised learning achieved better initial performance, and was better at predicting the outcome of human professional games (Figure 3). Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 hours of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play. To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Figure 4). Four neural networks were created, using either separate policy and value networks, as in AlphaGo Lee, or combined policy and value networks, as in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimise the same loss function (Equation 1) using a fixed data-set of self-play games generated by AlphaGo Zero after 72 hours of self-play training. Using a residual network was more accurate, achieved lower error, and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective  regularises the network to a common representation that supports multiple use cases.

Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee.
Comparison of neural network architectures using either separate (“sep”) or combined policy and value networks (“dual”), and using either convolutional (“conv”) or residual networks (“res”). The combinations “dual-res” and “sep-conv” correspond to the neural network architectures used in AlphaGo Zero and AlphaGo Lee respectively. Each network was trained on a fixed data-set generated by a previous run of AlphaGo Zero. a Each trained network was combined with AlphaGo Zero’s search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 seconds of thinking time per move. b Prediction accuracy on human professional moves (from the GoKifu data-set) for each network architecture. c Mean-squared error on human professional game outcomes (from the GoKifu data-set) for each network architecture.

 

3. Knowledge Learned by AlphaGo Zero
AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process. This included fundamental elements of human Go knowledge, and also non-standard strategies beyond the scope of traditional Go knowledge. Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Figure 5b, Extended Data Figure 2). Figure 5c and the Supplementary Information show several fast self-play games played at different stages of training. Tournament length games played at regular intervals throughout training are shown in Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji (tactics), life-and-death, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (“ladder” capture sequences that may span the whole board) – one of the first elements of Go knowledge learned by humans – were only understood by AlphaGo Zero much later in training.

 

4. Final Performance of AlphaGo Zero
We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days. Over the course of training, 29 million games of self-play were generated. Parameters were updated from 3.1 million mini-batches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Figure 6a. Games played at regular intervals throughout training are shown in Extended Data Figure 4 and Supplementary Information.

Figure 5: Go knowledge learned by AlphaGo Zero.
a Five human joseki (common corner sequences) discovered during AlphaGo Zero training. The associated timestamps indicate the first time each sequence occured (taking account of rotation and reflection) during self-play training. Extended Data Figure 1 provides the frequency of occurence over training for each sequence. b Five joseki favoured at different stages of self-play training. Each displayed corner sequence was played with the greatest frequency, among all corner sequences, during an iteration of self-play training. The timestamp of that iteration is indicated on the timeline. At 10 hours a weak corner move was preferred. At 47 hours the 3-3 invasion was most frequently played. This joseki is also common in human professional play; however AlphaGo Zero later discovered and preferred a new variation. Extended Data Figure 2 provides the frequency of occurence over time for all five sequences and the new variation. c The first 80 moves of three self-play games that were played at different stages of training, using 1,600 simulations (around 0.4s) per search. At 3 hours, the game focuses greedily on capturing stones, much like a human beginner. At 19 hours, the game exhibits the fundamentals of life-and-death, influence and territory. At 70 hours, the game is beautifully balanced, involving multiple battles and a complicated ko fight, eventually resolving into a half-point win for white. See Supplementary Information for the full games.


We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee, and several previous Go programs. We also played games against the strongest existing program, AlphaGo Master – a program based on the algorithm and architecture presented in this paper but utilising human data and features (see Methods) – which defeated the strongest human professional players 60–0 in online games 34 in January 2017. In our evaluation, all programs were allowed 5 seconds of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.
Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.
Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100 game match with 2 hour time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Figure 6) and Supplementary Information.

 

5. Conclusion

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data. Using this ap proach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin.

Figure 6: Performance of AlphaGo Zero.
a Learning curve for AlphaGo Zero using larger 40 block residual network over 40 days. The plot shows the performance of each player αθi from each iteration i of our reinforcement learning algorithm. Elo ratings were computed from evaluation games between different players, using 0.4 seconds per search (see Methods). b Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40 days using a 40 residual block neural network. The plot shows the results of a tournament between: AlphaGo Zero, AlphaGo Master (defeated top human professionals 60-0 in online games), AlphaGo Lee (defeated Lee Sedol), AlphaGo Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and GnuGo. Each program was given 5 seconds of thinking time per move. AlphaGo Zero and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan and AlphaGo Lee were distributed over many machines. The raw neural network from AlphaGo Zero is also included, which directly selects the move a with maximum probability pa, without using MCTS. Programs were evaluated on an Elo scale 25: a 200 point gap corresponds to a 75% probability of winning.

Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.



출처 : https://deepmind.com/docum.../119/agz_unformatted_nature.pdf





1 Comments
댓글쓰기 폼