Computation and Optimization in Turn-Based Zero-Sum Games

1 / 32

Embed Share

Explore the concepts of Monte-Carlo methods, N-Grams, and the Last-Good-Reply Policy in MCTS for game playing optimization. Learn about turn-based zero-sum games, worst-case scenarios, minimax trees, and the strategic approach applied in game theory.

const Follow

Uploaded on Mar 22, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Monte-Carlo methods for Computation and Optimization Spring 2015 N-Grams and the Last-Good- Reply Policy in MCTS Based on N-Grams and the Last-Good-Reply Policy Applied in General Game Playing (Mandy J. W. Tak, Mark H. M. Winands, Yngvi Bj rnsson), IEEE Transactions on Computational Intelligence and AI in games, Vol. 4 No. 2, June 2012 Presentation by Ayal Shwartz

o Games: o Turn-Based Zero-sum games o Worst-Case scenario o Minimax trees o ?? pruning o Problems with minimax trees and ?? pruning o Probabilistic approach o Monte-Carlo Tree Search o Selection and Expansion o Simulation and Backpropagation o Upper confidence bounds o Simulation strategies: o Scoring o Move-Average sampling o N-Grams o Last-Good-Reply policy o Some fine details o Results

Games A game is structured playing, usually undertaken for enjoyment, and sometimes used as an educational tool Key components of games are goals, rules, challenge, and interaction Source: Wikipedia In a game, we have players, each with their own reward function. Usually the only reward that interests us is the one at the end of the game when no one can play anymore, and thus, the rewards are set. Each player wants to maximize their own reward, defined by the game.

Turn-Based Zero-Sum Games Many games are turn-based, meaning every player has his turn to play, and cannot play during someone else s turn. Many games of interest are zero-sum games, meaning the sum of the reward is 0 (or some constant). This means that if one player has positive reward, someone else must have negative reward. In two-player games, this means that if your opponent wins, you lose!

Worst-Case scenario We assume that our opponents will always choose the move that guarantees them maximal reward in the worst- case scenario. And we should do the same. In other words, we should choose the moves for which the minimal reward we can guarantee will be maximal

Minimax trees In our turn, we want to choose the move that gives us the highest worst-case reward < In his turn, our opponent wants to choose the move that gives him the highest worst- case reward 3 3 -2 -2 -4 Since this is a zero-sum game, maximal reward for us is minimal reward for the opponent, and minimal reward for us is maximal reward for the opponent. 2

?? Pruning > -2 ?? Pruning saves us some computations. Since our opponent will always pick the lowest reward, there is no point in looking at the brother of the leaf with reward -4, since our opponent won t pick any reward greater than -4. -2 -2 -2 -4 -4 3 3 We already have a way to guarantee atleast -2 reward. This is called Alpha pruning. 2 Similar reasoning can give us Beta pruning.

Problems with minimax trees and ?? pruning In almost any game of interest, there are too many possible states. We cannot consider all possible ways to continue the game from a given state, unless there aren t too many ways to end the game from the given state. Not enough memory. Not enough time to preform necessary computations. We are limited in how deep we can develop the tree.

Problems with ?? pruning (2) Solution: Use a heuristic reward function for the states. This enables us to look only at the next ? steps, and choose the step that would give us the best worst-case-scenario reward in ? steps. Problem: This requires prior knowledge of the game.

Problems with ?? pruning (3) In General Game Playing (GPP), the objective is to be able to play any game Given a description of the game (in Game Description Language), we would like to be able to play the game as soon as we have the game s description (with some time for pre-processing allowed). We won t have time to come up with a heuristic reward function.

Monte-Carlo Tree Search Rather than look at all the options, we can probabilistically sample moves, and simulate a game from the chosen moves until it ends. Based on the results of the simulations, we decide which moves are best. This (potentially) saves us some time and memory. Hopefully, we will get a good estimation of the (true) victory probability Note that the victory probability is also based on how we select the next move we will play. This also allows us to interrupt the tree search any time and come up with an action (thought we cannot make the same guarantees made in minimax trees).

Monte-Carlo Tree Search (2) If we bias ourselves towards good moves in the simulation, we might get a different probability distribution over the states in which the game ends. We are more likely to win if we bias our choice of moves to ones that give us a better victory probability (assuming we also select the next move to play appropriately)

Selection and Expansion Selection: Starting at the root (representing current state), traverse the tree until you reach a leaf. 1/2 3/4 Expansion: 2/2 1/2 If the leaf doesn t represent a state in which the game ends, select the leaf, and create one (or more) children for the leaf. Each child represents a state reachable from it s parent using an action that can be preformed from the parent state. 0/0

Simulation and Backpropagation 4/6 5/7 Simulation: From newly created node, simulate a game. That is, sample moves for each player, create a child node representing the resulting state, and repeat from new node until game end is reached. 1/2 3/4 4/5 2/2 1/2 2/3 Backpropagation: 0/0 1/1 Propagate the result of the simulation (victory/defeat, or reward) back to the node created in the expansion phase. Update parents. 1:0

Upper confidence bound In the selection phase, we would like to balance two elements: exploration and exploitation. We should explore new moves. Shorter paths to victory are preferable, as they contain less uncertainty. Perhaps we will find a path with a higher reward. We should exploit promising moves. We don t have enough time or memory to explore too many options. There may not be any better options which we are likely to reach.

Upper confidence bound (2) UCT (Upper Confidence bound for Trees) can be used to balance exploration and exploitation: ? =arg max ? ? ? ln? ? ? ?,? ? ?,? + ? Where: ? ? is a collection of all actions we can take from state ? ? ?,? is the average reward when taking action ? from ?. ? ? is the number of visits to state ?, and ? ?,? is the number of times action ? has been taken from state ?.

Upper confidence bound (3) ? =arg max ? ? ? ln? ? ? ?,? ? ?,? + ? Trade-off between exploration and exploitation Every time an action is taken from a certain state, the chance of this action being taken again decreases, thus increasing the importance of it s reward and decreasing the importance of exploring it again And the chance to choose any other action increases slightly, to promote exploration of other actions.

Upper confidence bound (4) The UCT gives us a measure of how important it is to expand certain nodes. Note that if we haven t explored a certain action ? from a state ?, then ? ?,? is 0, and ? ?,? is not defined In such a case, we will explore ? by default. In other words, if we ve decided that it is important to explore a certain state ? which has unexplored actions, it is important to explore them. If there are many such actions, we use the same selection strategy we use in the simulation stage.

Simulation strategies The simulation strategy is a key component of MCTS. A basic simulation strategy would be to select random moves. This, however, might not be such a good idea: we can generally assume that our opponent will favour moves which give him a good chance at victory. We would also like to choose good moves when we actually play the game.

Scoring If we have a score for each move, we can use that score when considering which move to select next. We will consider a simple scoring function: Average reward returned when the move (or series of moves) we re considering was used. For convenience, let s assume our rewards are 1 (victory) or 0 (defeat). Can easily be generalized to games with different rewards/non-zero-sum games.

Scoring (2) How do we use the score? Gibbs measure: let ?1, ,?? be all the legal moves from a certain state. Let ?? be the probability of selecting ?? as the next move, then: ? ?? ?? ? ? ? ?? is the score for move ??. ? > 0 is a parameter which controls exploration. ? ?? is the maximal reward if move ??hasn t been given a score yet. ?-Greedy: select the move with the highest score ?.?. 1 ? , otherwise, sample a random move uniformly.

Move-Average Sampling The Move-Average Sampling Technique (MAST) ?1 uses the average reward from play-outs ?2 (simulations) in which a move was used to calculate the move s score in order to determine ?3 if it s a good move or not. ?1: 3 Victories, 2 defeats ?2: 2 Victories, 7 defeats ?3: 5 Victories, 1 defeat ?3: 5 Victories, 2 defeat ?1: 3 Victories, 3 defeats ?2: 2 Victories, 8 defeats 0:1

N-Grams N-Grams add context to the moves: We look at up to (N-1) moves back when deciding which move to select next. We give a score to each legal move based on the N-Gram formed when adding it to the previous moves. If a certain series of moves hasn t been played enough times, we ignore it, as we do not have enough information on that move yet. We average the score of all the relevant ?-Grams, ? < ?, that we have in memory. (If a certain ?-Gram hasn t been at least ? times, we ignore it. In the paper, ? = 7 was chosen with manual tuning).

N-Grams (2) Reward Lookup (Average reward when series of moves was used) ?,?,?1 0.4, ?,?1 0.8,?1 0.9,Avg: 0.7 ?1?2 ?3 ? ? ?,?,?2 0.1, ?,?2 0.2,?2 0.3,Avg: 0.2 ?,?,?3 ?, ?,?3 0.6,?3 0.2,Avg: 0.4

Last-Good-Reply Policy During a game, we can consider moves to be replies to the previous move (or series of moves). If a reply to a series of moves was successful, we should try it again when we are able to. Like in N-Grams, we can look back a few steps and try to consider the context. If we lost in one of our simulations, all of the replies which appeared in the simulation are removed from memory, since they are no longer considered good replies. If we won in the simulation, all the replies that appeared in the simulation will be added, since they seem to be good replies.

Some fine details Expansion when a node has unvisited moves: The UCT cannot be applied when we have actions for which we have no score. In such a case, we will use one of the simulation strategies to select an action from the set of unexplored actions that can be taken. Last-Good-Reply: If we cannot use a reply (if it isn t a legal move), we use a fall-back strategy either N-Gram, or MAST in this case. Basic MCTS algorithms might randomly select moves with uniform probabilities.

Some fine details (2) Non-zeros-sum games When the game is not a zero-sum game, we cannot assume that our opponent will choose the move which is worst for us. Instead, when we simulate the opponent s move, we assume he tries to maximize his own reward. Games with more than two players: When there are ? players, an N-Gram is made of a sequence of 1 consecutive move, made by players: ? ? + 1 ??? ?, , ? 1 ??? ? ? With the move made by player ? appended the end of this sequence.

Some fine details (3) Concurrent move games: We can look at a turn-based game as a concurrent-move game, with a special action called noop (No operation), which each player must play until a certain condition is met (signalling that it s his turn). Note that even in concurrent-move games, the N-Grams look back on previous turns, not on multiple moves in the same turn, though they are calculated from the perspective of each player.

Some fine details (4) And finally, how do we select the move when we actually need to play? Option 1: Select the move which leads to the highest reward we ve seen so far. This is the strategy employed by CadiaPlayer a player who won multiple General Game Playing competitions, using MCTS with MAST and Gibbs measure. Option 2: Select the move which has the highest average reward. Both have their advantages and disadvantages.

Some fine details (5) It would be preferable to win in as few moves as possible, since fewer moves mean less uncertainty. A discount factor can be introduced. The reward for an endgame that takes ? steps would be multiplied by ?? In CadiaPlayer, ? = 0.999

Results (Best of all ? values tested when compared to Gibbs) Compared Checkers Connect5 against MAST, ?-Greedy MAST, ?-Greedy MAST MAST MAST 74.4 Score 46.5 (49.9) 41.0 (55.3) 36.8 (43.5) 72.7 Score ? ? N-Gram 71.5 (0.1) 77.2 76.8 0.4 ?-Greedy LGR, MAST 76.4 (0.1) 75.3 76.5 0.4 LGR, N-Gram 81.3 (85.7) 71.2 (0.1) 82.7 (0.1)

Results (2) ?-Greedy MAST against MAST using Gibbs measure ? = ?.? ? = ?.? ? = ?.? ? = ?.? ? = ?.? ? = ?.? ? = ?.? Connect5 28.1 33.6 49.7 57.9 59.6 71.7 72.7 Checkers 74.4 73.7 69.8 71.1 65.6 68.7 62.7 It seems that in the case of Connect5, exploration is more important than exploitation, as opposed to Checkers.

Computation and Optimization in Turn-Based Zero-Sum Games

Download Presentation

Presentation Transcript

Related

More Related Content