CN111178541B - Game artificial intelligence system and performance improving system and method thereof - Google Patents

Game artificial intelligence system and performance improving system and method thereof Download PDF

Info

Publication number
CN111178541B
CN111178541B CN201911389843.4A CN201911389843A CN111178541B CN 111178541 B CN111178541 B CN 111178541B CN 201911389843 A CN201911389843 A CN 201911389843A CN 111178541 B CN111178541 B CN 111178541B
Authority
CN
China
Prior art keywords
data
layer
node
priority
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911389843.4A
Other languages
Chinese (zh)
Other versions
CN111178541A (en
Inventor
王志伟
涂仕奎
徐雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201911389843.4A priority Critical patent/CN111178541B/en
Publication of CN111178541A publication Critical patent/CN111178541A/en
Application granted granted Critical
Publication of CN111178541B publication Critical patent/CN111178541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a performance improvement method of a game artificial intelligence system, which is characterized in that a depth parallel computing frame is used for computing an initial value of a node and return information estimation on multi-channel game data; forming a tree structure by using the initial values of the nodes, filling the initial values of the nodes as data node information, returning the data node information, updating information stored by each data node, and outputting final priority; for input multi-channel game data, calculating the weight proportion of the multi-channel game data by using a depth parallel computing framework; for input multi-channel game data, a depth gate network model is used for obtaining a weight proportion, and the weighted combination priority of the input multi-channel game data and the fixed data record is calculated. Meanwhile, a system for executing the method and a game artificial intelligence system adopting the method for performance improvement are provided. The invention effectively improves the performance of the system under limited data, and is beneficial to the development of common enterprises or teams due to the characteristic of relatively low power consumption.

Description

Game artificial intelligence system and performance improving system and method thereof
Technical Field
The invention relates to the technical field of game artificial intelligence systems built by computing mechanisms, in particular to a game artificial intelligence system and a performance improvement method thereof, which are used for improving the performance of a game playing system and are performance improvement technologies based on a combined expert system and a depth parallel computing framework.
Background
The computer is adopted to construct a game artificial intelligence system, the initial technology mainly focuses on designing a heuristic function, for example, the deep blue of the chess AI program is obtained by designing characteristics of a plurality of chess players and then writing a minimum search program by a computer programmer. However, such approaches have the following limitations: (1) The cognition of human beings on the game is limited, for example, for the tasks with extremely large state space such as go, the field knowledge of the human beings is extremely limited, so that the designed heuristic function is unreliable; (2) The extremely-small search is too violent, the time consumption is huge, the efficiency is low, and the chess only surpasses human beings on the chess, but can not make breakthrough progress on the chess playing of the go class; (3) The manual linear characteristic is too simple to adapt to complex go games and can not exert the advantages of modern high-performance computing resources.
With the development of a deep parallel computing framework, particularly a deep convolutional neural network, the end-to-end technology is increasingly widely applied. There are some methods that have emerged and made a breakthrough in considering multi-channel game data (e.g., a board) as an image. Convolutional neural networks are a class of artificial neural networks suitable for processing images, the square grid and local pattern attributes of multi-channel game data making it possible to extract features through convolutional neural networks. For example, deep Q learning replaces the table model with a convolutional neural network, which achieves better results on Atari games. For the intelligent system for playing the go, the convolutional neural network is used for replacing artificial features, so that the accuracy of predicting the chess falling of the go experts is greatly improved. However, the method does not model the environment, has no forward search process, and the instability brought by the forward search process is an important reason.
The research and development of the intelligent system for playing the weiqi game is one of the research and development focuses of the industry and the academic world for decades, and the research and development of the artificial intelligence program of the intelligent system is very challenging. At present, a method for modeling the priority and the return value by using a neural network is available, but sampling is still carried out by self-playing, and although the effect is good, hardware resources are extremely consumed. Specifically, the priority-return value is established by a residual module, a chessboard representation of the chess is received as input, the probability distribution and the winning rate of the next step of the chess are estimated, and the priority-return value can be regarded as a fast and powerful heuristic function to guide the Monte Carlo tree search. In recent two years, several enterprises and teams continue to use the method to develop high-level artificial intelligence programs for playing weiqi, but the artificial intelligence programs are only simply reproduced, have no theoretical or methodical innovation, consume huge hardware resources and time, and cannot be borne by ordinary researchers or individuals.
The machine learning means can be used for training and enhancing the game artificial intelligence system. For the go game, a cognitive system can be designed for initializing the priority of data nodes, and both supervised learning and reinforcement learning can be used for enhancing the cognitive system. The supervised learning approach trains the priority-return model with a data set containing over 2000 million (multi-channel game data-priority information-return) triplets. Reinforcement learning is that a large amount of data is generated in a self-playing mode to enhance a system, and data with higher quality is generated in an iteration process, and the performance of a cognitive module is gradually enhanced.
Combining multiple classifiers or other learners is a very classical technology, and has achieved good effects in many fields such as face recognition, room price prediction, tumor detection, and so on, and there are many theoretical studies on the prior art. The combined expert system can be seen as a linear combination of a plurality of experts in functional form, the weight of the combination being controlled by a gate network and being related to the input. The parameters of the combined expert system (including the parameters of the expert and the parameters of the gate network) may be learned through a maximum likelihood or other loss function. Thus, the parameters of the expert system and the door network can be adjusted simultaneously, or one can be fixed and the other optimized. However, at present, the combined expert system is mainly used for simple models such as a shallow neural network or a linear classifier, and no combined expert system for a chess playing system or a plurality of deep neural network models is available.
At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.
Disclosure of Invention
Aiming at the defects of dependence on a large amount of data for machine learning and large hardware consumption in the prior art, the invention aims to provide a game Artificial Intelligence (AI) system based on a deep parallel computing framework and a forward simulation system and a performance improvement method thereof, wherein the trained AI system can reach the highest level at present, and the used computing resources are within 4-core CPUs and 15 common display cards and can be borne by common enterprises or teams.
The invention is realized by the following technical scheme.
According to a first aspect of the present invention, there is provided a performance improvement method for a game artificial intelligence system, comprising:
s0: acquiring a data set (S, pi, z) on a network (such as https:// u-go. Net/gamerecds /) as input data of the game artificial intelligence system, wherein S is multi-channel game data, pi is the final priority obtained in S2, and z is binary information returned by the game artificial intelligence system according to the win or loss when the game is ended;
s1: calculating the initial value of the priority of the nodes and the approximate value of the returned information of the multi-channel game data recorded by the data set by using a depth parallel computing frame;
s2: forming a tree structure by using the initial value of the node priority obtained by the calculation of the S1, generating a new data node, filling the initial value of the node priority obtained by the calculation of the cognitive module as data node information, filling the approximate value of the return information obtained in the S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting an action posterior prediction result, namely the final priority;
s3: calculating the weight proportion of the multi-channel game data recorded by the data set by using a depth parallel computing framework;
s4: and (4) calculating the priority of the data set record and the final priority obtained in the S2 by combining the weight proportion obtained in the S3 to obtain the weighted combination of the two priorities.
Preferably, in S1, the depth parallel computing framework includes an L-layer residual error module and an L +1 th-layer feature adjusting module, where the size of process data of the multi-channel game data passing through the L-layer residual error module is unchanged, and the process data is used to perform compression operation and batch normalization processing on the data; the L +1 layer characteristic adjusting module comprises the following two parts:
a first part, calculating the initial priority of the output node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for calculating an estimate of the output return information via a tanh function after adjusting the size of the multi-channel data;
wherein:
the initial value of the node priority, namely the initial priority of the node output by the first part of the L + 1-th layer characteristic adjusting module in the calculation mode, is a 362-dimensional array and is used as the initial value in the S2;
and the return information approximate value is the estimation of the return information output by the second part of the L + 1-th layer characteristic adjusting module through calculation, and the estimation is the approximate value estimation of the return binary result of the game artificial intelligence system.
Preferably, the S1 further includes the following steps: the deep parallel computing framework is trained based on a data set; wherein:
the depth parallel computing framework updates the parameters of the framework by defining an update mechanism, wherein the update mechanism is as follows:
L=-π T logp+(z-v) 2 +c||θ|| 2
in the formula, the first term-pi T logp is a cross entropy mechanism and is used for calculating the difference between the initial value of the node priority output by the framework and the priority recorded by the data set, pi is the final priority obtained in S2, and p is the initial value of the node priority given by the depth parallel computing framework; second item (z-v) 2 The method is used for calculating the difference between a return information approximate value output by a framework and binary information returned according to game win and loss, z is the binary information returned by a system according to win and loss when a game is ended, and v is the return information approximate value given by a depth parallel computing framework; the third term c | | theta | | non-woven phosphor 2 The method is characterized by comprising the following steps that (1) an L2 regular term is used for reducing the scale of a frame, theta is all parameters of a depth parallel computing frame, and c is a coefficient used for controlling the L2 regular term;
the method adopted by the updating mechanism is as follows:
Figure BDA0002344628350000041
wherein η is the update rate, used to control the magnitude of the frame update;
Figure BDA0002344628350000042
the gradient information fed back to the depth parallel computing framework after the computation of the updating mechanism represents the direction in which the framework needs to be updated.
Preferably, in S2, a connection is established between data nodes in a tree structure, where each data node is used to store the following information:
-a node priority initial value, representing the priority of selecting the data node, calculated by S1;
-a number of accesses, representing the number of times the data node has been accessed;
-average result information obtained from the running average of the approximate values of the return information calculated in S1;
the following 4 steps were repeated:
-selecting: the tree-form simulation adopts an optimal priority principle, namely each layer accesses the subdata nodes according to the principles of high node initial value, low access times and high action value, and the final reached end node is the selected data node;
-unfolding: initializing all legal nodes under the end node according to the calculation in the S1, initializing the initial value of the node to the initial value of the priority of the node calculated in the S1, and initializing the access times and the approximate value of average return information to 0;
-evaluating: obtaining a return information approximate value v of the terminal node in the S1;
-backtransmission: updating data node information layer by layer upwards until the initial data node; specifically, the number of access times is increased by 1, and after the approximate value of the average return information is accumulated and evaluated v, the average value is taken;
after the steps are repeated for a plurality of times, the final priority of each action is calculated and selected by dividing the access times of each subdata node by the sum of the access times of all subdata nodes.
Preferably, in S3, the depth parallel computing framework includes an L-layer residual error module and an L +1 th-layer feature adjusting module, where the size of process data of the multi-channel game data passing through the L-layer residual error module is unchanged, and the process data is used to perform compression operation and batch normalization processing on the data; and the L + 1-th layer feature adjusting module calculates the weight proportion through a sigmoid function after adjusting the size of the multichannel data.
Preferably, in S4, the method for calculating the weighted combination priority of the data set record and the final priority obtained in S2 includes:
first, the following is calculated:
multiplying the weight coefficient output by the depth parallel computing framework by the priority of the data set record;
1, subtracting the final priority obtained by multiplying the weight coefficient output by the depth parallel computing frame by the S2;
and then, adding the two results obtained by the calculation to obtain the weighted combination priority.
According to a second aspect of the present invention, there is provided a performance enhancing system for a gaming artificial intelligence system for performing any of the above methods, comprising:
a cognitive module: for input multi-channel game data, calculating a node priority initial value and a return information approximate value by using a depth parallel computing frame;
a forward simulation module: forming a tree structure by using the initial value of the node priority calculated by the cognitive module, generating a new data node, filling the initial value of the node calculated by the cognitive module as data node information, filling the approximate value of the returned information in the step S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting the final priority;
a gate control module: calculating the weight proportion of the multi-channel game data by using a depth parallel computing framework for the input multi-channel game data;
a combined calculation module: and calculating the priority recorded by the data set and the final priority obtained in the forward simulation module by combining the weight proportion obtained by the gating module to obtain the weighted combination of the two priorities.
Preferably, in the cognitive module, a depth parallel computing framework is adopted, which includes:
l layer residual module: each layer of residual error module comprises a data compression layer and a batch normalization layer, the convolution kernel size of the data compression layer is 3 multiplied by 3, and the size of multi-channel data is not changed through the process of the L layers of residual error modules;
the L +1 layer characteristic adjusting module: the method comprises the following two parts:
a first part, which calculates the initial priority of the node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for calculating an estimate of the returned information via a tanh function after resizing the multichannel data.
Preferably, in the gating module, a depth parallel computing framework is adopted, which includes:
the system comprises an L-layer residual error module and an L + 1-th layer characteristic adjusting module, wherein the size of process data of the multi-channel game data passing through the L-layer residual error module is unchanged, and the process data are used for performing compression operation and batch normalization processing on the data; and the L + 1-th layer characteristic adjusting module calculates a weight ratio through a sigmoid function after adjusting the size of the multichannel data.
According to a third aspect of the present invention, there is provided a game artificial intelligence system, wherein the performance is improved by using any one of the above methods; wherein:
when the initial value of the priority of the node and the approximate value of the return information are calculated, the adopted depth parallel calculation framework comprises the following structures:
the first layer is an input layer and is used for receiving multi-channel game data; the input layer includes:
n data channel, wherein each channel all stores the binary piece information of each check: empty, black, white;
the terminal channel is used for storing = next player information;
the 2 nd layer is a data compression layer, and convolution compression is carried out on the input multi-channel game data to obtain initial multi-channel data;
the 3-L layer is an information short-circuit layer and receives initial multi-channel data, and each information short-circuit layer obtains a characteristic diagram of each layer through the steps of data compression, batch standardization, data compression and batch standardization;
from the L +1 layer, the deep neural network is divided into two branches, one branch firstly presses the two-dimensional characteristic diagram into a one-dimensional vector, and after passing through a softmax function, action probability distribution is output to form a priority calculation network; the other branch is output through the tanh function to form a return value calculation network;
when the weight proportion of the multi-channel game data is calculated, the adopted depth parallel calculation framework comprises the following structures:
the first layer is an input layer and is used for receiving multi-channel game data; the input layer includes:
n data channel, wherein each channel all stores the binary piece information of each check: empty, black, white;
the terminal channel is used for storing the player information of the next step;
the 2 nd layer is a data compression layer, and convolution compression is carried out on the input multi-channel game data to obtain initial multi-channel data;
the 3-L layer is an information short-circuit layer and receives initial multi-channel data, and each information short-circuit layer obtains a characteristic diagram of each layer through the steps of data compression, batch standardization, data compression and batch standardization;
and from the L +1 layer, the deep neural network adjusts the size of the multichannel data and outputs the multichannel data through a sigmoid function to form a weight proportion calculation network.
Compared with the prior art, the invention has the following beneficial effects:
the game artificial intelligence system and the performance improvement system and method thereof provided by the invention can improve the performance of the game system by completely utilizing data without generating data by a machine. Compared with a model-free reinforcement learning technology, although a deep parallel computing frame is introduced as a computing model in the process of computing the final priority, through the step, the performance is greatly improved, and the stability is also greatly improved. Particularly in the basic situation, the error rate of the invention is greatly reduced, and because the crash situation is greatly reduced, the system which is improved in performance by the technology provided by the invention is obviously superior to other prior art under the same hardware condition.
Compared with the technology of using thousands of GPUs on a distributed system, the game artificial intelligence system and the performance improvement system and method thereof provided by the invention only need about 10 GPUs and a commonly configured CPU, a common researcher or team can operate and reproduce results within 10 days, a distributed type equal cross-machine technology is not needed, a large data platform is not needed to be built, and an adopted deep parallel computing frame is also a deep learning frame depending on a numpy scientific computing base and a pytorech.
According to the game artificial intelligence system and the performance improvement system and method thereof, only 10 thousand of human chess manuals are needed as training data, wherein most of the human chess manuals are amateur chess manuals, and most of the human chess manuals are professional chess manuals (not more than 1%), and all the chess manuals can be obtained on the Internet.
The artificial intelligence system for the Weiqi game has the advantages that the Weiqi reaches the human occupation middle level, and the Hex chess reaches the highest level in the existing program.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a general architecture and workflow diagram according to one embodiment of the present invention;
fig. 2 is a diagram of a door network structure according to an embodiment of the present invention.
In the figure, a is the game information input, B is the data compression layer, C is the residual block, D is the data compression layer, E is the full connection layer, and F is the output structure.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
The embodiment of the invention provides a performance improvement method of a game artificial intelligence system, which comprises the following steps:
s0: acquiring a data set (S, pi, z) as input data of the game artificial intelligence system, wherein S is multi-channel game data, pi is the final priority obtained in S2, and z is binary information returned by the game artificial intelligence system according to the win or loss when the game is finished;
s1: calculating the initial value of the priority of the nodes and the approximate value of the returned information of the multichannel game data recorded by the data set by using a depth parallel computing frame;
s2: forming a tree structure by using the initial value of the node priority obtained by the calculation of the S1, generating a new data node, filling the initial value of the node priority obtained by the calculation of the cognitive module as data node information, filling the approximate value of the return information obtained in the S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting an action posterior prediction result, namely the final priority;
s3: calculating the weight proportion of the multichannel game data recorded by the data set by using a depth parallel computing framework;
s4: and (4) calculating the priority of the data set record and the final priority obtained in the S2 by combining the weight proportion obtained in the S3 to obtain the weighted combination of the two priorities.
Further:
in the S1, the depth parallel computing framework comprises an L-layer residual error module and an L + 1-th layer feature adjusting module, wherein the size of process data of the multichannel game data passing through the L-layer residual error module is unchanged, and the multichannel game data are used for performing compression operation and batch normalization processing on the data; the L +1 layer characteristic adjusting module comprises the following two parts:
a first part, calculating the initial priority of the output node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for computing an estimate of the output return information via a tanh function after resizing the multichannel data;
wherein:
the initial value of the node priority, namely the initial priority of the node output by the first part of the L + 1-th layer feature adjusting module in the calculation mode, is a 362-dimensional array and is used as the initial value in the S2;
and returning information approximate values, namely estimating the returned information output by the second part of the L + 1-th layer feature adjusting module through calculation, wherein the estimation is approximate value estimation of returning binary results to the game artificial intelligence system.
Here, L in the depth parallel computing framework may be preferably 19, that is, the depth parallel computing framework includes a 19-layer residual module and a 20 th-layer feature adjusting module, wherein the size of the process data of the multi-channel game data passing through the 19-layer residual module is unchanged, and the process data is used for performing compression operation and batch normalization processing on the data; the layer 20 characteristic adjusting module comprises the following two parts:
a first part, calculating the initial priority of the output node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for calculating an estimate of the output return information via a tanh function after adjusting the size of the multi-channel data;
wherein:
the initial value of the node priority, namely the initial priority of the node output by the first part of the 20 th-layer characteristic adjustment module in the calculation mode, is a 362-dimensional array and is used as the initial value in the S2;
and the return information approximate value is the estimation of the return information output by the second part of the layer 20 characteristic adjusting module, and the estimation is the approximate value estimation of the return binary result of the game artificial intelligence system.
The S1 further comprises the following steps: the depth-parallel computing framework is trained based on a dataset; wherein:
the depth parallel computing framework updates the parameters of the framework by defining an update mechanism, wherein the update mechanism is as follows:
L=-π T logp+(z-v) 2 +c||θ|| 2
in the formula, the first term-pi T logp is a cross entropy mechanism used for calculating the difference between the initial value of the priority of the node output by the framework and the priority of the record of the data set, pi is the final priority obtained in S2, and p is the node given by the depth parallel computing frameworkAn initial value of point priority; second item (z-v) 2 The method is used for calculating the difference between a return information approximate value output by a framework and binary information returned according to game win and loss, z is the binary information returned by a system according to win and loss when a game is ended, and v is the return information approximate value given by a depth parallel computing framework; the third term c | | theta | | non-woven 2 The method is characterized by comprising the following steps that (1) an L2 regular term is used for reducing the scale of a frame, theta is all parameters of a depth parallel computing frame, and c is a coefficient used for controlling the L2 regular term;
the method adopted by the updating mechanism is as follows:
Figure BDA0002344628350000091
wherein η is the update rate, used to control the magnitude of the frame update;
Figure BDA0002344628350000092
the gradient information fed back to the depth parallel computing framework after the computation of the updating mechanism represents the direction in which the framework needs to be updated.
In S2, a connection is established between data nodes of the tree structure, where each data node is used to store the following information:
-a node priority initial value, representing the priority of selecting the data node, calculated by S1;
-a number of accesses, representing the number of times the data node has been accessed;
-average result information obtained from the running average of the approximate values of the return information calculated in S1;
the following 4 steps were repeated:
-selecting: the tree form simulation adopts an optimal priority principle, namely, each layer accesses the subdata nodes according to the principles of high node initial value, low access times and high action value, and the final reached end node is the selected data node;
-unfolding: initializing all legal nodes under the end node according to the calculation in the S1, initializing the initial value of the node to the initial value of the priority of the node calculated in the S1, and initializing the access times and the approximate value of average return information to 0;
-evaluating: obtaining a return information approximate value v of the terminal node in the S1;
-backtransmission: updating data node information layer by layer upwards until the initial data node; specifically, the number of access times is increased by 1, and after the approximate value of the average return information is accumulated and evaluated v, the average value is taken;
after the steps are repeated for a plurality of times, the final priority of each action is calculated and selected by dividing the access times of each subdata node by the sum of the access times of all subdata nodes.
In the S3, the depth parallel computing frame comprises an L-layer residual error module and an L + 1-th layer feature adjusting module, wherein the size of the process data of the multichannel game data passing through the L-layer residual error module is unchanged, and the multichannel game data are used for performing compression operation and batch normalization processing on the data; and the L + 1-th layer characteristic adjusting module calculates a weight ratio through a sigmoid function after adjusting the size of the multichannel data.
Here, L in the depth parallel computing framework may preferably be 19, that is, the depth parallel computing framework includes a 19-layer residual error module and a 20 th-layer feature adjustment module, wherein the size of process data of the multi-channel game data passing through the 19-layer residual error module is unchanged, and the process data is used for performing compression operation and batch normalization processing on the data; and the 20 th layer feature adjusting module calculates the weight proportion through a sigmoid function after adjusting the size of the multichannel data.
In S4, the method for calculating the weighted combination priority of the data set record and the final priority obtained in S2 includes:
first, the following is calculated:
multiplying the weight coefficient output by the depth parallel computing framework by the priority of the data set record;
1, subtracting the final priority obtained by multiplying the weight coefficient output by the depth parallel computing frame by the S2;
then, the two results obtained by the above calculation are added to obtain the weighted combination priority.
Based on the method provided by the above embodiment of the present invention, an embodiment of the present invention also provides a performance enhancing system for a game artificial intelligence system for executing the above method, including:
a cognitive module: for input multi-channel game data, calculating a node priority initial value and a return information approximate value by using a depth parallel computing frame;
a forward simulation module: forming a tree structure by using the initial value of the node priority calculated by the cognitive module, generating a new data node, filling the initial value of the node calculated by the cognitive module as data node information, filling the approximate value of the returned information in the step S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting the final priority;
a gate control module: calculating the weight proportion of the multi-channel game data by using a depth parallel computing framework for the input multi-channel game data;
a combined calculation module: and calculating the priority recorded by the data set and the final priority obtained in the forward simulation module by combining the weight proportion obtained by the gating module to obtain a weighted combination of the two priorities.
Further:
in the cognitive module, a depth parallel computing framework comprises:
l layer residual module: each layer of residual error module comprises a data compression layer and a batch normalization layer, the convolution kernel size of the data compression layer is 3 multiplied by 3, and the size of multi-channel data is not changed through the process of the L layers of residual error modules;
the L +1 layer characteristic adjusting module: the method comprises the following two parts:
a first part, which calculates the initial priority of the node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for calculating an estimate of the returned information via a tanh function after resizing the multi-channel data.
Here, L in the depth parallel computing framework may be preferably 19, that is, the depth parallel computing framework includes a 19-layer residual error module and a 20 th-layer feature adjusting module, wherein the size of process data of the multi-channel game data passing through the 1-19-layer residual error module is not changed, and the process data is used for performing compression operation and batch normalization processing on the data; the layer 20 characteristic adjusting module comprises the following two parts:
a first part, calculating the initial priority of the output node through a softmax function after the characteristic diagram is adjusted in size;
a second part, calculating and outputting the estimation of the return information through a tanh function after adjusting the size of the multichannel data.
In the gate control module, the adopted depth parallel computing framework comprises:
the system comprises an L-layer residual error module and an L + 1-layer characteristic adjusting module, wherein the size of process data of the multi-channel game data passing through the L-layer residual error module is unchanged, and the multi-channel game data are used for performing compression operation and batch normalization processing on the data; and the L + 1-th layer feature adjusting module calculates the weight proportion through a sigmoid function after adjusting the size of the multichannel data.
Here, L in the depth parallel computing framework may be preferably 19, that is, the depth parallel computing framework includes a 19-layer residual error module and a 20 th-layer feature adjusting module, wherein the size of process data of the multi-channel game data passing through the 1-19-layer residual error module is not changed, and the process data is used for performing compression operation and batch normalization processing on the data; and the 20 th layer feature adjusting module calculates the weight proportion through a sigmoid function after adjusting the size of the multichannel data.
As a specific application of the method provided by the embodiment of the present invention, the embodiment of the present invention further provides an artificial intelligence system for a game, wherein the system adopts any one of the above methods to improve performance; wherein:
when the initial value of the priority of the node and the approximate value of the return information are calculated, the adopted depth parallel calculation framework comprises the following structures:
the first layer is an input layer and is used for receiving multi-channel game data; the input layer includes:
n data channels, wherein each channel all stores binary piece information of each check: empty, black, white;
the terminal channel is used for storing = next chess player information;
the 2 nd layer is a data compression layer, and convolution compression is carried out on the input multi-channel game data to obtain initial multi-channel data;
the 3-L layer is an information short-circuit layer and receives initial multi-channel data, and each information short-circuit layer obtains a characteristic diagram of each layer through the steps of data compression, batch standardization, data compression and batch standardization;
from the L +1 layer, the deep neural network is divided into two branches, one branch firstly presses the two-dimensional characteristic diagram into a one-dimensional vector, and after passing through a softmax function, action probability distribution is output to form a priority calculation network; the other branch is output through the tanh function to form a return value calculation network;
when the weight proportion of the multi-channel game data is calculated, the adopted depth parallel computing framework comprises the following structures:
the first layer is an input layer and is used for receiving multi-channel game data; the input layer includes:
n data channel, wherein each channel all stores the binary piece information of each check: empty, black, white;
the terminal channel is used for storing the player information of the next step;
the 2 nd layer is a data compression layer, and convolution compression is carried out on the input multi-channel game data to obtain initial multi-channel data;
the 3-L layer is an information short-circuit layer and receives initial multi-channel data, and each information short-circuit layer obtains a characteristic diagram of each layer through the steps of data compression, batch standardization, data compression and batch standardization;
and from the L +1 layer, the deep neural network adjusts the size of the multichannel data and outputs the multichannel data through a sigmoid function to form a weight ratio calculation network.
The performance improving method and the performance improving system provided by the embodiment of the invention do not need to add any human domain knowledge, and the used parallel computing framework is end-to-end.
In order to improve the final effect, the new data is ensured to be input and the old data is replaced. And during each training, randomly extracting a small batch, and sending the small batch into a depth parallel computing framework to update parameters. Has the advantages that: the correlation among samples is reduced, so that the updating of the depth parallel computing framework is more efficient; in addition, the model can learn the experience of the current experience and can also learn the experience of the past.
In the embodiment of the invention:
in the tree structure, a most urgent node is selected first in the tree, and an "optimal priority" principle is adopted, that is, a node with an optimal evaluation value is selected in each layer. Here, the estimation is composed of two terms, one is the average motion value estimation, one is the number of visits preceded by an exploratory term coefficient. When the leaf node is reached, the leaf is the most urgent node selected. This child node needs to be expanded at this time, and in order to be more efficient, the priority provided by the deep parallel computing framework needs to be used.
In the weight proportion calculation process, a depth parallel calculation framework is adopted, wherein: inputting multi-channel game data and a final game result, utilizing a depth parallel computing frame model, and calculating the falling probability distribution through a sigmoid function after adjusting the size of a characteristic diagram at a 20 th layer without changing the size of the characteristic diagram through 19 layers of residual modules, wherein each residual module comprises data compression operation and batch normalization, the size of a convolution kernel is 3 multiplied by 3.
In the weighted combination calculation process, etc., wherein: and multiplying the result output by the depth parallel computing framework by a human walking method, multiplying the result output by subtracting the depth gate network from 1 by a Monte Carlo search tree walking method, and adding the two to obtain a combined walking method.
The following describes the related technical solutions in further detail by taking an artificial intelligence system of a go game as an example, in conjunction with the accompanying drawings and the performance improvement method and system provided by the embodiments of the present invention.
As shown in fig. 1, the performance improvement method provided by the embodiment of the present invention is a game AI system overall architecture and a work flow diagram, and the performance improvement method performs one-step iterative enhancement by using a depth parallel computing frame to update parameters of the depth parallel computing frame by weighted linear combination of data records and final priority information and using a combined result. Also in this process, the priority information becomes more accurate, and thus the performance of the program is improved.
The technical scheme provided by the embodiment of the invention can utilize the monitoring information of chess game win and loss to distinguish the advantages and disadvantages of the walking method and realize the complementary combination of the human walking method and the machine walking method. In addition, the technical scheme provided by the embodiment of the invention only needs a certain number of multi-channel data records, and the priority of the data set records and the final priority obtained by the calculation of the depth parallel calculation framework can be learned from the priority.
Specifically, as shown in fig. 1, the method includes the steps of:
s1: calculating the initial value of the priority of the nodes and the approximate value of the returned information of the multi-channel game data recorded by the data set by using a depth parallel computing frame;
s2: forming a tree structure by using the initial value of the node priority obtained by the calculation of the S1, generating a new data node, filling the initial value of the node priority obtained by the calculation of the cognitive module as data node information, filling the approximate value of the return information obtained in the S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting an action posterior prediction result, namely the final priority;
s3: calculating the weight proportion of the multi-channel game data recorded by the data set by using a depth parallel computing framework; the method comprises the steps of obtaining expanded priority information (multidimensional vector) depending on the calculation result of a cognitive system;
s4: and (4) calculating the priority of the data set record and the final priority obtained in the S2 by combining the weight proportion obtained in the S3 to obtain a linear weighted combination of the two priorities.
Specific implementations of the above steps and modules are described in detail below to facilitate understanding of the technical solutions of the present invention.
In some embodiments of the present invention, in order to obtain the final priority from the human data, in step S2, each calculation in the embodiments of the present invention includes the following 4 steps:
(1) And (4) selecting. The forward simulation adopts an optimal priority principle, each layer selects subsequent nodes according to the principles of high priority, low access times and high action value, and finally reaches the selected terminal nodes;
(2) And (4) unfolding. Initializing all legal nodes under the end node according to the calculation of the depth parallel calculation frame, setting a priority initial value as the priority given by the depth parallel calculation frame, and initializing the access times and the average return result to be 0;
(3) And (6) evaluating. Obtaining the return value estimation v of the depth parallel computing framework to the leaf node;
(4) And returning. And updating the node information to the upper layer until the root node. Specifically, the number of accesses is increased by 1, and the average action value is accumulated and averaged.
Each node in the tree maintains the following information:
the initial priority, which represents the priority degree of the node to be selected, is calculated by the step S1;
the number of visits represents the number of times the node is traversed;
and the average result information is obtained by accumulating and averaging the return values output by the depth parallel computing framework.
After the plurality of calculations (e.g., 800) are completed, the final priority of the next step is calculated based on the statistics of the data nodes.
To achieve the above effect, the output combining weight coefficient is implemented by a depth parallel computing framework. The structure of the depth parallel computing framework is shown in fig. 2. The feed frame features 18 binary channels. The first 16 channels are chessboard messages, representing the chessmen (black, white, empty) of each grid. The 17 th channel represents the player information of the next step, specifically, if the next step is black, all 1 s are obtained, otherwise, all 0 s are obtained. The last pass is the final result of the board game, which also implies the accuracy of the current priority information. Specifically, 1 represents a final win and-1 represents a final failure.
The training sample format received by the depth parallel computing frame used by the system is (s, pi, z), namely (multichannel game data, priority information and return value) triple, wherein s is the multichannel game data, pi is the combined priority and z is the return value. The multi-channel game data tensor shape is 19 × 19 × 17, the priority vector shape is 362 × 1, and the game result is scalar-1 or 1.
The performance improvement method and system provided by the above embodiments of the present invention are based on a combined system and a depth parallel computing framework, improve the performance of a game artificial intelligence system, and can be used for training a game AI model, including: for input multi-channel game data, calculating node initial values and return information estimation by using a depth parallel computing framework; forming a tree structure by using the initial values of the nodes, generating new data nodes, filling the initial values of the nodes as data node information, filling return information approximate values into the data nodes, returning the data node information, updating information stored by each data node, and outputting an action posterior prediction result, namely the final priority; calculating the weight proportion of the multi-channel game data by using a depth parallel computing framework for the input multi-channel game data; for input multi-channel game data, a depth gate network model is used for obtaining a weight proportion, and the weighted combination priority of the input multi-channel game data and the fixed data record is calculated. In order to solve the problems that the existing technology depends on a large amount of data and the hardware consumption is large, the performance improvement method and the performance improvement system provided by the embodiment of the invention can automatically extract the advantages and disadvantages of the machine calculation result and the data set recording result from the data by introducing the combined calculation system on the basis of the general supervised learning technology. The performance improving method and system provided by the embodiment of the invention effectively improve the performance of the system under limited data, and are beneficial to the development of common enterprises or teams due to the characteristic of relatively low power consumption.
The technical features of the above examples can be arbitrarily combined, and for the sake of simplicity of description, all possible combinations of the technical features of the above embodiments are not described, however, the combination of the technical features should be considered as the scope of the present description as long as there is no contradiction.
The foregoing description has described specific embodiments of the present invention. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A performance improvement method of a game artificial intelligence system is characterized by comprising the following steps:
s0: acquiring a data set
Figure QLYQS_1
As input data for a game artificial intelligence system, where s is multi-channel game data, and->
Figure QLYQS_2
The final priority obtained in the S2, and z is binary information returned by the game artificial intelligence system according to the wins and the losses when the game is finished;
s1: calculating the initial value of the priority of the nodes and the approximate value of the returned information of the multi-channel game data recorded by the data set by using a depth parallel computing frame;
s2: forming a tree structure by using the initial value of the node priority obtained by the calculation of the S1, generating a new data node, filling the initial value of the node priority obtained by the calculation of the cognitive module as data node information, filling the approximate value of the return information obtained in the S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting an action posterior prediction result, namely the final priority;
s3: calculating the weight proportion of the multi-channel game data recorded by the data set by using a depth parallel computing framework;
s4: calculating the priority recorded by the data set and the final priority obtained in the S2 to obtain a weighted combination of the two priorities by combining the weight proportion obtained in the S3;
in the S1, the depth parallel computing framework comprises an L-layer residual error module and an L + 1-th layer feature adjusting module, wherein the size of process data of the multichannel game data passing through the L-layer residual error module is unchanged, and the multichannel game data are used for performing compression operation and batch normalization processing on the data; the L +1 layer characteristic adjusting module comprises the following two parts:
a first part, calculating the initial priority of the output node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for computing an estimate of the output return information via a tanh function after resizing the multichannel data;
wherein:
the initial value of the node priority, namely the initial priority of the node output by the first part of the L + 1-th layer characteristic adjusting module in the calculation mode, is a 362-dimensional array and is used as the initial value in the S2;
the returned information approximate value is the estimation of the returned information output by the second part of the L + 1-th layer characteristic adjusting module, and the estimation is the approximate value estimation of the returned binary result of the game artificial intelligence system;
in the S3, the depth parallel computing frame comprises an L-layer residual error module and an L + 1-th layer characteristic adjusting module, wherein the size of process data of the multichannel game data passing through the L-layer residual error module is unchanged, and the process data is used for carrying out compression operation and batch normalization processing on the data; and the L + 1-th layer feature adjusting module calculates the weight proportion through a sigmoid function after adjusting the size of the multichannel data.
2. The method for improving performance of a game artificial intelligence system of claim 1, wherein the step S1 further comprises the steps of: the depth-parallel computing framework is trained based on a dataset; wherein:
the depth parallel computing framework updates the parameters of the framework by defining an update mechanism, wherein the update mechanism is as follows:
Figure QLYQS_3
in the formula (I)
Figure QLYQS_4
Is a cross entropy mechanism for calculating the difference between the initial value of the priority of the node output by the framework and the priority of the record of the data set, and->
Figure QLYQS_5
The final priority obtained in the S2 is obtained, and p is a node priority initial value given by the depth parallel computing framework; a second item +>
Figure QLYQS_6
The method is used for calculating the difference between a return information approximate value output by a frame and binary information returned according to game win and lose, z is the binary information returned by a system according to win and lose when a game is finished, and v is the return information approximate value given by a depth parallel calculation frame; the third item->
Figure QLYQS_7
Is an L2 regularization term for downsizing the frame>
Figure QLYQS_8
All parameters of the depth parallel computing framework, and c is a coefficient for an L2 control regular term;
the method adopted by the updating mechanism is as follows:
Figure QLYQS_9
wherein,
Figure QLYQS_10
is the update rate, which controls the amplitude of the frame update; />
Figure QLYQS_11
The gradient information fed back to the depth parallel computing framework after the computation of the updating mechanism represents the direction in which the framework needs to be updated.
3. The method for improving performance of a game artificial intelligence system of claim 1, wherein in S2, a connection is established between data nodes in a tree structure, wherein each data node is used for storing the following information:
-a node priority initial value, representing the priority of selecting the data node, calculated by S1;
-a number of accesses, representing the number of times the data node has been accessed;
-average result information obtained from the running average of the approximate values of the return information calculated in S1;
the following 4 steps were repeated:
-selecting: the tree-form simulation adopts an optimal priority principle, namely each layer accesses the subdata nodes according to the principles of high node initial value, low access times and high action value, and the final reached end node is the selected data node;
-unfolding: initializing all legal nodes under the end node according to the calculation in the S1, initializing the initial value of the node to the initial value of the priority of the node calculated in the S1, and initializing the access times and the approximate value of average return information to 0;
-evaluating: obtaining a return information approximate value v of the terminal node in the S1;
-backtransmission: updating data node information layer by layer upwards until the initial data node; specifically, the number of access times is increased by 1, and after the approximate value of the average return information is accumulated and evaluated v, the average value is taken; v is a return information approximate value given by the depth parallel computing framework;
after the steps are repeated for a plurality of times, the access times of each sub data node are divided by the sum of the access times of all the sub data nodes, and the final priority of each action is calculated and selected.
4. The method for improving performance of a game artificial intelligence system of claim 1, wherein in S4, the method for calculating the weighted combination priority of the data set record and the final priority obtained in S2 comprises:
first, the following is calculated:
multiplying the weight coefficient output by the depth parallel computing frame by the priority of the data set record;
1, subtracting the final priority obtained by multiplying the weight coefficient output by the depth parallel computing frame by the S2;
then, the two results obtained by the above calculation are added to obtain the weighted combination priority.
5. A performance enhancement system for a gaming artificial intelligence system that performs the method of any of claims 1-4, comprising:
a cognitive module: for input multi-channel game data, calculating a node priority initial value and a return information approximate value by using a depth parallel computing frame;
a forward simulation module: forming a tree structure by using the initial value of the node priority calculated by the cognitive module, generating a new data node, filling the initial value of the node calculated by the cognitive module as data node information, filling the approximate value of the returned information in the step S1 into the data node, returning the data node information, updating the information stored by each data node, and outputting the final priority;
a gate control module: for input multi-channel game data, calculating the weight proportion of the multi-channel game data by using a depth parallel computing framework;
a combined calculation module: and calculating the priority recorded by the data set and the final priority obtained in the forward simulation module by combining the weight proportion obtained by the gating module to obtain the weighted combination of the two priorities.
6. The system of claim 5, wherein the cognitive module employs a depth-parallel computing framework comprising:
l layer residual module: each layer of residual error module comprises a data compression layer and a batch normalization layer, the convolution kernel size of the data compression layer is 3 multiplied by 3, and the size of multi-channel data is not changed in the process of passing through the L layers of residual error modules;
the L +1 layer characteristic adjusting module: the method comprises the following two parts:
a first part, which calculates the initial priority of the node through a softmax function after the characteristic diagram is adjusted in size;
-a second part for calculating an estimate of the returned information via a tanh function after resizing the multi-channel data.
7. The system of claim 5, wherein the gating module employs a depth-parallel computing framework comprising:
the system comprises an L-layer residual error module and an L + 1-th layer characteristic adjusting module, wherein the size of process data of the multi-channel game data passing through the L-layer residual error module is unchanged, and the process data are used for performing compression operation and batch normalization processing on the data; and the L + 1-th layer feature adjusting module calculates the weight proportion through a sigmoid function after adjusting the size of the multichannel data.
8. A gaming artificial intelligence system, wherein performance enhancement is performed using the method of any one of claims 1 to 4; wherein:
when the initial value of the priority of the node and the approximate value of the return information are calculated, the adopted depth parallel calculation framework comprises the following structures:
the first layer is an input layer and is used for receiving multi-channel game data; the input layer includes:
n data channel, wherein each channel all stores the binary piece information of each check: empty, black, white;
the terminal channel is used for storing the player information of the next step;
the 2 nd layer is a data compression layer, and convolution compression is carried out on the input multi-channel game data to obtain initial multi-channel data;
the 3-L layer is an information short-circuit layer and receives initial multi-channel data, and each information short-circuit layer obtains a characteristic diagram of each layer through the steps of data compression, batch standardization, data compression and batch standardization;
starting from the L +1 layer, the deep neural network is divided into two branches, wherein one branch firstly presses the two-dimensional characteristic graph into a one-dimensional vector, and the action probability distribution is output after the one-dimensional vector passes through a softmax function to form a priority calculation network; the other branch is output through the tanh function to form a return value calculation network;
when the weight proportion of the multi-channel game data is calculated, the adopted depth parallel calculation framework comprises the following structures:
the first layer is an input layer and is used for receiving multi-channel game data; the input layer includes:
and each data channel stores binary piece information of each chess grid: empty, black, white;
the terminal channel is used for storing the player information of the next step;
the 2 nd layer is a data compression layer, and convolution compression is carried out on the input multi-channel game data to obtain initial multi-channel data;
the 3-L layer is an information short-circuit layer and receives initial multi-channel data, and each information short-circuit layer obtains a characteristic diagram of each layer through the steps of data compression, batch standardization, data compression and batch standardization;
and from the L +1 layer, the deep neural network adjusts the size of the multichannel data and outputs the multichannel data through a sigmoid function to form a weight ratio calculation network.
CN201911389843.4A 2019-12-30 2019-12-30 Game artificial intelligence system and performance improving system and method thereof Active CN111178541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911389843.4A CN111178541B (en) 2019-12-30 2019-12-30 Game artificial intelligence system and performance improving system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911389843.4A CN111178541B (en) 2019-12-30 2019-12-30 Game artificial intelligence system and performance improving system and method thereof

Publications (2)

Publication Number Publication Date
CN111178541A CN111178541A (en) 2020-05-19
CN111178541B true CN111178541B (en) 2023-04-18

Family

ID=70657554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911389843.4A Active CN111178541B (en) 2019-12-30 2019-12-30 Game artificial intelligence system and performance improving system and method thereof

Country Status (1)

Country Link
CN (1) CN111178541B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677923A (en) * 2016-03-24 2016-06-15 安徽大学 Game searching method of Einstein chess based on attack and defense evaluation function
CN109657802A (en) * 2019-01-28 2019-04-19 清华大学深圳研究生院 A kind of Mixture of expert intensified learning method and system
CN109818786A (en) * 2019-01-20 2019-05-28 北京工业大学 A kind of cloud data center applies the more optimal choosing methods in combination of resources path of appreciable distribution

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677923A (en) * 2016-03-24 2016-06-15 安徽大学 Game searching method of Einstein chess based on attack and defense evaluation function
CN109818786A (en) * 2019-01-20 2019-05-28 北京工业大学 A kind of cloud data center applies the more optimal choosing methods in combination of resources path of appreciable distribution
CN109657802A (en) * 2019-01-28 2019-04-19 清华大学深圳研究生院 A kind of Mixture of expert intensified learning method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Ahmad Aljaafreh.Development of a Computer Player for Seejeh (A.K.A Seega, Siga, Kharbga) Board Game with Deep Reinforcement Learning.Procedia Computer Science.2019,第160卷241-247. *
David Silver etl.mastering chess and shogi by self-play with a general renforcement learning algorithm.Machine Learning.2017,全文. *
Yuze Guo etl.Regularize Network Skip Connections by Gating Mechanisms for Electron Microscopy Image Segmentation.2019 IEEE International Conference on Multimedia and Expo (ICME).2019,全文. *
翟建伟.基于深度Q网络的算法与模型研究.中国优秀硕士学位论文全文数据库(信息科技辑).2018,(第4期),全文. *

Also Published As

Publication number Publication date
CN111178541A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
Lanctot et al. A unified game-theoretic approach to multiagent reinforcement learning
CN111282267B (en) Information processing method, information processing apparatus, information processing medium, and electronic device
CN112052948B (en) Network model compression method and device, storage medium and electronic equipment
CN109063823B (en) Batch A3C reinforcement learning method for exploring 3D maze by intelligent agent
Graesser et al. The state of sparse training in deep reinforcement learning
Fu et al. Actor-critic policy optimization in a large-scale imperfect-information game
JP6908302B2 (en) Learning device, identification device and program
CN115659281A (en) Method and device for fusing self-adaptive acceleration operators
CN113435606A (en) Method and device for optimizing reinforcement learning model, storage medium and electronic equipment
CN111282272B (en) Information processing method, computer readable medium and electronic device
Sarigül et al. Performance comparison of different momentum techniques on deep reinforcement learning
CN111667043B (en) Chess game playing method, system, terminal and storage medium
Rao et al. Distributed deep reinforcement learning using tensorflow
Zhao et al. Handling large-scale action space in deep Q network
Wang et al. Balanced training for sparse gans
CN110772794B (en) Intelligent game processing method, device, equipment and storage medium
CN115033878A (en) Rapid self-game reinforcement learning method and device, computer equipment and storage medium
Li et al. Global Opposition Learning and Diversity ENhancement based Differential Evolution with exponential crossover for numerical optimization
CN117709393A (en) Short-term power load prediction method
CN111178541B (en) Game artificial intelligence system and performance improving system and method thereof
Khamesian et al. Hybrid self-attention NEAT: a novel evolutionary self-attention approach to improve the NEAT algorithm in high dimensional inputs
CN111617479B (en) Acceleration method and system of game artificial intelligence system
CN113779870B (en) Parallelization imperfect information game strategy generation method, parallelization imperfect information game strategy generation device, electronic equipment and storage medium
WO2022127603A1 (en) Model processing method and related device
CN112836805B (en) KRFPV algorithm, execution device, electronic device, storage medium, and neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant