CN112016704B

CN112016704B - AI model training method, model using method, computer device and storage medium

Info

Publication number: CN112016704B
Application number: CN202011188747.6A
Authority: CN
Inventors: 周正; 季兴; 李宏亮; 张正生; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-02-26
Anticipated expiration: 2040-10-30
Also published as: CN112016704A

Abstract

The application relates to the field of artificial intelligence, and particularly discloses an AI model training method, a model using method, computer equipment and a storage medium, wherein the method comprises the following steps: obtaining a plurality of sample generating models, and playing chess according to the sample generating models to obtain first playing data; acquiring second playing data, and training a to-be-trained model according to the second playing data and the first playing data, wherein the second playing data are real playing data; when the model to be trained converges, taking the model to be trained as a model to be evaluated, playing the model and a comparison model for multiple times, and obtaining a playing result; and when the chess playing result reaches a preset index, determining the model to be evaluated as the AI model, and finishing AI model training. The simulated humanity of the trained reinforcement learning model is improved.

Description

AI model training method, model using method, computer device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an AI model training method, a model using method, a computer device, and a storage medium.

Background

With the rapid development of Artificial Intelligence (AI) technology, in the field of game entertainment, the Artificial Intelligence technology can be used to realize the game between virtual AI and real persons, and can defeat top-level professional players. In consideration of prediction accuracy and competitive level, the AI model trained by the artificial intelligence technology at present mainly trains a reinforcement learning model by using a deep reinforcement learning mode. However, because the reinforcement learning model only considers the final win-or-lose, the trained reinforcement learning model is relatively stiff, and the user experience of the trained AI model is poor.

Disclosure of Invention

The application provides an AI model training method, a model using method, computer equipment and a storage medium, which are used for improving the anthropomorphic performance of a trained reinforcement learning model.

In a first aspect, the present application provides an AI model training method, including:

obtaining a plurality of sample generating models, and playing chess according to the sample generating models to obtain first playing data; acquiring second playing data, and training a to-be-trained model according to the second playing data and the first playing data, wherein the second playing data are real playing data; when the model to be trained converges, taking the model to be trained as a model to be evaluated, playing the model and a comparison model for multiple times, and obtaining a playing result; and when the chess playing result reaches a preset index, determining the model to be evaluated as the AI model, and finishing AI model training.

In a second aspect, the present application also provides a model using method, including:

acquiring current playing data, and performing feature extraction on the current playing data to obtain current image features and current vector features; inputting an AI model according to the current class image characteristics and the current vector characteristics to obtain a predicted main strategy label and a predicted auxiliary strategy label, wherein the AI model is obtained by training according to the AI model training method of any one of claims 1-8; and determining corresponding predicted actions according to the predicted main strategy labels and the predicted auxiliary strategy labels, and outputting the predicted actions to play with the real users.

In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the AI model training method and the model using method as described above when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the AI model training method and the model using method as described above.

The application discloses an AI model training method, a model using method, computer equipment and a storage medium, wherein a plurality of sample generation models are obtained, and then chess playing is carried out according to the plurality of sample generation models to obtain first chess playing data; acquiring second playing data, and training the model to be trained according to the first playing data and the second playing data; when the model to be trained converges, taking the model to be trained as a model to be evaluated, and playing the model and the comparison model for multiple times to obtain a playing result; and finally, when the chess playing result reaches a preset index, determining the model to be evaluated as the AI model, and finishing the AI model training. And adding real second playing data when the model to be trained is trained, so as to improve the anthropomorphic property of the obtained model to be trained. And evaluating the model after obtaining the model to be trained, and further determining the final AI model, thereby ensuring the prediction accuracy and the competitive level of the obtained AI model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a training and using scenario of an AI model provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an architecture of a training submodule provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of an AI model training method provided by the embodiments of the present application;

FIG. 4 is a schematic flow chart diagram of determining a sample generation model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a hierarchical structure of a model to be trained according to an embodiment of the present application;

FIG. 6 is a schematic diagram of class image features in an embodiment of the present application;

FIG. 7 is a schematic flow chart illustrating a method for using a model according to an embodiment of the present disclosure;

fig. 8 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

With the rapid development of Artificial Intelligence (AI) in various fields, AI technology in chess and card game programs has also been broken through. Both board AI's played on perfect information games and card AI's played on imperfect information games have demonstrated the ability to defeat top-class professional opponents (e.g., Alphago, Alphazero, Plunibus, and Suhhx). From the point of view of the game result, the levels of the AI are top levels, however, the AI of the perfect information game or the imperfect information game has a common problem: not sufficiently anthropomorphic.

This is because, in consideration of prediction accuracy and competitive level, the AI model trained at present mainly uses a deep reinforcement learning method to train the reinforcement learning model. However, the reinforcement learning model is separated from actual sample data generated by human beings in the training process, and only learning is carried out according to environmental feedback, so that the trained reinforcement learning model is relatively stiff, the anthropomorphic performance is insufficient, and operations which do not influence win-win or lose-win but cannot be taken by a real player can occur, for example, a card which is discharged at one time can be split into a plurality of single cards.

From an athletic perspective, this problem may be secondary if needed to demonstrate an AI beyond human level, but for AIs that require large volumes to be given to human players for experience, personification is undoubtedly a very important loop in assessing the overall ability of an AI. In practical application scenarios (e.g., cold start, accompany, etc.), AI ideally requires not only a reasonable enough, high level of play, but also an AI program that prevents human players from discovering an opponent. Therefore, the AI technology can have practical value in game application. Therefore, it is necessary to improve the anthropomorphic performance of the trained AI model.

To this end, embodiments of the present application provide a training method for an AI model, a model using method, a computer device, and a storage medium. The AI model training method can be applied to a server, wherein the server can be a single server or a server cluster consisting of a plurality of servers.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a training scenario and a usage scenario of an AI model according to an embodiment of the present application. As shown in fig. 1, the model training server includes an AI training module and a model deployment module. The AI training module also comprises a training submodule and an evaluation submodule.

The training submodule is used for calling the sample generation model to generate first playing data, then the first playing data is combined with the cached second playing data to train the model to be trained, and when the model to be trained converges, the model to be trained is used as the model to be evaluated.

The evaluation module is used for evaluating the model to be evaluated, evaluating the performance of the model to be evaluated, such as the winning rate of the model to be evaluated, and the like, selecting an excellent model from the multiple models to be evaluated as a final AI model according to the evaluation score of the model to be evaluated, and deploying the AI model in the AI server through the model deployment module.

When using the AI model, the player establishes a connection with the front-end server, and the AI server also connects with the front-end server. The front-end server sends current playing data (such as information of current player cards, the number of opposite player cards and the like) to the AI server, the AI server calls a deployed AI model to predict what cards should be played in the current frame, and the prediction result is sent to the front-end server to be executed specifically through conversion from the label to the specific card, so as to play games between the player and the AI model.

Taking a floor-fighting main game as an example, the use scene of the trained AI model may be various scenes such as novice teaching, offline hosting, man-machine challenge and the like.

In the novice teaching scene, the AI model can be used for guiding the novice player to exert the maximum value of the card force of the player's own hand so as to win the profit.

In the offline hosting scenario, the AI model can be utilized to help the player to call out a reasonable card to maximize the profit or minimize the loss of playing in the game under the condition that the player is offline, and the influence on the experience of other players is avoided.

In the man-machine challenge scene, for a high-level player, an AI model with a high-level card-playing function can be accessed to enable the player to challenge the score, and the liveness is increased.

Referring to fig. 2, fig. 2 is a schematic diagram of an architecture of a training submodule according to an embodiment of the present application. As shown in fig. 2, the training submodule includes a sample generation part, a storage part, and a learning part.

The sample generation section selects the sample generation models, and performs self-play between the sample generation models to generate first play data. When the sample generation model is selected, the sample generation model may be selected from the historical model library and played. All models generated by the AI training module and models additionally added to the historical model library by the user are stored in the historical model library.

Taking a landlord game as an example, the sample generation part acquires three sample generation models from the historical model library, and each sample generation model corresponds to one participant. Because the roles of the three participants in the fighting-land-owner game are different, the roles of two participants are farmers, and the role of one participant is the landowner, the division module can be utilized to simulate division before playing, so that the role of each participant is determined. The calling module can be a calling model trained in advance by adopting a supervised learning method.

After determining the role of each participant through the credit, each participant can play games based on the game logic. Wherein the game logic comprises play logic, such as: the type of the cards played by the next house is the same as that of the cards played by the previous house, and the cards played by the next house are larger than those played by the previous house, or the types of the cards played by the next house are different from those played by the previous house, and the cards played by the next house are 4 same cards or two jokers.

In the playing process, each participant takes the cards played by the last family as the input characteristics of the playing, then utilizes the corresponding sample generation model to output the playing labels based on the input characteristics, plays the cards corresponding to the playing labels according to the conversion to specific cards, and circulates for multiple times until winning or losing are distinguished, thereby completing one playing. And taking data generated in the primary playing process as first playing data.

In one embodiment, the sample generation section executes a large number of mirror image examples to perform sample production work by playing games, and performs the self-playing games simultaneously by using multiple threads, thereby increasing the data amount of the generated first game playing data.

The first playing data is cached in the storage part so as to facilitate consumption of the learning module and model training. Meanwhile, the storage part also caches second playing data, namely real playing data, for personifying training. The storage section may be, for example, a Redis server.

The learning part consumes the first playing data and the second playing data stored in the storage part and performs intensive training on the model to be trained. When training is carried out, a PPO algorithm can be adopted for carrying out strengthening training, and the value function and the strategy of the neural network of the model to be trained are subjected to iterative optimization through environment feedback.

The model generated by each training of the learning part can be stored in the historical model library, so that the sample generation part can select the sample generation model to play the game. During the training process, multi-thread asynchronous training can be adopted to accelerate the training speed and the convergence speed.

Referring to fig. 3, fig. 3 is a schematic flow chart of an AI model training method according to an embodiment of the present disclosure. According to the AI model training method, the second playing data, namely the real playing data, is added in the model training process, so that the anthropomorphic degree of the AI model obtained by training is improved.

As shown in fig. 3, the AI model training method specifically includes: step S101 to step S104.

S101, obtaining a plurality of sample generating models, and playing according to the sample generating models to obtain first playing data.

The sample generation model is a model for generating first play data by performing self-play. The sample generation model may be a model trained using the AI model training method.

In one embodiment, a plurality of selectable models are stored in the historical model library, and the models can be randomly called from the historical model library to serve as sample generation models.

The plurality of sample generation models play themselves according to the playing rules, and first playing data can be generated. The first game playing data comprise playing characteristics, playing strategy labels, environment feedback and predicted scores. The card-playing strategy labels comprise a master strategy label and a slave strategy label. The playing rules may be set according to the type of the card game, and this is not specifically limited in this application.

Taking a ground-fighting primary game as an example, the primary strategy tag is a primary tag, the primary tag includes but is not limited to corresponding tags of single cards, single cisons, pairs, double cisons, three cisons, four cisons, rockets, and standing blocks, and the secondary strategy tag is a tag with cards, and the tag with cards includes but is not limited to corresponding tags of single cards, two single cards, one pair, two pairs, and no cards.

The playing rules comprise: the type of the cards played by the next house is the same as that of the cards played by the previous house, and the cards played by the next house are larger than those played by the previous house, or the types of the cards played by the next house are different from those played by the previous house, and the cards played by the next house are 4 same cards or two jokers.

In an embodiment, the number of the obtained sample generation models may be determined according to a game logic of the game, and the game logic may be set according to a type of the card game, which is not particularly limited in this application.

For example, the game logic for a floor-fighting primary game includes: the number of the participants is three, namely three sample generation models are required to be selected; the role of one of the three participants is the landowner, and the roles of the other two participants are farmers; the participation content of the participant with the landowner is 20 random cards, the participation content of the participant with the farmer is 17 random cards, the participant with the landowner finishes 20 cards first, the participant with the landowner wins, and any one of the participants with the farmer finishes 17 cards first, the participants with the two roles win the wins.

Specifically, the model training server acquires the number of participants from preset game logic, and creates a corresponding number of participants according to the number of the participants, wherein each created participant corresponds to one sample generation model; a role is assigned to each of the players and a respective hand is assigned to each of the players based on the total number of cards.

Taking a bucket landowner as an example, the model training server creates three participants, assigns a landowner character to one participant, assigns a farmer character to the other two participants, assigns 20 cards to the players of the landowner character, and the 20 cards are R222AAAKKK101099874433, wherein the 20 cards assigned to the landowner character include three bottom cards which are R23, and the hands assigned to the players of the two farmer characters are 17 cards which are B2AKQJJ101099874433 and qqjj 8877665555 respectively.

In one embodiment, the model training server may simulate the callout process according to a pre-trained callout model when assigning roles to each participant, thereby determining the identity of each participant. The calling model can be obtained by training in a supervised learning mode.

In an embodiment, the sample generation model comprises a supervised learning model. Referring to fig. 4, fig. 4 is a schematic flow chart of determining a sample generation model according to an embodiment of the present application. As shown in fig. 4, the method includes step S201 and step S202.

And S201, performing cyclic playing on the plurality of supervised learning models, and determining the selection probability of each supervised learning model.

And the supervised learning model is obtained by training by adopting a supervised learning method according to the second playing data. In one embodiment, the obtained supervised learning model is also put into the historical model library to facilitate the invocation of the model in the first game data generation phase.

And after the plurality of supervised learning models are obtained by using the second playing data, the plurality of supervised learning models are played circularly, so that the selection probability of each supervised learning model is determined.

In one embodiment, the selection probability of each supervised learning model can be determined according to the average winning rate of each supervised learning model after a plurality of supervised learning models are played in a circulating mode. The determining the selection probability of each supervised learning model comprises the following steps: acquiring the average winning rate of each supervised learning model in the cyclic playing; calculating a selection probability of each of the supervised learning models based on the average win ratio of each of the supervised learning models.

One supervised learning model is selected from the plurality of supervised learning models, the supervised learning model and all other supervised learning models are subjected to cyclic playing, and then the average winning rate of each supervised learning model is calculated according to the playing result. And then, calculating the selection probability of each supervised learning model according to a calculation formula and the average winning rate of each supervised learning model.

E.g. share

A supervised learning model is created for each of the models,

is shown as

A supervised learning model for all

The supervised learning model is subjected to an off-line loop competition, i.e. each model is matched with other models

The models are respectively matched with 5000 rounds, and therefore the average winning rate of each model is calculated.

Will be provided with

Is described as the first

Individual supervision learning model

Average odds in a recurring contestCalculating by using a calculation formula

Individual supervision learning model

The calculation formula may be, for example:

and S202, determining a target model from the plurality of supervised learning models according to the selection probability and generating the model as a sample.

A target model is determined from the plurality of supervised learning models on the basis of the selection probability of the supervised learning models, and the determined target model is used as a sample generation model to participate in self-play to generate first play data.

In the self-play process, the selected sample generation model does not necessarily include the supervised learning model.

Taking a floor-fighting main game as an example, when a sample generation model is selected from a historical model library, the number of the selected sample generation models is three, so that a supervised learning model is selected as the sample generation model with a probability of 33%, and when the selected sample generation model includes the supervised learning models, a final target model is selected according to the selection probability of each supervised learning model, and the target model is participated in self-play as the sample generation model to generate first play data.

The supervised learning model is added into the historical model library, so that the reinforced learning model in the historical model library can contact the game situation similar to the game situation of human as an opponent, generate diverse data and effectively learn how to fight with human. And the plurality of supervised learning models do not cause the reinforcement learning model to find the weakness of one of the supervised learning models to enter the local optimization.

S102, obtaining second playing data, and training the model to be trained according to the second playing data and the first playing data.

The second playing data is real playing data, namely playing data generated by a real user in the playing process, and comprises real playing characteristics and real strategy labels. After the second playing data is obtained, the training model can be trained according to the second playing data and the first playing data. During training, the data generated by the model and the real data are used for training together, so that the anthropomorphic degree of the model to be trained obtained by training is improved.

In one embodiment, when the model to be trained is trained based on the second playing data and the first playing data, the first playing data and the second playing data may be mixed in a certain ratio and the mixed data may be used together to train the model to be trained. For example, the first playing data and the second playing data may be mixed at a ratio of 8:2, that is, the data for training the model to be trained may have a ratio of 80% for the first playing data and 20% for the second playing data.

In an embodiment, please refer to fig. 5, wherein fig. 5 is a schematic diagram of a hierarchical structure of a model to be trained according to an embodiment of the present application. As shown in fig. 5, the model to be trained includes a first fully-connected layer, a residual network layer, a splicing layer, and a second fully-connected layer. The first full connection layer and the residual error network layer are respectively connected with the splicing layer, and the splicing layer is connected with the second full connection layer. The first fully-connected layer includes a plurality of fully-connected layers, for example, three fully-connected layers, and the second fully-connected layer includes a plurality of fully-connected layers, for example, two fully-connected layers.

The training of the model to be trained according to the second playing data and the first playing data comprises the following steps: constructing sample data according to the second playing data and the first playing data, and performing feature extraction on the sample data to obtain sample vector features and sample image features, wherein the sample data comprises environment feedback; processing the sample vector characteristics through the first full-connection layer to obtain a first target vector; processing the sample type image characteristics through the residual error network layer to obtain a second target vector; splicing the first target vector and the second target vector through the splicing layer to obtain a spliced vector; determining, by the second fully-connected layer, a probability distribution of master policy labels, a probability distribution of slave policy labels, and a prediction score based on the stitching vector; and training the neural network parameters of the model to be trained according to the probability distribution of the master strategy label, the probability distribution of the slave strategy labels, the prediction score and the environmental feedback.

And constructing sample data for reinforcement learning according to the cached first playing data and the cached second playing data, and performing feature extraction on the sample data to obtain sample class image features and sample vector features. The class image features are used to model the distribution of cards, representing the distribution of card types and the distribution of quantities.

Taking a fighting-ground main game as an example, the horizontal axis of the class image feature is characters of all cards arranged from large to small, the vertical axis of the class image feature is the number of characters corresponding to each card, if the number of characters is 1, the vertical axis is [1000], if the number of characters is 2, the vertical axis is [1100], if the number of characters is 3, the vertical axis is [1110], if the number of characters is 4, the vertical axis is [1111], the class image feature comprises 14 channels, which are hand information (1 channel) of the current participant, hand information (3 channels) output by the three participants in the first round, hand information (3 channels) output by the three participants in the second round, hand information (3 channels) of the third round, all hand information (1 channel) output before the third round, all hand information (1 channel) not output, and hand information (2 channels) of the other than the current participant One channel).

Fig. 6 is a schematic diagram of class image features in an embodiment of the present application, where the class image features include 14 channels, a diagram in fig. 6 is a feature expression of the current player's own hand information EBAKKQQ73, and B diagram in fig. 6 is a feature expression of the round of card-out information 22 of the current player.

The vector features include the player's role, the number of hands, the payout factor, the number of hands played by the player who has played, whether there are any more cards in the current player's hand than the player who has played, the number of four identical cards currently played, and the number of four identical cards that may be present in the cards that are not present.

For example, if the character is a landowner, the character code is 1, if the character is a farmer, the character code is 0, the number of hands is between 00000 (holding 0 cards) and 10100 (holding 20 cards), the number of hands played by the previous player is between 00000 (playing 0 cards) and 10100 (playing 20 cards), if the current player has a card larger than the previous player, the corresponding code is 1, otherwise, if the current player has no card larger than the previous player, the corresponding code is 0.

The roles of the three participants are landowner, peasant and peasant respectively, the number of holding cards of the three participants is 15, 12 and 8 respectively, the number of playing cards of the previous participant is 5, the cards larger than the playing cards of the previous participant are in the hand cards of the current participant, and the corresponding vector characteristics are as follows: [1,0,0, 01111, 01100, 01000, 00101,1].

When encoding is carried out, the number of four same cards which may appear in the cards which do not appear can be encoded by multi-hot, and the number of four same cards which are currently played can be encoded by one-hot.

The model training server processes the sample vector characteristics by utilizing the first full-connection layer to obtain a first target vector; processing the sample image characteristics through a residual error network layer to obtain a second target vector; splicing the first target vector and the second target vector through a splicing layer to obtain a spliced vector; and finally, outputting the probability distribution of the master strategy label, the probability distribution of the slave strategy label and the prediction score based on the splicing vector through a second full-connection layer, and updating the neural network parameters of the model to be trained according to the obtained probability distribution of the master strategy label, the probability distribution of the slave strategy label, the prediction score and the environment feedback. It should be noted that the neural network parameter updating algorithm may be set based on actual conditions, which is not specifically limited in this application, and optionally, the neural network parameter of the artificial intelligence model is updated based on a back propagation algorithm.

The prediction score is the prediction score for the current game output by the model to be trained, if the pre-measured score is positive, the model to be trained considers that the model to be trained wins the local game, and if the pre-measured score is negative, the model to be trained considers that the model to be trained falls off the local game.

In the training process, a cascading mode can be adopted, the prediction of the main strategy label is firstly carried out, then the prediction of the auxiliary strategy label is carried out according to the main strategy label, namely, the action prediction of the main card is firstly carried out, and then the action prediction of the card carrying is carried out according to the action of the main card.

In an embodiment, the training the neural network parameters of the model to be trained according to the probability distribution of the master policy label, the probability distribution of the slave policy labels, the prediction score and the environmental feedback includes: calculating a corresponding first loss value according to the probability distribution of the main strategy label; calculating a corresponding second loss value according to the probability distribution of the slave strategy label; calculating a corresponding third loss value according to the prediction score and the environment feedback; calculating a fourth loss value according to the second playing data, the probability distribution of the main strategy label and the probability distribution of the auxiliary strategy label output by the model to be trained; determining whether the model to be trained is converged according to the first loss value, the second loss value, the third loss value and the fourth loss value; and if the model to be trained is converged, executing the step of playing the model to be trained as the model to be evaluated and the comparison model for multiple times when the model to be trained is converged, and obtaining a playing result.

And in the process of updating the neural network parameters of the model to be trained, determining the model loss value of the artificial intelligent model according to the probability distribution of the master strategy label, the probability distribution of the slave strategy label and the prediction score.

The Loss function is (Loss) = Loss1 + Loss2 + Loss3 + Loss4, where Loss1 is determined from output master policy tag predictions, Loss2 is determined from output slave policy tag predictions, and Loss1 and Loss2 may be surerate Loss and inverse Loss. Loss3 is determined according to the output prediction score and the environmental feedback, and Loss3 is an L2 function. Loss4 is a supervised Loss function, and may be, for example, a cross entropy Loss function, and the Loss function of the probability distribution of the master strategy label and the probability distribution of the slave strategy label output by the model to be trained is calculated with the real master strategy and the slave strategy in the second playing data as reference, so that the model approaches the playing strategy of the real user while considering the maximum profit.

In one embodiment, the second playing data includes an average number of actions; calculating a corresponding third loss value according to the prediction score and the environmental feedback, including: and obtaining a target score according to the average action times and the environment feedback, and calculating a corresponding third loss value based on the target score and the prediction score.

The average number of movements is also the average number of card-out times. And (4) counting the average card-playing times after the real game center, and taking the average card-playing times as a standard, and combining environment feedback as a target score optimized by a neural network loss function of the model to be trained, so that the average card-playing times of the trained model to be trained are closer to those of real users. The target score may be obtained from the average number of actions in the second playing data and the environmental feedback, and the corresponding third loss value may be calculated based on the target score and the predicted score.

For example, the target score may be calculated using the following formula:

wherein the content of the first and second substances,

representing characters

(landlord/landlord upper home/landlord lower home) counted the number of true average movements,

representing characters in the game

The number of the card-playing times is increased,

representing characters

The target score of (a) is obtained,

representing characters

Environmental feedback of (1).

A target score is calculated based on the formula, and a corresponding third loss value is calculated based on the target score and the predicted score, so that the model to be trained can continuously win the game towards using the number of times of playing cards close to the real user. Since the field-fighting primary game belongs to the zero-sum game, the target score also contributes to the learning strategy that the model to be trained can more effectively learn.

It should be noted that, since the field main game belongs to the null and game, the field main game is in

In this case, if the character is in the current situation

If the landlord is expressed, the landlord falls off in the local game, and the target scores of the upper and lower landlords are divided into

If the character is in the current situation

If the landholder wins the local play, the goal score of the landholder is

。

In an embodiment, determining whether the model to be trained converges comprises: calculating the sum of the first loss value, the second loss value, the third loss value and the fourth loss value, and taking the sum of the first loss value, the second loss value, the third loss value and the fourth loss value as a total loss value; and if the total loss value is less than or equal to a preset loss value threshold value, determining that the model to be trained is converged. It should be noted that the loss value threshold may be set based on actual conditions, and this application is not limited to this.

S103, when the model to be trained converges, the model to be trained is used as a model to be evaluated, and the model to be trained and the comparison model are subjected to playing for multiple times, so that a playing result is obtained.

The comparison model refers to a basic model which is not subjected to iterative training of the model, and the basic model is obtained by training in a supervision learning mode. When the model to be trained converges, playing the model to be trained for multiple times as the model to be evaluated and the comparison model to obtain a playing result, and determining whether the model to be evaluated is a deployable model or not according to the playing result.

In one embodiment, when the model to be trained converges, the model to be trained is saved in the historical model library for sample generation.

And S104, when the playing result reaches a preset index, determining the model to be evaluated as the AI model, and finishing AI model training.

And when the chess playing result reaches a preset index, the model to be evaluated is considered to be a deployable model at the moment, and the model to be evaluated is used as an AI model to finish AI model training. The preset index may be some threshold value set in advance.

In one embodiment, the playing result comprises the average score and/or the winning rate of the model to be evaluated and the comparison model for playing for multiple times; when the playing result reaches a preset index, determining that the model to be evaluated is an AI model, including: and when the average score and/or the average rate reach a preset threshold value, determining the model to be evaluated as an AI model.

And when any one of the average score or the odds ratio reaches a preset threshold value, determining the model to be evaluated as an AI model, and finishing AI model training.

It should be noted that the process of model evaluation may be performed simultaneously with the training process of the model to be trained, that is, after the first model to be trained is trained, the second model to be trained begins to be trained, and meanwhile, the first model to be trained is played as the first model to be evaluated and the comparison model to perform model evaluation.

In the AI model training method provided in the above embodiment, first playing data is obtained by obtaining a plurality of sample generation models and then playing games according to the plurality of sample generation models; acquiring second playing data, and training the model to be trained according to the first playing data and the second playing data; when the model to be trained converges, taking the model to be trained as a model to be evaluated, and playing the model and the comparison model for multiple times to obtain a playing result; and finally, when the chess playing result reaches a preset index, determining the model to be evaluated as the AI model, and finishing the AI model training. And adding real second playing data when the model to be trained is trained, so as to improve the anthropomorphic property of the obtained model to be trained. And evaluating the model after obtaining the model to be trained, and further determining the final AI model, thereby ensuring the prediction accuracy and the competitive level of the obtained AI model.

Referring to fig. 7, fig. 7 is a flowchart illustrating a method for using a model according to an embodiment of the present disclosure.

As shown in fig. 7, the model using method includes steps S301 to S303.

S301, obtaining current playing data, and performing feature extraction on the current playing data to obtain current image features and current vector features.

In the actual use process, a player is connected with a front-end server, an AI server is also connected with the front-end server, the front-end server acquires current playing data and sends the current playing data to the AI server, and the AI server extracts the characteristics of the current playing data after receiving the current playing data, so that current image characteristics and current vector characteristics corresponding to the current playing data are obtained.

The current playing data can include information such as current hands, the number of opponent hands and current roles.

For example, in a game of fighting with the landlord, when a real player abnormally drops during a game, the AI server accesses the front-end server, and acquires information on a player character (landlord or farmer) of the player who dropped the game, a hand of the player who dropped the game, the number of hands of the opponent player, and the like from the front-end server.

S302, inputting an AI model according to the current image feature and the current vector feature to obtain a predicted main strategy label and a predicted auxiliary strategy label.

And inputting the obtained current class image features and the current vector features into the AI model, and outputting predicted main strategy labels and predicted slave strategy labels through the AI model, namely outputting the predicted main plate labels and predicted band plate labels. The AI model is obtained by training with the AI model training method.

And S303, determining corresponding predicted actions according to the predicted main strategy label and the predicted auxiliary strategy label, and outputting the predicted actions to play chess with a real user.

The predicted action is also the card that needs to be played. And similarly, the predicted slave strategy label is converted into a specific band plate to be played according to the slave strategy label output by the AI model, and a predicted action is obtained according to the master plate to be played and the band plate to be played.

The AI server outputs the predicted action, namely, sends a card playing command to the front-end server, so that the front-end server plays the corresponding card according to the predicted action output by the AI server, and plays the game with the real player.

The model using method provided in the above embodiment obtains current playing data, performs feature extraction on the current playing data to obtain current class image features and current vector features, inputs the current class image features and the current vector features into the AI model to obtain predicted master policy tags and slave policy tags, determines predicted actions according to the predicted master policy tags and slave policy tags, outputs the predicted actions, and plays games with real users. When needing to call the AI model and playing games together with the real user, the training AI model can make corresponding actions according to the current playing data, so that the AI model can be quickly called, and the user experience of the real user is effectively improved.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of an AI model training method and/or a model using method.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any one of an AI model training method and/or a model using method.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

In one embodiment, the model to be trained comprises a first fully-connected layer, a residual network layer, a splicing layer and a second fully-connected layer; when the processor realizes that the model to be trained is trained according to the second playing data and the first playing data, the processor is used for realizing that:

constructing sample data according to the second playing data and the first playing data, and performing feature extraction on the sample data to obtain sample vector features and sample image features, wherein the sample data comprises environment feedback; processing the sample vector characteristics through the first full-connection layer to obtain a first target vector; processing the sample type image characteristics through the residual error network layer to obtain a second target vector; splicing the first target vector and the second target vector through the splicing layer to obtain a spliced vector; determining, by the second fully-connected layer, a probability distribution of master policy labels, a probability distribution of slave policy labels, and a prediction score based on the stitching vector; and training the neural network parameters of the model to be trained according to the probability distribution of the master strategy label, the probability distribution of the slave strategy labels, the prediction score and the environmental feedback.

In one embodiment, the processor, when implementing the training of the neural network parameters of the model to be trained according to the probability distribution of the master policy label, the probability distribution of the slave policy labels, the prediction score and the environmental feedback, is configured to implement:

calculating a corresponding first loss value according to the probability distribution of the main strategy label; calculating a corresponding second loss value according to the probability distribution of the slave strategy label; calculating a corresponding third loss value according to the prediction score and the environment feedback; calculating a fourth loss value according to the second playing data, the probability distribution of the main strategy label and the probability distribution of the auxiliary strategy label output by the model to be trained; determining whether the model to be trained is converged according to the first loss value, the second loss value, the third loss value and the fourth loss value; and if the model to be trained is converged, executing the step of playing the model to be trained as the model to be evaluated and the comparison model for multiple times when the model to be trained is converged, and obtaining a playing result.

In one embodiment, the second playing data includes an average number of actions; the processor, in performing the calculating a corresponding third loss value from the prediction score and the environmental feedback, is configured to perform:

and obtaining a target score according to the average action times and the environment feedback, and calculating a corresponding third loss value based on the target score and the prediction score.

In one embodiment, the processor, when implementing the determining whether the model to be trained converges according to the first loss value, the second loss value, the third loss value and the fourth loss value, is configured to implement:

calculating the sum of the first loss value, the second loss value, the third loss value and the fourth loss value, and taking the sum of the first loss value, the second loss value, the third loss value and the fourth loss value as a total loss value; and if the total loss value is less than or equal to a preset loss value threshold value, determining that the model to be trained is converged.

In one embodiment, the sample generation model comprises a supervised learning model, the supervised learning model being trained from the second playing data; the processor is configured to implement:

carrying out cyclic playing on a plurality of supervised learning models, and determining the selection probability of each supervised learning model; and determining a target model from the plurality of supervised learning models according to the selection probability and generating the model as a sample.

In one embodiment, the playing result comprises the average score and/or the winning rate of the model to be evaluated and the comparison model for playing for multiple times; and when the processor determines that the model to be evaluated is an AI model when the playing result reaches a preset index, the processor is used for realizing that:

and when the average score and/or the average rate reach a preset threshold value, determining the model to be evaluated as an AI model.

In one embodiment, the processor is configured to execute a computer program stored in the memory to perform the steps of:

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the AI model training methods and/or the model using methods provided in the embodiments of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An AI model training method, comprising:

obtaining a plurality of sample generating models, and playing chess according to the sample generating models to obtain first playing data;

acquiring second playing data, training a model to be trained according to the second playing data and the first playing data, and calculating a first loss value, a second loss value, a third loss value and a fourth loss value of the model to be trained;

determining whether the model to be trained is converged according to the first loss value, the second loss value, the third loss value and the fourth loss value;

when the model to be trained converges, taking the model to be trained as a model to be evaluated, playing the model and a comparison model for multiple times, and obtaining a playing result;

when the chess playing result reaches a preset index, determining the model to be evaluated as an AI model, and finishing AI model training;

the second playing data is real playing data, the second playing data comprises average action times, the third loss value is obtained by obtaining a target score according to the average action times and environment feedback and calculating based on the target score and a predicted score, and the predicted score is a score output by the model to be trained.

2. The AI model training method of claim 1, wherein the model to be trained comprises a first fully-connected layer, a residual network layer, a splice layer, and a second fully-connected layer; the training of the model to be trained according to the second playing data and the first playing data comprises the following steps:

constructing sample data according to the second playing data and the first playing data, and performing feature extraction on the sample data to obtain sample vector features and sample image features, wherein the sample data comprises environment feedback;

processing the sample vector characteristics through the first full-connection layer to obtain a first target vector;

processing the sample type image characteristics through the residual error network layer to obtain a second target vector;

splicing the first target vector and the second target vector through the splicing layer to obtain a spliced vector;

determining, by the second fully-connected layer, a probability distribution of master policy labels, a probability distribution of slave policy labels, and a prediction score based on the stitching vector;

and training the neural network parameters of the model to be trained according to the probability distribution of the master strategy label, the probability distribution of the slave strategy labels, the prediction score and the environmental feedback.

3. The AI model training method of claim 2, wherein the calculating a first penalty value, a second penalty value, a third penalty value, and a fourth penalty value for the model to be trained comprises:

calculating a corresponding first loss value according to the probability distribution of the main strategy label;

calculating a corresponding second loss value according to the probability distribution of the slave strategy label;

calculating a corresponding third loss value according to the prediction score and the environment feedback;

and calculating a fourth loss value according to the second playing data, the probability distribution of the main strategy label and the probability distribution of the auxiliary strategy label output by the model to be trained.

4. The AI model training method of claim 3, wherein the determining whether the model to be trained converges according to the first loss value, the second loss value, the third loss value, and the fourth loss value comprises:

calculating the sum of the first loss value, the second loss value, the third loss value and the fourth loss value, and taking the sum of the first loss value, the second loss value, the third loss value and the fourth loss value as a total loss value;

and if the total loss value is less than or equal to a preset loss value threshold value, determining that the model to be trained is converged.

5. The AI model training method of claim 1, wherein the sample generation model includes a supervised learning model trained from the second playing data; the method comprises the following steps:

carrying out cyclic playing on a plurality of supervised learning models, and determining the selection probability of each supervised learning model;

and determining a target model from the plurality of supervised learning models according to the selection probability and generating the model as a sample.

6. The AI model training method of claim 5, wherein the determining a selection probability for each of the supervised learning models comprises:

acquiring the average winning rate of each supervised learning model in the cyclic playing;

calculating a selection probability of each of the supervised learning models based on the average win ratio of each of the supervised learning models.

7. The AI model training method according to claim 1, wherein the playing results include an average score and/or a winning rate of the model to be evaluated and the comparison model playing a plurality of times; when the playing result reaches a preset index, determining that the model to be evaluated is an AI model, including:

8. A method of using a model, comprising:

acquiring current playing data, and performing feature extraction on the current playing data to obtain current image features and current vector features;

inputting an AI model according to the current class image characteristics and the current vector characteristics to obtain a predicted main strategy label and a predicted auxiliary strategy label, wherein the AI model is obtained by training according to the AI model training method of any one of claims 1-7;

and determining corresponding predicted actions according to the predicted main strategy labels and the predicted auxiliary strategy labels, and outputting the predicted actions to play with the real users.

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and implementing the AI model training method according to any one of claims 1 to 7 and the model using method according to claim 8 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the AI model training method according to any one of claims 1 to 7 and the model using method according to claim 8.