CN112712161B

CN112712161B - Data generation method and system

Info

Publication number: CN112712161B
Application number: CN201911023194.6A
Authority: CN
Inventors: 徐思佳; 胡仁杰; 柳杨; 孙胡杨
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2023-02-24
Anticipated expiration: 2039-10-25
Also published as: CN112712161A

Abstract

The invention discloses a data generation method and a system, wherein the method comprises the following steps: acquiring game data sent by terminal equipment; extracting a plurality of state features of the terminal device from the game data, wherein the state features comprise state features of a plurality of times; generating a corresponding characteristic sequence according to preset time and state characteristics corresponding to the preset time; and inputting the characteristic sequence into a preset model so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal equipment according to the characteristic sequence and the current state characteristic of the terminal equipment. The present invention can infer future states and rewards from current partial observations during the course of a game for proper macro action selection.

Description

Data generation method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data generation method and system.

Background

With the rapid development of computer technology, there are many applications with two-dimensional or three-dimensional virtual environment, namely, instant Strategy games, on terminals such as smart phones, tablet computers, etc., wherein interstellar dispute (SC) is one of the most popular and successful instant Strategy (RTS) games. In recent years, SC has been considered as a test platform for artificial intelligence research due to its huge state space, hidden information, multi-agent collaboration, and the like. Due to annual AIIDE and CIG competitions, more and more artificial intelligence robots are being proposed and continuously improved. However, there is still a large gap between top level robots and professional human players.

With the development of Deep Reinforcement Learning (DRL), macroscopic actions are beginning to be performed in the SC using a DRL framework. By macro actions are meant strategic actions such as production, construction, upgrading, etc. when competing with an opponent in the game. However, during the game, the current partial observations are not sufficient to infer future states and rewards due to fog caused by wars in the map, and correct macro action selection is not possible.

Disclosure of Invention

The invention aims to provide a data generation method, a data generation system, a computer device and a readable storage medium, which are used for solving the defects that the current partial observation is insufficient to infer the future state and reward and correct macro action selection cannot be performed due to fog caused by wars in a map in the prior art.

According to an aspect of the present invention, there is provided a data generating method including the steps of:

acquiring game data sent by terminal equipment;

extracting a plurality of state features of the terminal device from the game data, wherein the state features comprise state features of a plurality of times;

generating a corresponding characteristic sequence according to preset time and state characteristics corresponding to the preset time;

and inputting the characteristic sequence into a preset model so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal equipment according to the characteristic sequence and the current state characteristic of the terminal equipment, wherein the model at least comprises a first full connection layer, a second full connection layer, an LSTM network and a third full connection layer.

Optionally, the extracting a plurality of status features of the terminal device from the game data includes:

storing the game data in a segment playback buffer corresponding to the terminal device;

a plurality of status features of the terminal device are extracted from the segment playback buffer.

Optionally, the generating a corresponding feature sequence according to a preset time and a state feature corresponding to the preset time includes:

sequentially passing the state features through the first full connection layer and the second full connection layer, and acquiring target state features corresponding to the preset time;

and generating the characteristic sequence according to the preset time and the target state characteristic corresponding to the preset time.

Optionally, the inputting the feature sequence into a preset model to enable the model to generate a Q value corresponding to each action executed by a virtual object in the terminal device according to the feature sequence and the current state feature of the terminal device includes:

inputting each state feature into the LSTM network according to the time sequence relation among the feature sequences, and acquiring competition state information of other terminal equipment having competition relation with the terminal equipment through the LSTM network;

and inputting the competition state information and the current state characteristic into the third full connection layer, and acquiring the Q value from the full connection layer.

In order to achieve the above object, the present invention further provides a data generating system, which specifically includes the following components:

the acquisition module is used for acquiring game data sent by the terminal equipment;

the extracting module is used for extracting a plurality of state features of the terminal equipment from the game data, wherein the state features comprise state features of a plurality of times;

the generating module is used for generating a corresponding characteristic sequence according to preset time and state characteristics corresponding to the preset time;

the generating module is further configured to input the feature sequence into a preset model, so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal device according to the feature sequence and the current state feature of the terminal device, where the model at least includes a first full connection layer, a second full connection layer, an LSTM network, and a third full connection layer.

Optionally, the extracting module is further configured to:

storing the game data in a segment playback buffer corresponding to the terminal equipment;

and extracting a plurality of state characteristics of the terminal equipment from the segment playback buffer.

Optionally, the generating module is further configured to:

inputting each state characteristic into the LSTM network according to the time sequence relation among the characteristic sequences, and acquiring competition state information of other terminal equipment having competition relation with the terminal equipment through the LSTM network;

In order to achieve the above object, the present invention further provides a computer device, which specifically includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the data generating method introduced above when executing the computer program.

In order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data generation method introduced above.

According to the data generation method, the data generation system, the computer device and the readable storage medium, the state features of a plurality of times are extracted from the terminal device, the feature sequence is generated according to the state features of the preset times, and the feature sequence is input into the preset model, so that the model generates the Q value corresponding to each action executed by the virtual object in the terminal device according to the feature sequence and the current state features, and the future state and the reward can be inferred according to the current partial observation in the game process, and therefore correct macro action selection can be conducted.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is an alternative application environment diagram of a data generation method provided by the embodiment of the present disclosure;

FIG. 2 is an alternative model architecture diagram of the data generation method provided by the embodiments of the present disclosure;

fig. 3 is an alternative flow chart of a data generation method provided by the embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating an alternative specific flowchart of step S102 in fig. 3;

fig. 5 is a schematic diagram illustrating an alternative specific flowchart of step S106 in fig. 3;

fig. 6 is a schematic diagram of an alternative specific flowchart of step S104 in fig. 3;

FIG. 7 is a schematic diagram of an alternative program module of the data generation system provided by the embodiments of the present disclosure;

fig. 8 is a schematic diagram of an alternative hardware architecture of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Partial noun interpretation

Virtual environment: is a virtual environment that is displayed (or provided) when an application is run on the terminal. The virtual environment may be a simulation environment of a real world, a semi-simulation semi-fictional three-dimensional environment, or a pure fictional three-dimensional environment. The virtual environment may be any one of a two-dimensional virtual environment, a 2.5-dimensional virtual environment, and a three-dimensional virtual environment. Optionally, the virtual environment is also used for virtual environment engagement between at least two virtual characters, in which virtual resources are available for use by the at least two virtual characters. Optionally, the map of the virtual environment is a square or rectangle, and the square or rectangle includes a lower left diagonal region and an upper right diagonal region that are symmetrical; the winning conditions of the battle in the virtual environment include occupying or destroying the target site of the enemy battle, which can be all the sites of the enemy battle or partial sites of the enemy battle (such as a main base and a guard tower).

Virtual object: refers to a movable object in a virtual environment. The movable object may be at least one of a virtual character, a virtual animal, and an animation character. Alternatively, when the virtual environment is a three-dimensional virtual environment, the virtual objects are three-dimensional stereo models, each virtual object having its own shape and volume in the three-dimensional virtual environment, occupying a part of the space in the three-dimensional virtual environment. Taking the SC game as an example, the virtual object may be any one of a family of worms, a family of people, and a family of spirit in the SC game, and the embodiment of the present invention is described with a target virtual object as a family of people as an example.

Reward value: is the overall contribution of the scheduling policy and/or behavior of the virtual object to the winning condition. The contribution of the behavior of the virtual object to the winning condition is instant income, and the contribution of the scheduling strategy of the virtual object to the winning condition is return income. For example, the virtual object defends in the area a, the behavior of the virtual object is to attack the virtual animal, and the contribution of the experience value caused by attacking the virtual animal to the winning condition is the instant benefit of the virtual object a; the user controls the virtual object to transfer from the area A in a large range to the area B to carry out local battle with the virtual object for enemy camp, and the contribution of the virtual object to the victory condition when the virtual object kills the local camp virtual object is the return income.

Fig. 1 is an alternative application environment diagram of the data generation method according to the embodiment of the present invention. In fig. 1, the server S communicates with a plurality of terminal devices P1 and P2 … Pn. The plurality of terminal devices P1 and P2 … Pn may be various electronic display devices such as a notebook computer, a desktop computer, a mobile phone, and a tablet computer. The server S receives the game data uploaded by the plurality of terminal devices P1 and P2 … Pn, trains and updates a model according to the game data, and then synchronizes the trained and updated model to the plurality of terminal devices P1 and P2 … Pn. RTS games are installed on the plurality of terminal devices P1 and P2 … Pn, the models synchronized by the server S are received, and new game data are generated by using the models, so that the server S can further train and update the models. It should be noted that the server S is provided with a playback buffer B, and the playback buffer B is provided with a plurality of segment playback buffers B1, B2, B3, B4 … bm, and is configured to store game data uploaded by the plurality of terminal devices P1, P2 … Pn. In the embodiment of the present invention, an SC game is exemplified.

The following describes an exemplary data generation method provided by the present invention, taking the game data of the terminal device P1 as an example, with reference to the accompanying drawings.

Fig. 3 is a schematic flow chart of an optional step of the data generating method of the present invention, and it should be understood that the flow chart in the embodiment of the method is not used to limit the order of executing the steps. The following description is exemplarily made with the server S as the execution subject.

As shown in fig. 3, the data generating method specifically includes steps S100 to S106.

Step S100: and acquiring game data sent by the terminal equipment.

Illustratively, in conjunction with fig. 1, the server S acquires game data transmitted by a plurality of terminal devices P1, P2 … Pn, the game data including state data of the terminal devices P1, P2 … Pn in a game state and a non-game state, an action to be performed, and a prize value obtained by performing the action.

Step S102: extracting a plurality of state features of the terminal device from the game data, the plurality of state features including state features of a plurality of times.

In an exemplary embodiment, as shown in FIG. 4, the step S102 includes steps S1020-S1022.

Step S1020: and storing the game data in a segment playback buffer corresponding to the terminal equipment.

Illustratively, the game data is stored in a corresponding segment playback buffer in a playback buffer according to the difference of the terminal devices, and the playback buffer includes a plurality of segment playback buffers. With reference to fig. 1, after a plurality of game data are respectively obtained from a plurality of terminal devices P1, P2 … Pn, the plurality of game data are stored in a segment playback buffer of the playback buffer according to the plurality of terminal devices P1, P2 … Pn, so that corresponding data segments are stored in the playback buffer according to a first-in first-out sequence of data according to a difference in data generation speeds of the plurality of terminal devices P1, P2 … Pn, which is favorable for data maintenance. For example: a plurality of game data acquired from the terminal device P1 are stored in the segment playback buffer b1 corresponding to the terminal device P1.

Step S1022: and extracting a plurality of state characteristics of the terminal equipment from the segment playback buffer.

Illustratively, in conjunction with fig. 1, a plurality of status features of the terminal device P1 are extracted from the segment playback buffer b1 according to a preset rule. It should be noted that the plurality of state features are state features of the terminal device P1 after the virtual object in the terminal device P1 executes each action within a preset period of time. Each state feature corresponds to a time.

Step S104: and generating a corresponding characteristic sequence according to preset time and the state characteristic corresponding to the preset time.

Since the state features correspond to the preset time one by one, the preset time and the state features corresponding to the preset time form a feature sequence.

Step S106: and inputting the characteristic sequence into a preset model so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal equipment according to the characteristic sequence and the current state characteristic of the terminal equipment. The model includes at least a first fully-connected layer, a second fully-connected layer, an LSTM network, and a third fully-connected layer.

In an exemplary embodiment, as shown in fig. 5, since each target state feature in the feature sequence has a certain time sequence relationship, the step S106 specifically includes steps S1060 to S1062, where:

step S1060: and inputting each state characteristic into the LSTM network according to the time sequence relation among the characteristic sequences, and acquiring competitive state information of other terminal equipment having competitive relation with the terminal equipment through the LSTM network.

It should be noted that the time step of the LSTM network is set to 30 seconds in the game to capture most of the macroscopic change information during the game.

Illustratively, the LSTM network operating principle may be as follows:

H _t ＝o _t ⊙tanh(C _t )

a forgetting gate for receiving a memory message and deciding which part of the memory is to be reserved and forgotten;

wherein the forgetting factor is f _t ，f _t ∈[0,1]，f _t Target unit state information representing output from time t to time t-1

Is used for determining whether the memory information learned at the time t-1 (namely the target unit state information output at the time t-1 and obtained by conversion)

) Pass or partially pass.

An input gate for selecting information to be memorized;

i _t ∈[0,1]，i _t indicating temporary cell state information g at time t _t Selection weight of g _t Temporary cell state information at time t;

may indicate information that is desired to be deleted, i _t ⊙g _t New information can be represented, and the cell state information C at the time t can be obtained through the two parts _t 。

An output gate for outputting the hidden state information H at time t _t Wherein o is _t ∈[0,1]，o _t Showing the selection weight of the cell state information at time t.

In addition, W is _xf 、W _hf 、W _xg 、W _hg 、W _xi 、W _hi 、W _xo 、W _ho All are weight parameters in the LSTM network; b _f 、b _g 、b _i 、b _o Are all bias terms in the LSTM network; these parameters are obtained by model training.

It should be noted that the above exemplary structure of the LSTM network is not used to limit the scope of the present invention.

Step S1062: inputting the competition state information and the current state information into the third full connection layer, and acquiring the Q value from the full connection layer.

By the embodiment of the invention, the future state and the reward can be effectively inferred according to the current partial observation in the game process so as to select the correct macro action.

In an exemplary embodiment, as shown in fig. 6, the step S104 further includes steps S1040 to S1042, where:

step S1040: and sequentially passing the state features through the first full-connection layer and the second full-connection layer, and acquiring target state features corresponding to the preset time.

With reference to fig. 1 and fig. 2, fig. 2 is an alternative model architecture diagram of the data generation method according to the embodiment of the present invention. The model is an improved APE-X DQN model. The model comprises an input end, a first full connection layer, a second connection layer, an LSTM network layer, a third full connection layer and a loss function calculation layer. After the plurality of state features of the terminal device P1 are extracted from the game data, the plurality of state features are sequentially passed through a first full connection layer and a second full connection layer as the plurality of state features are extracted according to the preset time, and a target feature corresponding to the preset time is obtained from the second full connection layer.

It should be noted that the fully-connected layer is configured to perform weighted summation on the input features, and the loss function calculation layer is configured to train the model according to the Q value and the game data.

Step S1042: and inputting the characteristic sequence into a preset model so that the model generates the characteristic sequence according to the preset time and the target state characteristic corresponding to the preset time.

Illustratively, in conjunction with fig. 2, the target status feature obtained from the second fully-connected layer and the time corresponding to the target status feature constitute the feature sequence.

In an exemplary embodiment, referring to fig. 2, the loss function calculation layer performs calculation by inputting the Q value generated by the model and the actual reward value of the terminal device after performing the corresponding action in the state into the loss function, and trains the model according to the calculation result. It should be noted that the Q value reflects the reward of the virtual environment after a certain state and execution of the corresponding action.

Specifically, with reference to fig. 1, after the model generates a Q value of the terminal device P1 in each state, an action executed by a virtual object in the terminal device P1 is determined according to a preset search degree and the Q value, and after the virtual object executes the action, a corresponding bonus estimate value is obtained, and a loss value of the model is calculated according to the bonus estimate value and an actual bonus value of the terminal device after the terminal device executes the corresponding action in the state, so as to adjust parameter information of the model according to the loss value.

The search degree indicates a reliability of execution of the operation corresponding to the maximum value among the Q values. The higher the search degree is, the lower the probability that the terminal device P1 executes the action corresponding to the maximum data in the Q values is, and the lower the credibility of executing the action corresponding to the maximum numerical value in the Q values is.

Based on the data generation method provided in the foregoing embodiment, the present embodiment provides a data generation system, and the data generation system may be applied to the server S. In particular, FIG. 7 illustrates an alternative block diagram of the data generation system, which is divided into one or more program modules, stored in a storage medium, and executed by one or more processors to implement the present invention. The program module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the data generation system in the storage medium than the program itself.

As shown in fig. 7, the data generation system specifically includes the following components:

the obtaining module 201 is configured to obtain game data sent by a terminal device.

Illustratively, in conjunction with fig. 1, the obtaining module 201 obtains game data sent by a plurality of terminal devices P1, P2 … Pn, the game data including state data of the terminal devices P1, P2 … Pn in a game state and a non-game state, an executed action and a prize value obtained by executing the action.

An extracting module 202, configured to extract a plurality of status features of the terminal device from the game data, where the plurality of status features include status features of a plurality of times.

In an exemplary embodiment, the extracting module 202 is further configured to store the game data in a segment playback buffer corresponding to the terminal device.

The extracting module 202 is further configured to extract a plurality of status features of the terminal device from the segment playback buffer.

Illustratively, in conjunction with fig. 1, the extracting module 202 extracts a plurality of status features of the terminal device P1 from the segment playback buffer b1 according to a preset rule. It should be noted that the plurality of state features are state features of the terminal device P1 after the virtual object in the terminal device P1 executes each action within a preset period of time. Each state feature corresponds to a time.

The generating module 203 is configured to generate a corresponding feature sequence according to preset time and a state feature corresponding to the preset time.

The generating module 203 is further configured to input the feature sequence into a preset model, so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal device according to the feature sequence and the current state feature of the terminal device. The model includes at least a first fully-connected layer, a second fully-connected layer, an LSTM network, and a third fully-connected layer.

In an exemplary embodiment, since each target status feature in the feature sequence has a certain time sequence relationship, the generating module 203 is further configured to input each status feature into the LSTM network according to the time sequence relationship between the feature sequences, and acquire, through the LSTM network, contention status information of other terminal devices having a contention relationship with the terminal device.

Illustratively, the LSTM network operating principle may be as follows:

H _t ＝o _t ⊙tanh(C _t )

) Pass or partially pass.

An input gate for selecting information to be memorized;

i _t ∈[0,1]，i _t indicating temporary cell state information g at time t _t Selection weight of g _t Temporary cell state information for time t;

The generating module 203 is further configured to input the competition state information and the current state information into the third fully-connected layer, and obtain the Q value from the fully-connected layer.

In an exemplary embodiment, the generating module 203 is further configured to pass the state feature through the first full connection layer and the second full connection layer in sequence, and obtain a target state feature corresponding to the preset time.

The generating module 203 is further configured to input the feature sequence into a preset model, so that the model generates the feature sequence according to the preset time and a target state feature corresponding to the preset time.

In an exemplary embodiment, the model further includes a loss function calculation layer, please refer to fig. 2, which inputs the Q value generated by the model and the actual reward value of the terminal device after performing the corresponding action in the state into the loss function for calculation, and trains the model according to the calculation result. It should be noted that the Q value reflects the reward of the virtual environment after a certain state and execution of the corresponding action.

The search degree indicates a degree of reliability of execution of the operation corresponding to the maximum value among the Q values. The higher the search degree is, the lower the probability that the terminal device P1 executes the action corresponding to the maximum data in the Q values is, and the lower the credibility of executing the action corresponding to the maximum numerical value in the Q values is.

The present embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of multiple servers) capable of executing a program, and the like. As shown in fig. 8, the computer device 30 of the present embodiment includes at least but is not limited to: a memory 301, a processor 302 communicatively coupled to each other via a system bus. It is noted that fig. 8 only shows a computer device 30 with components 301-302, but it is to be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the memory 301 (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 301 may be an internal storage unit of the computer device 30, such as a hard disk or a memory of the computer device 30. In other embodiments, the memory 301 may also be an external storage device of the computer device 30, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 30. Of course, the memory 301 may also include both internal and external storage devices for the computer device 30. In the present embodiment, the memory 301 is generally used for storing an operating system installed in the computer device 30 and various types of application software, such as program codes of a data generation system and the like. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 302 generally serves to control the overall operation of the computer device 30.

Specifically, in this embodiment, the processor 302 is configured to execute a program of a data generation method stored in the processor 302, and when executed, the program of the data generation method implements the following steps:

acquiring game data sent by terminal equipment;

For the specific embodiment of the process of the above method steps, reference may be made to the above embodiments, and details of this embodiment are not repeated herein.

The present embodiments also provide a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App, etc., having stored thereon a computer program that when executed by a processor implements the method steps of:

acquiring game data sent by terminal equipment;

The computer device and the readable storage medium provided by the embodiment can deduce the future state and reward according to the current partial observation in the game process by extracting the state features of a plurality of times from the terminal device, generating the feature sequence according to the state features of the preset times and inputting the feature sequence into the preset model, so that the model generates the Q value corresponding to each action executed by the virtual object in the terminal device according to the feature sequence and the current state features, and thus, the correct macro action selection can be performed.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of data generation, the method comprising:

acquiring game data sent by terminal equipment;

inputting the feature sequence into a preset model, so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal device according to the feature sequence and the current state feature of the terminal device, wherein the Q value reflects a reward of a virtual environment after a certain state and the corresponding action are executed, and the model at least comprises a first full connection layer, a second full connection layer, an LSTM network and a third full connection layer;

inputting the feature sequence into a preset model so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal device according to the feature sequence and the current state feature of the terminal device, including:

2. The data generation method of claim 1, wherein said extracting a plurality of state features of the terminal device from the game data comprises:

3. The data generation method of claim 1, wherein the generating a corresponding feature sequence according to a preset time and a state feature corresponding to the preset time comprises:

4. A data generation system, the system comprising:

an extraction module, configured to extract a plurality of status features of the terminal device from the game data, where the plurality of status features include status features of a plurality of times;

the generating module is further configured to input the feature sequence into a preset model, so that the model generates a Q value corresponding to each action executed by a virtual object in the terminal device according to the feature sequence and current state features of the terminal device, where the Q value reflects a reward of a virtual environment after a certain state and the corresponding action are executed, and the model at least includes a first fully-connected layer, a second fully-connected layer, an LSTM network, and a third fully-connected layer;

the generation module is further configured to:

5. The data generation system of claim 4, wherein the extraction module is further to:

6. The data generation system of claim 4, wherein the generation module is further to:

sequentially enabling the state features to pass through the first full connection layer and the second full connection layer, and acquiring target state features corresponding to the preset time;

7. A computer device, the computer device comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the data generation method of any of claims 1 to 3 when executing the computer program.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data generation method of any one of claims 1 to 3.