CN113663335A - AI model training method, device, equipment and storage medium for FPS game - Google Patents

AI model training method, device, equipment and storage medium for FPS game Download PDF

Info

Publication number
CN113663335A
CN113663335A CN202110800433.5A CN202110800433A CN113663335A CN 113663335 A CN113663335 A CN 113663335A CN 202110800433 A CN202110800433 A CN 202110800433A CN 113663335 A CN113663335 A CN 113663335A
Authority
CN
China
Prior art keywords
virtual character
reward
moving direction
road condition
time step
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110800433.5A
Other languages
Chinese (zh)
Other versions
CN113663335B (en
Inventor
刘舟
徐键滨
吴梓辉
徐雅
王理平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Sanqi Jiyao Network Technology Co ltd
Original Assignee
Guangzhou Sanqi Jiyao Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Sanqi Jiyao Network Technology Co ltd filed Critical Guangzhou Sanqi Jiyao Network Technology Co ltd
Priority to CN202110800433.5A priority Critical patent/CN113663335B/en
Publication of CN113663335A publication Critical patent/CN113663335A/en
Application granted granted Critical
Publication of CN113663335B publication Critical patent/CN113663335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/55Controlling game characters or game objects based on the game progress
    • A63F13/57Simulating properties, behaviour or motion of objects in the game world, e.g. computing tyre load in a car race game
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/825Fostering virtual characters
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/837Shooting of targets

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)
  • Instructional Devices (AREA)

Abstract

The invention discloses an AI model training method of an FPS game, which comprises the following steps: acquiring a first state action value of a virtual character at a current time step and a second state action value of the virtual character at a next time step based on an AI model of the FPS game; calculating a value difference value of the first state action value and the second state action value as a prediction reward; calculating a loss function based on the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism; the reward mechanism includes: determining the road condition type of the virtual character in the preset moving direction, and calculating the actual reward according to the road condition type and the actual moving direction of the virtual character; and optimizing the model according to the loss function until the loss function converges. By adopting the embodiment of the invention, effective situation analysis can be carried out aiming at complex game scenes, so that the virtual character behaviors output by the trained AI model can better simulate real human behaviors, and the game experience of users is improved.

Description

AI model training method, device, equipment and storage medium for FPS game
Technical Field
The present invention relates to a model training method, and more particularly, to an AI model training method, device, apparatus, and storage medium for an FPS game.
Background
With the gradual development of the electronic competitive industry, the First-person visual shooting game (FPS) is receiving more and more extensive attention. For an FPS game, in order to ensure good experience of a user in a game process, an AI model often needs to be constructed and trained, so that the AI model can be well applied to the FPS game. Existing gaming models are generally only able to handle relatively simple game scenarios and provide game strategies that follow the underlying game rules. The actions that a virtual character makes in a virtual scene according to game strategies provide a user with an immersion and realism that results from the reasonableness of the actions that the virtual character takes in the virtual scene, and for a virtual character whose character is a human, one would like the behavioral actions of the virtual character to be as similar as possible to those of a real human. When a complex game scene or a complex game rule is faced, for example, obstacle judgment, a game model is difficult to perform effective situation analysis, and naturally, an accurate obstacle judgment result is difficult to give, so that a virtual character may be misled to move towards a moving direction with an obstacle, and an AI model makes the behavior of the output virtual character far from the behavior of a real human, so that a user cannot be well integrated into a game environment, and game experience of the user is influenced.
Disclosure of Invention
The embodiment of the invention aims to provide an AI model training method, device, equipment and storage medium for an FPS game, which can perform effective situation analysis aiming at a complex game scene, so that virtual character behaviors output by a trained AI model can better simulate real human behaviors, and the game experience of a user is improved.
In order to achieve the above object, an embodiment of the present invention provides an AI model training method for an FPS game, including:
acquiring a first state action value of a virtual character at a current time step and a second state action value of the virtual character at a next time step based on an AI model of the FPS game;
calculating a value difference between the first state action value and the second state action value as a prediction reward;
calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism; the reward mechanism includes: determining the road condition type of the virtual character in a preset moving direction, and calculating an actual reward according to the road condition type and the actual moving direction of the virtual character;
and optimizing the AI model of the FPS game according to the loss function until the loss function is converged.
As an improvement of the above scheme, determining the road condition type of the virtual character in the preset moving direction includes:
emitting two parallel rays to the moving direction according to the current position of the virtual character; the height difference of the two parallel rays meets a preset height difference threshold value;
detecting the ray detection condition of the two parallel rays at the current frame of the current time step;
determining the detection condition of the obstacle in the moving direction according to the ray detection condition;
and determining the road condition type of the virtual character in the moving direction according to the obstacle detection condition.
As an improvement of the above, determining the obstacle detection situation in the moving direction based on the radiation detection situation includes:
when the ray detection condition is that no ray return distance value is detected in the current frame of the current time step, determining that no obstacle is detected in the moving direction by the ray;
and when the ray detection condition is that a ray return distance value is detected in the current frame of the current time step, determining that the ray detects the obstacle in the moving direction.
As an improvement of the above scheme, the road condition type includes at least one of an unobstructed passable road condition, a slope passable road condition, an obstructed passable road condition and an obstructed nonpassable road condition.
As an improvement of the above scheme, calculating an actual reward according to the road condition type and the actual moving direction of the virtual character includes:
when the actual moving direction of the virtual character moves towards the accessible road condition without obstacles or the accessible road condition on a slope, giving positive reward;
when the actual moving direction of the virtual character moves towards the road condition with obstacles and without passing, giving negative reward;
when the actual moving direction of the virtual character moves towards the road condition with obstacles and the virtual character is detected to take jumping action, positive reward is given;
and when the actual moving direction of the virtual character is towards the road condition that the virtual character has obstacles and can pass, and the virtual character is detected not to take the jumping action, giving a negative reward.
As an improvement of the above solution, the AI model of the FPS game includes a first network and a second network; then, obtaining a first state action value of the virtual character at the current time step and a second state action value at the next time step based on the AI model of the FPS game includes:
acquiring first state information and first action information of the virtual role at the current time step;
inputting the first state information and the first action information into the first network to obtain a first state action value of the virtual role at the current time step;
acquiring second state information of the virtual role at the next time step;
inputting the second state information into the second network to obtain second action information;
and inputting the second state information and the second action information into the first network to obtain a second state action value of the virtual character at the next time step.
In order to achieve the above object, an embodiment of the present invention further provides an AI model training apparatus for an FPS game, including:
the data acquisition unit is used for acquiring a first state action value of the virtual character at the current time step and a second state action value at the next time step based on an AI model of the FPS game;
a training unit to:
calculating a value difference between the first state action value and the second state action value as a prediction reward;
calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism; the reward mechanism includes: determining the road condition type of the virtual character in a preset moving direction, and calculating an actual reward according to the road condition type and the actual moving direction of the virtual character;
and optimizing the AI model of the FPS game according to the loss function until the loss function is converged.
As an improvement of the above scheme, determining the road condition type of the virtual character in the preset moving direction includes:
emitting two parallel rays to the moving direction according to the current position of the virtual character; the height difference of the two parallel rays meets a preset height difference threshold value;
detecting the ray detection condition of the two parallel rays at the current frame of the current time step;
determining the detection condition of the obstacle in the moving direction according to the ray detection condition;
and determining the road condition type of the virtual character in the moving direction according to the obstacle detection condition.
To achieve the above object, an embodiment of the present invention further provides an AI model training apparatus for an FPS game, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the AI model training method for the FPS game according to any one of the above embodiments when executing the computer program.
In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the AI model training method for an FPS game according to any one of the above embodiments.
Compared with the prior art, the method, the device, the equipment and the storage medium for training the AI model of the FPS game provided by the embodiment of the invention have the advantages that in the process of training the AI model of the FPS game, the road condition type of the virtual character in the preset moving direction is determined, and the actual reward is calculated according to the road condition type and the actual moving direction of the virtual character. By calculating the road condition type of the virtual character in the moving direction, calculating the actual reward according to the road condition type, and setting the reward mechanism of positive reward and negative reward, effective situation analysis can be carried out aiming at complex game scenes, so that the virtual character behavior output by the trained AI model can better simulate real human behavior, and the game experience of a user is improved.
Drawings
FIG. 1 is a flowchart of an AI model training method for FPS game according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a barrier-free passable road condition according to an embodiment of the present invention;
fig. 3 is a schematic view of a passable road condition on a slope according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a trafficable road condition with obstacles according to an embodiment of the invention;
fig. 5 is a schematic diagram illustrating a road condition with obstacles and without passing through according to an embodiment of the present invention;
fig. 6 is a block diagram illustrating an AI model training apparatus for an FPS game according to an embodiment of the present invention;
fig. 7 is a block diagram illustrating an AI model training apparatus for an FPS game according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of an AI model training method for an FPS game according to an embodiment of the present invention, where the AI model training method for the FPS game includes:
s1, acquiring a first state action value of the virtual character at the current time step and a second state action value at the next time step based on the AI model of the FPS game;
s2, calculating a value difference value of the first state action value and the second state action value to be used as a prediction reward;
s3, calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism;
and S4, optimizing the AI model of the FPS game according to the loss function until the loss function is converged.
Specifically, in step S1, the step S1 of obtaining a first state action value of the virtual character at the current time step and a second state action value of the virtual character at the next time step based on the AI model of the FPS game includes steps S11 to S15: .
S11, acquiring first state information and first action information of the virtual role at the current time step;
s12, inputting the first state information and the first action information into the first network to obtain a first state action value of the virtual character at the current time step;
s13, acquiring second state information of the virtual role at the next time step;
s14, inputting the second state information into the second network to obtain second action information;
s15, inputting the second state information and the second action information into the first network to obtain the second state action value of the virtual role at the next time step
It is worth noting that the action information refers to actions taken by the virtual character in the environment; the state information is the result of taking the action and is reflected in the state of the game. For example, if the action information is shooting, the corresponding state information is enemy blood loss, enemy death, and the like; for another example, if the action information is jumping, and if the box body is jumped, the corresponding state information is that the height of the box body is increased. In the embodiment of the present invention, acquiring a state from a game environment every n frames is referred to as a time step, and it can be understood that the current time step and the next time step are two consecutive time steps.
The AI model of the FPS game is a DDPG model, and the DDPG model has two networks: a critical network and an actor network. Illustratively, the first network is a critical network and the second network is an operator network. The detailed network structures of the critical network and the operator network can refer to the prior art, and the present invention is not described herein again.
Further, the first state information, the first action information and the second state information are obtained by performing environment interaction in the game in advance. The method comprises the steps of firstly obtaining first state information of the virtual character walking at the current time, then inputting the first state information into an actor network to obtain first action information, and selecting actions actually transmitted into a game environment according to an epsilon greedy algorithm (selecting between random actions and prediction actions, selecting random actions larger than epsilon random numbers and selecting prediction actions smaller than epsilon random numbers). And transmitting the action actually taken by the virtual character into a game environment to obtain second state information of the virtual character at the next time step, and repeating the steps. At this time, the first state information, the first action information, and the second state information may be obtained through an actual game environment as sample data input to the AI model of the FPS game. In this process, an actual reward calculated based on the first status information, the first action information, the second status information, and based on a reward mechanism is calculated.
Exemplarily, first state information and first action information are input into a criticic network to obtain a current time-walking first state action value; then, the next-time-stepped second state information is input to the actor network to obtain the next-time-stepped prediction action as second action information, and the second state information and the second action information are input to the critic network to obtain the next-time-stepped second state action value.
Specifically, in step S2, the first state action value and the second state action value are differentiated to obtain a value difference value.
It is noted that the value of a state action refers to the expected return for a certain state and action, for example, the return is the accumulated value of a game award, and the expected return is the average value of the returns of a plurality of game plays. The first network is a neural network, the input of the first network is state and action, the output is the value of the state and action, and the parameters of the neural network are adjusted to be close to the reality by back propagation through a loss function.
Specifically, in step S3, the actual bonus is calculated by a preset bonus mechanism. Calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward, comprising steps S31-S32:
s31, calculating a loss function of the first network according to the predicted reward and the actual reward;
and S32, obtaining the predicted value of the first network and calculating the loss function of the second network.
Illustratively, the sum of the squares of the difference between the predicted reward and the actual reward is the loss function of the critic network, and the loss function of the actor network is the predicted value of the critic network. Specifically, in the embodiment of the invention, the operator network and the critic network are respectively divided into the evaluate network and the target network, the target network does not participate in training, and the parameters are periodically copied from the evaluate network.
Specifically, the criticc network loss function satisfies the following formula:
Figure BDA0003164475330000081
wherein N is the total amount of the sample data; y isi=ri+γQ'(si+1,μ'(si+1μ’)|θQ’);riReward for the ith time step; gamma is a super parameter, a discount factor; q (s, a | theta)Q) For criticc network, the parameter is θQ;μ(s|θμ) For an actor network, the parameter is θμ(ii) a Q' is a target network of the critic network; mu' is a target network of the operator network, the parameters of the target network are updated from the Q network and the mu network regularly without participating in training, and the parameters are updated by the operator network through the predicted value of the maximization criticc network.
Optionally, the reward mechanism comprises: and determining the road condition type of the virtual character in a preset moving direction, and calculating the actual reward according to the road condition type and the actual moving direction of the virtual character.
Specifically, determining the road condition type of the virtual character in the preset moving direction includes steps S311 to S314:
s311, emitting two parallel rays to the moving direction according to the current position of the virtual character; the height difference of the two parallel rays meets a preset height difference threshold value;
s312, detecting the ray detection condition of the two parallel rays at the current frame of the current time step;
s313, determining the detection condition of the obstacle in the moving direction according to the ray detection condition;
s314, determining the road condition type of the virtual character in the moving direction according to the obstacle detection condition.
Illustratively, the moving direction is 8 directions in which the virtual character can move, the two parallel rays are both linear rays or spherical rays, and the spherical rays are spherical rays which emit a sphere towards a target direction and are detected by using the sphere. The height difference threshold value is the height of the sole of the virtual character from the ground after the virtual character executes the jumping action. The method includes the steps that ray detection conditions of two parallel rays are detected at a current frame of a current time step (each time step is detected), the extension length of the rays is determined by setting the length of a space frame between the two time steps, for example, the larger the space frame is, the longer the extension length of the rays is, the longer the detectable distance is, and therefore, the road condition of the virtual character in a certain range can be determined by setting the length of the space frame between the two time steps. When the ray detection condition is that no ray return distance value is detected in the current frame of the current time step, it can be determined that the ray does not detect the obstacle in the moving direction; when the ray detection condition is that a ray return distance value is detected in the current frame of the current time step, it can be determined that the ray detects an obstacle in the moving direction.
Further, the road condition type includes at least one of an unobstructed passable road condition, a slope passable road condition, an obstructed passable road condition and an obstructed nonpassable road condition.
Referring to fig. 2 and fig. 2 are schematic diagrams of the unobstructed passable road condition according to the embodiment of the present invention, where the virtual character a emits a group of parallel linear rays, which are a first ray L1 and a second ray L2, respectively, and the height of the first ray L1 is higher than the height of the second ray L2. When the first ray L1 and the second ray L2 return distance values are not detected at the current frame of the current time step, it may be determined that there is no obstacle in the moving direction, and at this time, it may be determined that the road condition type of the virtual character in the moving direction is an obstacle-free passable road condition.
Referring to fig. 3, fig. 3 is a schematic diagram of a slope passable road condition provided in the embodiment of the present invention, where a virtual character a emits a first ray L1 and a second ray L2 that are parallel at a position where the virtual character a is located, and if an obstacle is detected in a current frame of a current time step, it needs to further determine whether an object S detected by the first ray L1 and the second ray L2 is a slope or an obstacle, and at this time, the following method may be adopted to determine: acquiring a first distance from the target object to the avatar returned by the first ray L1 upon detection of the target object, and acquiring a second distance from the target object to the avatar returned by the second ray L2 upon detection of the target object; calculating a horizontal distance of the first distance and the second distance in a horizontal direction; comparing the horizontal distance to a preset distance threshold; when the horizontal distance is larger than the distance threshold, determining that the target object is a slope; when the horizontal distance is less than or equal to the distance threshold, determining that the target object is an obstacle. The horizontal distance La shown in fig. 3 is greater than the distance threshold, which indicates that there is a slope in the moving direction, and it can be determined that the road condition type of the virtual character in the moving direction is the slope trafficable road condition.
Referring to fig. 4, fig. 4 is a schematic diagram of a passable road condition with an obstacle according to an embodiment of the present invention, where a virtual character a emits a first ray L1 and a second ray L2 that are parallel at a position where the virtual character a is located, and if only a distance value returned by the first ray L1 is detected in a current frame of a current time step, it may be determined that an obstacle exists in the moving direction, but since a distance value returned by the second ray L2 is not detected, it may be indicated that the height of the obstacle is relatively low, the road condition type is a passable road condition with an obstacle, and the virtual character may pass through the obstacle by jumping.
Referring to fig. 5, fig. 5 is a schematic diagram of a road condition with obstacles and no traffic through, where a virtual character a sends a first ray L1 and a second ray L2 that are parallel at its position, and if an obstacle is detected in the current frame of the current time step, it needs to further determine whether an object detected by the first ray L1 and the second ray L2 is a slope or an obstacle, and the determination manner is the same as the determination manner of the road condition with a slope and is not described herein again. The horizontal distance Lb shown in fig. 5 is smaller than the distance threshold, which indicates that there is an obstacle in the moving direction, and it can be determined that the road condition type of the virtual character in the moving direction is an obstacle-existing and impassable road condition.
Optionally, calculating an actual reward according to the road condition type and the actual moving direction of the virtual character, including steps S321 to S324:
s321, when the actual moving direction of the virtual character moves towards the accessible road condition without barriers or the accessible road condition on a slope, giving positive reward;
s322, when the actual moving direction of the virtual character moves towards the road condition with obstacles and not available, giving negative reward;
s323, when the actual moving direction of the virtual character moves towards the road condition with obstacles and the virtual character is detected to take jumping action, giving forward reward;
and S324, when the actual moving direction of the virtual character is towards the road condition with obstacles and the virtual character does not take the jumping action, giving a negative reward.
Illustratively, in order to enable the virtual character behavior output by the trained AI model to better simulate the real human behavior, the virtual character is prompted to simulate the real human behavior in order to obtain more positive rewards by setting positive rewards and negative rewards, so that a user can better integrate into a game environment, and the game experience of the user is improved. And when the actual moving direction of the virtual character is towards the accessible road condition without obstacles or the accessible road condition on a slope, giving positive reward. And when the actual moving direction of the virtual character moves towards the road condition with obstacles and no possibility of passing, giving a negative reward to prevent the virtual character from going to the place where the virtual character cannot pass. When the actual moving direction of the virtual character is towards the road condition that the virtual character has obstacles and can pass through, and the virtual character is detected to take the jumping action, the virtual character is indicated to try to pass through the obstacles through the jumping action at the moment, and positive reward is given at the moment. When the actual moving direction of the virtual character moves towards the road condition with obstacles and can pass through and the virtual character is detected not to take the jumping action, the virtual character does not execute the jumping action when encountering the obstacles which can jump and pass through, and negative rewards are given to avoid the virtual character from appearing again.
Specifically, in step S4, each model training is optimized twice, and the critical network is optimized according to the first state information, the first action information, and the second state information in combination with a loss function of the critical network, and then the operator network is optimized according to the first state information in combination with a loss function of the operator network (that is, a predicted value of the critical network, where an input of the critical network is state information of a current time step and a prediction action of the current time step).
Compared with the prior art, in the training method for the AI model of the FPS game, the actual reward is calculated according to the road condition type and the actual moving direction of the virtual character by determining the road condition type of the virtual character in the preset moving direction in the training process of the AI model of the FPS game. By calculating the road condition type of the virtual character in the moving direction, calculating the actual reward according to the road condition type, and setting the reward mechanism of positive reward and negative reward, effective situation analysis can be carried out aiming at complex game scenes, so that the virtual character behavior output by the trained AI model can better simulate real human behavior, and the game experience of a user is improved.
Referring to fig. 6, fig. 6 is a block diagram illustrating a structure of an AI model training apparatus 10 for an FPS game according to an embodiment of the present invention, where the AI model training apparatus 10 for an FPS game includes:
a data acquisition unit 11 for acquiring a first state action value of the virtual character at a current time step and a second state action value at a next time step based on an AI model of the FPS game;
a training unit 12 for:
calculating a value difference between the first state action value and the second state action value as a prediction reward;
calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism; the reward mechanism includes: determining the road condition type of the virtual character in a preset moving direction, and calculating an actual reward according to the road condition type and the actual moving direction of the virtual character;
and optimizing the AI model of the FPS game according to the loss function until the loss function is converged.
Specifically, the AI model of the FPS game includes a first network and a second network, and the data obtaining unit 11 is configured to: .
Acquiring first state information and first action information of the virtual role at the current time step;
inputting the first state information and the first action information into the first network to obtain a first state action value of the virtual role at the current time step;
acquiring second state information of the virtual role at the next time step;
inputting the second state information into the second network to obtain second action information;
inputting the second state information and the second action information into the first network to obtain a second state action value of the virtual character at the next time step
It is worth noting that the action information refers to actions taken by the virtual character in the environment; the state information is the result of taking the action and is reflected in the state of the game. For example, if the action information is shooting, the corresponding state information is enemy blood loss, enemy death, and the like; for another example, if the action information is jumping, and if the box body is jumped, the corresponding state information is that the height of the box body is increased. In the embodiment of the present invention, acquiring a state from a game environment every n frames is referred to as a time step, and it can be understood that the current time step and the next time step are two consecutive time steps.
The AI model of the FPS game is a DDPG model, and the DDPG model has two networks: a critical network and an actor network. Illustratively, the first network is a critical network and the second network is an operator network. The detailed network structures of the critical network and the operator network can refer to the prior art, and the present invention is not described herein again.
Further, the first state information, the first action information and the second state information are obtained by performing environment interaction in the game in advance. The method comprises the steps of firstly obtaining first state information of the virtual character walking at the current time, then inputting the first state information into an actor network to obtain first action information, and selecting actions actually transmitted into a game environment according to an epsilon greedy algorithm (selecting between random actions and prediction actions, selecting random actions larger than epsilon random numbers and selecting prediction actions smaller than epsilon random numbers). And transmitting the action actually taken by the virtual character into a game environment to obtain second state information of the virtual character at the next time step, and repeating the steps. At this time, the first state information, the first action information, and the second state information may be obtained through an actual game environment as sample data input to the AI model of the FPS game. In this process, an actual reward calculated based on the first status information, the first action information, the second status information, and based on a reward mechanism is calculated.
Exemplarily, first state information and first action information are input into a criticic network to obtain a current time-walking first state action value; then, the next-time-stepped second state information is input to the actor network to obtain the next-time-stepped prediction action as second action information, and the second state information and the second action information are input to the critic network to obtain the next-time-stepped second state action value.
Specifically, the training unit 11 obtains a value difference value by subtracting the first state action value and the second state action value.
It is noted that the value of a state action refers to the expected return for a certain state and action, for example, the return is the accumulated value of a game award, and the expected return is the average value of the returns of a plurality of game plays. The first network is a neural network, the input of the first network is state and action, the output is the value of the state and action, and the parameters of the neural network are adjusted to be close to the reality by back propagation through a loss function.
Specifically, the actual reward is calculated through a preset reward mechanism. Calculating a loss function of an AI model of the FPS game based on the predicted and actual awards, comprising:
calculating a loss function for the first network based on the predicted reward and the actual reward;
and obtaining the predicted value of the first network to calculate the loss function of the second network.
Illustratively, the sum of the squares of the difference between the predicted reward and the actual reward is the loss function of the critic network, and the loss function of the actor network is the predicted value of the critic network. Specifically, in the embodiment of the invention, the operator network and the critic network are respectively divided into the evaluate network and the target network, the target network does not participate in training, and the parameters are periodically copied from the evaluate network.
The criticc network loss function satisfies the following equation:
Figure BDA0003164475330000141
wherein N is the total amount of the sample data; y isi=ri+γQ'(si+1,μ'(si+1μ’)|θQ’);riReward for the ith time step; gamma is a super parameter, a discount factor; q (s, a | theta)Q) For criticc network, the parameter is θQ;μ(s|θμ) For an actor network, the parameter is θμ(ii) a Q' is a target network of the critic network; mu' is a target network of the operator network, the parameters of the target network are updated from the Q network and the mu network regularly without participating in training, and the parameters are updated by the operator network through the predicted value of the maximization criticc network.
Optionally, the reward mechanism comprises: and determining the road condition type of the virtual character in a preset moving direction, and calculating the actual reward according to the road condition type and the actual moving direction of the virtual character.
Specifically, determining the road condition type of the virtual character in the preset moving direction includes:
emitting two parallel rays to the moving direction according to the current position of the virtual character; the height difference of the two parallel rays meets a preset height difference threshold value;
detecting the ray detection condition of the two parallel rays at the current frame of the current time step;
determining the detection condition of the obstacle in the moving direction according to the ray detection condition;
and determining the road condition type of the virtual character in the moving direction according to the obstacle detection condition.
Illustratively, the moving direction is 8 directions in which the virtual character can move, the two parallel rays are both linear rays or spherical rays, and the spherical rays are spherical rays which emit a sphere towards a target direction and are detected by using the sphere. The height difference threshold value is the height of the sole of the virtual character from the ground after the virtual character executes the jumping action. The method includes the steps that ray detection conditions of two parallel rays are detected at a current frame of a current time step (each time step is detected), the extension length of the rays is determined by setting the length of a space frame between the two time steps, for example, the larger the space frame is, the longer the extension length of the rays is, the longer the detectable distance is, and therefore, the road condition of the virtual character in a certain range can be determined by setting the length of the space frame between the two time steps. When the ray detection condition is that no ray return distance value is detected in the current frame of the current time step, it can be determined that the ray does not detect the obstacle in the moving direction; when the ray detection condition is that a ray return distance value is detected in the current frame of the current time step, it can be determined that the ray detects an obstacle in the moving direction.
Further, the road condition type includes at least one of an unobstructed passable road condition, a slope passable road condition, an obstructed passable road condition and an obstructed nonpassable road condition.
Optionally, calculating an actual reward according to the road condition type and the actual moving direction of the virtual character includes:
when the actual moving direction of the virtual character moves towards the accessible road condition without obstacles or the accessible road condition on a slope, giving positive reward;
when the actual moving direction of the virtual character moves towards the road condition with obstacles and without passing, giving negative reward;
when the actual moving direction of the virtual character moves towards the road condition with obstacles and the virtual character is detected to take jumping action, positive reward is given;
and when the actual moving direction of the virtual character is towards the road condition that the virtual character has obstacles and can pass, and the virtual character is detected not to take the jumping action, giving a negative reward.
Illustratively, in order to enable the virtual character behavior output by the trained AI model to better simulate the real human behavior, the virtual character is prompted to simulate the real human behavior in order to obtain more positive rewards by setting positive rewards and negative rewards, so that a user can better integrate into a game environment, and the game experience of the user is improved. And when the actual moving direction of the virtual character is towards the accessible road condition without obstacles or the accessible road condition on a slope, giving positive reward. And when the actual moving direction of the virtual character moves towards the road condition with obstacles and no possibility of passing, giving a negative reward to prevent the virtual character from going to the place where the virtual character cannot pass. When the actual moving direction of the virtual character is towards the road condition that the virtual character has obstacles and can pass through, and the virtual character is detected to take the jumping action, the virtual character is indicated to try to pass through the obstacles through the jumping action at the moment, and positive reward is given at the moment. When the actual moving direction of the virtual character moves towards the road condition with obstacles and can pass through and the virtual character is detected not to take the jumping action, the virtual character does not execute the jumping action when encountering the obstacles which can jump and pass through, and negative rewards are given to avoid the virtual character from appearing again.
Specifically, the training unit 12 performs optimization twice during each model training, and first optimizes the critical network according to the first state information, the first action information, and the second state information in combination with a loss function of the critical network, and then optimizes the operator network according to the first state information in combination with a loss function of the operator network (that is, a predicted value of the critical network, where an input of the critical network is state information of a current time step and a prediction action of the current time step).
Compared with the prior art, the AI model training device 10 of the FPS game according to the embodiment of the present invention determines the road condition type of the virtual character in the preset moving direction during the training of the AI model of the FPS game, and calculates the actual reward according to the road condition type and the actual moving direction of the virtual character. By calculating the road condition type of the virtual character in the moving direction, calculating the actual reward according to the road condition type, and setting the reward mechanism of positive reward and negative reward, effective situation analysis can be carried out aiming at complex game scenes, so that the virtual character behavior output by the trained AI model can better simulate real human behavior, and the game experience of a user is improved.
Referring to fig. 7, fig. 7 is a block diagram illustrating a structure of an AI model training device 20 of an FPS game according to an embodiment of the present invention, where the AI model training device 20 of the FPS game includes: a processor 21, a memory 22 and a computer program stored in said memory 22 and executable on said processor 21. The processor 21 implements the steps in the above-described embodiments of the AI model training method for each FPS game when executing the computer program. Alternatively, the processor 21 implements the functions of the modules/units in the above-described device embodiments when executing the computer program.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 22 and executed by the processor 21 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution process of the computer program in the AI model training apparatus 20 of the FPS game.
The AI model training device 20 of the FPS game may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The AI model training device 20 of the FPS game may include, but is not limited to, a processor 21 and a memory 22. It will be understood by those skilled in the art that the schematic diagram is merely an example of the AI model training device 20 of the FPS game and does not constitute a limitation of the AI model training device 20 of the FPS game and may include more or fewer components than those shown, or combine certain components, or different components, for example, the AI model training device 20 of the FPS game may further include an input-output device, a network access device, a bus, etc.
The Processor 21 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 21 is a control center of the AI model training apparatus 20 of the FPS game, and various interfaces and lines are used to connect various parts of the AI model training apparatus 20 of the entire FPS game.
The memory 22 may be used to store the computer programs and/or modules, and the processor 21 implements various functions of the AI model training device 20 of the FPS game by running or executing the computer programs and/or modules stored in the memory 22 and calling data stored in the memory 22. The memory 22 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 22 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the modules/units integrated by the AI model training device 20 of the FPS game may be stored in a computer readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by the processor 21 to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. An AI model training method of an FPS game is characterized by comprising the following steps:
acquiring a first state action value of a virtual character at a current time step and a second state action value of the virtual character at a next time step based on an AI model of the FPS game;
calculating a value difference between the first state action value and the second state action value as a prediction reward;
calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism; the reward mechanism includes: determining the road condition type of the virtual character in a preset moving direction, and calculating an actual reward according to the road condition type and the actual moving direction of the virtual character;
and optimizing the AI model of the FPS game according to the loss function until the loss function is converged.
2. The AI model training method of an FPS game according to claim 1, wherein determining the road condition type of the virtual character in a preset moving direction comprises:
emitting two parallel rays to the moving direction according to the current position of the virtual character; the height difference of the two parallel rays meets a preset height difference threshold value;
detecting the ray detection condition of the two parallel rays at the current frame of the current time step;
determining the detection condition of the obstacle in the moving direction according to the ray detection condition;
and determining the road condition type of the virtual character in the moving direction according to the obstacle detection condition.
3. The AI model training method of an FPS game according to claim 2, wherein determining the obstacle detection situation in the moving direction based on the ray detection situation comprises:
when the ray detection condition is that no ray return distance value is detected in the current frame of the current time step, determining that no obstacle is detected in the moving direction by the ray;
and when the ray detection condition is that a ray return distance value is detected in the current frame of the current time step, determining that the ray detects the obstacle in the moving direction.
4. The AI model training method of an FPS game according to claim 1, wherein the road condition type includes at least one of an unobstructed passable road condition, a slope passable road condition, an obstructed passable road condition and an obstructed nonpassable road condition.
5. The AI model training method of an FPS game according to claim 4, wherein calculating an actual award based on the road condition type and the actual moving direction of the virtual character comprises:
when the actual moving direction of the virtual character moves towards the accessible road condition without obstacles or the accessible road condition on a slope, giving positive reward;
when the actual moving direction of the virtual character moves towards the road condition with obstacles and without passing, giving negative reward;
when the actual moving direction of the virtual character moves towards the road condition with obstacles and the virtual character is detected to take jumping action, positive reward is given;
and when the actual moving direction of the virtual character is towards the road condition that the virtual character has obstacles and can pass, and the virtual character is detected not to take the jumping action, giving a negative reward.
6. The AI model training method of an FPS game according to claim 1, wherein the AI model of the FPS game includes a first network and a second network; then, obtaining a first state action value of the virtual character at the current time step and a second state action value at the next time step based on the AI model of the FPS game includes:
acquiring first state information and first action information of the virtual role at the current time step;
inputting the first state information and the first action information into the first network to obtain a first state action value of the virtual role at the current time step;
acquiring second state information of the virtual role at the next time step;
inputting the second state information into the second network to obtain second action information;
and inputting the second state information and the second action information into the first network to obtain a second state action value of the virtual character at the next time step.
7. An AI model training device for an FPS game, comprising:
the data acquisition unit is used for acquiring a first state action value of the virtual character at the current time step and a second state action value at the next time step based on an AI model of the FPS game;
a training unit to:
calculating a value difference between the first state action value and the second state action value as a prediction reward;
calculating a loss function of an AI model of the FPS game according to the predicted reward and the actual reward; the actual reward is calculated through a preset reward mechanism; the reward mechanism includes: determining the road condition type of the virtual character in a preset moving direction, and calculating an actual reward according to the road condition type and the actual moving direction of the virtual character;
and optimizing the AI model of the FPS game according to the loss function until the loss function is converged.
8. The AI model training device of claim 7, wherein determining the road condition type of the virtual character in a predetermined moving direction comprises:
emitting two parallel rays to the moving direction according to the current position of the virtual character; the height difference of the two parallel rays meets a preset height difference threshold value;
detecting the ray detection condition of the two parallel rays at the current frame of the current time step;
determining the detection condition of the obstacle in the moving direction according to the ray detection condition;
and determining the road condition type of the virtual character in the moving direction according to the obstacle detection condition.
9. An AI model training apparatus for an FPS game comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, when executing the computer program, implementing an AI model training method for an FPS game according to any one of claims 1 to 6.
10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the AI model training method of an FPS game according to any one of claims 1 to 6.
CN202110800433.5A 2021-07-15 2021-07-15 AI model training method, device, equipment and storage medium for FPS game Active CN113663335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110800433.5A CN113663335B (en) 2021-07-15 2021-07-15 AI model training method, device, equipment and storage medium for FPS game

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110800433.5A CN113663335B (en) 2021-07-15 2021-07-15 AI model training method, device, equipment and storage medium for FPS game

Publications (2)

Publication Number Publication Date
CN113663335A true CN113663335A (en) 2021-11-19
CN113663335B CN113663335B (en) 2024-08-16

Family

ID=78539223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110800433.5A Active CN113663335B (en) 2021-07-15 2021-07-15 AI model training method, device, equipment and storage medium for FPS game

Country Status (1)

Country Link
CN (1) CN113663335B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114288663A (en) * 2022-01-05 2022-04-08 腾讯科技(深圳)有限公司 Game data processing method, device, equipment and computer readable storage medium
CN116459520A (en) * 2022-01-11 2023-07-21 腾讯科技(深圳)有限公司 Intelligent virtual role control method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110975283A (en) * 2019-11-28 2020-04-10 腾讯科技(深圳)有限公司 Processing method and device of virtual shooting prop, storage medium and electronic device
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN111185008A (en) * 2020-01-20 2020-05-22 腾讯科技(深圳)有限公司 Method and apparatus for controlling virtual character in game
CN111773724A (en) * 2020-07-31 2020-10-16 网易(杭州)网络有限公司 Method and device for crossing virtual obstacle
CN111888762A (en) * 2020-08-13 2020-11-06 网易(杭州)网络有限公司 Method for adjusting visual angle of lens in game and electronic equipment
CN112221140A (en) * 2020-11-04 2021-01-15 腾讯科技(深圳)有限公司 Motion determination model training method, device, equipment and medium for virtual object

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110975283A (en) * 2019-11-28 2020-04-10 腾讯科技(深圳)有限公司 Processing method and device of virtual shooting prop, storage medium and electronic device
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN111185008A (en) * 2020-01-20 2020-05-22 腾讯科技(深圳)有限公司 Method and apparatus for controlling virtual character in game
CN111773724A (en) * 2020-07-31 2020-10-16 网易(杭州)网络有限公司 Method and device for crossing virtual obstacle
CN111888762A (en) * 2020-08-13 2020-11-06 网易(杭州)网络有限公司 Method for adjusting visual angle of lens in game and electronic equipment
CN112221140A (en) * 2020-11-04 2021-01-15 腾讯科技(深圳)有限公司 Motion determination model training method, device, equipment and medium for virtual object

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114288663A (en) * 2022-01-05 2022-04-08 腾讯科技(深圳)有限公司 Game data processing method, device, equipment and computer readable storage medium
CN116459520A (en) * 2022-01-11 2023-07-21 腾讯科技(深圳)有限公司 Intelligent virtual role control method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113663335B (en) 2024-08-16

Similar Documents

Publication Publication Date Title
JP7159458B2 (en) Method, apparatus, device and computer program for scheduling virtual objects in a virtual environment
CN110882542B (en) Training method, training device, training equipment and training storage medium for game intelligent agent
CN113663335B (en) AI model training method, device, equipment and storage medium for FPS game
CN111632379B (en) Game role behavior control method and device, storage medium and electronic equipment
CN109529352B (en) Method, device and equipment for evaluating scheduling policy in virtual environment
CN111026272B (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN110639208B (en) Control method and device for interactive task, storage medium and computer equipment
CN114392560B (en) Method, device, equipment and storage medium for processing running data of virtual scene
CA3168052C (en) System and method for anti-blinding target game
CN113577769B (en) Game character action control method, apparatus, device and storage medium
CN112245934B (en) Data analysis method, device and equipment for virtual resources in virtual scene application
CN116036601B (en) Game processing method and device, computer equipment and storage medium
CN111265871A (en) Virtual object control method and device, equipment and storage medium
CN114935893B (en) Motion control method and device for aircraft in combat scene based on double-layer model
CN114404976B (en) Training method and device for decision model, computer equipment and storage medium
CN116047902A (en) Method, device, equipment and storage medium for navigating robots in crowd
Burelli Interactive virtual cinematography
CN113797543A (en) Game processing method, game processing device, computer device, storage medium, and program product
Joo et al. Learning to automatically spectate games for Esports using object detection mechanism
Dung et al. Building Machine Learning Bot with ML-Agents in Tank Battle
CN113521746A (en) AI model training method, device, system and equipment for FPS game
Chang et al. Investigating and modeling the emergent flocking behaviour of sheep under threat with fear contagion
CN112933600B (en) Virtual object control method, device, computer equipment and storage medium
KR102479931B1 (en) Game providing device, game providing method and computer program for providing reward corresponding to a predicted probability index
CN114247132B (en) Control processing method, device, equipment, medium and program product for virtual object

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant