CN116920405A

CN116920405A - Training method for intelligent agent control model, computer equipment and storage medium

Info

Publication number: CN116920405A
Application number: CN202310731361.2A
Authority: CN
Inventors: 张黎; 王善意; 梁敏明; 邓志弘; 郭仁杰; 杨木
Original assignee: Shenzhen Haipu Parameter Technology Co ltd
Current assignee: Shenzhen Haipu Parameter Technology Co ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-24

Abstract

The application relates to the field of artificial intelligence, and provides an agent control model training method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring key position information in a virtual environment and position preference information of different countermeasures; determining a key position database of different countermeasure props according to the key position information and the position preference information; determining a target action position of the intelligent agent in the virtual environment according to the current position, the current countermeasures and the key position database; controlling the intelligent agent to move to a target action position based on a preset model, and interacting with the virtual environment at the target action position to obtain interaction feedback information; and adjusting model parameters of the preset model according to the interactive feedback information until a target model is obtained. Different key position databases can be determined according to the difference of the held countermeasures, interaction is performed with the virtual environment in different styles based on the different key position databases, and the flexibility of the intelligent agent is improved.

Description

Training method for intelligent agent control model, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an agent control model training method, a computer device, and a storage medium.

Background

With the development of artificial intelligence, scenes such as role hosting, game guidance, game testing, non-Player Character (NPC) control and the like are also becoming more and more popular in computer games by artificial intelligence. However, the conventional method for training the intelligent agent control model in the game has single combat strategy, poor humanization and flexibility, and difficult cooperation or countermeasure with a real player according to actual conditions. Therefore, how to train the intelligent agent control model to improve the humanity and flexibility of the intelligent agent is a problem to be solved.

Disclosure of Invention

The application mainly aims to provide an agent control model training method, computer equipment and a storage medium, and aims to improve anthropomorphic property and flexibility of agent control.

In a first aspect, the present application provides an agent control model training method, including the steps of:

acquiring key position information in a virtual environment and position preference information of different countermeasures, wherein different countermeasures correspond to different properties, and the key positions of the different properties in the virtual environment are different;

determining a key position database corresponding to different countermeasure props according to the key position information and the position preference information;

acquiring the current position and the current counterprop of the intelligent body in the virtual environment, and determining the target action position of the intelligent body at the next moment in the virtual environment according to the current position, the current counterprop and the key position database;

controlling the intelligent body to move from the current position to the target action position based on a preset model, and interacting with the virtual environment at the target action position to obtain interaction feedback information;

and according to the interactive feedback information, adjusting the model parameters of the preset model until a target model is obtained.

In a second aspect, the present application also provides a computer device, where the computer device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, where the computer program when executed by the processor implements an agent control model training method as described above.

In a third aspect, the present application further provides a computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements an agent control model training method as described above.

The application provides a training method, equipment and a computer storage medium of an agent control model, which are characterized in that key position information in a virtual environment and position preference information of different countermeasures are obtained, the properties of the different countermeasures corresponding to the countermeasures are different, and the key positions of the different properties of the props, which are adapted in the virtual environment, are different; determining a key position database corresponding to different countermeasure props according to the key position information and the position preference information; acquiring the current position and the current counterprop of the intelligent body in the virtual environment, and determining the target action position of the intelligent body at the next moment in the virtual environment according to the current position, the current counterprop and the key position database; controlling the intelligent body to move from the current position to the target action position based on a preset model, and interacting with the virtual environment at the target action position to obtain interaction feedback information; and according to the interactive feedback information, adjusting the model parameters of the preset model until a target model is obtained. Because the intelligent agent can determine different key position databases according to the difference of the held countermeasures, and interact with the virtual environment in different styles based on the different key position databases, the humanization and the flexibility of the intelligent agent are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an agent control model training method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a sub-step of an agent control model training method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating sub-steps of an agent control model training method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating sub-steps of an agent control model training method according to an embodiment of the present application;

fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

The embodiment of the application provides an agent control model training method, computer equipment and a computer readable storage medium.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flow chart of an intelligent agent control model training method according to an embodiment of the application. The intelligent agent control model training method can be used in a terminal or a server to train the intelligent agent control model. The terminal can be electronic equipment such as a mobile phone, a tablet personal computer, a notebook computer, a desktop computer, a personal digital assistant, wearable equipment and the like; the server may be an independent server, a server cluster, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

In the related art, the agent control model is accessed to the game server by using the same network protocol as that used by the game client, so that the agent can be effectively applied to various scenes such as offline hosting, man-machine fight, man-machine cooperation and the like in the game.

For example, an agent in a game may hold a variety of very different styles of counterprops, for example, a gunfight game, and an agent in a game may hold one or more of a sniper, automatic rifle, submachine, shotgun. Since different countermeasures have different characteristics, for example, the effective ranges of sniper guns, automatic rifles, submachine guns, shotguns decrease in sequence, in general, for countermeasures in which the ranges are closer, it is necessary to select the fight position with a denser shelter, facilitating access to the attack target by transferring between shelters; for the countermeasures with a longer range, the fight position with a wider field of view needs to be selected, so that the adversary role can be found conveniently and the long-distance attack can be launched. Therefore, different combat strategies of the combat props in the game are different, and the existing intelligent body control model can not control the intelligent body to combat according to the difference of the combat props, so that the intelligent body has poor humanization and flexibility.

As shown in fig. 1, the agent control model training method includes steps S101 to S105.

Step S101, obtaining key position information in a virtual environment and position preference information of different countermeasures, wherein different countermeasures correspond to different properties, and the key positions of the different properties in the virtual environment are different.

The behavior of the agent in the virtual environment is, for example, a process in which the agent actually moves from the current location to the next target action location, and after the actions such as information collection and prop use are performed at the target action location, the agent continues to move from the current location to the next target action location. Thus, the agent is actually fighting based on critical locations in the virtual environment, which are critical to the agent's behavior.

Illustratively, a critical location is a location in the virtual environment where an agent can stay and perform an action. In step S101, the key position information in the virtual environment is acquired, so that the corresponding key position database is established for different countermeasures in order to acquire the key position information of all key positions in the virtual environment, so as to combine the position preference information of different countermeasures subsequently.

The key position information may be obtained through a deep reinforcement learning algorithm, or may be obtained based on a statistical machine learning algorithm, which is not limited to this, and may be obtained through a statistical machine learning algorithm after the key position information is obtained through the deep reinforcement learning algorithm, for performing secondary checksum supplementation; because the virtual environment may have interference in the training process, the key position may be supplemented by manual marking after the key position is automatically acquired, which is not limited herein.

Illustratively, the property attributes include an effective distance, e.g., range, against the property. Of course, the present application is not limited thereto, and is not limited thereto.

And step S102, determining a key position database corresponding to different countermeasure props according to the key position information and the position preference information.

By way of example, different position preference information is preset according to prop attributes of different countermeasure props, so that key position information is screened based on the position preference information to obtain key position databases corresponding to the different countermeasure props, and the key position databases can be used as prior knowledge for subsequent control agents to explore and fight in a virtual environment.

By way of example, the location preference information may be embodied in the form of weights, such as, but not limited to, a reward function when determining the key location for adapting to different countermeasures based on the weights.

Referring to fig. 2, fig. 2 is a flow chart illustrating sub-steps of an intelligent agent control model training method according to an embodiment of the application.

As shown in fig. 2, in some embodiments, the key location information includes: the key position coordinates and at least one position evaluation value; the location preference information includes: at least one evaluation preference weight corresponding to the position evaluation value; step S102, determining a key location database corresponding to different countermeasures according to the key location information and the location preference information, including: step S1021, determining standardized position scores of the key position coordinates for different countermeasure props according to at least one position evaluation value and at least one evaluation preference weight corresponding to the position evaluation value; step S1022, determining a key position database corresponding to different countermeasures according to the key position coordinates and the standardized position scores of the different countermeasures; wherein the location evaluation value includes at least: shelter density evaluation value and visual field opening evaluation value.

The key location may be quantified from at least one dimension by a location-evaluation value, for example, but not limited to, a shelter density and a view opening of the key location by a location-evaluation value, and the altitude and topography specificity of the key location (whether it belongs to a special topography such as a room, a slope, and both sides of a bridge) may be quantified by a location-evaluation value, without limitation.

For example, the location preference information may be an evaluation preference weight set for location evaluation values of different dimensions. For example, the sniper gun has a longer range, and needs to exert its advantage in a key position with a higher view opening degree, while the requirements for the shelter density in the key position are lower, so that the sniper gun can set a higher evaluation preference weight for the view opening degree and a lower evaluation preference weight for the shelter density degree. In contrast, the shotgun has a closer firing range, needs to avoid at the key position with higher shelter density, ensures the safety of the battle, and has lower requirements on the visual field opening degree of the key position, so that the shotgun can set higher evaluation preference weight for the shelter density and lower evaluation preference weight for the visual field opening degree.

For example, different target models may be trained for counterprops with large differences in combat style, such as sniper guns with furthest ranges and shotguns with closest ranges, respectively; of course, the present application is not limited thereto, and different types of counterprops may be accommodated in the same target model, and is not limited thereto.

Referring to fig. 3, fig. 3 is a flowchart illustrating sub-steps of an intelligent agent control model training method according to an embodiment of the application.

As shown in fig. 3, in some embodiments, step S1021 determines normalized position scores of the key position coordinates for different countermeasure props according to at least one of the position evaluation values and an evaluation preference weight corresponding to at least one of the position evaluation values, including: step S1121, determining at least one standardized position evaluation value according to the position evaluation value and the evaluation preference weight corresponding to the position evaluation value; step S1221, determining the standardized position scores of the key position coordinates for the different countermeasure props according to the sum of the standardized position evaluation values.

Illustratively, multiplying the position evaluation value by an evaluation preference weight corresponding to the position evaluation value to obtain a standardized position evaluation value; for example, the shelter density degree evaluation value is multiplied by the evaluation preference weight corresponding to the shelter density degree evaluation value to obtain the standardized position evaluation value corresponding to the shelter density degree.

Illustratively, the normalized position evaluation values are added, for example, the normalized position evaluation value corresponding to the shelter density and the normalized position evaluation value corresponding to the view opening degree are added, and the normalized position score of the key position coordinate is obtained.

In some embodiments, step S1022 determines a key location database corresponding to different countermeasures according to the normalized location scores of the key location coordinates for the different countermeasures, including: and if the standardized position score of the key position coordinate for the target countermeasures is larger than a preset score, storing the key position coordinate and the standardized position score into a key position database corresponding to the target countermeasures.

For example, the preset score may be set according to actual requirements, and specifically, different preset scores may be set for different countermeasures, which is not limited herein.

By comparing the standardized position scores with the preset scores, the key position coordinates are screened, so that key position coordinates which can enable different countermeasures to play a role advantage are obtained respectively, the key position databases corresponding to the countermeasures are determined, the intelligent agent with different countermeasures can perform differential combat according to local conditions, and humanization and flexibility of the intelligent agent are improved.

By way of example, the key position coordinates and the standardized position scores are stored together in a key position database corresponding to the target countermeasures, so that the key position coordinates of the target action position can be determined according to the standardized position scores, and the rationality of determining the target action position is improved.

Step S103, the current position and the current countermeasures of the intelligent agent in the virtual environment are obtained, and the target action position of the intelligent agent at the next moment in the virtual environment is determined according to the current position, the current countermeasures and the key position database.

By way of example, since different countermeasures have different key location databases, even at the same current location, there may be differences in the target action locations determined by an agent holding different countermeasures, so that the agent can act in the environment based on different countermeasures, improving the variability of agent control.

Referring to fig. 4, fig. 4 is a flowchart illustrating sub-steps of an intelligent agent control model training method according to an embodiment of the application.

As shown in fig. 4, in some embodiments, the determining the target action location of the agent at the next moment in the virtual environment according to the current location, the current countermeasures, and the key location database includes: determining candidate key position coordinates with the distance from the intelligent body smaller than a preset distance according to the current position and a key position database corresponding to the current countermeasures; and determining the target action position of the intelligent agent according to the standardized position score corresponding to the candidate key position coordinates.

For example, the key position coordinates near the current position are obtained as candidate key position coordinates, wherein the preset distance can be set according to actual requirements.

Specifically, the preset distance that the intelligent agent can safely move can be determined according to the number and the distance of the enemy characters, the number and the distance of the friend characters and the firepower coverage range of the enemy characters within a certain range, so that the flexibility of the action of the intelligent agent is improved.

In some embodiments, the determining the target action location of the agent according to the normalized location score corresponding to the candidate key location coordinates includes: and determining the candidate key position coordinates with highest standardized position scores as target action positions of the intelligent agent.

By way of example, the key position with the largest standard position score in the preset distance is used as the target action position of the intelligent agent at the next moment, so that the safety and the rationality of the action of the intelligent agent are improved.

And step S104, controlling the intelligent body to move from the current position to the target action position based on a preset model, and interacting with the virtual environment at the target action position to obtain interaction feedback information.

After determining the key position databases corresponding to different countermeasures, the intelligent agent is controlled to play on the basis of at least one countermeasure in a virtual environment through a preset model, so that a target model which can be accessed to a server to control the intelligent agent is obtained.

In some embodiments, the controlling the agent to move from the current location to the target action location based on the preset model, and interacting with the virtual environment at the target action location, to obtain interaction feedback information includes: and based on the preset model, controlling the intelligent body to interact with the virtual environment and the virtual roles in the virtual environment at the target action position, and acquiring the interaction feedback information.

Exemplary interactions with the virtual environment include interactions with virtual characters in the virtual environment, including, but not limited to: attack enemy characters, rescue friend characters, shelter from the shelter, use of throwing objects, and the like.

The evaluation of the actions executed by the intelligent agent is embodied in the form of interactive feedback information, and the preset model can adjust parameters according to the interactive feedback information, so that the intelligent agent can execute reasonable actions in different scenes.

In some embodiments, the method further comprises: at random locations in the virtual environment, at least one virtual character for interacting with the agent is set, the virtual characters each having at least one counterprop.

By way of example, because of the randomness of the positions of the virtual characters, the combat of the intelligent agent in the virtual environment can be trained, the comprehensiveness of the combat performance of the intelligent agent is improved, the intelligent agent can play an advantage in favorable terrains and can also have good performance in unfavorable terrains.

For example, the virtual character may hold the same countermeasure props as the agent, or may hold different types of countermeasure props, respectively. For example, the countermeasures held by the virtual characters may also be randomly determined, without limitation.

For example, in the case where the virtual characters respectively hold different types of countermeasure props, the intelligent agents holding the different countermeasure props may be subjected to mixed training.

By way of example, by holding the enemy or friend character of the randomly opposing prop, the agent is enabled to play itself in the virtual environment, so that the agent can determine appropriate action strategies when facing potential attacks of different opposing props, and the intelligent degree of the agent is improved.

In some embodiments, the controlling the agent to move from the current location to the target action location based on the preset model, and interacting with the virtual environment at the target action location, to obtain interaction feedback information includes: acquiring the role type of the virtual role and an opposing prop held by the virtual role, and controlling the intelligent body based on the preset model according to the role type and the opposing prop; wherein the character type at least comprises: friend roles, enemy roles.

For example, different action strategies are determined according to the counterprops held by the virtual characters, for example, according to prop attributes of the counterprops held by the enemy characters, in combination with the distance between the current position and the enemy characters.

And step 105, according to the interaction feedback information, adjusting the model parameters of the preset model until a target model is obtained.

The interactive feedback information can evaluate the behavior of the intelligent agent, and the parameters of the preset model are adjusted according to the interactive feedback information, so that the rationality of the behavior of the intelligent agent is improved, and the intelligent agent can perform battle better in a virtual environment or an online environment.

In some embodiments, the method further comprises: and accessing the target model into an online environment so that the target model outputs a control instruction to a server to control the intelligent body to interact with at least the online environment.

Illustratively, the object model is accessed to the game server in the same network protocol as used by the game client so that the object model controls the agent to interact in the online environment.

For example, in an online environment with high complexity, in order to prevent the unreasonable behavior of the target model, the behavior of the agent can be adjusted through a heuristic algorithm. For example, based on a heuristic algorithm, the behavior of the agent is adjusted according to the distance from the enemy character, the distance from the friend character, the security of the target action position, and the like in the actual online environment.

The inventive methods are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The above-described method may be implemented, for example, in the form of a computer program that is executable on a computer device as shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server or a terminal.

As shown in fig. 5, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a storage medium and an internal memory.

The storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause the processor to perform any of a number of agent control model training methods.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in the storage medium that, when executed by the processor, causes the processor to perform any one of the agent control model training methods.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

It should be noted that, for convenience and brevity of description, the specific working process of the training of the intelligent agent control model may refer to the corresponding process in the foregoing embodiment of the training control method of the intelligent agent control model, which is not described herein.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, where the computer program includes program instructions, where the method implemented when the program instructions are executed may refer to various embodiments of the method for training an agent control model of the present application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. An agent control model training method, comprising:

2. The method for training an agent control model according to claim 1, wherein the controlling the agent to move from the current position to the target action position based on a preset model and interact with the virtual environment at the target action position to obtain interaction feedback information comprises:

and based on the preset model, controlling the intelligent body to interact with the virtual environment and the virtual roles in the virtual environment at the target action position, and acquiring the interaction feedback information.

3. The agent control model training method of claim 2, further comprising:

at random locations in the virtual environment, at least one virtual character for interacting with the agent is set, the virtual characters each having at least one counterprop.

4. The method for training an agent control model according to claim 2, wherein the controlling the agent to move from the current position to the target action position based on a preset model and interact with the virtual environment at the target action position to obtain interaction feedback information includes:

acquiring the role type of the virtual role and an opposing prop held by the virtual role, and controlling the intelligent body based on the preset model according to the role type and the opposing prop;

wherein the character type at least comprises: friend roles, enemy roles.

5. The agent control model training method of claim 1, wherein the key location information comprises: the key position coordinates and at least one position evaluation value; the location preference information includes: at least one evaluation preference weight corresponding to the position evaluation value; the determining a key position database corresponding to different countermeasure props according to the key position information and the position preference information comprises the following steps:

determining standardized position scores of the key position coordinates for different countermeasure props according to at least one position evaluation value and at least one evaluation preference weight corresponding to the position evaluation value;

determining a key position database corresponding to different countermeasures according to the standardized position scores of the key position coordinates for the different countermeasures;

wherein the location evaluation value includes at least: shelter density evaluation value and visual field opening evaluation value.

6. The method of claim 5, wherein determining the normalized position scores of the key position coordinates for different countermeasure props according to at least one of the position evaluation values and the evaluation preference weights corresponding to the at least one position evaluation value comprises:

determining at least one standardized position evaluation value according to the position evaluation value and the evaluation preference weight corresponding to the position evaluation value;

and determining the standardized position scores of the key position coordinates for the different countermeasure props according to the sum of the standardized position evaluation values.

7. The method of claim 5, wherein determining a key location database corresponding to different countermeasures based on the key location coordinates for normalized location scores of the different countermeasures, comprises:

and if the standardized position score of the key position coordinate for the target countermeasures is larger than a preset score, storing the key position coordinate and the standardized position score into a key position database corresponding to the target countermeasures.

8. The method of claim 5, wherein determining the target action location of the agent at the next time in the virtual environment based on the current location, the current countermeasures, and the key location database comprises:

determining candidate key position coordinates with the distance from the intelligent body smaller than a preset distance according to the current position and a key position database corresponding to the current countermeasures;

and determining the target action position of the intelligent agent according to the standardized position score corresponding to the candidate key position coordinates.

9. The method of claim 8, wherein determining the target action location of the agent based on the normalized location scores corresponding to the candidate key location coordinates comprises:

and determining the candidate key position coordinates with highest standardized position scores as target action positions of the intelligent agent.

10. The agent control model training method of any one of claims 1-9, further comprising:

and accessing the target model into an online environment so that the target model outputs a control instruction to a server to control the intelligent body to interact with at least the online environment.

11. A computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the agent control model training method of any of claims 1 to 10.

12. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, wherein the computer program, when executed by a processor, implements the agent control model training method of any one of claims 1 to 10.