WO2019075632A1 - 一种ai对象行为模型优化方法以及装置 - Google Patents

一种ai对象行为模型优化方法以及装置 Download PDF

Info

Publication number
WO2019075632A1
WO2019075632A1 PCT/CN2017/106507 CN2017106507W WO2019075632A1 WO 2019075632 A1 WO2019075632 A1 WO 2019075632A1 CN 2017106507 W CN2017106507 W CN 2017106507W WO 2019075632 A1 WO2019075632 A1 WO 2019075632A1
Authority
WO
WIPO (PCT)
Prior art keywords
game
real
environment
value
status information
Prior art date
Application number
PCT/CN2017/106507
Other languages
English (en)
French (fr)
Inventor
姜润知
李源纯
黄柳优
李德元
王鹏
魏学峰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to CN201780048483.4A priority Critical patent/CN109843401B/zh
Priority to PCT/CN2017/106507 priority patent/WO2019075632A1/zh
Publication of WO2019075632A1 publication Critical patent/WO2019075632A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to an artificial intelligence (AI) object behavior model optimization method and apparatus.
  • AI artificial intelligence
  • game AI is an important part. In complex game scenes, game AI can enhance the entertainment of the game.
  • the system traverses the entire tree from the root node.
  • the parent node executes the child node.
  • the result is returned to the parent node.
  • the parent node determines the result according to the child node. What to do next. This causes the behavior of the game AI to follow and the behavior pattern is fixed.
  • the embodiment of the present application provides a behavior model optimization method and device for an AI object, so that the AI makes corresponding decisions according to real-time changes of the environment, and improves the flexibility of the AI.
  • a first aspect of the embodiments of the present application provides a method for establishing a behavior model of an AI object, including:
  • a behavior model of the AI object is established according to the target weight value.
  • a second aspect of the embodiments of the present application provides a method for controlling an AI object, including:
  • the action strategy is fed back to the AI object such that the AI object executes the action strategy.
  • an embodiment of the present application provides a behavior model establishing apparatus for an AI object, where the apparatus may be a server, and the server has a function of implementing a server in the foregoing method.
  • This function can be implemented in hardware or in hardware by executing the corresponding software.
  • the hardware or software includes one or more modules corresponding to the functions described above.
  • the behavior model establishing apparatus of the AI object includes:
  • An acquiring module configured to acquire first real-time status information of the first environment where the AI object is located;
  • a processing module configured to extract feature information of the first real-time state information acquired by the acquiring module; and obtain an action strategy of the game AI object according to the feature information and a weight value of the learning network;
  • a feedback module configured to feed back the action policy obtained by the processing module to the AI object, so that the AI object executes the action policy
  • the acquiring module is configured to acquire second real-time status information of the second environment where the AI object is located, where the second environment is an environment after the AI object executes the action policy;
  • the processing module is configured to obtain a return value of the action policy according to the second real-time status information acquired by the acquiring module; if the reward value meets a preset condition, determine that the weight of the learning network is a target of the learning network The weight value; the behavior model of the AI object is established according to the target weight value.
  • the behavior model building device of the AI object includes:
  • One or more processors and a memory storing program instructions that, when executed by the one or more processors, configure the apparatus to perform a behavior model building method of the AI object of the present application.
  • an embodiment of the present application provides an AI object control device, where the AI object control device has the functions of implementing the foregoing method.
  • This function can be implemented in hardware or in hardware by executing the corresponding software.
  • the hardware or software includes one or more modules corresponding to the functions described above.
  • the AI object control device includes:
  • An obtaining module configured to acquire real-time status information of an environment in which the AI object is located
  • a processing module configured to extract feature information of the real-time status information; obtain an action policy of the AI object according to the feature information and a weight value of the learning network, where the weight value of the learning network is a preset value;
  • a feedback module configured to feed back the action policy obtained by the processing module to the AI object, so that the AI object executes the action policy.
  • the AI object control device includes:
  • One or more processors are One or more processors.
  • a memory storing program instructions that, when executed by the one or more processors, configure the apparatus to perform the AI object control method of the present application.
  • an embodiment of the present application provides a computer readable storage medium, including instructions, when the instructions are executed on a processor of a computing device, the apparatus performs the foregoing methods.
  • an embodiment of the present application provides a computer program product comprising instructions, when the computer program product runs on a computer, the computer executes the foregoing methods.
  • the server after acquiring the first real-time state information of the first game environment, extracts multi-dimensional feature information of the first real-time state information, and then, according to the multi-dimensional feature information, And learning the weight value of the network to obtain an action strategy of the game AI object; finally, the server feeds the action policy to the game AI object, so that the game AI object executes the action policy; and the server acquires the game AI object to execute the action a second real-time status information of the second game environment after the action strategy, and calculating a return value of the action policy according to the second real-time status information, and determining that the weight value of the learning network is a target when the return value meets a preset condition A weight value, and a behavior model of the game AI object is established according to the target weight value.
  • This application makes corresponding decisions based on real-time changes in the environment, which can increase flexibility.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning through the learning network is more specific, thereby further improving the game AI. flexibility.
  • Figure 1 is a schematic diagram of an example behavior tree
  • FIG. 2 is a schematic diagram of a method for optimizing a behavior model of an AI object according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of a Snake game provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a mode for establishing a behavior model of an AI object according to an embodiment of the present application
  • FIG. 5 is another schematic diagram of a method for establishing a behavior model of an AI object according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of extracting feature information of real-time state information by a convolutional neural network according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the output content of the Snake game provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a method for controlling an AI object according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a Snake game provided by an embodiment of the present application.
  • FIG. 10 is another schematic diagram of a Snake game provided by an embodiment of the present application.
  • FIG. 11 is a flowchart of a method for establishing a behavior model of an AI object according to an embodiment of the present application
  • FIG. 12 is a flowchart of a method for controlling an AI object according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of an embodiment of a server in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of another embodiment of a server in an embodiment of the present application.
  • FIG. 15 is a schematic diagram of an embodiment of an AI object control device according to an embodiment of the present application.
  • FIG. 16 is a schematic diagram of another embodiment of an AI object control device according to an embodiment of the present application.
  • the embodiment of the present application provides a method and a device for establishing a behavior model of an AI object, so that the AI makes corresponding decisions according to real-time changes of the environment, and improves the flexibility of the AI.
  • game AI is a very important part. In complex game scenes, game AI can be strengthened. The entertainment of the game.
  • Current game AI training methods usually use a state machine or a behavior tree. For an AI system constructed with a behavioral tree model, each time the AI is executed, the system traverses the entire tree from the root node. The parent node executes the child node. After the child node executes, the result is returned to the parent node, and then the parent node is based on the child. The result of the node determines what to do next.
  • a behavior tree model as shown in FIG.
  • the parent node selects a node for a behavior; the leaf node of the parent node is action 1; the child node is a sequential execution child node; and the leaf node of the child node includes a node condition And action 2.
  • the behavior tree is input from the root node, and then the parent node performs its sequential execution of the child node, and when the leaf nodes (node condition and action 2) of the sequence execution child node execute successfully, the sequence executes the child node.
  • the sequence execution child node returns the parent node failure flag, and the parent node executes its leaf
  • the node is action 1. If it is assumed that the action 1 is sleeping, the action 2 is a greeting, and the node condition is that the game player is encountered. In practical applications, according to the description of the behavior tree, if the game AI encounters the game player, it greets; if the game AI does not touch the game player, sleeps. In this model, the behavior tree makes it easy to organize complex AI knowledge items into a very intuitive way.
  • the default combination node handles the iteration of the child nodes as if it were a preset priority policy queue, which is also in line with the normal thinking mode of human beings.
  • various nodes of the behavior tree including leaf nodes, are highly reusable. But each time the behavior of the AI system constructed by the model is constructed, the system traverses the entire tree from the root node. The parent node executes the child node. After the child node executes, the result is returned to the parent node. Then the parent node determines the result according to the child node. What to do next. This causes the behavior of the game AI to follow and the behavior pattern is fixed.
  • the embodiment of the present application provides a solution for: acquiring first real-time status information of a first environment in which an AI object is located; and then extracting feature information of the first real-time status information; and second, according to the feature information and learning
  • the weight value of the network obtains the action strategy of the AI object; secondly, the action strategy is fed back to the AI object, so that the AI object executes the action strategy; and secondly, the second real time of the second environment in which the AI object is located is acquired.
  • the second environment is an environment after the AI object executes the action policy; and then, the reward value of the action policy is obtained according to the second real-time status information; if the reward value meets a preset condition, determining the learning network
  • the weight value is the target weight value of the learning network; finally, the behavior model of the AI object is established according to the target weight value.
  • the target weight value of the behavior model of the AI object is obtained according to the real-time state information of the environment, and the operation is repeated according to the sample.
  • the specific situation is as follows, please refer to the following description.
  • FIG. 2 illustrates a behavior model optimization method of an AI object in one embodiment of the present application.
  • the method includes:
  • the first real-time status information of the first environment is obtained by the server, and the first environment may be a game environment sample set.
  • the server is a computer having a deep learning network, and the computer may have a display function.
  • the first environment may be a first game environment, the first game environment including at least one of the game AI object, the game player, and the game scene.
  • the first real-time status information is a picture, and the picture includes the first game environment.
  • the first real-time status information is as shown in FIG. 3. If the game is a Snake Game, the game in the first real-time status information
  • the AI object is a snake represented by "visitor 408247928" in the game; the game player in the first real-time status information is a snake represented by "biubiubiu" in the game; each scattered small point in the first real-time status information For the food in the game scene.
  • the server may adopt the following manner when acquiring the first real-time status information of the first game environment:
  • the server may obtain valid data in the first game environment, where the valid data includes a role parameter of the game AI object, a position parameter of the game AI object, and a role parameter of the game player character, At least one of a position parameter of the game player character, a game scene parameter, the valid data is extracted in a game environment within an area with a preset value as a center centered on a preset part of the game AI; and then the server
  • the two-dimensional data matrix is drawn as the first real-time state information according to the valid data, and the two-dimensional data matrix is represented as an image, that is, the first real-time state information exists as a picture. For example, as shown in FIG.
  • the valid data acquired by the server at this time includes: the length of the snake represented by the "tourist 408247928” is 33; the number of kills of the snake represented by the “tourist 408247928” is 0; the "biubiubiu”
  • the representative snake has a length of 89; the position parameter of the snake represented by the "biubiubiu” indicates that the snake represented by the "biubiubiu” is located at the lower right of the screen; the position parameter of the snake represented by the "tourist 408247928” indicates the "visitor 408247928"
  • the representative snake is located in the middle of the screen; the positional parameters of each food.
  • the server redraws a two-dimensional data matrix (ie, the first real-time status information) similar to the FIG. 3 according to the valid data, and the server may assign a color value to each valid data, and then draw the two according to the color data.
  • Dimension data matrix In practical applications, for simplicity, the server usually uses grayscale images, and different objects use different values. For example, in the Snake game shown in Figure 3, the open space is neutral and gray; the game AI object (the snake represented by "tourist 408247928”) is also neutral, gray; border and game player characters (ie, the snake represented by the "biubiubiu”) is "not good” and is black; the food (ie, the scattered dots in Figure 3) is "good” and white.
  • the specific code can be drawn as follows:
  • the server may directly obtain a screenshot image of the first game environment, and directly use the screenshot image as the first real-time status information.
  • the server can directly obtain FIG. 3, and then use FIG. 3 as the first real-time status information of the first game environment.
  • the server is a device having a computing resource, that is, as long as it can establish a behavior model of the game AI object, the specific situation is not limited herein.
  • the server can run the game and also establish the game.
  • the behavior model of the AI object the server can also be used only to establish a behavioral model of the game AI object.
  • the game environment can occur on a terminal device (such as a mobile phone, a tablet, etc.) or directly on the server.
  • the server receives the terminal device and sends the terminal device.
  • the server directly collects the first real-time status information of the first game environment during the running of the game.
  • the first game environment in the process of establishing the behavior model, may be a pre-set game environment sample set, or may be a game environment sample set in real-time operation, specifically in this manner. No restrictions are imposed.
  • the terminal device when the game is run on the terminal device, the terminal device may be multiple or a single one, which is not limited herein.
  • the server After obtaining the first real-time status information, the server extracts feature information of the first real-time status information.
  • the feature information is multi-dimensional information.
  • the server may extract the feature information of the first real-time state information by using a Convolutional Neural Network (CNN).
  • CNN Convolutional Neural Network
  • CNN is a feedforward neural network whose artificial neurons can respond to a surrounding area of a part of the coverage and have excellent performance for large image processing.
  • the CNN consists of one or more convolutional layers and a fully connected layer at the top (corresponding to a classical neural network), and also includes associated weights and a pooling layer. This structure enables CNN to take advantage of the two-dimensional structure of the input data.
  • the convolution kernel of the convolutional layer in the CNN convolves the image. The convolution is to use a filter with a specific parameter to scan the image and extract the feature values of the image.
  • the specific process of the server using the CNN to extract the feature information of the first real-time status information may be as follows:
  • the server delivers the first real-time status information to a preset number of convolution layers in the CNN in a preset format; then the server extracts the feature value of the first real-time status information according to the preset number of convolution layers, where The feature value is local feature information of the first real-time state information; then, the server performs dimension reduction on the feature value through the pooling layer in the CNN to obtain a dimension-reducing feature value, and the dimension-reducing feature value is two Dimension data; finally, the server modifies the dimensionality reduction feature value into one-dimensional data through the matrix variable-dimensional Reshape function, and then uses the one-dimensional data as the feature information.
  • the server extracts the feature information of the first real-time status information through the CNN as shown in FIG. 6:
  • the server extracts the feature values of the real-time status information in multiple manners.
  • the server extracts the real-time status information, that is, obtains a real-time status information (as described above, the real-time status information may be in the form of an image), and extracts the feature value of the real-time status information. That is, the server uses the real-time status information of a single piece as the first real-time status information.
  • the server needs to obtain a real-time status information set, where the real-time status information set includes a preset number of real-time status information. That is, the server uses the real-time status information set as the first real-time status information; then the server extracts the feature value of the real-time status information set; if the server obtains a real-time status information, the server will The earliest acquired real-time state information is discarded in the state information set, and the newly acquired real-time state information is added to the real-time state information set; then the server extracts the feature value of the modified real-time state set.
  • This embodiment will be described by way of example in this manner.
  • the server delivers the acquired first real-time status information to the convolution layer in the CNN in an 80*80 pixel format, where the first real-time status information includes four gray levels generated by extracting valid data for four consecutive times.
  • the 80*80 pixels are used to represent a small area of the head of the game AI for reducing input complexity. It can be understood that the pixel can be preset, and the specific data is not limited herein.
  • the server extracts the first real-time state by using the first layer convolution layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 4, an extraction number of 32 times, and a convolution step size of 1.
  • the first eigenvalue of the information The server extracts the first eigenvalue by using a second layer convolutional layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 32, an extraction number of 32 times, and a convolution step size of 1.
  • the server extracts the second eigenvalue by using a third layer convolutional layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 32, an extraction number of 32 times, and a convolution step of one.
  • the server also reduces the first real-time state information by 2*2 pixels through the pooling layer. Then, the server extracts the third eigenvalue by using a fourth layer convolutional layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 32, an extraction number of 64 times, and a convolution step length of 1. The fourth characteristic value. Finally, the server extracts the fourth eigenvalue by using a fourth layer convolutional layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 64, an extraction number of 64 times, and a convolution step size of 1. The fifth feature value is used as the dimension reduction feature value of the first real-time state information.
  • the server obtains the two-dimensional data with the feature value of 40*40*32 when the first real-time state information is passed through the first layer convolution layer and the pooling layer; the server passes the feature value of 40*40*32
  • the second layer of the convolution layer and the pooling layer obtains a eigenvalue of 20*20*32; the server obtains the eigenvalue of 20*20*32 through the third convolution layer and the pooling layer.
  • the characteristic value is a feature value of 10*10*32; the server obtains a feature value of 10*10*64 when the feature value of 10*10*32 passes through the fourth layer convolution layer; the server will 10 *10*64 eigenvalues obtained through the fifth layer convolution layer and the pooling layer have an eigenvalue of 10*10*64.
  • the server transforms the 10*10*64 eigenvalues by Reshape (ie, from 2D data to 1D data) to obtain 1400*1 one-dimensional data.
  • the one-dimensional data is used as the feature information of the first real-time status information.
  • the server obtains the action policy of the AI object by using the feature information, and the weight value is a weight value of each connection layer in the fully connected network of the learning network.
  • the one-dimensional data obtained by the CNN through the Reshape is input to the fully connected network of the learning network, and then the one-dimensional data is valued through the connection layers of the fully connected network, and finally the game AI object is output. Action strategy.
  • the learning network may be a deep reinforcement learning network.
  • Deep learning is a branch of machine learning that attempts to perform high-level abstraction of data using multiple processing layers consisting of complex structures or multiple nonlinear transforms. Deep learning is a method based on the representation of data in machine learning. Observations (e.g., an image) can be represented in a variety of ways, such as a vector of each pixel intensity value, or more abstractly represented as a series of edges, regions of a particular shape, and the like. It is easier to learn tasks from instances (eg, face recognition or facial expression recognition) using some specific representation methods.
  • Reinforcement learning is a branch of machine learning that emphasizes how to act based on the environment to maximize the expected benefits. Under the stimulation of the reward or punishment given by the environment, the organism gradually forms the expectation of the stimulus and produces the habitual behavior that can obtain the maximum benefit.
  • This method is universal, so there are studies in many other fields, such as gaming. Theory, cybernetics, operations research, information theory, simulation optimization, multi-agent system learning, swarm intelligence, statistics, and genetic algorithms.
  • reinforcement learning is called "approximate dynamic programming" (ADP). This problem has also been studied in optimal control theory, although most of the research is about the existence and characteristics of optimal solutions, not learning or approximation.
  • reinforcement learning is used to explain how equilibrium occurs under conditions of bounded rationality.
  • the action strategy of the game AI object may be only the direction control, and then 8 quantizations as shown in FIG. 7 can be obtained.
  • the direction, and then considering whether to accelerate the factor, the full connection layer output node can be 16.
  • the action strategy is fed back to the AI object, so that the AI object executes the action policy.
  • the server feeds back the action policy to the AI object such that the AI object executes the action policy.
  • the server needs to directly feed back the action policy to the terminal device, and then is controlled by the terminal device.
  • the behavior of the game AI object If the game is running on the server and the server is a computer, as shown in FIG. 5, the computer can directly acquire the action strategy of the game AI object, and then control the behavior of the game AI object.
  • S205 Acquire second real-time status information of the second environment, where the second environment is an environment after the AI object executes the action policy.
  • the server acquires second real-time status information of the second environment in real time.
  • the manner in which the server acquires the second real-time status information of the second game environment is the same as the manner in which the server acquires the first real-time status information of the first game environment in step S201. I will not repeat them here.
  • the server obtains a return value of the action policy according to the second real-time status information.
  • the server After obtaining the second real-time status information, the server obtains a reward value after the game AI object executes the action policy according to the second real-time status information.
  • the reward value is used to indicate a state condition value after the game AI object changes the environment after executing the action policy.
  • the greater the return value the more the environment changes according to the expected situation.
  • the server may calculate a reward value after the game AI object executes the action policy as follows:
  • the Q is a reward value after the game AI object executes the action strategy
  • the R t+1 is a return value after the iteration number is increased once
  • the ⁇ is a preset coefficient
  • the S t+1 is added once for the number of iterations.
  • the a is the action strategy.
  • the reward value can be adjusted by calculating its loss function:
  • the L is a loss in the calculation process
  • the ⁇ is an expectation
  • the r is a preset coefficient
  • the ⁇ is an attenuation coefficient
  • the s′ is a real-time state information at a next moment
  • the a′ is an action strategy at a next moment.
  • the w is the current network system
  • the s is Current real-time status information, which is the current action policy.
  • the server can also use the DQN algorithm to train the learning network.
  • the specific pseudo code is as follows:
  • r is a feedback value
  • the game AI object changes the environment at the next moment after executing the action strategy, and the change is reflected as a feedback value.
  • set r to the snake's own length increase, the number of kills, and whether the state of death is quantified as a value.
  • the effect of the feedback value can be set such that the larger the feedback value, the better the change; the effect of the feedback value can also be set as the smaller the feedback value, the better the change.
  • the specific setting method here is not limited.
  • the action output policy generally sets a random preset value before the action policy outputted by the reference learning network, and when the return value is higher than the random preset value, the server adopts an action strategy of learning the network output; Random action is used when the value is lower than the random preset value.
  • the random preset value may be designed as a dynamically changing value. In the process of establishing a behavior model of the game AI object, the random preset value is exponentially reduced according to the training expectation; finally, the random preset value converges to a very A small random value, at this time, the action strategy of the game AI object is basically equivalent to the output value of the learning network.
  • step S207 Determine whether the reward value meets a preset condition. If yes, go to step S208, if no, go to step S209.
  • the server After obtaining the reward value of the action policy, the server determines whether the reward value does not change any more, and if yes, determines that the reward value is a return value of the optimal action policy, and performs step S208. If the return value is still changing, step S210 is performed.
  • the server determines that the weight of the learning network is a target weight value of the learning network.
  • the AI object behavior model is established according to the target weight value.
  • the server sets the weight value of the fully connected network of the learning network to the target weight value to complete the AI object row. For the establishment of the model.
  • the server modifies the weight value of the fully connected network of the learning network according to the rule for obtaining the maximum Q value, and repeats the steps S201 to S207 described above until the condition for performing step S208 to step S209 is satisfied.
  • the connection layer of the learning network changes the direction of the value of the one-dimensional data.
  • the server makes a corresponding decision on the action strategy of the AI object according to the real-time change of the environment, which can improve the flexibility of the AI.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning through the learning network is more specific, thereby further improving the game AI. flexibility.
  • FIG. 8 illustrates a method for controlling an AI object in one embodiment of the present application. It should be noted that although the control method of the AI object of the present application is described by taking a game as an example in the embodiment, those skilled in the art can understand that the control method of the AI object of the present application is not limited to the game.
  • the method includes:
  • the AI control device obtains real-time status information of the operating environment.
  • the running environment is a game environment, and the game environment includes at least one of the game AI object, the game player, and the game scene.
  • the real-time status information is a picture containing the game environment.
  • the real-time status information is as shown in FIG. If the game is a snake game, the game AI object in the real-time status information is a snake represented by "tourist 408247928" in the game; the game player in the real-time status information is " ⁇ " in the game.
  • the representative snake, the scattered dots in the real-time status information are the food in the game scene.
  • the AI control device may be a game AI control device.
  • the game AI control device may be a terminal device running the game, or may be a server independent of the terminal device running the game, as long as it stores the behavior model of the game AI object, the specific manner is not here. Make a limit.
  • the game AI control device may adopt the following manner when acquiring real-time status information of the game environment:
  • the game AI control device may acquire valid data in the game environment, where the valid data includes a character parameter of the game AI object, a position parameter of the game AI object, and a role parameter of the game player character. And at least one of a position parameter of the game player character and a game scene parameter, wherein the valid data is extracted in a game environment in an area with a preset value as a center centering on a preset part of the game AI;
  • the game AI control device draws a two-dimensional data matrix as the first real-time state information according to the valid data, and the two-dimensional data matrix is represented as an image, that is, the real-time state information exists as a picture. For example, as shown in FIG.
  • the valid data acquired by the game AI control device at this time includes: the length of the snake represented by the "tourist 408247928” is 43; the number of kills of the snake represented by the “tourist 408247928” is 0; The length of the snake represented by the " ⁇ ” is 195; the positional parameter of the snake represented by the " ⁇ ” is the lower left of the screen; the positional parameter of the snake represented by the "tourist 408247928” is the middle of the screen; The positional parameters of each food.
  • the game AI control device then redraws a two-dimensional data matrix (ie, the real-time status information) similar to the FIG.
  • the game AI control device can assign a color value to each valid data, and then according to the color.
  • the data is plotted against the two-dimensional data matrix.
  • the game AI control device can usually use a grayscale image, and different objects use different values.
  • the open space is neutral and gray
  • the game AI object (the snake represented by "tourist 408247928") is also neutral and gray
  • border and game player characters (That is, the snake represented by "Zuo Ruo Ting Lan” is "not good", it is black
  • the food that is, the small dots scattered in Figure 9) is "good” and white.
  • the specific code can be drawn as follows:
  • the game AI control device may directly acquire a screenshot picture of the game environment, and directly use the screenshot picture as the real-time status information.
  • the game AI control device can directly obtain FIG. 9 and then use FIG. 9 as real-time status information of the game environment.
  • the game AI control device is a device having a computing resource, that is, as long as it can establish a behavior model of the game AI object, the specific situation is not limited herein.
  • the game AI control device can run the The game also has a behavior model for establishing the game AI object; the game AI control device can also be used only to establish a behavior model of the game AI object.
  • the game environment can occur on a terminal device (such as a mobile phone, a tablet, etc.) or directly on the game AI control device.
  • the game AI control device receives real-time status information of the game environment during the running of the game sent by the terminal device; when the game is run on the game AI control device, The game AI control device directly collects real-time status information of the game environment during the running of the game.
  • the game environment in the process of establishing the behavior model, may be a pre-set game environment sample set, or may be a game environment sample set in real-time operation, and the specific manner is not here. Make a limit.
  • the terminal device when the game is run on the terminal device, the terminal device may be multiple or a single one, which is not limited herein.
  • the AI control device After acquiring the real-time status information, the AI control device extracts feature information of the real-time status information.
  • the feature information is multi-dimensional information.
  • the AI control device may extract the feature information of the real-time state information by using the CNN.
  • the specific process of the AI control device adopting the CNN to extract the feature information of the real-time status information may be as follows:
  • the AI control device transmits the real-time status information to a preset number of convolution layers in the CNN in a preset format; then the AI control device extracts feature values of the real-time status information according to the preset number of convolution layers, where The feature value is local feature information of the real-time state information; and then, the AI control device passes the feature through the pooling layer in the CNN The value is dimension-reduced to obtain a dimension-reducing feature value, and the dimension-reducing feature value is two-dimensional data; finally, the AI control device modifies the dimension-reduced feature value into one-dimensional data by using the matrix variable-dimensional Reshape function, and then the one-dimensional data is As this feature information.
  • the game AI control device extracts the feature information of the real-time status information through the CNN as shown in FIG. 4:
  • the game AI control device extracts the feature values of the real-time state information in a plurality of manners.
  • the game AI control device extracts in real time, that is, extracts the feature value of the real-time state information when acquiring a real-time status information. That is, the game AI control device uses the real-time status information of a single piece as the real-time status information.
  • the game AI control device needs to first acquire a real-time state information set, where the real-time state information set includes a preset number of real-time state information. That is, the game AI control device uses the real-time status information set as the real-time status information; then the game AI control device extracts the feature value of the real-time status information set; if the game AI control device acquires a real-time status information, Then, the game AI control device discards the earliest acquired real-time state information in the real-time state information set, and adds the newly acquired real-time state information to the real-time state information set; then the game AI control device extracts the modification.
  • the feature value of the post-real state set This embodiment will be described by way of example in this manner. Specifically as shown in Figure 6:
  • the game AI control device transmits the acquired real-time status information to the convolution layer in the CNN in an 80*80 pixel format, wherein the real-time status information includes four gray levels generated by extracting valid data for four consecutive times. .
  • the 80*80 pixel is used to represent a small area of the head of the game AI for reducing input complexity. It can be understood that the pixel can be preset, and the specific data is not limited herein.
  • the game AI control device extracts the real-time by using the first layer convolution layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 4, an extraction number of 32 times, and a convolution step size of 1. The characteristic value of the status information.
  • the game AI control device extracts the feature value by using a second layer convolution layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 32, an extraction number of 32 times, and a convolution step size of 1.
  • the second characteristic value The game AI control device extracts the second feature by using a third layer convolution layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 32, an extraction number of 32 times, and a convolution step size of 1.
  • the third eigenvalue of the value At the same time, in the first three layers of convolution, the game AI control device also reduces the real-time status information by 2*2 pixels through the pooling layer.
  • the game AI control device extracts the third through the fourth layer convolution layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 32, an extraction number of 64 times, and a convolution step size of 1.
  • the fourth characteristic value of the feature value is extracted.
  • the game AI control device extracts the first layer by using a fourth layer convolution layer in the CNN with a convolution kernel of 3*3 pixels, a depth of 64, an extraction number of 64 times, and a convolution step size of 1.
  • a fifth eigenvalue of the eigenvalue, the fifth eigenvalue being used as a dimensionality reduction feature value of the real-time state information.
  • the game AI control device passes the real-time state information through the layer convolution layer and the pooling layer to obtain two-dimensional data having a feature value of 40*40*32; the game AI control device will have a feature of 40*40*32
  • the value obtained by the second layer convolution layer and the pooling layer is 2*20*32; the game AI control device passes the feature value of 20*20*32 through the third layer convolution layer and The eigenvalue obtained when the pooling layer is obtained is an eigenvalue of 10*10*32; the AI control device obtains a eigenvalue of 10*10*32 through the fourth layer of the convolutional layer, and the eigenvalue is 10*10*64.
  • the feature value of the game AI control device is that the feature value of 10*10*64 is passed through the fifth layer convolution layer and the pooling layer, and the feature value obtained by the feature value is 10*10*64.
  • the game AI control device changes the feature value of 10*10*64 by Reshape (ie, by two-dimensional The data becomes one-dimensional data) and one-dimensional data of 6400*1 is obtained.
  • the one-dimensional data is used as the feature information of the real-time status information.
  • the AI control device obtains the action policy of the game AI object by using the feature information, and the weight value is a weight value of each connection layer in the fully connected network of the learning network.
  • the one-dimensional data obtained by the CNN through the Reshape is input to the fully connected layer of the learning network, and then the one-dimensional data is valued through the connection layers of the fully connected layer, and finally the game AI object is output. Action strategy.
  • the action strategy of the game AI object may be only the direction control, and then 8 quantizations as shown in FIG. 7 can be obtained.
  • the direction, and then considering whether to accelerate the factor, the full connection layer output node can be 16.
  • the action strategy is fed back to the AI object, so that the AI object executes the action policy.
  • the AI control device feeds back the action policy to the AI object to cause the AI object to execute the action policy.
  • the snake represented by the “visitor 408247928” turns to the place where the food is densely distributed in FIG. 10 to start phagocytizing the food.
  • the game AI control device is a server independent of the terminal device running the game, the game AI control device needs to directly feed back the action policy to the terminal device, and then the terminal device controls the device.
  • the behavior of the game AI object If the game AI control device is a terminal device running the game, the game AI control device may directly acquire an action strategy of the game AI object, and then control the behavior of the game AI object.
  • the game AI control device after acquiring the real-time state information of the game environment, extracts the multi-dimensional feature information of the real-time state information, and then obtains the weight value according to the multi-dimensional feature information and the learning network.
  • the action strategy of the game AI object finally the game AI object control device feeds back the action strategy to the game AI object such that the game AI object executes the action strategy.
  • This application makes corresponding decisions based on real-time changes in the environment, which can improve the flexibility of the AI.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning the network is more specific, thereby further improving the flexibility of the game AI. Sex.
  • the terminal device runs the game on the server, and the terminal device or the server extracts real-time status information of the game environment; then, the server performs pre-processing on the real-time status information, that is, extracts multi-dimensional feature information; the server Obtaining an action strategy of the game AI object by using the multi-dimensional feature information and the learning network; the server acquiring real-time status information of the game environment after the game AI object executes the action policy; the server executing the action policy according to the game AI object
  • the real-time status information of the subsequent game environment calculates the reward value of the action strategy; the server adjusts the weight value of the learning network according to the reward value.
  • the terminal device runs the game on the server, and the terminal device or the server extracts real-time status information of the game environment; then, the terminal device or the server pre-processes the real-time status information, that is, extracts Multi-dimensional feature information; the terminal device or the server obtains an action strategy of the game AI object by using the multi-dimensional feature information and the learning network; the terminal device or the server feeds the action policy to the game AI object; The game AI object executes the action strategy.
  • the method embodiment in the embodiment of the present application has been described above.
  • the following describes the behavior model establishing device and the AI control device of the AI object in the embodiment of the present application.
  • the behavior model building device of the AI object may be a server.
  • an embodiment of the behavior model building apparatus for an AI object in the embodiment of the present application includes:
  • the obtaining module 1301 is configured to acquire first real-time status information of the first environment where the AI object is located;
  • the processing module 1302 is configured to extract feature information of the first real-time state information acquired by the acquiring module 1301, and obtain an action policy of the AI object according to the feature information and the weight value of the learning network;
  • the feedback module 1303 is configured to feed back the action policy obtained by the processing module 1302 to the AI object, so that the AI object executes the action policy;
  • the obtaining module 1301 is configured to acquire second real-time status information of the second environment where the AI object is located, where the second environment is an environment after the AI object executes the action policy;
  • the processing module 1302 is configured to obtain a return value of the action policy according to the second real-time status information acquired by the acquiring module 1301; if the return value meets a preset condition, determine a weight value of the learning network as the learning network.
  • the target weight value; the behavior model of the AI object is established according to the target weight value.
  • the processing module 1302 is specifically configured to: deliver the first real-time status information to a preset number of convolution layers in a preset format; and extract the first by the pooling layer and the preset number of convolution layers
  • the real-time state information obtains a dimension reduction feature value, and the dimension reduction feature value is two-dimensional data; the dimension reduction feature value is modified into one-dimensional data, and the one-dimensional data is used as the feature information.
  • the preset format is a picture with a length and a width of 80 pixels, the preset number is 5, the convolution layer of the convolution layer has a length and a width of 3 pixels, and the convolution step is 1, the pool.
  • the dimensionality reduction of the layer is set such that the maximum value is selected as the dimensionality reduction feature value in an area where both the length and the width are 2 pixels.
  • the processing module 1302 is further configured to modify a weight value of the learning network if the reward value does not meet the preset condition.
  • the AI object is a game AI object
  • the environment is a game environment
  • the acquiring module 1301 is configured to acquire valid data of the first game environment, where the valid data includes a role parameter of the game AI object, At least one of a position parameter of the game AI object, a character parameter of the game player character, a position parameter of the game player character, and a game scene parameter; and drawing a two-dimensional data matrix as the first real-time status information according to the valid data, A two-dimensional data matrix represents an image.
  • the obtaining module 1301 is configured to obtain a color value corresponding to each game object in the valid data, where the color value is used to represent a color of each game object in the game environment, and the game object includes the game AI object.
  • the player game character and the game scene; the two-dimensional data matrix is drawn as the first real-time status information according to the color value corresponding to each game object.
  • the obtaining module 1301 is configured to acquire a screenshot image of the first game environment as the first real-time status information.
  • the learning network is a deep reinforcement learning network
  • the algorithm for the deep reinforcement learning network includes a Q-learning algorithm or a DQN algorithm.
  • the feature information is multi-dimensional information
  • the first environment is a game environment sample set, where the game environment sample set includes a player real-time operating game environment and a preset game environment.
  • the processing module 1301 after acquiring the first real-time state information of the first game environment, extracts multi-dimensional feature information of the first real-time state information, and then learns according to the multi-dimensional feature information and learning.
  • the weight value of the network obtains an action strategy of the game AI object; finally, the feedback module 1303 feeds back the action policy to the game AI object, so that the game AI object executes the action policy; then the acquisition module 1301 acquires the game AI object to execute
  • the second real-time status information of the second game environment after the action strategy, the processing module 1302 calculates a return value of the action policy according to the second real-time status information, and determines the learning network when the return value meets a preset condition.
  • the weight value is the target weight value, and the behavior model of the game AI object is established according to the target weight value.
  • This application makes corresponding decisions based on real-time changes in the environment, which can improve the flexibility of the AI.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning through the learning network is more specific, thereby further improving the game AI. flexibility.
  • FIG. 14 another embodiment of the server in this embodiment of the present application includes:
  • the transceiver 1401 is connected to the processor 1402 via the bus 1403;
  • the bus 1403 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 14, but it does not mean that there is only one bus or one type of bus.
  • the processor 1402 may be a central processing unit (CPU), a network processor (NP) or a combination of a CPU and an NP.
  • CPU central processing unit
  • NP network processor
  • the processor 1402 may further include a hardware chip.
  • the hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination.
  • the server may also include a memory 1404.
  • the memory 1404 can include a volatile memory, such as a random-access memory (RAM); the memory can also include a non-volatile memory, such as a flash memory ( A flash memory, a hard disk drive (HDD) or a solid-state drive (SSD); the memory 1404 may also include a combination of the above types of memories.
  • RAM random-access memory
  • non-volatile memory such as a flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the memory 1404 may also include a combination of the above types of memories.
  • the memory 1404 can also be used to store program instructions, and the processor 1402 can call the program instructions stored in the memory 1404 to perform one or more steps in the embodiments shown in FIG. 2 to FIG. 7 and FIG. Or where An optional implementation implements the functionality of the server in the above method.
  • the transceiver 1401 performs the following steps:
  • the processor 1402 performs the following steps:
  • the transceiver 1401 also performs the following steps:
  • the processor 1402 also performs the following steps:
  • a behavior model of the AI object is established according to the target weight value.
  • the transceiver 1401 also performs all steps of data transceiving, and the processor 1402 also performs processing steps of all data in the above embodiments.
  • the processor 1401 after acquiring the first real-time status information of the first game environment, extracts multi-dimensional feature information of the first real-time status information, and then learns according to the multi-dimensional feature information and learning.
  • the weight value of the network obtains an action strategy of the game AI object; finally, the transceiver 1401 feeds back the action policy to the game AI object, so that the game AI object executes the action policy; then the transceiver 1401 acquires the game AI object to execute
  • the weight value is the target weight value, and the behavior model of the game AI object is established according to the target weight value.
  • This application makes corresponding decisions based on real-time changes in the environment, which can improve the flexibility of the AI.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning through the learning network is more specific, thereby further improving the game AI. flexibility.
  • the AI control device ie, the AI object control device in the embodiment of the present application includes:
  • the obtaining module 1501 is configured to acquire real-time status information of an environment in which the AI object is located;
  • the processing module 1502 is configured to extract feature information of the real-time state information, and obtain an action policy of the AI object according to the feature information and the weight value of the learning network, where the weight value of the learning network is a preset value;
  • the feedback module 1503 is configured to feed back the action policy obtained by the processing module to the AI object, so that the AI object executes the action policy.
  • the processing module 1502 is configured to: deliver the real-time status information to a preset number of convolution layers in a preset format; and extract the real-time status information by using the pooling layer and the preset number of convolution layers.
  • the dimensionality reduction feature value is a two-dimensional data; the dimensionality reduction feature value is modified into one-dimensional data, and the one-dimensional data is used as the feature information.
  • the AI object is a game AI object
  • the environment is a game environment
  • the processing module 1502 is specifically configured to extract valid data of the game environment
  • the valid data includes a role parameter of the game AI object
  • the game The position parameter of the AI object, the character parameter of the player game character, the position parameter of the player game character, and the game scene parameter At least one of the numbers; drawing a two-dimensional data matrix as the real-time state information according to the valid data, the two-dimensional data matrix representing an image.
  • the processing module 1501 extracts the multi-dimensional feature information of the real-time status information, and then the processing module 1502 obtains the weight value according to the multi-dimensional feature information and the learning network.
  • the action strategy of the game AI object; the final feedback module 1503 feeds back the action strategy to the game AI object such that the game AI object executes the action strategy.
  • This application makes corresponding decisions based on real-time changes in the environment, which can improve the flexibility of the AI.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning the network is more specific, thereby further improving the flexibility of the game AI. Sex.
  • FIG. 16 another embodiment of the AI control device in the embodiment of the present application includes:
  • transceiver 1601 a transceiver 1601, a processor 1602, and a bus 1603;
  • the transceiver 1601 is connected to the processor 1602 via the bus 1603;
  • the processor performs the following steps:
  • the action strategy is fed back to the AI object such that the AI object executes the action strategy.
  • the bus 1603 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 16, but it does not mean that there is only one bus or one type of bus.
  • the processor 1602 can be a central processing unit (CPU), a network processor (NP) or a combination of a CPU and an NP.
  • CPU central processing unit
  • NP network processor
  • the processor 1602 can also further include a hardware chip.
  • the hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination.
  • the AI control device may further include a memory 1604.
  • the memory 1604 can include a volatile memory, such as a random-access memory (RAM); the memory can also include a non-volatile memory, such as a flash memory ( A flash memory, a hard disk drive (HDD) or a solid-state drive (SSD); the memory 1604 may also include a combination of the above types of memories.
  • RAM random-access memory
  • non-volatile memory such as a flash memory ( A flash memory, a hard disk drive (HDD) or a solid-state drive (SSD); the memory 1604 may also include a combination of the above types of memories.
  • the memory 1604 can also be used to store program instructions, and the processor 1602 calls the program instructions stored in the memory 1604, and can perform one or more steps in the embodiments shown in FIG. 8 to FIG. 10 and FIG. Or In an optional implementation manner, the function of the AI control device behavior in the above method is implemented.
  • the processor 1602 performs the following steps:
  • the action strategy is fed back to the AI object such that the AI object executes the action strategy.
  • the transceiver 1601 also performs all steps of data transceiving, and the processor 1602 also performs processing steps of all data in the above embodiment.
  • the processor 1602 after acquiring the real-time state information of the game environment, extracts multi-dimensional feature information of the real-time state information, and then the processor 1602 obtains the weight value of the learning network according to the multi-dimensional feature information.
  • the action strategy of the game AI object; the last processor 1602 feeds back the action strategy to the game AI object such that the game AI object executes the action strategy.
  • This application makes corresponding decisions based on real-time changes in the environment, which can improve the flexibility of the AI.
  • the dimension of the extracted feature information is multi-dimensional feature information, the dimension of the feature information extracted by the behavior tree is higher, and the action strategy obtained after learning the network is more specific, thereby further improving the flexibility of the game AI. Sex.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种AI对象的行为模型建立方法以及装置,能够根据环境的实时改变做出相应的决策,提高游戏的灵活性。提供如下技术方案:获取AI对象所处的第一环境的第一实时状态信息(S201);提取第一实时状态信息的特征信息(S202);根据特征信息和学习网络的权重值得到AI对象的动作策略(S203);将动作策略反馈给AI对象,以使得AI对象执行动作策略(S204);获取AI对象所处的第二环境的第二实时状态信息,第二环境为AI对象执行动作策略之后生成(S205);根据第二实时状态信息得到动作策略的回报值(S206);若回报值符合预设条件,则确定学习网络的权重值为学习网络的目标权重值(S208);根据目标权重值建立AI对象的行为模型(S209)。

Description

一种AI对象行为模型优化方法以及装置 技术领域
本申请实施例涉及人工智能领域,尤其涉及一种人工智能(artificial intelligence,AI)对象行为模型优化方法以及装置。
背景技术
随着科技的发展,人们对于娱乐项目的要求越来越高,游戏行业也进入了高速发展的时代。在游戏场景中,游戏AI是很重要的一环,在复杂的游戏场景中,游戏AI可以加强游戏的娱乐性。
目前的游戏AI的训练方式通常是使用状态机或者是行为树。对于用行为树定模型构造的AI系统来说,每次执行AI时,系统都会从根节点遍历整个树,父节点执行子节点,子节点执行完后将结果返回父节点,然后父节点根据子节点的结果来决定接下来怎么做。在这种模型下,行为树可以方便地把复杂的AI知识条目组织得非常直观。默认的组合节点处理子节点的迭代方式就像是处理一个预设优先策略队列,也非常符合人类的正常思考模式即先最优再次优。同时该行为树的各种节点,包括叶子节点,可复用性都极高。
但是每次执行行为树定模型构造的AI系统时,系统都会从根节点遍历整个树,父节点执行子节点,子节点执行完后将结果返回父节点,然后父节点根据子节点的结果来决定接下来怎么做。这样导致游戏AI的行为有迹可循,行为模式固定。
发明内容
本申请实施例提供了一种AI对象的行为模型优化方法以及装置,使得AI根据环境的实时改变做出相应的决策,提高AI的灵活性。
本申请实施例的第一方面提供一种AI对象的行为模型建立方法,包括:
获取AI对象所处的第一环境的第一实时状态信息;
提取该第一实时状态信息的特征信息;
根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为随机设置的权重值;
将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略;
获取AI对象所处的第二环境的第二实时状态信息,该第二环境为该AI对象执行该动作策略之后的环境;
根据该第二实时状态信息得到该动作策略的回报值;
若该回报值符合预设条件,则确定该学习网络的权重值为该学习网络的目标权重值;
根据该目标权重值建立该AI对象的行为模型。
本申请实施例第二方面提供了一种AI对象的控制方法,包括:
获取AI对象所处的环境的实时状态信息;
提取该实时状态信息的特征信息;
根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为预置数值;
将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
第三方面,本申请实施例提供一种AI对象的行为模型建立装置,该装置可以是服务器,该服务器具有实现上述方法中服务器的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
一种可能实现方式中,该AI对象的行为模型建立装置包括:
获取模块,用于获取AI对象所处的第一环境的第一实时状态信息;
处理模块,用于提取该获取模块获取到的该第一实时状态信息的特征信息;根据该特征信息和学习网络的权重值得到游戏AI对象的动作策略;
反馈模块,用于将该处理模块得到的该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略;
该获取模块,用于获取AI对象所处的第二环境的第二实时状态信息,该第二环境为该AI对象执行该动作策略之后的环境;
该处理模块,用于根据该获取模块获取到的该第二实时状态信息得到该动作策略的回报值;若该回报值符合预设条件,则确定该学习网络的权重值为该学习网络的目标权重值;根据该目标权重值建立该AI对象的行为模型。
另一种可能实现方式中,该AI对象的行为模型建立装置包括:
一个或多个处理器;和存储器,所述存储器存储程序指令,所述指令当由所述一个或多个处理器执行时,配置所述装置执行本申请的AI对象的行为模型建立方法。
第四方面,本申请实施例提供一种AI对象控制设备,该AI对象控制设备具有实现上述方法中的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
一种可能实现方式中,该AI对象控制设备包括:
获取模块,用于获取AI对象所处的环境的实时状态信息;
处理模块,用于提取该实时状态信息的特征信息;根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为预置数值;
反馈模块,用于将该处理模块得到的该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
另一种可能实现方式中,该AI对象控制设备包括:
一个或多个处理器;和
存储器,所述存储器存储程序指令,所述指令当由所述一个或多个处理器执行时,配置所述装置执行本申请的AI对象控制方法。
第五方面,本申请实施例提供一种计算机可读存储介质,包括指令,当该指令在计算装置的处理器上运行时,该装置执行上述各项方法。
第六方面,本申请实施例提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,该计算机执行上述各项方法。
本申请实施例提供的技术方案中,服务器在获取到第一游戏环境的第一实时状态信息之后,该服务器提取该第一实时状态信息的多维度特征信息,然后根据该多维度特征信息 和学习网络的权重值得到该游戏AI对象的动作策略;最后服务器将该动作策略反馈给该游戏AI对象,以使得该游戏AI对象执行该动作策略;同时该服务器获取该游戏AI对象执行了该动作策略之后的第二游戏环境的第二实时状态信息,并根据该第二实时状态信息计算该动作策略的回报值,并在该回报值符合预设条件时确定该学习网络的权重值为目标权重值,并根据该目标权重值建立该游戏AI对象的行为模型。本申请根据环境的实时改变做出相应的决策,可以提高灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络学习之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
附图说明
图1为一个示例的行为树的示意图;
图2为本申请实施例中AI对象的行为模型优化方法的示意图;
图3为本申请实施例提供的贪吃蛇游戏的一个示意图;
图4为本申请实施例中AI对象的行为模型建立方法的一种模式示意图;
图5为本申请实施例中AI对象的行为模型建立方法的另一种模式示意图;
图6为本申请实施例中卷积神经网络提取实时状态信息的特征信息的一种示意图;
图7为本申请实施例提供的贪吃蛇游戏输出内容的示意图;
图8为本申请实施例中AI对象控制方法示意图;
图9为本申请实施例提供的贪吃蛇游戏的一个示意图;
图10为本申请实施例提供的贪吃蛇游戏的另一个示意图;
图11为本申请实施例中AI对象的行为模型建立方法流程图;
图12为本申请实施例中AI对象控制方法流程图;
图13为本申请实施例中服务器的一个实施例示意图;
图14为本申请实施例中服务器的另一个实施例示意图;
图15为本申请实施例中AI对象控制设备的一个实施例示意图;
图16为本申请实施例中AI对象控制设备的另一个实施例示意图。
具体实施方式
本申请实施例提供了一种AI对象的行为模型建立方法以及装置,使得AI根据环境的实时改变做出相应的决策,提高AI的灵活性。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
随着科技的发展,人们对于娱乐项目的要求越来越高,游戏行业也进入了高速发展的时代。在游戏场景中,游戏AI是很重要的一环,在复杂的游戏场景中,游戏AI可以加强 游戏的娱乐性。目前的游戏AI的训练方式通常是使用状态机或者是行为树。对于用行为树定模型构造的AI系统来说,每次执行AI时,系统都会从根节点遍历整个树,父节点执行子节点,子节点执行完后将结果返回父节点,然后父节点根据子节点的结果来决定接下来怎么做。如图1所示的一个行为树模型,其中,父节点为一个行为选择节点;该父节点的叶子节点为动作1;该子节点为一个顺序执行子节点;该子节点的叶子节点包括节点条件和动作2。在实际应用中,该行为树从根节点输入,然后通过父节点执行其顺序执行子节点,当该顺序执行子节点的各叶子节点(节点条件和动作2)执行成功时,该顺序执行子节点返回该父节点成功标志;当该顺序执行子节点中任一项叶子节点(节点条件和动作2)执行失败时,该顺序执行子节点返回该父节点失败标志,这时该父节点执行其叶子节点即动作1。若假设该动作1为睡觉,该动作2为打招呼,该节点条件为碰到游戏玩家。在实际应用中,根据该行为树的描述可知,若该游戏AI碰到了游戏玩家,则打招呼;若该游戏AI未碰到游戏玩家,则睡觉。在这种模型下,行为树可以方便地把复杂的AI知识条目组织得非常直观。默认的组合节点处理子节点的迭代方式就像是处理一个预设优先策略队列,也非常符合人类的正常思考模式即先最优再次优。同时该行为树的各种节点,包括叶子节点,可复用性都极高。但是每次执行行为树定模型构造的AI系统时,系统都会从根节点遍历整个树,父节点执行子节点,子节点执行完后将结果返回父节点,然后父节点根据子节点的结果来决定接下来怎么做。这样导致游戏AI的行为有迹可循,行为模式固定。
为了解决此问题,本申请实施例提供如下方案:获取AI对象所处的第一环境的第一实时状态信息;然后提取该第一实时状态信息的特征信息;再其次,根据该特征信息和学习网络的权重值得到AI对象的动作策略;再其次,将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略;再其次,获取AI对象所处的第二环境的第二实时状态信息,该第二环境为该AI对象执行该动作策略之后的环境;然后,根据该第二实时状态信息得到该动作策略的回报值;若该回报值符合预设条件,则确定该学习网络的权重值为该学习网络的目标权重值;最后根据该目标权重值建立该AI对象的行为模型。
本申请实施中,根据该环境的实时状态信息获取该AI对象的行为模型的目标权重值是根据样本进行重复的操作的,具体的情况如下,请参阅如下说明。
请参阅图2所示,图2示出了本申请一个实施例中AI对象的行为模型优化方法。需要注意的是,虽然实施例中以游戏为例描述本申请的AI对象行为模型优化方法,但是本领域的技术人员可以理解,本申请的AI对象行为模型优化方法不限于游戏。该方法包括:
S201、获取第一环境的第一实时状态信息。
本实施例中,由服务器获取该第一环境的第一实时状态信息,该第一环境可以为游戏环境样本集合。
本实施例中,该服务器为具有深度学习网络的计算机,该计算机可以具有显示功能。该第一环境可以为第一游戏环境,该第一游戏环境包括该游戏AI对象、游戏玩家,游戏场景至少其中一项。该第一实时状态信息为图片,该图片里包括该第一游戏环境。比如,该第一实时状态信息为图3所示。若该游戏为贪吃蛇游戏,该第一实时状态信息中的游戏 AI对象为该游戏中的“游客408247928”代表的蛇;该第一实时状态信息中的游戏玩家为该游戏中的“biubiubiu”代表的蛇;该第一实时状态信息中的各分散的小点为该游戏场景中的食物。
可选的,当该第一环境为该第一游戏环境时,该服务器在获取该第一游戏环境的第一实时状态信息时可以采用如下方式:
一种可能实现方式中,该服务器可以获取该第一游戏环境中的有效数据,这时该有效数据包括该游戏AI对象的角色参数、该游戏AI对象的位置参数,游戏玩家角色的角色参数,该游戏玩家角色的位置参数,游戏场景参数中的至少一项,该有效数据在以该游戏AI的预设部位为中心,以预设数值为半径的区域内的游戏环境内提取;然后该服务器根据该有效数据绘制二维数据矩阵作为该第一实时状态信息,该二维数据矩阵表示为图像,即该第一实时状态信息以图片存在。比如,如图3所示,这时该服务器获取到的有效数据包括:该“游客408247928”代表的蛇的长度为33;该“游客408247928”代表的蛇的击杀数为0;该“biubiubiu”代表的蛇的长度为89;该“biubiubiu”代表的蛇的位置参数指示该“biubiubiu”代表的蛇位于屏幕的右下方;该“游客408247928”代表的蛇的位置参数指示该“游客408247928”代表的蛇位于屏幕的中间;各食物的位置参数。然后该服务器根据该有效数据重新绘制类似于该图3的二维数据矩阵(即该第一实时状态信息),此时该服务器可以为各个有效数据赋予颜色数值,然后根据该颜色数据绘制该二维数据矩阵。在实际应用中,为了简便,该服务器通常可用灰度图,不同的对象使用不同的数值。比如以图3所示的贪吃蛇游戏为例,空地为中性,灰色;该游戏AI对象(即“游客408247928”代表的蛇)的身体也比较中性,为灰色;边界和游戏玩家角色(即该“biubiubiu”代表的蛇)是“不好”的,为黑色;食物(即该图3中分散的小点)是“好”的,为白色。具体绘制的代码可以如下:
BORDER_COLOR=0
FOOD_COLOR=255
MAP_COLOR=120
PLAYER_HEAD=180
PLAYER_BODY=150
ENEMY_HEAD=30
ENEMY_BODY=0
另一种可能实现方式中,该服务器可以直接获取该第一游戏环境的截图图片,并直接将该截图图片作为该第一实时状态信息。比如,该服务器可以直接获取到图3,然后将该图3作为该第一游戏环境的第一实时状态信息。
本实施例中,该服务器为具有计算资源的设备,即只要其可以建立该游戏AI对象的行为模型即可,具体情况此处不做限定,比如该服务器可以既运行该游戏同时也建立该游戏AI对象的行为模型;该服务器也可以仅用于建立该游戏AI对象的行为模型。根据上述说明,该游戏环境可以发生在终端设备(比如手机、平板等终端)也可以直接发生在该服务器上。如图4所示,即当该游戏在该终端设备运行时,这时该服务器接收该终端设备发 送的该游戏运行过程中的第一游戏环境的有效数据。当该游戏在该服务器上运行时,如图5所示该服务器直接采集该游戏运行过程中的第一游戏环境的第一实时状态信息。
本实施例中,该游戏AI对象在进行行为模型建立的过程中,该第一游戏环境可以是预先设置好的游戏环境样本集合,也可以是在实时操作中的游戏环境样本集合,具体方式此处不做限定。
本实施例中,当该游戏在该终端设备上运行时,该终端设备可以为多个,也可以为单独一个,具体情况此处不做限定。
S202、提取该第一实时状态信息的特征信息。
该服务器在获取到该第一实时状态信息之后,提取该第一实时状态信息的特征信息。可选的是,该特征信息为多维度信息。
在实际应用中,为了提取该第一实时状态信息的多维度信息,该服务器可以采用卷积神经网络(Convolutional Neural Network,CNN)提取该第一实时状态信息的特征信息。
CNN是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,对于大型图像处理有出色表现。该CNN由一个或多个卷积层和顶端的全连通层(对应经典的神经网络)组成,同时也包括关联权重和池化层(pooling layer)。这一结构使得CNN能够利用输入数据的二维结构。该CNN中的卷积层的卷积核会对图像进行卷积,卷积就是用一个特定参数的滤波器去扫描图像,提取图像的特征值。
而该服务器采用该CNN提取该第一实时状态信息的特征信息的具体过程可以如下:
该服务器将该第一实时状态信息以预设格式传递给该CNN中预设数目的卷积层;然后该服务器根据该预设数目的卷积层提取该第一实时状态信息的特征值,此时该特征值为该第一实时状态信息的局部特征信息;再然后,该服务器通过该CNN中的池化层对该特征值进行降维,得到降维特征值,该降维特征值为二维数据;最后该服务器通过该矩阵变维Reshape函数将该降维特征值修改为一维数据,然后将该一维数据作为该特征信息。
在实际应用中,该服务器通过CNN提取该第一实时状态信息的特征信息可以如图6所示:
本实施例中,该服务器提取实时状态信息的特征值可以采用多种方式。
一种可能实现方式中,该服务器为实时提取,即获取到一张实时状态信息(如上文所述,实时状态信息可以是图像的形式)时就提取该实时状态信息的特征值。即该服务器将单张的实时状态信息作为该第一实时状态信息。
另一种可能实现方式中,该服务器需要先获取到一个实时状态信息集合,该实时状态信息集合中包含有预设数目的实时状态信息。即该服务器将该实时状态信息集合作为该第一实时状态信息;然后该服务器再提取该实时状态信息集合的特征值;若该服务器再获取到一张实时状态信息,则该服务器就将该实时状态信息集合中最早获取到的实时状态信息丢弃,将该最新获取到的实时状态信息添加进行该实时状态信息集合中;然后该服务器提取该修改后的实时状态集合的特征值。本实施例中以此种方式为例进行说明。
首先,该服务器将获取到的第一实时状态信息以80*80像素的格式传递给该CNN中的卷积层,其中该第一实时状态信息包括连续4次提取有效数据生成的四张灰度图。该 80*80像素用于表示该游戏AI的头部的一小块区域,用于降低输入复杂度。可以理解的是,该像素可以预先设置,具体数据此处不做限定。
然后,该服务器通过该CNN中的第1层卷积层以卷积核为3*3像素,深度为4,提取次数为32次,卷积步长为1的方式进行提取该第一实时状态信息的第一特征值。该服务器通过该CNN中的第2层卷积层以卷积核为3*3像素,深度为32,提取次数为32次,卷积步长为1的方式进行提取该第一特征值的第二特征值。该服务器通过该CNN中的第3层卷积层以卷积核为3*3像素,深度为32,提取次数为32次,卷积步长为1的方式进行提取该第二特征值的第三特征值。同时在前三层卷积层时,该服务器还通过池化层以2*2像素对该第一实时状态信息进行降维。然后该服务器通过该CNN中的第4层卷积层以卷积核为3*3像素,深度为32,提取次数为64次,卷积步长为1的方式进行提取该第三特征值的第四特征值。最后然后该服务器通过该CNN中的第4层卷积层以卷积核为3*3像素,深度为64,提取次数为64次,卷积步长为1的方式进行提取该第四特征值的第五特征值,该第五特征值作为该第一实时状态信息的降维特征值。因此,该服务器将该第一实时状态信息通过第一层卷积层和池化层时得到的特征值为40*40*32的二维数据;该服务器将40*40*32的特征值通过第二层卷积层和池化层时得到的特征值为20*20*32的二维数据;该服务器将20*20*32的特征值通过第三层卷积层和池化层时得到的特征值为10*10*32的特征值;该服务器将10*10*32的特征值通过第四层卷积层时得到的特征值为10*10*64的特征值;该服务器将10*10*64的特征值通过第五层卷积层和池化层时得到的特征值为10*10*64的特征值。
最后,该服务器通过Reshape将该10*10*64的特征值进行变维(即由二维数据变为一维数据)得到6400*1的一维数据。此处将该一维数据作为该第一实时状态信息的特征信息。
S203、根据该特征信息和学习网络的权重值得到AI对象的动作策略。
该服务器将该特征信息通过该学习网络得到该AI对象的动作策略,该权重值为该学习网络的全连接网络中的各连接层的权重值。
本实施例中,该CNN通过Reshape得到的一维数据输入给该学习网络的全连接网络,然后通过该全连接网络的各连接层对该一维数据进行取值,最后输出该游戏AI对象的动作策略。
本实施例中,该学习网络可以为深度强化学习网络。深度学习(deep learning,DP)是机器学习的分支,它试图使用包含复杂结构或由多重非线性变换构成的多个处理层对数据进行高层抽象的算法。深度学习是机器学习中一种基于对数据进行表征学习的方法。观测值(例如一幅图像)可以使用多种方式来表示,如每个像素强度值的向量,或者更抽象地表示成一系列边、特定形状的区域等。而使用某些特定的表示方法更容易从实例中学习任务(例如,人脸识别或面部表情识别)。
强化学习是机器学习中的一个分支,强调如何基于环境而行动,以取得最大化的预期利益。令有机体在环境给予的奖励或惩罚的刺激下,逐步形成对刺激的预期,产生能获得最大利益的习惯性行为。这个方法具有普适性,因此在其他许多领域都有研究,例如博弈 论、控制论、运筹学、信息论、仿真优化、多主体系统学习、群体智能、统计学以及遗传算法。在运筹学和控制理论研究的语境下,强化学习被称作“近似动态规划”(approximate dynamic programming,ADP)。在最优控制理论中也有研究这个问题,虽然大部分的研究是关于最优解的存在和特性,并非是学习或者近似方面。在经济学和博弈论中,强化学习被用来解释在有限理性的条件下如何出现平衡。
在实际应用中,可以如图3所示,当该游戏为贪吃蛇游戏时,该游戏AI对象的动作策略可以仅为方向上的控制,这时可以得到如图7所示的8个量化的方向,然后考虑到是否加速的因素,该全连接层输出的节点可以为16个。
S204、将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
该服务器将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
在实际应用中,若该AI对象为游戏AI对象,且该游戏运行在终端设备上,如图4所示,则该服务器需要将该动作策略直接反馈给该终端设备,然后由该终端设备控制该游戏AI对象的行为。若该游戏运行在该服务器上且该服务器为电脑时,如图5所示,则该电脑可以直接获取到该游戏AI对象的动作策略,然后控制该游戏AI对象的行为。
S205、获取第二环境的第二实时状态信息,该第二环境为该AI对象执行该动作策略之后的环境。
该AI对象执行该动作策略之后,该环境发生了变化,这时可以将此时的环境称为第二环境。该服务器实时获取该第二环境的第二实时状态信息。
本实施例中,以游戏环境为例,则该服务器获取该第二游戏环境的第二实时状态信息的方式与步骤S201中该服务器获取该第一游戏环境的第一实时状态信息的方式相同,此处不再赘述。
S206、该服务器根据该第二实时状态信息得到该动作策略的回报值。
该服务器在得到该第二实时状态信息之后根据该第二实时状态信息得到该游戏AI对象执行该动作策略之后的回报值。
本实施例中,该回报值用于指示该游戏AI对象执行该动作策略后环境发生变化后的状态情况值,通常可以理解为回报值越大,环境发生的变化越符合预期情况。
在实际应用中,该服务器可以采用如下方式计算该游戏AI对象执行该动作策略之后的回报值:
Figure PCTCN2017106507-appb-000001
其中,该Q为该游戏AI对象执行该动作策略之后的回报值,该Rt+1为迭代次数增加一次之后的回报值,该λ为预设系数,该St+1为迭代次数增加一次之后的实时状态信息,该a为动作策略。
在计算该游戏AI对象执行该动作策略的回报值的过程中可以通过计算其损失函数来调整该回报值:
Figure PCTCN2017106507-appb-000002
其中,该L为计算过程中的损失,该Ε为期望,该r为预设系数,该γ为衰减系数,该s'为下一时刻实时状态信息,该a'为下一时刻的动作策略,该w为当前网络系统,该s为 当前实时状态信息,该a为当前动作策略。
在实际应用中,该服务器还可以采用DQN算法对该学习网络进行训练,具体伪代码如下:
初始化一个容量为N的样本池D
初始化网络中的权值为随机值
for episode=1,M do
初始化输入序列s1={x1}为当前图像,并预处理序列为φ1=φ(s1)
For t=1,T do
根据策略ε选择一个随机动作或者网络输出的最大Q值的动作at
在设备上执行动作at并观察得到即时反馈at和图像xt+1
令St+1=St,at,xt+1,并处理序列得到φt+1=φ(st+1)
将序列(φt,at,rtt+1)保存到样本池D中
从样本池D中随机采样一个minibatch大小的样本
令目标函数
Figure PCTCN2017106507-appb-000003
对损失函数(yj-Q(φj,aj;w))2执行梯度下降算法
end for
end for
在这段伪代码中的r为反馈值,该游戏AI对象执行该动作策略后下一时刻环境发生变化,这个变化的好坏就体现为反馈值,本实施例中为了使游戏AI对象具有强大的生存能力和一定的击杀能力,设置r为蛇自身长度的增加、击杀数、以及是否死亡的状态量化求和为一个数值。此处该反馈值的效果可以设置为反馈值越大,变化越好;该反馈值的效果也可以设置为反馈值越小,变化越好。此处具体的设置方式不做限定。
本实施例中,动作输出策略一般会在参考学习网络输出的动作策略之前设置一个随机预设值,当该回报值高于随机预设值时该服务器采用学习网络输出的动作策略;当该回报值低于随机预设值时采用随机动作。该随机预设值可以设计成动态变化的值,该服务器建立该游戏AI对象的行为模型的过程中,该随机预设值呈指数型降低符合训练预期;最后该随机预设值收敛为一个很小的随机值,此时该游戏AI对象的动作策略基本等同为学习网络的输出值。
S207、判断该回报值是否符合预设条件,若是,执行步骤S208,若否,则执行步骤S209。
该服务器在得到该动作策略的回报值之后,判断该回报值是否不再发生变化,若是,则确定该回报值为最优动作策略的回报值,并执行步骤S208。若该回报值仍然在发生变化,则执行步骤S210。
S208、确定该学习网络的权重值为该学习网络的目标权重值。
该服务器确定该学习网络的权重值为该学习网络的目标权重值。
S209、根据该目标权重值建立该AI对象行为模型。
该服务器将该学习网络的全连接网络的权重值设置为该目标权重值完成该AI对象行 为模型的建立。
S210、修改该学习网络的权重值。
该服务器按照得到最大Q值的规则修改该学习网络的全连接网络的权重值,并重复上述S201至步骤S207的步骤,直至满足执行步骤S208至步骤S209的条件。
该服务器在修改该学习网络的全连接网络的权重值之后,该学习网络的各连接层对于一维数据的取值侧重方向会发生改变。
本实施例中,该服务器根据环境的实时改变对AI对像的动作策略做出相应的决策,可以提高AI的灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络学习之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
请参阅图8所示,图8示出了本申请一个实施例中AI对象的控制方法。需要注意的是,虽然实施例中以游戏为例描述本申请的AI对象的控制方法,但是本领域的技术人员可以理解,本申请的AI对象的控制方法不限于游戏。该方法包括:
S801、获取运行环境的实时状态信息。
AI控制设备获取该运行环境的实时状态信息。
本实施例中,以游戏为例,则该运行环境为游戏环境,该游戏环境包括该游戏AI对象、游戏玩家,游戏场景至少其中一项。该实时状态信息为包含有该游戏环境的图片。比如,该实时状态信息为图9所示。若该游戏为贪吃蛇游戏,该实时状态信息中的游戏AI对象为该游戏中的“游客408247928”代表的蛇;该实时状态信息中的游戏玩家为该游戏中的“芷若汀兰”代表的蛇,该实时状态信息中的各分散的小点为该游戏场景中的食物。
本实施例中,在以游戏为例时,该AI控制设备可以为游戏AI控制设备。则该游戏AI控制设备可以为运行该游戏的终端设备,也可以是独立在运行该游戏的终端设备之外的服务器,只要其保存有该游戏AI对象的行为模型即可,具体方式此处不做限定。
可选的,该游戏AI控制设备在获取该游戏环境的实时状态信息时可以采用如下方式:
一种可能实现方式中,该游戏AI控制设备可以获取该游戏环境中的有效数据,这时该有效数据包括该游戏AI对象的角色参数、该游戏AI对象的位置参数,游戏玩家角色的角色参数,该游戏玩家角色的位置参数,游戏场景参数中的至少一项,该有效数据在以该游戏AI的预设部位为中心,以预设数值为半径的区域内的游戏环境内提取;然后该游戏AI控制设备根据该有效数据绘制二维数据矩阵作为该第一实时状态信息,该二维数据矩阵表示为图像,即该实时状态信息以图片存在。比如,如图9所示,这时该游戏AI控制设备获取到的有效数据包括:该“游客408247928”代表的蛇的长度为43;该“游客408247928”代表的蛇的击杀数为0;该“芷若汀兰”代表的蛇的长度为195;该“芷若汀兰”代表的蛇的位置参数为屏幕的左下方;该“游客408247928”代表的蛇的位置参数为屏幕的中间;各食物的位置参数。然后该游戏AI控制设备根据该有效数据重新绘制类似于该图9的二维数据矩阵(即该实时状态信息),此时该游戏AI控制设备可以为各个有效数据赋予颜色数值,然后根据该颜色数据绘制该二维数据矩阵。在实际应用中,为了简 便,该游戏AI控制设备通常可用灰度图,不同的对象使用不同的数值。比如以图9所示的贪吃蛇游戏为例,空地为中性,灰色;该游戏AI对象(即“游客408247928”代表的蛇)的身体也比较中性,为灰色;边界和游戏玩家角色(即该“芷若汀兰”代表的蛇)是“不好”的,为黑色;食物(即该图9中分散的小点)是“好”的,为白色。具体绘制的代码可以如下:
BORDER_COLOR=0
FOOD_COLOR=255
MAP_COLOR=120
PLAYER_HEAD=180
PLAYER_BODY=150
ENEMY_HEAD=30
ENEMY_BODY=0
另一种可能实现方式中,该游戏AI控制设备可以直接获取该游戏环境的截图图片,并直接将该截图图片作为该实时状态信息。比如,该游戏AI控制设备可以直接获取到图9,然后将该图9作为该游戏环境的实时状态信息。
本实施例中,该游戏AI控制设备为具有计算资源的设备,即只要其可以建立该游戏AI对象的行为模型即可,具体情况此处不做限定,比如该游戏AI控制设备可以既运行该游戏同时也具备建立该游戏AI对象的行为模型;该游戏AI控制设备也可以仅用于建立该游戏AI对象的行为模型。根据上述说明,该游戏环境可以发生在终端设备(比如手机、平板等终端)也可以直接发生在该游戏AI控制设备上。即当该游戏在该终端设备运行时,这时该游戏AI控制设备接收该终端设备发送的该游戏运行过程中的游戏环境的实时状态信息;当该游戏在该游戏AI控制设备上运行时,该游戏AI控制设备直接采集该游戏运行过程中的游戏环境的实时状态信息。
本实施例中,该游戏AI对象在进行行为模型建立的过程中,该游戏环境可以是预先设置好的游戏环境样本集合,也可以是在实时操作中的游戏环境样本集合,具体方式此处不做限定。
本实施例中,当该游戏在该终端设备上运行时,该终端设备可以为多个,也可以为单独一个,具体情况此处不做限定。
S802、提取该实时状态信息的特征信息。
该AI控制设备在获取到该实时状态信息之后,提取该实时状态信息的特征信息。可选的是,该特征信息为多维度信息。
在实际应用中,为了提取该实时状态信息的多维度信息,该AI控制设备可以采用CNN提取该实时状态信息的特征信息。
而该AI控制设备采用该CNN提取该实时状态信息的特征信息的具体过程可以如下:
该AI控制设备将该实时状态信息以预设格式传递给该CNN中预设数目的卷积层;然后该AI控制设备根据该预设数目的卷积层提取该实时状态信息的特征值,此时该特征值为该实时状态信息的局部特征信息;再然后,该AI控制设备通过该CNN中的池化层对该特征 值进行降维得到降维特征值,该降维特征值为二维数据;最后该AI控制设备通过该矩阵变维Reshape函数将该降维特征值修改为一维数据,然后将该一维数据作为该特征信息。
在实际应用中,该游戏AI控制设备通过CNN提取该实时状态信息的特征信息可以如图4所示:
本实施例中,该游戏AI控制设备提取实时状态信息的特征值可以采用多种方式。
一种可能实现方式中,该游戏AI控制设备为实时提取,即获取到一张实时状态信息时就提取该实时状态信息的特征值。即该游戏AI控制设备将单张的实时状态信息作为该实时状态信息。
另一种可能实现方式中,该游戏AI控制设备需要先获取到一个实时状态信息集合,该实时状态信息集合中包含有预设数目的实时状态信息。即该游戏AI控制设备将该实时状态信息集合作为该实时状态信息;然后该游戏AI控制设备再提取该实时状态信息集合的特征值;若该游戏AI控制设备再获取到一张实时状态信息,则该游戏AI控制设备就将该实时状态信息集合中最早获取到的实时状态信息丢弃,将该最新获取到的实时状态信息添加进行该实时状态信息集合中;然后该游戏AI控制设备提取该修改后的实时状态集合的特征值。本实施例中以此种方式为例进行说明。具体如图6所示:
首先,该游戏AI控制设备将获取到的实时状态信息以80*80像素的格式传递给该CNN中的卷积层,其中该实时状态信息包括连续4次提取有效数据生成的四张灰度图。该80*80像素用于表示该游戏AI的头部的一小块区域,用于降低输入复杂度。可以理解的是,该像素可以预先设置,具体数据此处不做限定。然后,该游戏AI控制设备通过该CNN中的第1层卷积层以卷积核为3*3像素,深度为4,提取次数为32次,卷积步长为1的方式进行提取该实时状态信息的特征值。该游戏AI控制设备通过该CNN中的第2层卷积层以卷积核为3*3像素,深度为32,提取次数为32次,卷积步长为1的方式进行提取该特征值的第二特征值。该游戏AI控制设备通过该CNN中的第3层卷积层以卷积核为3*3像素,深度为32,提取次数为32次,卷积步长为1的方式进行提取该第二特征值的第三特征值。同时在前三层卷积层时,该游戏AI控制设备还通过池化层以2*2像素对该实时状态信息进行降维。然后该游戏AI控制设备通过该CNN中的第4层卷积层以卷积核为3*3像素,深度为32,提取次数为64次,卷积步长为1的方式进行提取该第三特征值的第四特征值。最后然后该游戏AI控制设备通过该CNN中的第4层卷积层以卷积核为3*3像素,深度为64,提取次数为64次,卷积步长为1的方式进行提取该第四特征值的第五特征值,该第五特征值作为该实时状态信息的降维特征值。因此,该游戏AI控制设备将该实时状态信息通过层卷积层和池化层时得到的特征值为40*40*32的二维数据;该游戏AI控制设备将40*40*32的特征值通过第二层卷积层和池化层时得到的特征值为20*20*32的二维数据;该游戏AI控制设备将20*20*32的特征值通过第三层卷积层和池化层时得到的特征值为10*10*32的特征值;该游戏AI控制设备将10*10*32的特征值通过第四层卷积层时得到的特征值为10*10*64的特征值;该游戏AI控制设备将10*10*64的特征值通过第五层卷积层和池化层时得到的特征值为10*10*64的特征值。
最后,该游戏AI控制设备通过Reshape将该10*10*64的特征值进行变维(即由二维 数据变为一维数据)得到6400*1的一维数据。此处将该一维数据作为该实时状态信息的特征信息。
S803、根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为预置数值。
该AI控制设备将该特征信息通过该学习网络得到该游戏AI对象的动作策略,该权重值为该学习网络的全连接网络中的各连接层的权重值。
本实施例中,该CNN通过Reshape得到的一维数据输入给该学习网络的全连接层,然后通过该全连接层的各连接层对该一维数据进行取值,最后输出该游戏AI对象的动作策略。
在实际应用中,可以如图9所示,当该游戏为贪吃蛇游戏时,该游戏AI对象的动作策略可以仅为方向上的控制,这时可以得到如图7所示的8个量化的方向,然后考虑到是否加速的因素,该全连接层输出的节点可以为16个。
S804、将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
该AI控制设备将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
在实际应用中,该AI对象执行该动作策略时,可以如图10所示,该“游客408247928”代表的蛇转向图10中食物分布密集的地方开始吞噬该食物。
在实际应用中,若该游戏AI控制设备为独立在运行该游戏的终端设备之外的服务器,则该游戏AI控制设备需要将该动作策略直接反馈给该终端设备,然后由该终端设备控制该游戏AI对象的行为。若该游戏AI控制设备为运行该游戏的终端设备,则该游戏AI控制设备可以直接获取到该游戏AI对象的动作策略,然后控制该游戏AI对象的行为。
本实施例中,游戏AI控制设备在获取到游戏环境的实时状态信息之后,该游戏AI控制设备提取该实时状态信息的多维度特征信息,然后根据该多维度特征信息和学习网络的权重值得到该游戏AI对象的动作策略;最后游戏AI对象控制设备将该动作策略反馈给该游戏AI对象,以使得该游戏AI对象执行该动作策略。本申请根据环境的实时改变做出相应的决策,可以提高AI的灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
下面将该游戏AI对象的行为模型优化方法以如图11所示的流程进行一次描述:
该终端设备或者是该服务器上运行该游戏,该终端设备或该服务器提取该游戏环境的实时状态信息;然后,该服务器对该实时状态信息进行预处理,即提取多维度的特征信息;该服务器通过该多维度的特征信息和学习网络得到该游戏AI对象的动作策略;该服务器获取该游戏AI对象执行该动作策略之后的游戏环境的实时状态信息;该服务器根据该游戏AI对象执行该动作策略之后的游戏环境的实时状态信息计算该动作策略的回报值;该服务器根据该回报值调整该学习网络的权重值。
下面将该游戏AI对象的控制方法以如图12所示的流程进行一次描述:
该终端设备或者是该服务器上运行该游戏,该终端设备或该服务器提取该游戏环境的实时状态信息;然后,该终端设备或者是该服务器对该实时状态信息进行预处理,即提取 多维度的特征信息;该终端设备或者是该服务器通过该多维度的特征信息和该学习网络得到该游戏AI对象的动作策略;该终端设备或者是该服务器将该动作策略反馈给游戏AI对象;该游戏AI对象执行该动作策略。
上面对本申请实施例中的方法实施例进行了描述,下面对本申请实施例中AI对象的行为模型建立装置和AI控制设备进行描述。其中AI对象的行为模型建立装置可以是服务器。
具体请参阅图13所示,本申请实施例中AI对象的行为模型建立装置一个实施例包括:
获取模块1301,用于获取AI对象所处的第一环境的第一实时状态信息;
处理模块1302,用于提取该获取模块1301获取到的该第一实时状态信息的特征信息;根据该特征信息和学习网络的权重值得到AI对象的动作策略;
反馈模块1303,用于将该处理模块1302得到的该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略;
该获取模块1301,用于获取AI对象所处的第二环境的第二实时状态信息,该第二环境为该AI对象执行该动作策略之后的环境;
该处理模块1302,用于根据该获取模块1301获取到的该第二实时状态信息得到该动作策略的回报值;若该回报值符合预设条件,则确定该学习网络的权重值为该学习网络的目标权重值;根据该目标权重值建立该AI对象的行为模型。
可选的,该处理模块1302,具体用于将该第一实时状态信息以预设格式传递给预设数目的卷积层;通过池化层和该预设数目的卷积层提取该第一实时状态信息得到降维特征值,该降维特征值为二维数据;将该降维特征值修改为一维数据,该一维数据作为该特征信息。
可选的,该预设格式为长宽均为80像素的图片,该预设数目为5,该卷积层的卷积核长宽均为3像素,该卷积步长为1,该池化层的降维设置为在长宽均为2像素的区域内选择最大值作为该降维特征值。
可选的,该处理模块1302,还用于若该回报值不符合该预设条件,则修改该学习网络的权重值。
可选的,所述AI对象为游戏AI对象,所述环境为游戏环境,该获取模块1301,具体用于获取该第一游戏环境的有效数据,该有效数据包括该游戏AI对象的角色参数、该游戏AI对象的位置参数、游戏玩家角色的角色参数、该游戏玩家角色的位置参数、游戏场景参数中的至少一项;根据该有效数据绘制二维数据矩阵作为该第一实时状态信息,该二维数据矩阵表示图像。
可选的,该获取模块1301,具体用于获取该有效数据中各游戏对象对应的颜色数值,该颜色数值用于表示该游戏环境中的各游戏对象的颜色,该游戏对象包括该游戏AI对象,该玩家游戏角色以及该游戏场景;根据该各游戏对象对应的颜色数值绘制该二维数据矩阵作为该第一实时状态信息。
可选的,该获取模块1301,具体用于获取该第一游戏环境的截图图片作为该第一实时 状态信息。
可选的,该学习网络为深度强化学习网络,该深度强化学习网络的算法包括Q-learning算法或DQN算法。
可选的,该特征信息为多维度信息,该第一环境为游戏环境样本集合,该游戏环境样本集合包括玩家实时操作游戏环境和预设游戏环境。
本实施例中,该获取模块1301在获取到第一游戏环境的第一实时状态信息之后,该处理模块1302提取该第一实时状态信息的多维度特征信息,然后根据该多维度特征信息和学习网络的权重值得到该游戏AI对象的动作策略;最后反馈模块1303将该动作策略反馈给该游戏AI对象,以使得该游戏AI对象执行该动作策略;然后该获取模块1301获取该游戏AI对象执行了该动作策略之后的第二游戏环境的第二实时状态信息,该处理模块1302根据该第二实时状态信息计算该动作策略的回报值,并在该回报值符合预设条件时确定该学习网络的权重值为目标权重值,并根据该目标权重值建立该游戏AI对象的行为模型。本申请根据环境的实时改变做出相应的决策,可以提高AI的灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络学习之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
具体请参阅图14所示,本申请实施例中服务器的另一个实施例包括:
收发器1401,处理器1402,总线1403;
该收发器1401与该处理器1402通过该总线1403相连;
该总线1403可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图14中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器1402可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。
处理器1402还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL)或其任意组合。
参见图14所示,该服务器还可以包括存储器1404。该存储器1404可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器1404还可以包括上述种类的存储器的组合。
可选地,存储器1404还可以用于存储程序指令,处理器1402调用该存储器1404中存储的程序指令,可以执行图2至图7以及图11中所示实施例中的一个或多个步骤,或其中 可选的实施方式,实现上述方法中服务器行为的功能。
该收发器1401,执行如下步骤:
获取AI对象所处的第一环境的第一实时状态信息;
该处理器1402,执行如下步骤:
提取该第一实时状态信息的特征信息;
根据该特征信息和学习网络的权重值得到AI对象的动作策略;
该收发器1401,还执行如下步骤:
将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略;获取AI对象所处的第二环境的第二实时状态信息,该第二环境为该AI对象执行该动作策略之后生成;
该处理器1402,还执行如下步骤:
根据该第二实时状态信息得到该动作策略的回报值;
若该回报值符合预设条件,则确定该学习网络的权重值为该学习网络的目标权重值;
根据该目标权重值建立该AI对象的行为模型。
本实施例中,该收发器1401还执行所有数据收发的步骤,该处理器1402还执行上述实施例中所有数据的处理步骤。
本实施例中,该收发器1401在获取到第一游戏环境的第一实时状态信息之后,该处理器1402提取该第一实时状态信息的多维度特征信息,然后根据该多维度特征信息和学习网络的权重值得到该游戏AI对象的动作策略;最后收发器1401将该动作策略反馈给该游戏AI对象,以使得该游戏AI对象执行该动作策略;然后该收发器1401获取该游戏AI对象执行了该动作策略之后的第二游戏环境的第二实时状态信息,该处理器1402根据该第二实时状态信息计算该动作策略的回报值,并在该回报值符合预设条件时确定该学习网络的权重值为目标权重值,并根据该目标权重值建立该游戏AI对象的行为模型。本申请根据环境的实时改变做出相应的决策,可以提高AI的灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络学习之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
具体请参阅图15所示,本申请实施例中AI控制设备(即AI对象控制设备)包括:
获取模块1501,用于获取AI对象所处的环境的实时状态信息;
处理模块1502,用于提取该实时状态信息的特征信息;根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为预置数值;
反馈模块1503,用于将该处理模块得到的该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
可选的,该处理模块1502,具体用于将该实时状态信息以预设格式传递给预设数目的卷积层;通过池化层和该预设数目的卷积层提取该实时状态信息得到降维特征值,该降维特征值为二维数据;将该降维特征值修改为一维数据,该一维数据作为该特征信息。
可选的,其中,该AI对象为游戏AI对象,该环境为游戏环境,该处理模块1502,具体用于提取该游戏环境的有效数据,该有效数据包括该游戏AI对象的角色参数、该游戏AI对象的位置参数、玩家游戏角色的角色参数、该玩家游戏角色的位置参数、游戏场景参 数中的至少一项;根据该有效数据绘制二维数据矩阵作为该实时状态信息,该二维数据矩阵表示图像。
本实施例中,获取模块1501在获取到游戏环境的实时状态信息之后,处理模块1502提取该实时状态信息的多维度特征信息,然后处理模块1502根据该多维度特征信息和学习网络的权重值得到该游戏AI对象的动作策略;最后反馈模块1503将该动作策略反馈给该游戏AI对象,以使得该游戏AI对象执行该动作策略。本申请根据环境的实时改变做出相应的决策,可以提高AI的灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
具体请参阅图16所示,本申请实施例中AI控制设备的另一个实施例包括:
收发器1601,处理器1602以及总线1603;
该收发器1601与该处理器1602通过该总线1603相连;
该处理器,执行如下步骤:
获取AI对象所处的环境的实时状态信息;
提取该实时状态信息的特征信息;
根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为预置数值;
将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
该总线1603可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图16中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器1602可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。
处理器1602还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL)或其任意组合。
参见图16所示,该AI控制设备还可以包括存储器1604。该存储器1604可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器1604还可以包括上述种类的存储器的组合。
可选地,存储器1604还可以用于存储程序指令,处理器1602调用该存储器1604中存储的程序指令,可以执行图8至图10以及图12中所示实施例中的一个或多个步骤,或其 中可选的实施方式,实现上述方法中AI控制设备行为的功能。
该处理器1602,执行如下步骤:
获取AI对象所处的环境的实时状态信息;
提取该实时状态信息的特征信息;
根据该特征信息和学习网络的权重值得到AI对象的动作策略,该学习网络的权重值为预置数值;
将该动作策略反馈给该AI对象,以使得该AI对象执行该动作策略。
本实施例中,该收发器1601还执行所有数据收发的步骤,该处理器1602还执行上述实施例中所有数据的处理步骤。
本实施例中,处理器1602在获取到游戏环境的实时状态信息之后,处理器1602提取该实时状态信息的多维度特征信息,然后处理器1602根据该多维度特征信息和学习网络的权重值得到该游戏AI对象的动作策略;最后处理器1602将该动作策略反馈给该游戏AI对象,以使得该游戏AI对象执行该动作策略。本申请根据环境的实时改变做出相应的决策,可以提高AI的灵活性。而且,由于该提取到的该特征信息的维度为多维度特征信息,比行为树提取的特征信息的维度高,在通过学习网络之后获得的动作策略就更具体,从而进一步提高了游戏AI的灵活性。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存 储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (21)

  1. 一种AI对象的行为模型优化方法,其特征在于,包括:
    获取AI对象所处的第一环境的第一实时状态信息;
    提取所述第一实时状态信息的特征信息;
    根据所述特征信息和学习网络的权重值得到AI对象的动作策略;
    将所述动作策略反馈给所述AI对象,以使得所述AI对象执行所述动作策略;
    获取AI对象所处的第二环境的第二实时状态信息,所述第二环境为所述AI对象执行所述动作策略之后的环境;
    根据所述第二实时状态信息得到所述动作策略的回报值;
    若所述回报值符合预设条件,则确定所述学习网络的权重值为所述学习网络的目标权重值;根据所述目标权重值建立所述AI对象的行为模型。
  2. 根据权利要求1所述的方法,其特征在于,所述提取所述第一实时状态信息的特征信息包括:
    将所述第一实时状态信息以预设格式传递给预设数目的卷积层;
    通过池化层和所述预设数目的卷积层提取所述第一实时状态信息得到降维特征值,所述降维特征值为二维数据;
    将所述降维特征值修改为一维数据作为所述特征信息。
  3. 根据权利要求2所述的方法,其中,所述预设格式为长宽均为80像素的图片,所述预设数目为5,所述卷积层的卷积核长宽均为3像素,所述卷积步长为1,所述池化层的降维设置为在长宽均为2像素的区域内选择最大值作为所述降维特征值。
  4. 根据权利要求1所述的方法,其特征在于,根据所述第二实时状态信息得到所述动作策略的回报值之后,所述方法还包括:
    若所述回报值不符合所述预设条件,则修改所述学习网络的权重值。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述AI对象为游戏AI对象,所述第一环境为第一游戏环境,
    其中,获取AI对象所处的第一环境的第一实时状态信息包括:
    获取所述第一游戏环境的有效数据,所述有效数据包括所述游戏AI对象的角色参数、所述游戏AI对象的位置参数、游戏玩家角色的角色参数、所述游戏玩家角色的位置参数、游戏场景参数中的至少一项,所述有效数据在以所述游戏AI的预设部位为中心,以预设数值为半径的区域内的游戏环境内提取;
    根据所述有效数据绘制二维数据矩阵作为所述第一实时状态信息。
  6. 根据权利要求5所述的方法,其特征在于,根据所述有效数据绘制二维数据矩阵作为所述第一实时状态信息包括:
    获取所述有效数据中各游戏对象对应的颜色数值,所述颜色数值用于表示所述游戏环境中的各游戏对象的颜色,所述游戏对象包括所述游戏AI对象、所述玩家游戏角色以及所述游戏场景;
    根据所述各游戏对象对应的颜色数值绘制所述二维数据矩阵作为所述第一实时状态信 息。
  7. 根据权利要求1至4中任一项所述的方法,其特征在于,获取AI对象所处的第一环境的第一实时状态信息包括:
    获取所述第一环境的截图图片作为所述第一实时状态信息。
  8. 根据权利要求1至4中任一项所述的方法,其特征在于,所述特征信息为多维度信息,所述第一环境为游戏环境样本集合,所述游戏环境样本集合包括玩家实时操作游戏环境和预设游戏环境。
  9. 一种AI对象的控制方法,其特征在于,包括:
    获取AI对象所处的环境的实时状态信息;
    提取所述实时状态信息的特征信息;
    根据所述特征信息和学习网络的权重值得到AI对象的动作策略,所述学习网络的权重值为预置数值;
    将所述动作策略反馈给所述AI对象,以使得所述AI对象执行所述动作策略。
  10. 根据权利要求9所述的方法,其特征在于,所述提取所述实时状态信息的特征信息包括:
    将所述实时状态信息以预设格式传递给预设数目的卷积层;
    通过池化层和所述预设数目的卷积层提取所述实时状态信息得到降维特征值,所述降维特征值为二维数据;
    将所述降维特征值通修改为一维数据作为所述特征信息。
  11. 根据权利要求10所述的方法,其特征在于,所述预设格式为长宽均为80像素的图片,所述预设数目为5,所述卷积层的卷积核长宽均为3像素,所述卷积步长为1,所述池化层的降维设置为在长宽均为2像素的区域内选择最大值作为所述降维特征值。
  12. 根据权利要求9所述的方法,其特征在于,所述AI对象为游戏AI对象,所述环境为游戏环境,获取AI对象所处的环境的实时状态信息包括:
    提取所述游戏环境的有效数据,所述有效数据包括所述游戏AI对象的角色参数、所述游戏AI对象的位置参数、玩家游戏角色的角色参数、所述玩家游戏角色的位置参数、游戏场景参数中的至少一项,所述有效数据在以所述游戏AI的预设部位为中心,以预设数值为半径的区域内的游戏环境内提取;
    根据所述有效数据绘制二维数据矩阵作为所述实时状态信息,所述二维数据矩阵表示图像。
  13. 一种AI对象的行为模型优化装置,其特征在于,包括:
    获取模块,用于获取AI对象所处的第一环境的第一实时状态信息;
    处理模块,用于提取所述获取模块获取到的所述第一实时状态信息的特征信息;根据所述特征信息和学习网络的权重值得到AI对象的动作策略;
    反馈模块,用于将所述处理模块得到的所述动作策略反馈给所述AI对象,以使得所述AI对象执行所述动作策略;
    所述获取模块,用于获取AI对象所处的第二环境的第二实时状态信息,所述第二环 境为所述AI对象执行所述动作策略之后的环境;
    所述处理模块,用于根据所述获取模块获取到的所述第二实时状态信息得到所述动作策略的回报值;若所述回报值符合预设条件,则确定所述学习网络的权重值为所述学习网络的目标权重值;根据所述目标权重值建立所述AI对象的行为模型。
  14. 根据权利要求13所述的装置,其特征在于,所述处理模块,具体用于将所述第一实时状态信息以预设格式传递给预设数目的卷积层;通过池化层和所述预设数目的卷积层提取所述第一实时状态信息得到降维特征值,所述降维特征值为二维数据;将所述降维特征值修改为一维数据作为所述特征信息。
  15. 根据权利要求13所述的装置,其中,根据所述第二实时状态信息得到所述动作策略的回报值之后,所述处理模块,还用于若所述回报值不符合所述预设条件,则修改所述学习网络的权重值。
  16. 一种AI对象控制设备,其特征在于,包括:
    获取模块,用于获取AI对象所处的环境的实时状态信息;
    处理模块,用于提取所述实时状态信息的特征信息;根据所述特征信息和学习网络的权重值得到AI对象的动作策略,所述学习网络的权重值为预置数值;
    反馈模块,用于将所述处理模块得到的所述动作策略反馈给所述AI对象,以使得所述AI对象执行所述动作策略。
  17. 根据权利要求16所述的设备,其特征在于,所述处理模块,还用于将所述实时状态信息以预设格式传递给预设数目的卷积层;通过池化层和所述预设数目的卷积层提取所述实时状态信息得到降维特征值,所述降维特征值为二维数据;将所述降维特征值修改为一维数据作为所述特征信息。
  18. 根据权利要求16所述的设备,其特征在于,所述AI对象为游戏AI对象,所述环境为游戏环境,所述处理模块,还用于提取所述游戏环境的有效数据,所述有效数据包括所述游戏AI对象的角色参数、所述游戏AI对象的位置参数、玩家游戏角色的角色参数、所述玩家游戏角色的位置参数、游戏场景参数中的至少一项;
    根据所述有效数据绘制二维数据矩阵作为所述实时状态信息,所述二维数据矩阵表示图像。
  19. 一种AI对象的行为模型建立装置,包括:
    一个或多个处理器;
    和,
    存储器,所述存储器存储程序指令,所述指令当由所述一个或多个处理器执行时,配置所述装置执行根据权利要求1至权利要求8中任一项所述的方法。
  20. 一种AI对象控制设备,包括:
    一个或多个处理器;
    和,
    存储器,所述存储器存储程序指令,所述指令当由所述一个或多个处理器执行时,配置所述装置执行根据权利要求9至权利要求12中任一项所述的方法。
  21. 一种计算机可读存储介质,包括指令,当该指令在计算装置的处理器上运行时,该装置执行权利要求1至权利要求12中任一项所述的方法。
PCT/CN2017/106507 2017-10-17 2017-10-17 一种ai对象行为模型优化方法以及装置 WO2019075632A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780048483.4A CN109843401B (zh) 2017-10-17 2017-10-17 一种ai对象行为模型优化方法以及装置
PCT/CN2017/106507 WO2019075632A1 (zh) 2017-10-17 2017-10-17 一种ai对象行为模型优化方法以及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/106507 WO2019075632A1 (zh) 2017-10-17 2017-10-17 一种ai对象行为模型优化方法以及装置

Publications (1)

Publication Number Publication Date
WO2019075632A1 true WO2019075632A1 (zh) 2019-04-25

Family

ID=66173024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/106507 WO2019075632A1 (zh) 2017-10-17 2017-10-17 一种ai对象行为模型优化方法以及装置

Country Status (2)

Country Link
CN (1) CN109843401B (zh)
WO (1) WO2019075632A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110327624A (zh) * 2019-07-03 2019-10-15 广州多益网络股份有限公司 一种基于课程强化学习的游戏跟随方法和系统
CN111901146A (zh) * 2020-06-28 2020-11-06 北京可信华泰信息技术有限公司 一种对象访问的控制方法和装置
CN112382151A (zh) * 2020-11-16 2021-02-19 深圳市商汤科技有限公司 一种线上学习方法及装置、电子设备及存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110996158B (zh) * 2019-12-26 2021-10-29 广州市百果园信息技术有限公司 一种虚拟物品的显示方法、装置、计算机设备和存储介质
CN111359212A (zh) * 2020-02-20 2020-07-03 网易(杭州)网络有限公司 游戏对象控制、模型训练方法及装置
CN112437690A (zh) * 2020-04-02 2021-03-02 支付宝(杭州)信息技术有限公司 确定执行设备的动作选择方针
CN111494959B (zh) * 2020-04-22 2021-11-09 腾讯科技(深圳)有限公司 游戏操控方法、装置、电子设备及计算机可读存储介质
CN111729300A (zh) * 2020-06-24 2020-10-02 贵州大学 基于蒙特卡洛树搜索和卷积神经网络斗地主策略研究方法
CN112044063B (zh) * 2020-09-02 2022-05-17 腾讯科技(深圳)有限公司 游戏对象动态变化方法、装置、设备及存储介质
CN112619125B (zh) * 2020-12-30 2023-10-13 深圳市创梦天地科技有限公司 游戏人工智能模块的使用方法和电子设备
CN112783781A (zh) * 2021-01-28 2021-05-11 网易(杭州)网络有限公司 游戏测试方法、装置、电子设备及存储介质
CN113209622A (zh) * 2021-05-28 2021-08-06 北京字节跳动网络技术有限公司 动作的确定方法、装置、可读介质和电子设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599198A (zh) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种多级联结循环神经网络的图像描述方法
CN106777125A (zh) * 2016-12-16 2017-05-31 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种基于神经网络及图像关注点的图像描述生成方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10142909B2 (en) * 2015-10-13 2018-11-27 The Board Of Trustees Of The University Of Alabama Artificial intelligence-augmented, ripple-diamond-chain shaped rateless routing in wireless mesh networks with multi-beam directional antennas
CN106422332B (zh) * 2016-09-08 2019-02-26 腾讯科技(深圳)有限公司 应用于游戏的人工智能操作方法和装置
CN106970615B (zh) * 2017-03-21 2019-10-22 西北工业大学 一种深度强化学习的实时在线路径规划方法
CN107066553B (zh) * 2017-03-24 2021-01-01 北京工业大学 一种基于卷积神经网络与随机森林的短文本分类方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599198A (zh) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种多级联结循环神经网络的图像描述方法
CN106777125A (zh) * 2016-12-16 2017-05-31 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种基于神经网络及图像关注点的图像描述生成方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SODHI, S., AI FOR CLASSIC VIDEO GAMES USING REINFORCEMENT LEARNING, 25 May 2017 (2017-05-25), pages 15, Retrieved from the Internet <URL:http://scholarworks.sjsu.edi/etd_project/538> *
STANESCU, . M. ET AL.: "Evaluating . Real-Time Strategy Game States Using Convolutional Neural Networks", 2016 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND GAMES (CIG, 23 February 2017 (2017-02-23), pages 1 - 7, XP033067659 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110327624A (zh) * 2019-07-03 2019-10-15 广州多益网络股份有限公司 一种基于课程强化学习的游戏跟随方法和系统
CN111901146A (zh) * 2020-06-28 2020-11-06 北京可信华泰信息技术有限公司 一种对象访问的控制方法和装置
CN112382151A (zh) * 2020-11-16 2021-02-19 深圳市商汤科技有限公司 一种线上学习方法及装置、电子设备及存储介质
CN112382151B (zh) * 2020-11-16 2022-11-18 深圳市商汤科技有限公司 一种线上学习方法及装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN109843401B (zh) 2020-11-24
CN109843401A (zh) 2019-06-04

Similar Documents

Publication Publication Date Title
WO2019075632A1 (zh) 一种ai对象行为模型优化方法以及装置
US20230089380A1 (en) Neural network construction method and apparatus
WO2020224403A1 (zh) 分类任务模型的训练方法、装置、设备及存储介质
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
JP2019510325A (ja) マルチモーダルデジタル画像を生成する方法及びシステム
WO2019136894A1 (en) Methods and systems for face alignment
CN107292352B (zh) 基于卷积神经网络的图像分类方法和装置
KR20180117704A (ko) 콘볼루셔널 신경 네트워크들에서의 구조 학습
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
US9436909B2 (en) Increased dynamic range artificial neuron network apparatus and methods
US11586903B2 (en) Method and system of controlling computing operations based on early-stop in deep neural network
WO2022052530A1 (zh) 人脸矫正模型的训练方法、装置、电子设备及存储介质
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2020260862A1 (en) Facial behaviour analysis
US20230071265A1 (en) Quantifying plant infestation by estimating the number of biological objects on leaves, by convolutional neural networks that use training images obtained by a semi-supervised approach
CN111292262B (zh) 图像处理方法、装置、电子设备以及存储介质
EP4006777A1 (en) Image classification method and device
WO2022012668A1 (zh) 一种训练集处理方法和装置
KR20220038475A (ko) 비디오 콘텐츠 인식 방법 및 장치, 저장 매체, 및 컴퓨터 디바이스
WO2023051369A1 (zh) 一种神经网络的获取方法、数据处理方法以及相关设备
WO2022111387A1 (zh) 一种数据处理方法及相关装置
WO2022179603A1 (zh) 一种增强现实方法及其相关设备
Miao et al. Evolving convolutional neural networks by symbiotic organisms search algorithm for image classification
CN112132281B (zh) 一种基于人工智能的模型训练方法、装置、服务器及介质
CN114169393A (zh) 一种图像分类方法及其相关设备

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17929298

Country of ref document: EP

Kind code of ref document: A1