CN117933353A

CN117933353A - Reinforced learning model training method and device, electronic equipment and storage medium

Info

Publication number: CN117933353A
Application number: CN202410113296.1A
Authority: CN
Inventors: 徐亮; 单彬; 赵鉴; 秦熔均; 俞扬
Original assignee: Nanqi Xiance Nanjing Technology Co ltd
Current assignee: Nanqi Xiance Nanjing Technology Co ltd
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-04-26

Abstract

The invention discloses a reinforcement learning model training method, a reinforcement learning model training device, electronic equipment and a storage medium. The method comprises the following steps: obtaining a pre-training model obtained by reinforcement learning training according to first scene sample data, wherein the pre-training model comprises a state sensing network and an action decision network; multiplexing an action decision network in the pre-training model; and acquiring second scene sample data, and training a state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model. According to the technical scheme, the transfer learning of the reinforcement learning model is realized, and the prediction accuracy of the reinforcement learning model in a new scene is effectively improved.

Description

Reinforced learning model training method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of reinforcement learning technologies, and in particular, to a reinforcement learning model training method and apparatus, an electronic device, and a storage medium.

Background

With the continuous development of reinforcement learning technology, reinforcement learning technology is widely applied in various scenes.

In the process of realizing the invention, the prior art is found to have at least the following technical problems: in the existing reinforcement learning technical scheme, under the condition that a reinforcement learning model is transferred to a new scene, the problem of low prediction accuracy exists.

Disclosure of Invention

The invention provides a reinforcement learning model training method, a reinforcement learning model training device, electronic equipment and a storage medium, so as to improve prediction accuracy of a reinforcement learning model.

According to an aspect of the present invention, there is provided a reinforcement learning model training method including:

obtaining a pre-training model obtained by reinforcement learning training according to first scene sample data, wherein the pre-training model comprises a state sensing network and an action decision network;

Multiplexing an action decision network in the pre-training model;

And acquiring second scene sample data, and training a state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model.

According to another aspect of the present invention, there is provided a reinforcement learning model training apparatus including:

the pre-training model acquisition module is used for acquiring a pre-training model obtained by performing reinforcement learning training according to the first scene sample data, wherein the pre-training model comprises a state sensing network and an action decision network;

the action decision network multiplexing module is used for multiplexing the action decision network in the pre-training model;

The target reinforcement learning model determining module is used for acquiring second scene sample data, and training the state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model.

According to another aspect of the present invention, there is provided an electronic apparatus including:

At least one processor;

And a memory communicatively coupled to the at least one processor;

Wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the reinforcement learning model training method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the reinforcement learning model training method according to any one of the embodiments of the present invention when executed.

According to the technical scheme, a pre-training model is obtained through reinforcement learning training according to first scene sample data, wherein the pre-training model comprises a state sensing network and an action decision network; multiplexing an action decision network in the pre-training model; and acquiring second scene sample data, training a state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model, so that the transfer learning of the reinforcement learning model is realized, and the prediction accuracy of the reinforcement learning model in a new scene is effectively improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a reinforcement learning model training method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a reinforcement learning model training method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of reinforcement learning model training provided in accordance with an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training device for reinforcement learning model according to a third embodiment of the present invention;

Fig. 5 is a schematic structural diagram of an electronic device implementing the reinforcement learning model training method according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The technical scheme of the application obtains, stores, uses, processes and the like the data, which all meet the relevant regulations of national laws and regulations.

Example 1

Fig. 1 is a flowchart of a reinforcement learning model training method according to an embodiment of the present invention, where the method may be applied to reinforcement learning model migration training, and the method may be performed by reinforcement learning model training apparatus, and the reinforcement learning model training apparatus may be implemented in hardware and/or software, and the reinforcement learning model training apparatus may be configured in a terminal and/or a server. As shown in fig. 1, the method includes:

s110, obtaining a pre-training model obtained by reinforcement learning training according to the first scene sample data, wherein the pre-training model comprises a state sensing network and an action decision network.

S120, multiplexing the action decision network in the pre-training model.

S130, acquiring second scene sample data, and training a state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model.

In an embodiment of the disclosure, the first scene sample data or the second scene sample data refers to model training data used by the reinforcement learning model in a specific application scene, where the first scene sample data and the second scene sample data are model training data in different application scenes. For example, the application scenario may be a game countermeasure, an industrial field, a medical treatment, etc., for example, the first scenario sample data may be data available for model training in a map scene A, B and a map scene C in a game, the second scenario sample data may be data available for model training in a map scene D in a game, and the model training data may include data such as blood volume of a character, attack force of the character, and position of the character.

For example, the first scene sample data or the second scene sample data may be read from a preset storage path of the electronic device, or may be retrieved from another device or a cloud connected to the electronic device, which is not limited herein.

In an embodiment of the present disclosure, the pre-training model may be obtained by performing reinforcement learning training according to the first scene sample data, where the pre-training model includes, but is not limited to, a state-aware network and an action decision network, where the state-aware network may be used to learn environmental state information, and the action decision network is used to output a state value function.

It should be noted that, the pre-trained action decision module is used for training in a new scene, and the network parameters of the action decision module are fixed, so that the model training in the new scene only needs to be focused on the training state sensing network, the model training time is reduced, and the model training efficiency of reinforcement learning is improved.

Optionally, the first scene sample data includes first state information corresponding to the first agent in the first map scene and first action information corresponding to the first agent in the first map scene; correspondingly, the training step of the pre-training model comprises the following steps: inputting first state information corresponding to a first agent in the first map scene and first action information corresponding to the first agent in the first map scene into an initial reinforcement learning model to obtain a first state value function value; and updating model parameters of the initial reinforcement learning model based on the first state value function value until the training stopping condition of the initial reinforcement learning model is met, so as to obtain a pre-training model.

In the embodiment of the present disclosure, the first map scene may be a scene corresponding to an in-game map, the first agent may be a character in a game, the number of the first agents may be one or more, the first status information may be information such as blood volume and attack force of the character, and the first action information may be information such as displacement or attack action of the character. The first state value function is an action value function corresponding to current state information output by the reinforcement learning model.

The state information and the motion information are jointly used as the input of the model to train, so that the motion decision module in the reinforcement learning model can be reused.

Optionally, the first state value comprises state value values corresponding to a plurality of tasks; correspondingly, updating the model parameters of the initial reinforcement learning model based on the first state-cost function value comprises: aggregating state value function values corresponding to a plurality of tasks to obtain a model predictive value; inputting the model predicted value and the model target value into a preset loss function to obtain model loss; and updating network parameters corresponding to the state-aware network in the pre-training model based on the model loss.

In the embodiment of the disclosure, each task has corresponding state information and action information, and a state value function value corresponding to the state information and the action information.

Specifically, aggregation of state cost function values corresponding to a plurality of tasks can be achieved through a Mixer module, so that a model predicted value is obtained, model loss is determined according to the model predicted value and a model target value, and then network parameters corresponding to a state sensing network in a pre-training model are updated according to the model loss until model training is finished, wherein the model target value refers to a label of model training.

Optionally, the preset loss function includes:

Where ω _i represents the weight of the ith task and loss _i represents the loss function of the ith task; an evaluation index indicating the degree of convergence of the ith task; /(I) Representing model predictive value corresponding to jth training sample,/>The model target value corresponding to the jth training sample is represented, N represents the task number, and M represents the training sample number.

In the embodiment of the present disclosure, the evaluation index of the task convergence degree may be a probability of task failure under the current state information, or the like. It should be noted that, in this embodiment, the training accuracy of the model under the multitasking condition is ensured by dynamically adjusting the weight of each task.

Optionally, the state sensing network comprises a full connection layer and a gate control circulation unit, and the full connection layer is connected with the gate control circulation unit; the action decision network includes a fully connected layer.

The state information and the action information can be input to the full connection layer of the state sensing network together, so that the input dimension is converted into the dimension of the hidden layer, the feature after the dimension conversion is input to the gating circulation unit, the gating circulation unit outputs the state sensing feature, and the state sensing feature is input to the full connection layer of the action decision network, and the one-dimensional state value function is obtained.

Example two

Fig. 2 is a flowchart of a reinforcement learning model training method according to a second embodiment of the present invention, where the method according to the present embodiment may be combined with each of the alternatives in the reinforcement learning model training method provided in the foregoing embodiment. The reinforcement learning model training method provided by the embodiment is further optimized. Optionally, the second scene sample data includes second state information corresponding to a second agent in a second map scene and second action information corresponding to the second agent in the second map scene; correspondingly, the training the state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model includes: inputting second state information corresponding to a second agent in the second map scene and second action information corresponding to the second agent in the second map scene into the pre-training model to obtain a second state value function; and updating network parameters corresponding to the state-aware network in the pre-training model based on the second state value function value until a model training stopping condition is met, so as to obtain a target reinforcement learning model.

As shown in fig. 2, the method includes:

S210, obtaining a pre-training model obtained by reinforcement learning training according to first scene sample data, wherein the pre-training model comprises a state sensing network and an action decision network.

S220, multiplexing the action decision network in the pre-training model.

S230, second scene sample data are acquired, wherein the second scene sample data comprise second state information corresponding to a second agent in a second map scene and second action information corresponding to the second agent in the second map scene.

S240, inputting second state information corresponding to the second agent in the second map scene and second action information corresponding to the second agent in the second map scene into the pre-training model to obtain a second state value function.

S250, updating network parameters corresponding to the state-aware network in the pre-training model based on the second state value function value until a model training stopping condition is met, and obtaining a target reinforcement learning model.

In the embodiment of the disclosure, the second map scene may be a scene corresponding to an in-game map, the second agent may be a character in a game, the number of the second agents may be one or more, the status information may be information such as blood volume and attack force of the character, and the action information may be information such as displacement or attack action of the character. The second state value is an action value corresponding to the current state information output by the reinforcement learning model.

Illustratively, FIG. 3 is a schematic diagram of reinforcement learning model training provided in accordance with an embodiment of the present invention. Specifically, in the model pre-training stage, the initial reinforcement learning model can be pre-trained through state information(s) and action information (a) of the N tasks, so as to obtain a pre-training model. Where Q _N (s, a) represents the state cost function value corresponding to the nth task. In the model migration stage, the action decision network in the pre-training model can be multiplexed, and the state sensing network is trained based on the state information corresponding to the new task (task new) and the action information, so that the target reinforcement learning model is obtained. In the training process of the new task model, only the training state sensing network needs to be focused, so that the model training time is shortened, and the model training efficiency of reinforcement learning is improved.

According to the technical scheme, the second state information corresponding to the second intelligent agent in the second map scene and the second action information corresponding to the second intelligent agent in the second map scene are input into the pre-training model to obtain the second state value function value, and further the network parameters corresponding to the state sensing network in the pre-training model are updated based on the second state value function value until the model training stopping condition is met, so that the target reinforcement learning model is obtained, the transfer learning of the reinforcement learning model is realized, and the prediction accuracy of the reinforcement learning model in a new scene is effectively improved.

Example III

Fig. 4 is a schematic structural diagram of a training device for reinforcement learning model according to a third embodiment of the present invention. As shown in fig. 4, the apparatus includes:

a pre-training model obtaining module 310, configured to obtain a pre-training model obtained by performing reinforcement learning training according to first scene sample data, where the pre-training model includes a state sensing network and an action decision network;

An action decision network multiplexing module 320, configured to multiplex the action decision network in the pre-training model;

the target reinforcement learning model determining module 330 is configured to obtain second scene sample data, and train the state-aware network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model.

In some optional embodiments, the second scene sample data includes second state information corresponding to a second agent in the second map scene and second action information corresponding to the second agent in the second map scene;

accordingly, the target reinforcement learning model determination module 330 includes:

A second state value determining unit, configured to input second state information corresponding to a second agent in the second map scene and second action information corresponding to the second agent in the second map scene to the pre-training model, to obtain a second state value;

And the state sensing network updating unit is used for updating network parameters corresponding to the state sensing network in the pre-training model based on the second state cost function value until the model training stopping condition is met, so as to obtain a target reinforcement learning model.

In some optional embodiments, the first scene sample data includes first status information corresponding to the first agent in the first map scene and first action information corresponding to the first agent in the first map scene;

Correspondingly, the training process of the pre-training model comprises the following steps:

The first state value function value determining module is used for inputting first state information corresponding to a first intelligent agent in the first map scene and first action information corresponding to the first intelligent agent in the first map scene into the initial reinforcement learning model to obtain a first state value function value;

and the pre-training model parameter updating module is used for updating the model parameters of the initial reinforcement learning model based on the first state value function until the training stopping condition of the initial reinforcement learning model is met, so as to obtain a pre-training model.

In some optional embodiments, the first state value comprises state value values corresponding to a plurality of tasks; correspondingly, the pre-training model parameter updating module is further used for:

Aggregating state value function values corresponding to a plurality of tasks to obtain a model predictive value;

inputting the model predicted value and the model target value into a preset loss function to obtain model loss;

and updating network parameters corresponding to the state-aware network in the pre-training model based on the model loss.

In some alternative embodiments, the preset loss function includes:

In some optional embodiments, the state-aware network includes a fully connected layer and a gating loop unit, the fully connected layer being connected to the gating loop unit; the action decision network includes a fully connected layer.

The reinforcement learning model training device provided by the embodiment of the invention can execute the reinforcement learning model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 5 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, wearable devices (e.g., helmets, eyeglasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An I/O interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the reinforcement learning model training method, which includes:

Multiplexing an action decision network in the pre-training model;

In some embodiments, the reinforcement learning model training method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. One or more of the steps of the reinforcement learning model training method described above may be performed when the computer program is loaded into RAM 13 and executed by processor 11. Alternatively, in other embodiments, processor 11 may be configured to perform the reinforcement learning model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system-on-chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of training a reinforcement learning model, comprising:

Multiplexing an action decision network in the pre-training model;

2. The method of claim 1, wherein the first scene sample data includes first status information corresponding to a first agent in a first map scene and first action information corresponding to the first agent in the first map scene;

correspondingly, the training step of the pre-training model comprises the following steps:

inputting first state information corresponding to a first agent in the first map scene and first action information corresponding to the first agent in the first map scene into an initial reinforcement learning model to obtain a first state value function value;

And updating model parameters of the initial reinforcement learning model based on the first state value function value until the training stopping condition of the initial reinforcement learning model is met, so as to obtain a pre-training model.

3. The method of claim 2, wherein the first state value of value comprises state value of value corresponding to a plurality of tasks;

correspondingly, the updating the model parameters of the initial reinforcement learning model based on the first state value function comprises the following steps:

4. A method according to claim 3, wherein the predetermined loss function comprises:

5. The method of claim 1, wherein the second scene sample data includes second status information corresponding to a second agent in a second map scene and second action information corresponding to a second agent in the second map scene;

Correspondingly, the training the state sensing network in the pre-training model based on the second scene sample data to obtain a target reinforcement learning model includes:

inputting second state information corresponding to a second agent in the second map scene and second action information corresponding to the second agent in the second map scene into the pre-training model to obtain a second state value function;

And updating network parameters corresponding to the state-aware network in the pre-training model based on the second state value function value until a model training stopping condition is met, so as to obtain a target reinforcement learning model.

6. The method of claim 1, wherein the state-aware network comprises a fully connected layer and a gating loop unit, the fully connected layer being connected to the gating loop unit; the action decision network includes a fully connected layer.

7. A reinforcement learning model training device, comprising:

8. The apparatus of claim 7, wherein the second scene sample data includes second status information corresponding to a second agent in a second map scene and second action information corresponding to a second agent in the second map scene;

correspondingly, the target reinforcement learning model determining module comprises:

9. An electronic device, the electronic device comprising:

At least one processor;

And a memory communicatively coupled to the at least one processor;

Wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the reinforcement learning model training method of any of claims 1-6.

10. A computer readable storage medium storing computer instructions for causing a processor to implement the reinforcement learning model training method of any one of claims 1-6 when executed.