CN109726808B

CN109726808B - Neural network training method and device, storage medium and electronic device

Info

Publication number: CN109726808B
Application number: CN201711037964.3A
Authority: CN
Inventors: 杨夏; 张力柯
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2022-12-09
Anticipated expiration: 2037-10-27
Also published as: WO2019080900A1; CN109726808A

Abstract

The invention discloses a neural network training method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring an offline sample set used for training a neural network in human-computer interaction application, wherein the offline sample set comprises offline samples meeting preset configuration conditions; training an initial neural network off line by using an off-line sample set to obtain an object neural network, wherein the processing capacity of the object neural network is higher than that of the initial neural network in human-computer interaction application; and accessing the object neural network into an online running environment of the man-machine interaction application for online training to obtain a target neural network. The invention solves the technical problem of low training efficiency in the neural network training method provided by the related technology.

Description

Neural network training method and device, storage medium and electronic device

Technical Field

The invention relates to the field of computers, in particular to a neural network training method and device, a storage medium and an electronic device.

Background

The Deep Q Network (DQN) algorithm is a method of fusing a convolutional neural Network and Q-Learning, and is applied to Deep Reinforcement Learning (DRL), where the DRL combines Deep Learning and Reinforcement Learning together, so as to implement end-to-end Learning from sensing to action. That is, after the perception information is input, the action is directly output through the deep neural network, so that the robot realizes the potential of completely autonomous learning and even multiple skills, thereby realizing Artificial Intelligence (AI) operation. In order to enable the robot to better complete autonomous learning, the robot can be applied to different scenes skillfully, and a neural network can be acquired quickly and accurately through training, so that the robot becomes a current urgent need problem.

At present, a sample object for accessing an online training environment to train a neural network is generally low in level, and during an initial training period, random actions are made with a high probability, so that although a state space of the training environment can be well explored, training time is prolonged, and in addition, due to the low level, constant exploration and learning in the training environment are often required to be performed, so that a certain training purpose can be achieved.

That is, the neural network training method provided in the related art requires a long training time, resulting in a problem of low efficiency of neural network training.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a neural network training method and device, a storage medium and an electronic device, which at least solve the technical problem of low training efficiency in the neural network training method provided by the related technology.

According to an aspect of an embodiment of the present invention, there is provided a neural network training method, including: acquiring an offline sample set used for training a neural network in human-computer interaction application, wherein the offline sample set comprises offline samples meeting preset configuration conditions; training an initial neural network off-line by using the off-line sample set to obtain an object neural network, wherein in the human-computer interaction application, the processing capacity of the object neural network is higher than that of the initial neural network; and accessing the object neural network to the online operation environment of the human-computer interaction application for online training to obtain a target neural network.

According to another aspect of the embodiments of the present invention, there is also provided a neural network training apparatus, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an offline sample set used for training a neural network in human-computer interaction application, and the offline sample set comprises offline samples meeting preset configuration conditions; an offline training unit, configured to use the offline sample set to offline train an initial neural network to obtain an object neural network, where in the human-computer interaction application, a processing capability of the object neural network is higher than a processing capability of the initial neural network; and the online training unit is used for accessing the object neural network into the online operating environment of the human-computer interaction application for online training to obtain a target neural network.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program executes the method.

According to another aspect of the embodiments of the present invention, there is provided an electronic apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method by using the computer program.

In the embodiment of the invention, an initial neural network is trained offline by using an acquired offline sample set for training the neural network in the human-computer interaction application to obtain a target neural network, wherein the processing capacity of the target neural network is higher than that of the initial neural network. And then, accessing the target neural network into an online running environment of the human-computer interaction application to realize online training, thereby obtaining a target neural network matched with the human-computer interaction application. That is to say, the initial neural network is trained offline by obtaining an offline sample set meeting a predetermined configuration condition in advance, so as to obtain an object neural network with higher processing capability, instead of accessing the initial neural network to an online operating environment for direct online training, thereby overcoming the problems of longer training time and lower training efficiency caused by the fact that the target neural network can only be obtained through online training in the prior art. In addition, the object neural network is obtained by utilizing the offline training of the offline sample set, and the sample range for carrying out the neural network training is enlarged, so that the offline samples with higher quality or different grades can be obtained, and the training efficiency of the neural network training is further ensured. And the technical problem of low training efficiency in the neural network training method provided by the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment for an alternative neural network training method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative neural network training method in accordance with embodiments of the present invention;

FIG. 3 is a schematic diagram illustrating an application of an alternative neural network training method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative neural network training method in accordance with embodiments of the present invention;

FIG. 5 is a schematic diagram of an alternative neural network training method in accordance with an embodiment of the present invention;

FIG. 6 is a flow diagram of an alternative neural network training method in accordance with an embodiment of the present invention;

FIG. 7 is a flow chart of yet another alternative neural network training method in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of an alternative neural network training device in accordance with embodiments of the present invention;

FIG. 9 is a schematic diagram of an alternative neural network training method in accordance with an embodiment of the present invention;

FIG. 10 is a schematic diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In an embodiment of the present invention, an embodiment of the above neural network training method is provided. As an alternative implementation, the neural network training method may be applied, but not limited to, in an application environment shown in fig. 1, a client of a human-computer interaction application is installed in a terminal 102, for example, the human-computer interaction application is described by taking a game application as an example, an object a is a user control object, and an object B is a machine control object. Obtaining an offline sample by operating a human-computer interaction application, and storing the offline sample into the database 104, wherein the database 104 may be but is not limited to be located in a training control server, and may also be but is not limited to be located in a third-party independent server; further, an offline sample set for training the neural network, which is formed by offline samples meeting the preset configuration condition, is obtained. And training the initial neural network offline in the terminal 106 using the offline sample set to obtain a target neural network, wherein the processing capacity of the target neural network is higher than that of the initial neural network. Then, the target neural network obtained by offline training in the terminal 106 is accessed to the online operating environment of the human-computer interaction application through the network 108 to realize online training, so as to obtain a target neural network matched with the human-computer interaction application.

In the embodiment, an initial neural network is trained offline by using an acquired offline sample set for training a neural network in a human-computer interaction application to obtain an object neural network, wherein the processing capacity of the object neural network is higher than that of the initial neural network. And then, accessing the target neural network into an online running environment of the human-computer interaction application to realize online training, thereby obtaining a target neural network matched with the human-computer interaction application. That is to say, the initial neural network is trained offline by obtaining an offline sample set meeting a predetermined configuration condition in advance, so as to obtain an object neural network with higher processing capability, instead of accessing the initial neural network to an online operating environment for direct online training, thereby overcoming the problems of longer training time and lower training efficiency caused by the fact that the target neural network can only be obtained through online training in the prior art. In addition, the object neural network is obtained by utilizing the offline training of the offline sample set, and the sample range for carrying out the neural network training is enlarged, so that the offline samples with higher quality or different grades can be obtained, and the training efficiency of the neural network training is further ensured.

Optionally, in this embodiment, the terminal may include, but is not limited to, at least one of the following: the system comprises a mobile phone, a tablet computer, a notebook computer, a desktop PC (personal computer), a digital television and other hardware equipment capable of running human-computer interaction application. The network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks. The above is only an example, and the present embodiment does not limit this.

According to an embodiment of the present invention, there is provided a neural network training method, as shown in fig. 2, the method including:

s202, acquiring an offline sample set used for training a neural network in human-computer interaction application, wherein the offline sample set comprises offline samples meeting preset configuration conditions;

s204, training the initial neural network in an off-line mode by using the off-line sample set to obtain an object neural network, wherein in the man-machine interaction application, the processing capacity of the object neural network is higher than that of the initial neural network;

and S206, accessing the target neural network into an online running environment of the human-computer interaction application for online training to obtain the target neural network.

Optionally, in this embodiment, the neural network training method may be applied to, but not limited to, the following scenarios of human-computer interaction application: 1) In the man-machine confrontation application, the trained target neural network is used for realizing a man-machine confrontation process with the online account; 2) In the on-hook confrontation application, the trained target neural network can replace an online account number and continue the subsequent man-machine confrontation process. That is to say, the intelligent operation in the human-computer interaction application is completed through the target neural network with multiple skills, which is provided in the embodiment and is obtained through offline training and online training by using the offline sample set.

It should be noted that, in this embodiment, the initial neural network is trained offline by obtaining an offline sample set meeting a predetermined configuration condition in advance, so as to obtain a target neural network with higher processing capability, instead of accessing the initial neural network to an online operating environment to perform online training directly, thereby overcoming the problems of longer training time and lower training efficiency caused by obtaining a target neural network only through online training in the prior art. In addition, the object neural network is obtained by utilizing the offline sample set for offline training, the sample range for carrying out neural network training is expanded, so that offline samples with higher quality or different grades can be obtained, and the training efficiency of the neural network training is further ensured.

Optionally, in this embodiment, the target neural network in the different application scenarios may include, but is not limited to, the following online training mode:

1) Accessing the object neural network into an online running environment of the human-computer interaction application, and performing online confrontation training with an online account number in the human-computer interaction application; or alternatively

2) And accessing the object neural network into an online running environment of the man-machine interaction application to replace the first online account in the man-machine interaction application, and continuing to perform online confrontation training with the second online account.

It should be noted that, the online account may be, but is not limited to, a user control account in a human-computer interaction application, as shown in fig. 3 for example, the object a may be a user control object, the object B is a machine control object, and an object neural network used for obtaining the target neural network may be, but is not limited to, an object B, and the weight value in the object neural network is further refined through online countermeasure training to obtain a corresponding target neural network; in addition, still taking the example shown in fig. 3 as an illustration, the object a may be a user-controlled object, and the object B may also be a user-controlled object, after the object a runs for a period of time and selects an on-hook operation, but not limited to replacing the object a with an object neural network, and performing a human-computer antagonistic process with the object B to further perfect a weight value in the object neural network, so as to obtain a corresponding target neural network.

Optionally, in this embodiment, the training the initial neural network offline by using an offline sample set, and obtaining the neural network of the object includes:

1) Under the condition that a preset configuration condition indicates that a high-grade object neural network is obtained, training by using a high-grade off-line sample set to obtain the high-grade object neural network, wherein the operation result of off-line samples in the high-grade off-line sample set in human-computer interaction application is higher than a preset threshold value; or

2) Under the condition that a preset configuration condition indicates that a plurality of grades of object neural networks are obtained, training the object neural networks of the corresponding grades by using the offline sample sets of each grade respectively, wherein the running results of the offline samples in the offline sample sets of the plurality of grades in the human-computer interaction application are respectively in different target threshold ranges, wherein the object neural networks of the plurality of grades at least comprise a first grade object network and a second grade object network, and the processing capacity of the first grade object network is higher than that of the second grade object network.

It should be noted that, in this embodiment, the target neural network may be, but is not limited to, trained to obtain neural networks with different levels of interaction levels according to the interaction levels of the offline samples in different sets of offline samples. For example, in the manner 1), a high-quality offline sample with an operation result higher than a predetermined threshold is obtained from the offline samples, and a high-grade object neural network is obtained through offline training, so that the machine victory rate in man-machine confrontation is improved, and more user accounts are attracted to participate in man-machine interaction application; in the mode 2), multiple-grade off-line sample sets with operation results respectively in different target threshold ranges are obtained from the off-line samples, and multiple-grade object neural networks are obtained through off-line training, so that confrontation levels in human-computer interaction are enriched.

Optionally, in this embodiment, the offline sample may be obtained by, but is not limited to, the following manner: in the process of running the human-computer interaction application by using the training account, collecting parameter values of interaction parameters of the training account in each state frame, wherein the interaction parameters comprise: interactive state, interactive action, interactive feedback excitation; and obtaining an offline sample according to the parameter value of the interactive parameter.

It should be noted that, the method may be, but is not limited to, sequentially displaying each state frame by frame according to the frame number in the running process of the human-computer interaction application, and acquiring the parameter value of the interaction parameter in each state frame to obtain a frame sequence of the parameter value of each interaction parameter, so as to obtain an offline sample by using the frame sequence. The interaction state can be determined but not limited to be determined according to an interaction picture of the man-machine interaction application, the interaction action can be determined but not limited to be determined according to interaction operation received in the man-machine interaction application, and the interaction feedback stimulus can be determined but not limited to be determined according to a parameter value of an interaction feedback stimulus parameter matched with an application type of the man-machine interaction application.

According to the embodiment provided by the application, the initial neural network is trained offline by acquiring the offline sample set meeting the preset configuration condition in advance to obtain the target neural network with higher processing capacity, and the initial neural network is not accessed into the online operating environment to be trained online directly, so that the problems that the training time is longer and the training efficiency is lower due to the fact that the target neural network can be obtained only through online training in the prior art are solved. In addition, the object neural network is obtained by utilizing the offline training of the offline sample set, and the sample range for carrying out the neural network training is enlarged, so that the offline samples with higher quality or different grades can be obtained, and the training efficiency of the neural network training is further ensured.

As an alternative, obtaining an offline sample set for training a neural network in a human-computer interaction application comprises:

s1, obtaining an offline sample obtained after a training account is used for running a human-computer interaction application;

and S2, screening the obtained offline samples according to a preset configuration condition to obtain an offline sample set.

Optionally, in this embodiment, the obtaining an offline sample obtained after the human-computer interaction application is run using the training account includes:

s11, collecting parameter values of interaction parameters of the training account in each state frame in the process of using the training account to run the man-machine interaction application, wherein the interaction parameters comprise: interaction state, interaction action, interaction feedback excitation;

and S12, obtaining an off-line sample according to the parameter value of the interactive parameter.

It should be noted that, in this embodiment, in the interactive feedback excitation, a feedback excitation value of the current state to the action is calculated according to a change of the interactive state in the human-computer interaction application by using the DQN algorithm, so as to obtain a parameter value of the interactive feedback excitation. The specific calculation formula can be set to different disclosures according to different types of man-machine interaction applications, but is not limited to. For example, taking a multiplayer interactive game application as an example, the parameters of the interactive feedback incentive may be but are not limited to blood volume of each character object, and when the blood volume of the training account is acquired to be higher in the training process, a positive incentive feedback value may be configured, otherwise, a negative incentive feedback value is configured. For another example, taking a distance competitive application as an example, the parameter of the interactive feedback incentive may be, but is not limited to, a completed mileage, and the farther the mileage completed by the training account is acquired in the training process, the larger the configured incentive feedback value is, otherwise, the smaller the configured incentive feedback value is. The above is only an example, and this is not limited in this embodiment. In addition, in this embodiment, the parameters of the interactive feedback excitation may be, but are not limited to, recorded sequentially according to the frame number of the status frame.

Specifically, the description is made by combining the example shown in fig. 4, in the process of running the human-computer interaction application, the interaction state st is collected, and the state frame sequence (s 0, s1 \8230st; st) is obtained through recording; obtaining action output to collect the interaction action at, and recording to obtain an action frame sequence (a 0, a1 \8230at); the parameter values of the interactive feedback excitation parameters are further calculated to determine the parameter values rt of the interactive feedback excitation, and the sequence of the feedback excitation frames (r 0, r1 \8230rt; rt) is recorded. And combining the acquired intermediate samples to obtain an offline sample, and storing the offline sample determined by combination into an offline sample library.

In this embodiment, the collected data of the interactive states, the interactive actions, and the interactive feedback stimuli are synchronously combined according to the frame numbers of the state frames to generate offline samples, such as DQN samples, and further store the generated DQN samples in an offline sample library.

As an optional scheme, obtaining the offline sample according to the parameter value of the interaction parameter includes:

s1, determining an offline sample according to the parameter value of the interactive parameter in the ith state frame and the parameter value of the interactive parameter in the (i + 1) th state frame in a combined mode, wherein i is larger than or equal to 1 and smaller than or equal to N, and N is the total frame number of the human-computer interactive application running for one time.

Specifically, as illustrated in fig. 5, the offline sample may be, but is not limited to, a quadruple (s, a, r, s'), which means:

s: interactive status (state, s) in ith status frame

a: interaction in the ith State frame (action, abbreviated as a)

r: in the interaction making interaction state s in the ith state frame, after action a is made, the obtained interaction feedback excitation (reward, r for short) is obtained

s '. Interactive state (next state, s' for short) in the (i + 1) th state frame

As shown in fig. 5, the parameter value of the interactive parameter in the ith state frame at the current time is combined with the parameter value of the interactive parameter in the (i + 1) th state frame at the next time, so as to obtain a group of offline samples on the right side. The parameter values of the interaction parameters of the actual current state frame are combined with the interaction parameter values of the interaction parameters of the next state frame.

In this embodiment, the parameter values of the interactive parameters in the ith state frame and the parameter values of the interactive parameters in the (i + 1) th state frame are combined to determine the offline sample, so as to generate accurate offline sample data, thereby accelerating the convergence process of the neural network.

As an optional scheme, acquiring a parameter value of an interaction parameter of the training account in each status frame includes at least one of:

1) Acquiring a state identifier of an interactive state in each state frame to obtain a state frame sequence in the process of operating the human-computer interactive application by using the training account;

2) Acquiring action identifiers of interactive actions in each state frame to obtain an action frame sequence in the process of running the human-computer interactive application by using the training account;

3) Acquiring interaction feedback excitation parameters matched with the application type of the man-machine interaction application; and calculating parameter values of the interactive feedback excitation parameters to obtain a feedback excitation frame sequence in the process of operating the human-computer interaction application by using the training account.

Explaining by taking an example shown in FIG. 4, in the running process of the man-machine interaction application, an interaction state st is collected, and a state frame sequence (s 0, s1 \8230st) is recorded; obtaining action output to collect the interaction action at, and recording to obtain an action frame sequence (a 0, a1 \8230at); and further calculating the parameter value of the interactive feedback excitation parameter to determine the parameter value rt of the interactive feedback excitation, and recording to obtain a feedback excitation frame sequence (r 0, r1 \8230rt).

In this embodiment, the interactive state and the interactive action in each state frame are acquired. And acquiring parameter values of the interactive feedback excitation parameters according to the interactive feedback excitation parameters so as to obtain a corresponding state frame sequence, action frame sequence and feedback excitation frame sequence in the man-machine interaction application process, so as to obtain DQN (neural network) offline samples by combination.

As an optional solution, collecting the state identifier of the interaction state in each state frame includes:

s1, capturing a state picture of an interactive state in each state frame;

and S2, determining the state identifier of the interactive state according to the state picture.

Specifically, the description is given with reference to fig. 6, and the acquiring of the state identifier of the interaction state in each state frame specifically includes the following steps:

s602, starting a real-time screen capture module in the terminal;

s604, operating the human-computer interaction application;

s606, in the process of running the human-computer interaction application, capturing a state picture in a state frame in real time;

and S608, obtaining a plurality of state pictures, and storing according to the frame numbers to obtain a state frame sequence.

In this embodiment, a state picture of the interaction state of each state frame is captured, and then a state identifier of the interaction state is determined according to the state picture, so that the state identifier of the interaction state in each state frame is acquired in real time in the running process of the human-computer interaction application.

As an optional scheme, collecting the action identifier of the interaction action in each state frame includes:

1) Collecting touch screen operation; acquiring an action identifier of an interactive action corresponding to touch screen operation in a man-machine interaction application; or

2) Acquiring an input event of an external device, wherein the input event comprises at least one of the following: the method comprises the following steps of inputting an event by a keyboard, an event by a body sensing device and an event by a sensing device; and acquiring an action identifier of an interactive action corresponding to the input event in the man-machine interaction application.

The following is a detailed description of collecting touch screen operations and collecting input events of external devices:

(1) Firstly, taking a capture touch screen operation as an example for explanation, the capture touch screen operation is usually performed on a mobile terminal, and in a human-computer interaction application on the mobile terminal, the following operation modes are usually performed: the method comprises the steps of performing touch key operation, universal wheel operation on a touch screen, gyroscope operation in a terminal, electronic screen touch operation and the like, mainly mapping interactive actions to the touch key on the mobile terminal, the universal wheel on the touch screen, the touch screen and the like, monitoring a keyboard event through an action acquisition module in the mobile terminal or interactive application, and recording actions corresponding to the event after acquiring the corresponding events so as to store an action frame sequence.

(2) Generally, the external device includes a keyboard, an infrared sensor, a temperature sensor, etc., and the external device may perform event input for the interactive application according to a corresponding operation. Taking the external device as an example for explanation, as shown in fig. 7, the step of collecting the input event of the external device includes the following steps:

s702, mapping interaction actions required in the human-computer interaction application to a keyboard, and establishing a keyboard event;

s704, monitoring a keyboard event through an action acquisition module;

s706, acquiring a keyboard event;

s708, recording the action corresponding to the keyboard event to save an action frame sequence.

In this embodiment, the acquiring of the action identifier of the interactive action in each state frame includes acquiring a touch screen operation applied to the terminal and acquiring an input event of the external device, so that multiple ways of acquiring the action identifier of the interactive action are provided, and the range of acquiring the action identifier by the interactive application is improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to an embodiment of the present invention, there is also provided a neural network training apparatus for implementing the neural network training method, as shown in fig. 8, the apparatus includes:

1) An obtaining unit 802, configured to obtain an offline sample set used for training a neural network in a human-computer interaction application, where the offline sample set includes offline samples that meet a predetermined configuration condition;

2) An offline training unit 804, configured to use an offline sample set to offline train an initial neural network to obtain an object neural network, where in human-computer interaction application, a processing capability of the object neural network is higher than a processing capability of the initial neural network;

3) And the online training unit 806 is configured to perform online training on the target neural network accessed to an online operating environment of the human-computer interaction application to obtain a target neural network.

1) Accessing the object neural network into an online running environment of the human-computer interaction application, and performing online confrontation training with an online account number in the human-computer interaction application; or

2) And accessing the object neural network into an online running environment of the human-computer interaction application to replace a first online account in the human-computer interaction application, and continuing to perform online confrontation training with a second online account.

It should be noted that, the online account may be, but is not limited to, a user control account in a human-computer interaction application, as illustrated in fig. 3 as an example, the object a may be a user control object, the object B is a machine control object, and the object neural network for obtaining the target neural network may be, but is not limited to, an object B, and a weight value in the object neural network is further perfected through online countermeasure training to obtain a corresponding target neural network; in addition, still taking the example shown in fig. 3 as an example for explanation, the object a may be a user-controlled object, and the object B may also be a user-controlled object, after the object a runs for a period of time and selects an on-hook operation, but not limited to replacing the object a with an object neural network, and continuing a human-computer impedance process with the object B to further perfect a weight value in the object neural network, so as to obtain a corresponding target neural network.

2) Under the condition that a preset configuration condition indicates that a plurality of grades of object neural networks are obtained, respectively training by using each grade of offline sample set to obtain the corresponding grade of object neural network, wherein the operation results of the offline samples in the plurality of grades of offline sample sets in the human-computer interaction application are respectively in different target threshold ranges, wherein the plurality of grades of object neural networks at least comprise a first grade object network and a second grade object network, and the processing capacity of the first grade object network is higher than that of the second grade object network.

It should be noted that, in this embodiment, the target neural network may be, but is not limited to, trained to obtain neural networks with different levels of interaction levels according to the interaction levels of the offline samples in different sets of offline samples. For example, in the mode 1), high-quality offline samples with an operation result higher than a preset threshold are obtained from the offline samples, and a high-grade object neural network is obtained through offline training, so that the success rate of a machine in human-computer confrontation is improved, and more user accounts are attracted to participate in human-computer interaction application; in the mode 2), the offline sample sets with multiple levels, in which the operation results are respectively in different target threshold ranges, are obtained from the offline samples, and the object neural networks with multiple levels are obtained through offline training, so that the confrontation levels in human-computer interaction are enriched.

It should be noted that, the method may be, but is not limited to, sequentially displaying each state frame by frame according to the frame number in the running process of the human-computer interaction application, and acquiring the parameter value of the interaction parameter in each state frame to obtain a frame sequence of the parameter value of each interaction parameter, so as to obtain an offline sample by using the frame sequence. The interactive state may be, but is not limited to, determined according to an interactive screen of the human-computer interactive application, the interactive action may be, but is not limited to, determined according to an interactive operation received in the human-computer interactive application, and the interactive feedback stimulus may be, but is not limited to, determined according to a parameter value of an interactive feedback stimulus parameter matched with an application type of the human-computer interactive application.

As an alternative, as shown in fig. 9, the obtaining unit 802 includes:

1) An obtaining module 902, configured to obtain an offline sample obtained after a human-computer interaction application is run using a training account;

2) And a screening module 904, configured to screen an offline sample set from the obtained offline samples according to a predetermined configuration condition.

As an optional scheme, the obtaining module includes:

1) The acquisition sub-module is used for acquiring the parameter values of the interaction parameters of the training account in each state frame in the process of operating the human-computer interaction application by using the training account, wherein the interaction parameters comprise: interaction state, interaction action, interaction feedback excitation;

2) And the obtaining submodule is used for obtaining the offline sample according to the parameter value of the interactive parameter.

It should be noted that, in this embodiment, in the interactive feedback excitation, a feedback excitation value of the current state to the action is calculated according to a change of the interactive state in the human-computer interaction application by using the DQN algorithm, so as to obtain a parameter value of the interactive feedback excitation. The specific calculation formula can be set to different disclosures according to different types of man-machine interaction applications, but is not limited to. For example, taking a multiplayer interactive game application as an example, the parameter of the interactive feedback incentive may be, but is not limited to, a blood volume of each character object, and when a blood volume of a training account is acquired to be high in a training process, a positive incentive feedback value may be configured, otherwise, a negative incentive feedback value is configured. For another example, taking a distance competitive application as an example, the parameter of the interactive feedback incentive may be, but is not limited to, a completed mileage, and the farther the completed mileage of the training account is acquired in the training process, the larger the configured incentive feedback value is, otherwise, the smaller the configured incentive feedback value is. The above is only an example, and this is not limited in this embodiment. In addition, in this embodiment, the parameters of the interactive feedback excitation may be, but are not limited to, recorded sequentially according to the frame number of the status frame.

Specifically, the description is made by combining the example shown in fig. 4, in the process of running the human-computer interaction application, the interaction state st is collected, and the state frame sequence (s 0, s1 \8230st; st) is obtained through recording; acquiring action output to acquire an interactive action at, and recording to obtain an action frame sequence (a 0, a1 \8230at); the parameter values of the interactive feedback excitation parameters are further calculated to determine the parameter values rt of the interactive feedback excitation, and the sequence of the feedback excitation frames (r 0, r1 \8230rt; rt) is recorded. And combining the acquired intermediate samples to obtain an offline sample, and storing the offline sample determined by combination into an offline sample library.

In this embodiment, the collected data of the three parts of the interaction state, the interaction action, and the interaction feedback excitation are synchronously combined according to the frame number of the state frame to generate an offline sample, such as a DQN sample, and the generated DQN sample is further stored in an offline sample library.

As an optional scheme, the obtaining sub-module obtains the offline sample according to the parameter value of the interactive parameter by the following steps:

1) And combining and determining an offline sample according to the parameter value of the interactive parameter in the ith state frame and the parameter value of the interactive parameter in the (i + 1) th state frame, wherein i is more than or equal to 1 and less than or equal to N, and N is the total frame number of the human-computer interactive application running once.

s: interactive state (state, s) in ith status frame

a: interaction in the ith State frame (action, a)

r: in the interactive state s of the i-th state frame, after action a is made, the obtained interactive feedback excitation (reward, r for short)

s '. Interactive state (next state, s' for short) in the (i + 1) th state frame

As shown in fig. 5, the parameter value of the interactive parameter in the i-th status frame at the current time is combined with the parameter value of the interactive parameter in the i + 1-th status frame at the next time, so as to obtain a group of off-line samples on the right side. The parameter values of the interaction parameters of the actual current state frame are combined with the interaction parameter values of the interaction parameters of the next state frame.

As an optional scheme, the acquisition sub-module acquires the parameter value of the interaction parameter of the training account in each status frame by at least one of the following methods:

3) Acquiring an interactive feedback excitation parameter matched with the application type of the man-machine interactive application; and calculating parameter values of the interactive feedback excitation parameters to obtain a feedback excitation frame sequence in the process of operating the human-computer interaction application by using the training account.

Explaining by taking an example shown in fig. 4, in the process of running the man-machine interaction application, an interaction state st is acquired, and a state frame sequence (s 0, s1 \8230st) is recorded; obtaining action output to collect the interaction action at, and recording to obtain an action frame sequence (a 0, a1 \8230at); the parameter values of the interactive feedback excitation parameters are further calculated to determine the parameter values rt of the interactive feedback excitation, and the sequence of the feedback excitation frames (r 0, r1 \8230rt; rt) is recorded.

In this embodiment, the interactive state and the interactive action in each state frame are obtained. And obtaining parameter values of the interactive feedback excitation parameters according to the interactive feedback excitation parameters so as to obtain corresponding state frame sequences, action frame sequences and feedback excitation frame sequences in the human-computer interaction application process, so as to obtain DQN (neural network) offline samples through combination.

As an optional scheme, the collecting sub-module collects the status identifier of the interaction status in each status frame by the following steps:

s1, capturing a state picture of an interactive state in each state frame;

Specifically, as described with reference to fig. 6, acquiring the state identifier of the interaction state in each state frame specifically includes the following steps:

s602, starting a real-time screen capture module in the terminal;

s604, operating the human-computer interaction application;

s606, in the process of running the human-computer interaction application, real-time capturing a state picture in a state frame;

As an alternative, the collecting sub-module collects the action identifier of the interaction action in each status frame by the following steps:

s704, monitoring a keyboard event through an action acquisition module;

s706, acquiring a keyboard event;

Example 3

According to an embodiment of the present invention, there is also provided an electronic device for implementing the neural network training method, as shown in fig. 10, the electronic device includes: one or more processors 1002 (only one of which is shown), memory 1004, display 1006, user interface 1008, and transmission means 1010. The memory 1004 may be used to store software programs and modules, such as program instructions/modules corresponding to the security vulnerability detection method and apparatus in the embodiments of the present invention, and the processor 1002 executes various functional applications and data processing by running the software programs and modules stored in the memory 1004, that is, implements the above-mentioned detection method for system vulnerability attacks. The memory 1004 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1004 may further include memory located remotely from the processor 1002, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1010 is used for receiving or transmitting data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1010 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices so as to communicate with the internet or a local area Network. In one example, the transmission device 1010 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Specifically, the memory 1004 is used for storing preset action conditions, information of preset authorized users and application programs.

Optionally, the specific examples in this embodiment may refer to the examples described in embodiment 1 and embodiment 2, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 10 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 10 is a diagram illustrating the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 4

The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be located in at least one of a plurality of network devices in a network.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

s1, obtaining an offline sample set used for training a neural network in human-computer interaction application, wherein the offline sample set comprises offline samples meeting preset configuration conditions;

s2, training an initial neural network off line by using an off-line sample set to obtain an object neural network, wherein the processing capacity of the object neural network is higher than that of the initial neural network in the man-machine interaction application;

and S3, accessing the target neural network into an online operation environment of the human-computer interaction application for online training to obtain the target neural network.

Optionally, the storage medium is further arranged to store program code for performing the steps of:

s1, collecting parameter values of interaction parameters of a training account in each state frame in the process of running a human-computer interaction application by using the training account, wherein the interaction parameters comprise: interactive state, interactive action, interactive feedback excitation;

and S2, obtaining an offline sample according to the parameter value of the interactive parameter.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A neural network training method, comprising:

the method comprises the steps of obtaining an offline sample obtained after a training account is used for operating a human-computer interaction application, and obtaining at least two offline sample sets from the offline sample according to an operation result of the offline sample, wherein the at least two offline sample sets are different in set grade, and the set grade indicates a target threshold range where the operation result of the offline sample in the corresponding offline sample set is located;

respectively training an initial neural network by using the at least two offline sample sets in an offline mode to obtain at least two object neural networks corresponding to each offline sample set, wherein the at least two object neural networks respectively have different levels of interaction levels, and in the man-machine interaction application, the processing capacity of the object neural network is higher than that of the corresponding initial neural network;

and accessing the at least two object neural networks to an online operating environment of the human-computer interaction application for online training to obtain at least two target neural networks, wherein the at least two target neural networks have different levels of interaction levels, and the at least two target neural networks are used for controlling an operation object and an operation object controlled by a user account to realize a human-computer confrontation process.

2. The method of claim 1, wherein obtaining an offline sample obtained after running the human-computer interaction application using a training account comprises:

in the process of running the human-computer interaction application by using the training account, collecting parameter values of interaction parameters of the training account in each state frame, wherein the interaction parameters comprise: interaction state, interaction action, interaction feedback excitation;

and obtaining the offline sample according to the parameter value of the interactive parameter.

3. The method of claim 2, wherein obtaining the offline sample according to the parameter value of the interaction parameter comprises:

and combining and determining the offline sample according to the parameter value of the interactive parameter in the ith state frame and the parameter value of the interactive parameter in the (i + 1) th state frame, wherein i is greater than or equal to 1 and less than or equal to N, and N is the total frame number of the human-computer interactive application running once.

4. The method of claim 2, wherein the collecting parameter values for the interaction parameters of the training account within each status frame comprises at least one of:

acquiring a state identifier of the interaction state in each state frame to obtain a state frame sequence in the process of operating the human-computer interaction application by using the training account;

acquiring action identifiers of the interactive actions in each state frame to obtain an action frame sequence in the process of running the human-computer interaction application by using the training account;

acquiring an interactive feedback excitation parameter matched with the application type of the human-computer interaction application; and calculating the parameter values of the interactive feedback excitation parameters to obtain a feedback excitation frame sequence in the process of operating the human-computer interactive application by using the training account.

5. The method of claim 4, wherein said collecting the state identification of the interaction state in each of the state frames comprises:

capturing a status picture of the interaction status in each status frame;

and determining the state identifier of the interactive state according to the state picture.

6. The method of claim 4, wherein said collecting the action identification of the interaction within each of the state frames comprises:

collecting touch screen operation; acquiring the action identifier of the interaction action corresponding to the touch screen operation in the man-machine interaction application; or

Acquiring an input event of an external device, wherein the input event comprises at least one of the following: a keyboard input event, a somatosensory input event and a sensing equipment input event; and acquiring the action identifier of the interactive action corresponding to the input event in the man-machine interactive application.

7. The method of claim 1, wherein the using the at least two offline sample sets to separately train an initial neural network offline, and obtaining at least two object neural networks corresponding to each offline sample set comprises:

under the condition that a preset configuration condition indicates that a plurality of grades of object neural networks are obtained, respectively training by using each grade of offline sample set to obtain the corresponding grade of object neural network, wherein the running results of the offline samples in the plurality of grades of offline sample sets in the human-computer interaction application are respectively in different target threshold ranges, wherein the plurality of grades of object neural networks at least comprise a first grade object network and a second grade object network, and the processing capacity of the first grade object network is higher than that of the second grade object network.

8. The method of claim 1, wherein the accessing the at least two object neural networks into the online runtime environment of the human-computer interaction application for online training to obtain at least two target neural networks comprises:

respectively accessing the at least one object neural network to the online operation environment of the human-computer interaction application, and performing online confrontation training with an online account number in the human-computer interaction application; or alternatively

And accessing the object neural network into the online running environment of the man-machine interaction application to replace a first online account in the man-machine interaction application, and continuing online countertraining with a second online account.

9. A neural network training device, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an offline sample obtained after a training account is used for running a human-computer interaction application, and acquiring not less than two offline sample sets from the offline sample according to a running result of the offline sample, wherein the set grades of the not less than two offline sample sets are different, and the set grades indicate a target threshold range where the running result of the offline sample in the corresponding offline sample set is located;

the off-line training unit is used for respectively off-line training an initial neural network by using the at least two off-line sample sets to obtain at least two object neural networks corresponding to each off-line sample set, wherein the at least two object neural networks respectively have different levels of interaction levels, and in the man-machine interaction application, the processing capacity of the object neural network is higher than that of the corresponding initial neural network;

and the online training unit is used for accessing the at least two target neural networks to an online operating environment of the human-computer interaction application for online training to obtain at least two target neural networks, wherein the at least two target neural networks have different levels of interaction levels, and the at least two target neural networks are used for controlling an operation object and an operation object controlled by a user account to realize a human-computer confrontation process.

10. The apparatus of claim 9, wherein the obtaining unit comprises:

the acquisition sub-module is configured to acquire a parameter value of an interaction parameter of the training account in each status frame in a process of running the human-computer interaction application by using the training account, where the interaction parameter includes: interactive state, interactive action, interactive feedback excitation;

and the obtaining submodule is used for obtaining the offline sample according to the parameter value of the interactive parameter.

11. The apparatus of claim 10, wherein the obtaining sub-module obtains the offline sample according to the parameter value of the interaction parameter by:

and combining and determining the offline sample according to the parameter value of the interactive parameter in the ith state frame and the parameter value of the interactive parameter in the (i + 1) th state frame, wherein i is more than or equal to 1 and less than or equal to N, and N is the total frame number of the human-computer interactive application running once.

12. The apparatus of claim 10, wherein the collecting sub-module collects the parameter values of the interaction parameters of the training account in each status frame by at least one of:

acquiring a state identifier of the interaction state in each state frame to obtain a state frame sequence in the process of running the human-computer interaction application by using the training account;

acquiring action identifiers of the interactive actions in each state frame to obtain an action frame sequence in the process of running the human-computer interactive application by using the training account;

acquiring an interactive feedback excitation parameter matched with the application type of the human-computer interaction application; and calculating parameter values of the interactive feedback excitation parameters to obtain a feedback excitation frame sequence in the process of operating the human-computer interaction application by using the training account.

13. The apparatus of claim 12, wherein the collecting sub-module collects the status identifications of the interaction statuses within each of the status frames by:

capturing a status picture of the interaction status within each of the status frames;

14. The apparatus of claim 13, wherein the collection sub-module collects the action identifiers of the interaction actions within each of the status frames by:

Acquiring an input event of an external device, wherein the input event comprises at least one of the following: a keyboard input event, a somatosensory input event and a sensing equipment input event; and acquiring the action identifier of the interaction action corresponding to the input event in the man-machine interaction application.

15. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any one of claims 1 to 8.

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the method of any one of claims 1 to 8 by means of the computer program.