WO2019080900A1 - 神经网络训练方法和装置、存储介质及电子装置 - Google Patents

神经网络训练方法和装置、存储介质及电子装置

Info

Publication number
WO2019080900A1
WO2019080900A1 PCT/CN2018/111914 CN2018111914W WO2019080900A1 WO 2019080900 A1 WO2019080900 A1 WO 2019080900A1 CN 2018111914 W CN2018111914 W CN 2018111914W WO 2019080900 A1 WO2019080900 A1 WO 2019080900A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
interaction
human
training
status
Prior art date
Application number
PCT/CN2018/111914
Other languages
English (en)
French (fr)
Inventor
杨夏
张力柯
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019080900A1 publication Critical patent/WO2019080900A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of computers, and in particular to a neural network training method and apparatus, a storage medium, and an electronic device.
  • the Deep Q Network (DQN) algorithm is a fusion convolutional neural network and Q-Learning method, which is applied to Deep Reinforcement Learning (DRL).
  • DRL Deep Reinforcement Learning
  • the Deep Enhanced Learning DRL is Deep learning and enhanced learning combine to achieve a new algorithm for end-to-end learning from perception to action. That is to say, after inputting the perceptual information, the action is directly output through the deep neural network, so that the robot realizes the potential of completely autonomous learning and even multiple skills, thereby realizing artificial intelligence (AI) operation.
  • AI artificial intelligence
  • the sample objects used to access the online training environment to train the neural network are usually of a low level.
  • the state space of the training environment can be well explored, The training time is extended.
  • the neural network training method provided in the related art requires a long training time, thereby causing a problem that the neural network training efficiency is low.
  • the embodiment of the present application provides a neural network training method and apparatus, a storage medium, and an electronic device to solve at least the technical problem of low training efficiency existing in the neural network training method provided by the related art.
  • a neural network training method including: acquiring, by a terminal, an offline sample set for training a neural network in a human-computer interaction application, wherein the offline sample set includes meeting a predetermined configuration condition.
  • the offline sample uses the offline sample set offline to train the initial neural network to obtain the object neural network, wherein in the above human-computer interaction application, the processing capability of the object neural network is higher than the processing capability of the initial neural network;
  • the terminal accesses the above-mentioned object neural network to the online running environment of the above-mentioned human-computer interaction application for online training, and obtains a target neural network.
  • a neural network training apparatus is further provided, which is applied to a terminal, including: an acquiring unit, configured to acquire an offline sample set for training a neural network in a human-computer interaction application, where The offline sample set includes an offline sample that satisfies a predetermined configuration condition; and an offline training unit configured to offline the initial neural network by using the offline sample set to obtain an object neural network, wherein in the human-computer interaction application, the object neural network is used.
  • the processing capability is higher than the processing capability of the initial neural network; the online training unit is configured to connect the target neural network to the online operating environment of the human-computer interaction application for online training, and obtain a target neural network.
  • a storage medium comprising a stored program, wherein the program is executed to execute the method described above.
  • an electronic device comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor executes the above by using the computer program Methods.
  • the terminal uses the obtained offline sample set for training the neural network in the human-computer interaction application to offlinely train the initial neural network to obtain the object neural network, wherein the object neural network has high processing capability.
  • the processing power of the initial neural network Then, the terminal accesses the object neural network to the online running environment of the human-computer interaction application to implement online training, thereby obtaining a target neural network that matches the human-computer interaction application. That is to say, the terminal performs offline training on the initial neural network by acquiring the offline sample set that satisfies the predetermined configuration condition in advance, and obtains the object neural network with high processing capability, instead of directly connecting the initial neural network to the online operating environment.
  • the online training is performed to overcome the problem that the training period obtained by the related art can only obtain the target neural network through online training is longer and the training efficiency is lower.
  • the terminal uses the offline sample set offline training to obtain the object neural network, and also expands the sample range for performing neural network training, so as to obtain better or different levels of offline samples, and the training efficiency of the neural network training is guaranteed. Furthermore, the technical problem of low training efficiency existing in the neural network training method provided by the related art is solved.
  • FIG. 1 is a schematic diagram of a hardware environment of an optional neural network training method according to an embodiment of the present application
  • FIG. 2 is a flow chart of an alternative neural network training method in accordance with an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an application of an optional neural network training method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an optional neural network training method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another optional neural network training method according to an embodiment of the present application.
  • FIG. 6 is a flow chart of another alternative neural network training method in accordance with an embodiment of the present application.
  • FIG. 7 is a flow chart of still another alternative neural network training method in accordance with an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an optional neural network training device in accordance with an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another alternative neural network training method according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an alternative electronic device in accordance with an embodiment of the present application.
  • the neural network training method may be, but is not limited to, being applied to an application environment as shown in FIG. 1 , and a client that installs a human-computer interaction application in the terminal 102, such as a human-computer interaction application to a game application.
  • a human-computer interaction application such as a human-computer interaction application to a game application.
  • the object A is a user manipulation object
  • the object B is a machine manipulation object.
  • the offline sample is obtained by running the human-computer interaction application and stored in the database 104, wherein the database 104 can be, but is not limited to, located in the training control server, and can be, but is not limited to, located in a third-party independent server; the acquisition meets predetermined configuration conditions.
  • An offline sample set of training samples for training neural networks is performed using the offline sample set to train the initial neural network offline in the terminal 106 to obtain the object neural network, wherein the processing power of the object neural network is higher than the processing capability of the initial neural network. Then, the object neural network obtained by offline training in the terminal 106 is connected to the online running environment of the human-computer interaction application through the network 108 to implement online training, thereby obtaining a target neural network matched with the human-computer interaction application.
  • the terminal 102 uses the acquired offline sample set for training the neural network in the human-computer interaction application to offlinely train the initial neural network to obtain the object neural network, wherein the object neural network has high processing capability. The processing power of the initial neural network. Then, the terminal 102 accesses the object neural network to the online running environment of the human-computer interaction application to implement online training, thereby obtaining a target neural network that matches the human-computer interaction application. That is to say, the terminal 102 performs off-line training on the initial neural network by acquiring an offline sample set that satisfies a predetermined configuration condition in advance, thereby obtaining an object neural network with higher processing capability, instead of accessing the initial neural network to the online running environment.
  • the online training is directly performed, thereby overcoming the problem that the training time obtained by the related art can only obtain the target neural network through online training is longer and the training efficiency is lower.
  • the off-line sample set offline training to obtain the object neural network and also expand the sample range for neural network training, in order to obtain better or different levels of offline samples, to ensure the training efficiency of neural network training.
  • the foregoing terminal 102 may include, but is not limited to, at least one of the following: a mobile phone, a tablet computer, a notebook computer, a desktop PC, a digital television, and other hardware devices that can run a human-computer interaction application.
  • the above network may include, but is not limited to, at least one of the following: a wide area network, a metropolitan area network, and a local area network. The above is only an example, and the embodiment does not limit this.
  • a neural network training method is provided, as shown in FIG. 2, the method includes:
  • the terminal acquires an offline sample set for training a neural network in the human-computer interaction application, where the offline sample set includes an offline sample that meets a predetermined configuration condition;
  • the terminal uses the offline sample set to train the initial neural network offline to obtain the object neural network, wherein in the human-computer interaction application, the processing power of the object neural network is higher than the processing capability of the initial neural network;
  • the terminal performs online training on the online running environment of the object neural network accessing the human-computer interaction application to obtain the target neural network.
  • the foregoing neural network training method may be, but is not limited to, applied to the following scenarios of human-computer interaction applications: 1) In the human-machine confrontation application, the trained target neural network is used for the online account. Realize the human-machine confrontation process; 2) In the hang-up confrontation application, the target neural network trained can replace the online account and continue the subsequent human-machine confrontation process. That is to say, the terminal completes the intelligent operation in the human-computer interaction application by using the target neural network with multiple skills obtained by offline training and online training provided by the offline sample set provided in this embodiment.
  • the terminal performs off-line training on the initial neural network by acquiring an offline sample set that satisfies a predetermined configuration condition in advance, thereby obtaining an object neural network with higher processing capability, instead of the initial neural network.
  • the network access online operating environment directly performs online training, thereby overcoming the problem that the training period obtained by the related art can only obtain the target neural network through online training, and the training time is low.
  • the terminal uses the offline sample set offline training to obtain the object neural network, and also expands the sample range for performing neural network training, so as to obtain better or different levels of offline samples, and the training efficiency of the neural network training is guaranteed.
  • the target neural network in the foregoing different application scenarios may include, but is not limited to, obtained by the following online training manner:
  • the terminal connects the object neural network to the online running environment of the human-computer interaction application, and performs online confrontation training with the online account in the human-computer interaction application;
  • the terminal connects the object neural network to the online running environment of the human-computer interaction application, replaces the first online account in the human-computer interaction application, and continues online confrontation training with the second online account.
  • the online account can be, but is not limited to, a user control account in the human-computer interaction application.
  • the object A can be a user-controlled object
  • the object B is a machine-controlled object, which is used to obtain
  • the target neural network of the target neural network may be, but is not limited to, the object B, and the weight value in the target neural network is improved by the online confrontation training to obtain the corresponding target neural network; further, the example shown in FIG. 3 is used as an example.
  • the object A can control the object for the user, and the object B can also manipulate the object by the user.
  • the object A After the object A runs for a period of time and selects the on-hook operation, the object A can be replaced by the object neural network, and the human machine is continued with the object B. In the confrontation process, the weight values in the object neural network are improved to obtain the corresponding target neural network.
  • the terminal uses the offline sample set to offlinely train the initial neural network, and obtaining the object neural network includes:
  • the terminal uses the high-level offline sample set to train to obtain a high-level object neural network, wherein the offline samples in the high-level offline sample set are in the human-computer interaction application.
  • the result of the operation is above a predetermined threshold; or
  • the terminal In the case that the predetermined configuration condition indicates that the target neural network of the plurality of levels is acquired, the terminal respectively uses the offline sample set of each level to train the object neural network of the corresponding level, wherein the offline sample sets of the plurality of levels are offline.
  • the running results of the samples in the human-computer interaction application are respectively within different target threshold ranges, wherein the plurality of levels of the object neural network include at least a first-level object network, and a second-level object network, wherein the first-level object network
  • the processing power is higher than the processing power of the second-level object network.
  • the target neural network may be, but is not limited to, training a neural network having different levels of interaction levels according to the interaction level of offline samples in different offline sample sets.
  • the terminal obtains high-quality offline samples whose running result is higher than a predetermined threshold from offline samples, and obtains a high-level object neural network through offline training, so as to improve the winning rate of the machine in the human-machine confrontation, thereby attracting more user accounts.
  • Participating in the human-computer interaction application; in the above manner 2) the terminal acquires multiple levels of offline sample sets whose operation results are respectively within different target threshold ranges from offline samples, and obtains multiple levels of object neural networks through offline training, Enrich the level of confrontation in human-computer interaction.
  • the offline sample may be obtained by, but not limited to, obtaining the parameter of the interaction parameter of the training account in each state frame in the process of running the human-computer interaction application by using the training account.
  • the value, wherein the interaction parameter comprises: an interaction state, an interaction action, an interaction feedback stimulus; and acquiring an offline sample according to the parameter value of the interaction parameter.
  • the terminal sequentially displays each status frame frame by frame according to the frame number, and collects the parameter values of the interaction parameters in each status frame to obtain each interaction.
  • a sequence of frames of parameter values of the parameter and the terminal then uses the sequence of frames to obtain offline samples.
  • the interaction state may be, but is not limited to, determined according to the interaction screen of the human-computer interaction application
  • the interaction action may be, but is not limited to, determined according to the interaction operation received in the human-computer interaction application
  • the interaction feedback incentive may be, but is not limited to, based on the interaction with the human-machine interaction application.
  • the application type matches the parameter value of the interactive feedback excitation parameter.
  • the offline neural network that satisfies the predetermined configuration condition is acquired in advance by the terminal, and the initial neural network is offlinely trained to obtain the target neural network with higher processing capability, instead of the initial neural network access.
  • the online running environment directly performs online training, thereby overcoming the problem that the training time obtained by the related art can only obtain the target neural network through online training is longer and the training efficiency is lower.
  • the terminal uses the offline sample set offline training to obtain the object neural network, and also expands the sample range for performing neural network training, so as to obtain better or different levels of offline samples, and the training efficiency of the neural network training is guaranteed.
  • the terminal acquires an offline sample set for training a neural network in the human-computer interaction application, including:
  • the terminal acquires an offline sample obtained by using the training account to run the human-computer interaction application;
  • the terminal selects an offline sample set from the obtained offline samples according to a predetermined configuration condition.
  • the offline sample obtained by the terminal after acquiring the human-computer interaction application by using the training account includes:
  • the terminal collects parameter values of the interaction parameters of the training account in each state frame in the process of running the human-computer interaction application by using the training account, where the interaction parameters include: an interaction state, an interaction action, and an interaction feedback stimulus;
  • the terminal acquires an offline sample according to the parameter value of the interaction parameter.
  • the interaction feedback excitation is calculated by the DQN algorithm in the human-computer interaction application, and the feedback excitation value of the current state to the action is calculated according to the change of the interaction state, so as to obtain the parameter value of the interaction feedback excitation.
  • Specific calculation formulas may be, but are not limited to, set to different disclosures according to different types of human-computer interaction applications.
  • the parameters of the above-mentioned interactive feedback excitation may be, but are not limited to, the blood volume of each character object. When the blood volume of the training account is obtained during the training process, the positive excitation feedback value may be configured. Otherwise, configure a negative stimulus feedback value.
  • the parameters of the interaction feedback excitation may be, but are not limited to, the completed mileage. The farther the mileage obtained by the training account is obtained during the training process, the larger the excitation feedback value may be configured. Otherwise, configure the incentive feedback value to be smaller.
  • the parameters of the interaction feedback excitation may be, but are not limited to, sequentially recorded according to the frame number of the status frame.
  • the terminal collects the interaction state st during the operation of the human-computer interaction application, and records the sequence of state frames (s0, s1...st); the terminal acquires the action output to collect the interaction action at The action frame sequence (a0, a1...at) is recorded; the parameter value of the interaction feedback excitation parameter is further calculated to determine the parameter value rt of the interaction feedback excitation, and the feedback excitation frame sequence (r0, r1...rt) is recorded.
  • the intermediate samples obtained by the above are obtained by combining the above intermediate samples to obtain offline samples, and the offline samples determined by the combination are stored in the offline sample library.
  • the terminal synchronously combines the collected data of the interaction state, the interaction action, and the interactive feedback excitation according to the frame number of the status frame to generate an offline sample, such as a DQN sample, and further saves the generated DQN sample to Offline sample library.
  • the terminal obtains the offline samples according to the parameter values of the interaction parameters, including:
  • the terminal determines, according to the parameter value of the interaction parameter in the i-th status frame and the parameter value of the interaction parameter in the i+1th status frame, the offline sample is determined, where i is greater than or equal to 1, less than or equal to N, N.
  • i is greater than or equal to 1, less than or equal to N, N.
  • the offline sample may be, but not limited to, a quad (s, a, r, s'), and the meanings are as follows:
  • s' the state of the next i+1 state frame (next state, referred to as s')
  • the terminal combines the parameter value of the interaction parameter in the i-th status frame at the current time with the parameter value of the interaction parameter in the i+1th status frame at the next time, thereby obtaining one on the right side. Group offline samples.
  • the parameter value of the interaction parameter of the current state frame is actually combined with the interaction parameter value of the interaction parameter of the next state frame.
  • the terminal determines the offline sample by combining the parameter value of the interaction parameter in the i-th status frame and the parameter value of the interaction parameter in the i+1th status frame, and can generate accurate offline sample data. To accelerate the convergence process of the neural network.
  • the parameter value of the interaction parameter of the terminal collection training account in each status frame includes at least one of the following:
  • the terminal collects the status identifier of the interaction status in each status frame, and obtains a sequence of status frames in the process of running the human-computer interaction application using the training account;
  • the terminal collects the action identifier of the interaction action in each status frame, and obtains an action frame sequence in the process of running the human-computer interaction application using the training account;
  • the terminal acquires an interaction feedback excitation parameter that matches the application type of the human-computer interaction application; calculates a parameter value of the interaction feedback excitation parameter, and obtains a feedback excitation frame sequence in the process of running the human-computer interaction application using the training account.
  • the terminal collects the interaction state st, records the sequence of state frames (s0, s1...st); the terminal acquires the action output to collect the interaction action at,
  • the sequence of action frames (a0, a1...at) is recorded;
  • the parameter values of the interactive feedback excitation parameters are further calculated to determine the parameter value rt of the interactive feedback excitation, and the feedback excitation frame sequence (r0, r1...rt) is recorded.
  • the terminal acquires an interaction state and an interaction action in each state frame.
  • the parameter values of the interaction feedback excitation parameters are obtained according to the interaction feedback excitation parameters to obtain a corresponding state frame sequence, an action frame sequence and a feedback excitation frame sequence in the human-computer interaction application process, so as to obtain a DQN (Neural Network) offline sample.
  • DQN Neurological Network
  • the terminal collects the status identifiers of the interaction status in each status frame, including:
  • the terminal screen captures a status picture of the interaction status in each status frame
  • the terminal determines a status identifier of the interaction status according to the status screen.
  • the terminal collects the status identifier of the interaction status in each status frame, and specifically includes the following steps:
  • the terminal runs a human-computer interaction application.
  • the real-time screenshot module in the terminal intercepts the status screen in the status frame in real time
  • the terminal obtains a plurality of status screens, and stores the status frame sequence according to the frame number.
  • the terminal intercepts the status picture of the interaction state of each status frame, and then determines the status identifier of the interaction status according to the status picture, so as to realize real-time collection of each status frame in the process of running the human-computer interaction application.
  • the status identifier of the interactive state is the status identifier of the interactive state.
  • the action identifier of the terminal collecting the interaction action in each status frame includes:
  • the terminal collects the touch screen operation; acquires the action identifier of the interaction action corresponding to the touch screen operation in the human-computer interaction application; or
  • the terminal collects an input event of the external device, where the input event includes at least one of the following: a keyboard input event, a somatosensory input event, a sensing device input event, and an action of acquiring an interaction action corresponding to the input event in the human-computer interaction application.
  • the input event includes at least one of the following: a keyboard input event, a somatosensory input event, a sensing device input event, and an action of acquiring an interaction action corresponding to the input event in the human-computer interaction application.
  • the touch screen operation is usually performed on the mobile terminal.
  • the following operation modes are generally used: touch button, universal screen on the touch screen Wheel operation, gyroscope operation in the terminal, electronic screen touch operation, etc., mainly by mapping the interaction action to the touch button on the mobile terminal, the universal wheel on the touch screen, the touch screen, etc., through the action collection in the mobile terminal or the interactive application
  • the module listens for keyboard events, and after acquiring the corresponding event, records the action corresponding to the event to save the action frame sequence.
  • the external device includes a keyboard, an infrared sensation, a temperature sensor, and the like, and the external device can input an event to the interactive application according to the corresponding operation.
  • the step of the terminal collecting the input event of the external device includes the following steps:
  • the terminal records an action corresponding to the keyboard event to save the action frame sequence.
  • the action identifier of the interaction action that is collected by the terminal in each state frame includes an input touch event applied to the terminal and an input event of the external device, and multiple modes for collecting the action identifier of the interaction action are provided. Improve the scope of the interactive application collection action identification.
  • the device includes:
  • an obtaining unit 802 configured to acquire an offline sample set for training a neural network in a human-computer interaction application, wherein the offline sample set includes an offline sample that satisfies a predetermined configuration condition;
  • the offline training unit 804 is configured to offline the initial neural network by using the offline sample set to obtain the object neural network, wherein in the human-computer interaction application, the processing power of the object neural network is higher than that of the initial neural network;
  • the online training unit 806 is configured to perform online training on the online running environment of the object neural network accessing the human-computer interaction application to obtain the target neural network.
  • the foregoing neural network training method may be, but is not limited to, applied to the following scenarios of human-computer interaction applications: 1) In the human-machine confrontation application, the trained target neural network is used for the online account. Realize the human-machine confrontation process; 2) In the hang-up confrontation application, the target neural network trained can replace the online account and continue the subsequent human-machine confrontation process. That is to say, the intelligent operation in the human-computer interaction application is completed by the target neural network with multiple skills obtained by offline training and online training provided by the offline sample set provided in this embodiment.
  • the offline neural network that satisfies the predetermined configuration condition is acquired in advance to perform offline training on the initial neural network, and the target neural network with higher processing capability is obtained, instead of the initial neural network.
  • the online operation environment is directly connected to the online training environment, thereby overcoming the problem that the training period obtained by the related art can only obtain the target neural network through online training, and the training time is low.
  • off-line sample set offline training to obtain the object neural network and also expand the sample range for neural network training, in order to obtain better or different levels of offline samples, to ensure the training efficiency of neural network training.
  • the target neural network in the foregoing different application scenarios may include, but is not limited to, obtained by the following online training manner:
  • the object neural network is connected to the online running environment of the human-computer interaction application, instead of the first online account in the human-computer interaction application, and the online confrontation training with the second online account is continued.
  • the online account can be, but is not limited to, a user control account in the human-computer interaction application.
  • the object A can be a user-controlled object
  • the object B is a machine-controlled object, which is used to obtain
  • the target neural network of the target neural network may be, but is not limited to, the object B, and the weight value in the target neural network is improved by the online confrontation training to obtain the corresponding target neural network; further, the example shown in FIG. 3 is used as an example.
  • the object A can control the object for the user, and the object B can also manipulate the object by the user.
  • the object A After the object A runs for a period of time and selects the on-hook operation, the object A can be replaced by the object neural network, and the human machine is continued with the object B. In the confrontation process, the weight values in the object neural network are improved to obtain the corresponding target neural network.
  • the initial neural network is offlinely trained using the offline sample set, and the obtained object neural network includes:
  • the high-level offline sample set is used to train the high-level object neural network, wherein the offline sample in the high-level offline sample set runs in the human-computer interaction application. The result is above a predetermined threshold; or
  • the target neural network of the corresponding level is trained using the offline sample set of each level, respectively, wherein the offline samples in the plurality of levels of the offline sample set
  • the operation results in the human-computer interaction application are respectively in different target threshold ranges, wherein the plurality of levels of the object neural network include at least a first-level object network, and a second-level object network, wherein the first-level object network
  • the processing power is higher than the processing power of the second-level object network.
  • the target neural network may be, but is not limited to, training a neural network having different levels of interaction levels according to the interaction level of offline samples in different offline sample sets.
  • a high-quality offline sample whose operation result is higher than a predetermined threshold is obtained from an offline sample, and a high-level object neural network is obtained through offline training, so as to improve the winning rate of the machine in the human-machine confrontation, thereby attracting more user accounts to participate.
  • Human-computer interaction application in the above manner 2), obtaining offline sample sets of multiple levels whose operation results are respectively within different target threshold ranges from offline samples, and obtaining multiple levels of object neural networks through offline training to enrich people The level of confrontation in machine interaction.
  • the offline sample may be obtained by, but not limited to, acquiring the parameter value of the interaction parameter of the training account in each status frame in the process of running the human-computer interaction application by using the training account.
  • the interaction parameters include: an interaction state, an interaction action, and an interaction feedback stimulus; and acquiring an offline sample according to the parameter value of the interaction parameter.
  • each state frame is sequentially displayed frame by frame according to the frame number, and the parameter values of the interaction parameters in each state frame are collected to obtain each interaction.
  • a sequence of frames of parameter values of the parameter which in turn is used to obtain offline samples.
  • the interaction state may be, but is not limited to, determined according to the interaction screen of the human-computer interaction application, and the interaction action may be, but is not limited to, determined according to the interaction operation received in the human-computer interaction application, and the interaction feedback incentive may be, but is not limited to, based on the interaction with the human-machine interaction application.
  • the application type matches the parameter value of the interactive feedback excitation parameter.
  • the initial neural network is offlinely trained by acquiring an offline sample set that satisfies a predetermined configuration condition in advance, thereby obtaining a target neural network with higher processing capability, instead of accessing the initial neural network online.
  • the operating environment directly performs online training, thereby overcoming the problem that the training period obtained by the related art can only obtain the target neural network through online training is longer and the training efficiency is lower.
  • the off-line sample set offline training to obtain the object neural network and also expand the sample range for neural network training, in order to obtain better or different levels of offline samples, to ensure the training efficiency of neural network training.
  • the obtaining unit 802 includes:
  • the obtaining module 902 is configured to obtain an offline sample obtained by running the human-computer interaction application using the training account;
  • the screening module 904 is configured to filter the offline sample set from the obtained offline samples according to predetermined configuration conditions.
  • the acquisition module includes:
  • the collection sub-module is configured to collect the parameter values of the interaction parameters of the training account in each state frame during the process of running the human-computer interaction application by using the training account, wherein the interaction parameters include: interaction state, interaction action, interaction Feedback incentive
  • the interaction feedback excitation is calculated by the DQN algorithm in the human-computer interaction application, and the feedback excitation value of the current state to the action is calculated according to the change of the interaction state, so as to obtain the parameter value of the interaction feedback excitation.
  • Specific calculation formulas may be, but are not limited to, set to different disclosures according to different types of human-computer interaction applications.
  • the parameters of the above-mentioned interactive feedback excitation may be, but are not limited to, the blood volume of each character object. When the blood volume of the training account is obtained during the training process, the positive excitation feedback value may be configured. Otherwise, configure a negative stimulus feedback value.
  • the parameters of the interaction feedback excitation may be, but are not limited to, the completed mileage. The farther the mileage obtained by the training account is obtained during the training process, the larger the excitation feedback value may be configured. Otherwise, configure the incentive feedback value to be smaller.
  • the parameters of the interaction feedback excitation may be, but are not limited to, sequentially recorded according to the frame number of the status frame.
  • the interaction state st is collected, and the sequence of state frames (s0, s1...st) is recorded; the action output is acquired to collect the interaction action at, and the record is recorded.
  • the action frame sequence (a0, a1...at) is obtained; the parameter value of the interaction feedback excitation parameter is further calculated to determine the parameter value rt of the interaction feedback excitation, and the feedback excitation frame sequence (r0, r1...rt) is recorded.
  • the intermediate samples obtained by the above are combined to obtain an offline sample by combining the above intermediate samples, and the offline samples determined by the combination are stored in the offline sample library.
  • the collected data of the interaction state, the interaction action, and the interactive feedback excitation are synchronously combined according to the frame number of the state frame to generate an offline sample, such as a DQN sample, and the generated DQN sample is further saved to offline.
  • an offline sample such as a DQN sample
  • the obtaining sub-module obtains an offline sample according to the parameter value of the interaction parameter by the following steps:
  • the offline sample may be, but not limited to, a quad (s, a, r, s'), and the meanings are as follows:
  • s' the state of the next i+1 state frame (next state, referred to as s')
  • the parameter value of the interaction parameter in the i-th status frame at the current time is combined with the parameter value of the interaction parameter in the i+1th status frame at the next time, thereby obtaining a group on the right side.
  • the parameter value of the interaction parameter of the current state frame is actually combined with the interaction parameter value of the interaction parameter of the next state frame.
  • the collecting submodule collects the parameter values of the interaction parameters of the training account in each status frame by using at least one of the following methods:
  • the interaction state st is collected, and the sequence of state frames (s0, s1...st) is recorded; the action output is acquired to collect the interaction action at, and the record is obtained.
  • the action frame sequence (a0, a1...at); further calculate the parameter value of the interaction feedback excitation parameter to determine the parameter value rt of the interaction feedback excitation, and record the feedback excitation frame sequence (r0, r1...rt).
  • an interaction state and an interaction action in each state frame are acquired.
  • the parameter values of the interaction feedback excitation parameters are obtained according to the interaction feedback excitation parameters to obtain a corresponding state frame sequence, an action frame sequence and a feedback excitation frame sequence in the human-computer interaction application process, so as to obtain a DQN (Neural Network) offline sample.
  • DQN Neurological Network
  • the collection submodule collects the status identifier of the interaction status in each status frame by the following steps:
  • S2 Determine a status identifier of the interaction status according to the status screen.
  • the status identifier of the interaction status in each status frame is collected, which specifically includes the following steps:
  • the state screen of the interaction state of each state frame is screened, and then the state identifier of the interaction state is determined according to the state screen, so that the interaction in each state frame is collected in real time during the operation of the human-computer interaction application.
  • the status ID of the status is a value of the status of the status.
  • the collecting submodule collects the action identifier of the interaction action in each status frame by the following steps:
  • the input event comprises at least one of the following: a keyboard input event, a somatosensory input event, a sensing device input event, and an action identifier for acquiring an interaction action corresponding to the input event in the human-computer interaction application.
  • the touch screen operation is usually performed on the mobile terminal.
  • the following operation modes are generally used: touch button, universal screen on the touch screen Wheel operation, gyroscope operation in the terminal, electronic screen touch operation, etc., mainly by mapping the interaction action to the touch button on the mobile terminal, the universal wheel on the touch screen, the touch screen, etc., through the action collection in the mobile terminal or the interactive application
  • the module listens for keyboard events, and after acquiring the corresponding event, records the action corresponding to the event to save the action frame sequence.
  • the external device includes a keyboard, an infrared sensation, a temperature sensor, and the like, and the external device can input an event to the interactive application according to the corresponding operation.
  • the step of collecting an input event of the external device includes the following steps:
  • the action identifier for collecting the interaction action in each state frame includes the input touch screen operation applied to the terminal and the input event of the external device, and provides various ways to collect the action identifier of the interaction action, thereby improving The scope of the interactive application collection action identifier.
  • an electronic device for implementing the above neural network training method is further provided.
  • the electronic device includes: one or more (only one is shown in the figure)
  • the processor 1002 the memory 1004, the display 1006, the user interface 1008, and the transmission device 1010.
  • the memory 1004 can be used to store the software program and the module, such as the security vulnerability detection method and the program instruction/module corresponding to the device in the embodiment of the present application, and the processor 1002 executes by executing the software program and the module stored in the memory 1004.
  • Various functional applications and data processing that is, detection methods for implementing the aforementioned system vulnerability attacks.
  • Memory 1004 can include high speed random access memory, and can also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, memory 1004 can further include memory remotely located relative to processor 1002, which can be connected to terminal A via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the above described transmission device 1010 is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • the transmission device 1010 includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • the transmission device 1010 is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the memory 1004 is configured to store preset action conditions and information of preset permission users, and an application.
  • FIG. 10 is merely illustrative, and the electronic device can also be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (Mobile Internet Devices, MID). ), PAD and other terminal devices.
  • FIG. 10 does not limit the structure of the above electronic device.
  • the electronic device may also include more or fewer components (such as a network interface, display device, etc.) than shown in FIG. 10, or have a different configuration than that shown in FIG.
  • a storage medium is also provided.
  • the foregoing storage medium may be located in at least one of the plurality of network devices in the network.
  • the storage medium is arranged to store program code for performing the following steps:
  • the object neural network is connected to the online running environment of the human-computer interaction application for online training, and the target neural network is obtained.
  • the storage medium is further arranged to store program code for performing the following steps:
  • S1 Obtain an offline sample obtained by running a human-computer interaction application using a training account
  • the offline sample set is selected from the obtained offline samples according to a predetermined configuration condition.
  • the storage medium is further arranged to store program code for performing the following steps:
  • the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic disk. Or a variety of media such as optical discs that can store program code.
  • the integrated unit in the above embodiment if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in the above-described computer readable storage medium.
  • the technical solution of the present application in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium.
  • a number of instructions are included to cause one or more computer devices (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the disclosed client may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the obtained initial neural network is offlinely trained to obtain the object neural network by using the obtained offline sample set for training the neural network in the human-computer interaction application, wherein the processing power of the object neural network is higher than The processing power of the initial neural network.
  • the above object neural network is connected to the online running environment of the human-computer interaction application to implement online training, thereby obtaining a target neural network matched with the human-computer interaction application. That is to say, by acquiring the offline sample set that satisfies the predetermined configuration condition in advance, the initial neural network is offlinely trained, and the object neural network with higher processing capability is obtained, instead of directly connecting the initial neural network to the online operating environment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请公开了一种神经网络训练方法和装置、存储介质及电子装置。其中,该方法包括:终端获取用于训练人机交互应用中的神经网络的离线样本集合,其中,离线样本集合中包括满足预定配置条件的离线样本;终端使用离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在人机交互应用中,对象神经网络的处理能力高于初始神经网络的处理能力;终端将对象神经网络接入人机交互应用的在线运行环境进行在线训练,得到目标神经网络。本申请解决了相关技术提供的神经网络训练方法中存在的训练效率较低的技术问题。

Description

神经网络训练方法和装置、存储介质及电子装置
本申请要求于2017年10月27日提交中国专利局、优先权号为2017110379643、申请名称为“神经网络训练方法和装置、存储介质及电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,具体而言,涉及一种神经网络训练方法和装置、存储介质及电子装置。
背景技术
深度Q网络(Deep Q Network,简称DQN)算法是一种融合卷积神经网络和Q-Learning的方法,应用于深度增强学习(Deep Reinforcement Learning,简称DRL)中,其中,深度增强学习DRL是将深度学习和增强学习结合起来,从而实现从感知到动作的端到端学习的一种全新的算法。也就是说,在输入感知信息之后,通过深度神经网络,直接输出动作,以使机器人实现完全自主的学习甚至多种技能的潜力,从而实现人工智能(Artificial Intelligence,简称AI)操作。为了使机器人更好的完成自主学习,以熟练应用于不同的场景中,通过训练以快速准确地获取神经网络,就成为当前迫切需要的问题。
目前,用于接入在线训练环境训练神经网络的样本对象,通常级别很低,在训练初期时,有很大概率是做出随机动作,虽然可以很好地探索训练环境的状态空间,但却延长了训练时间,此外,由于级别很低,往往需要在训练环境中进行不断地探索学习,才能达到一定的训练目的。
也就是说,相关技术中提供的神经网络训练方法所需训练时间较长,从而导致神经网络训练效率较低的问题。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本申请实施例提供了一种神经网络训练方法和装置、存储介质及电子装置,以至少解决相关技术提供的神经网络训练方法中存在的训练效率较低的技术问题。
根据本申请实施例的一个方面,提供了一种神经网络训练方法,包括:终端获取用于训练人机交互应用中的神经网络的离线样本集合,其中,上述离线样本集合中包括满足预定配置条件的离线样本;上述终端使用上述离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在上述人机交互应用中,上述对象神经网络的处理能力高于上述初始神经网络的处理能力;上述终端将上述对象神经网络接入上述人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
根据本申请实施例的另一方面,还提供了一种神经网络训练装置,应用于终端,包括:获取单元,设置为获取用于训练人机交互应用中的神经网络的离线样本集合,其中,上述离线样本集合中包括满足预定配置条件的离线样本;离线训练单元,设置为使用上述离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在上述人机交互应用中,上述对象神经网络的处理能力高于上述初始神经网络的处理能力;在线训练单元,设置为将上述对象神经网络接入上述人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
根据本申请实施例的又一方面,还提供了一种存储介质,上述存储介质包括存储的程序,其中,上述程序运行时执行上述的方法。
根据本申请实施例的又一方面,还提供了一种电子装置,包括存储器、处理器及存储在上述存储器上并可在上述处理器上运行的计算机程序,上述处理器通过上述计算机程序执行上述的方法。
在本申请实施例中,终端利用获取到的用于训练人机交互应用中的神经网络的离线样本集合,离线训练初始神经网络,以得到对象神经网络, 其中,该对象神经网络的处理能力高于初始神经网络的处理能力。然后,终端将上述对象神经网络接入人机交互应用的在线运行环境,以实现在线训练,从而得到与人机交互应用匹配的目标神经网络。也就是说,终端通过预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,终端利用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效率。进而解决了相关技术提供的神经网络训练方法中存在的训练效率较低的技术问题。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是根据本申请实施例的一种可选的神经网络训练方法的硬件环境示意图;
图2是根据本申请实施例的一种可选的神经网络训练方法的流程图;
图3是根据本申请实施例的一种可选的神经网络训练方法的应用示意图;
图4是根据本申请实施例的一种可选的神经网络训练方法的示意图;
图5是根据本申请实施例的另一种可选的神经网络训练方法的示意图;
图6是根据本申请实施例的另一种可选的神经网络训练方法的流程图;
图7是根据本申请实施例的又一种可选的神经网络训练方法的流程图;
图8是根据本申请实施例的一种可选的神经网络训练装置的示意图;
图9是根据本申请实施例的另一种可选的神经网络训练方法的示意图;
图10是根据本申请实施例的一种可选的电子装置的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
根据本申请实施例的一个方面,提供了一种上述神经网络训练方法的实施例。作为一种可选的实施方式,该神经网络训练方法可以但不限于应用于如图1所示的应用环境中,终端102中安装有人机交互应用的客户端,如人机交互应用以游戏应用为例进行说明,对象A为用户操控对象,对象B为机器操控对象。通过运行人机交互应用以获取离线样本,存储到数据库104中,其中,该数据库104可以但不限于位于训练控制服务器中,也可以但不限于位于第三方独立的服务器中;获取满足预定配置条件的离线样本所构成的用于训练神经网络的离线样本集合。并使用该离线样本集合在终端106中离线训练初始神经网络,以得到对象神经网络,其中,该对象神经网络的处理能力高于初始神经网络的处理能力。然后,将终端106 中离线训练得到的对象神经网络通过网络108接入人机交互应用的在线运行环境,以实现在线训练,从而得到与人机交互应用匹配的目标神经网络。
在本实施例中,终端102利用获取到的用于训练人机交互应用中的神经网络的离线样本集合,离线训练初始神经网络,以得到对象神经网络,其中,该对象神经网络的处理能力高于初始神经网络的处理能力。然后,终端102将上述对象神经网络接入人机交互应用的在线运行环境,以实现在线训练,从而得到与人机交互应用匹配的目标神经网络。也就是说,终端102通过预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,利用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效率。
可选地,在本实施例中,上述终端102可以包括但不限于以下至少之一:手机、平板电脑、笔记本电脑、台式PC机、数字电视及其他可以运行人机交互应用的硬件设备。上述网络可以包括但不限于以下至少之一:广域网、城域网、局域网。上述只是一种示例,本实施例对此不做任何限定。
根据本申请实施例,提供了一种神经网络训练方法,如图2所示,该方法包括:
S202,终端获取用于训练人机交互应用中的神经网络的离线样本集合,其中,离线样本集合中包括满足预定配置条件的离线样本;
S204,终端使用离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在人机交互应用中,对象神经网络的处理能力高于初始神经网络的处理能力;
S206,终端将对象神经网络接入人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
可选地,在本实施例中,上述神经网络训练方法可以但不限于应用于以下人机交互应用的场景中:1)人机对抗类应用中,训练得到的目标神经网络用于与在线账号实现人机对抗过程;2)挂机对抗应用中,训练得到的目标神经网络可以代替在线账号,继续后续的人机对抗过程。也就是说,终端通过本实施例中提供的利用离线样本集合经过离线训练和在线训练得到的具备多项技能的目标神经网络,来完成在人机交互应用中的智能操作。
需要说明的是,在本实施例中,终端通过预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,终端利用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效率。
可选地,在本实施例中,上述不同应用场景中的目标神经网络可以包括但不限于通过以下在线训练方式得到:
1)终端将对象神经网络接入人机交互应用的在线运行环境,与人机交互应用中的在线账号进行在线对抗训练;或者
2)终端将对象神经网络接入人机交互应用的在线运行环境,替代人机交互应用中的第一在线账号,继续与第二在线账号进行在线对抗训练。
需要说明的是,在线账号可以但不限于为人机交互应用中的用户控制账号,如以图3所示为例进行说明,对象A可以为用户操控对象,对象B为机器操控对象,用于得到上述目标神经网络的对象神经网络可以但不限 于为对象B,通过在线对抗训练,来完善对象神经网络中的权重值,得到对应的目标神经网络;此外,仍以图3所示为例进行说明,对象A可以为用户操控对象,对象B也可以用户操控对象,在对象A运行一段时间且选择挂机操作后,可以但不限于将对象A替换为对象神经网络,通过与对象B继续进行人机对抗过程,来完善对象神经网络中的权重值,得到对应的目标神经网络。
可选地,在本实施例中,终端使用离线样本集合离线训练初始神经网络,得到对象神经网络包括:
1)在预定配置条件指示获取高等级对象神经网络的情况下,终端使用高等级离线样本集合训练得到高等级对象神经网络,其中,高等级离线样本集合中的离线样本在人机交互应用中的运行结果高于预定阈值;或者
2)在预定配置条件指示获取多个等级的对象神经网络的情况下,终端分别使用每个等级的离线样本集合训练得到对应等级的对象神经网络,其中,多个等级的离线样本集合中的离线样本在人机交互应用中的运行结果分别处在不同的目标阈值范围内,其中,多个等级的对象神经网络至少包括第一等级对象网络,第二等级对象网络,其中,第一等级对象网络的处理能力高于第二等级对象网络的处理能力。
需要说明的是,在本实施例中,上述目标神经网络可以但不限于根据不同离线样本集合中的离线样本的交互水平,而训练得到具有不同等级的交互水平的神经网络。例如,上述方式1),终端从离线样本中获取运行结果高于预定阈值的优质离线样本,通过离线训练得到高等级对象神经网络,以提升人机对抗中机器的胜率,从而吸引更多用户账号参与人机交互应用;上述方式2),终端从离线样本中获取运行结果分别处在不同的目标阈值范围内的多个等级的离线样本集合,通过离线训练得到多个等级的对象神经网络,以丰富人机交互中的对抗层级。
可选地,在本实施例中,上述离线样本可以但不限于通过以下方式获取:在使用训练账号运行人机交互应用的过程中,终端采集训练账号在每 个状态帧内的交互参数的参数值,其中,交互参数包括:交互状态、交互动作、交互反馈激励;根据交互参数的参数值获取离线样本。
需要说明的是,可以但不限于终端在人机交互应用运行的过程中按照帧序号依次逐帧显示每一个状态帧,并采集每一个状态帧内的交互参数的参数值,以得到每一个交互参数的参数值的帧序列,进而终端利用该帧序列获取离线样本。其中,交互状态可以但不限于根据人机交互应用的交互画面确定,交互动作可以但不限于根据人机交互应用中收到的交互操作确定,交互反馈激励可以但不限于根据与人机交互应用的应用类型匹配的交互反馈激励参数的参数值确定。
通过本申请提供的实施例,通过终端预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,终端利用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效率。
作为一种可选的方案,终端获取用于训练人机交互应用中的神经网络的离线样本集合包括:
S1,终端获取使用训练账号运行人机交互应用后得到的离线样本;
S2,终端根据预定配置条件从获取到的离线样本中筛选得到离线样本集合。
可选地,在本实施例中,终端获取使用训练账号运行人机交互应用后得到的离线样本包括:
S11,终端在使用训练账号运行人机交互应用的过程中,采集训练账号在每个状态帧内的交互参数的参数值,其中,交互参数包括:交互状态、 交互动作、交互反馈激励;
S12,终端根据交互参数的参数值获取离线样本。
需要说明的是,在本实施例中,交互反馈激励是由DQN算法在人机交互应用中,根据交互状态的变化计算得到当前状态对动作的反馈激励值,以得到上述交互反馈激励的参数值。具体的计算公式可以但不限于根据不同类型的人机交互应用而设置为不同公开。例如,以多人互动游戏应用为例,上述交互反馈激励的参数可以但不限于为每个角色对象的血量,在训练过程中获取到训练账号血量较高时,可以配置正激励反馈值,否则,配置负激励反馈值。又例如,以距离竞技类应用为例,上述交互反馈激励的参数可以但不限于为已完成的里程,在训练过程中获取到训练账号完成的里程越远时,可以配置激励反馈值越大,否则,配置激励反馈值越小。上述仅是一种示例,本实施例中对此不做任何限定。此外,在本实施例中,上述交互反馈激励的参数可以但不限于按照状态帧的帧序号依次记录。
具体结合如图4所示的示例进行说明,终端在人机交互应用运行的过程中,采集交互状态st,记录得到状态帧序列(s0,s1…st);终端获取动作输出以采集交互动作at,记录得到动作帧序列(a0,a1…at);进一步计算交互反馈激励参数的参数值以确定交互反馈激励的参数值rt,记录得到反馈激励帧序列(r0,r1…rt)。并将上述采集得到的中间样本,通过组合上述中间样本以得到离线样本,并将组合确定的离线样本存储到离线样本库中。
在本实施例中,终端将上述交互状态,交互动作,交互反馈激励三部分的采集数据按状态帧的帧序号进行同步组合,以生成离线样本,如DQN样本,进一步将生成的DQN样本保存到离线样本库中。
作为一种可选的方案,终端根据交互参数的参数值获取离线样本包括:
S1,终端根据第i个状态帧内的交互参数的参数值,及第i+1个状态帧内的交互参数的参数值,组合确定离线样本,其中,i大于等于1,小于 等于N,N为运行一次人机交互应用的总帧数量。
具体结合图5所示进行说明,上述离线样本可以但不限于为一个四元组(s,a,r,s’),其含义分别为:
s:第i个状态帧内的交互状态(state,简称s)
a:第i个状态帧内的交互动作(action,简称a)
r:第i个状态帧内的交互做出交互状态s下,做出动作a后,获得的交互反馈激励(reward,简称r)
s’:第i+1个状态帧内的交互状态(next state,简称s’)
如图5所示,终端将当前时刻第i个状态帧内的交互参数的参数值,与下一时刻第i+1个状态帧内的交互参数的参数值进行组合,从而得到右侧的一组离线样本。实际上为当前状态帧的交互参数的参数值与下一状态帧的交互参数的交互参数值相组合。
在本实施例中,通过终端将第i个状态帧内的交互参数的参数值,及第i+1个状态帧内的交互参数的参数值,组合确定离线样本,可以生成准确的离线样本数据,以加速神经网络的收敛过程。
作为一种可选的方案,终端采集训练账号在每个状态帧内的交互参数的参数值包括以下至少之一:
1)终端采集每个状态帧内的交互状态的状态标识,得到使用训练账号运行人机交互应用的过程中的状态帧序列;
2)终端采集每个状态帧内的交互动作的动作标识,得到使用训练账号运行人机交互应用的过程中的动作帧序列;
3)终端获取与人机交互应用的应用类型匹配的交互反馈激励参数;计算交互反馈激励参数的参数值,得到使用训练账号运行人机交互应用的过程中的反馈激励帧序列。
以如图4所示的示例进行说明,在人机交互应用运行的过程中,终端采集交互状态st,记录得到状态帧序列(s0,s1…st);终端获取动作输出以采集交互动作at,记录得到动作帧序列(a0,a1…at);进一步计算交互反馈激励参数的参数值以确定交互反馈激励的参数值rt,记录得到反馈激励帧序列(r0,r1…rt)。
在本实施例中,终端获取各个状态帧内的交互状态、交互动作。根据交互反馈激励参数获取交互反馈激励参数的参数值从而得到在人机交互应用过程中对应的状态帧序列,动作帧序列以及反馈激励帧序列,以便于组合得到DQN(神经网络)离线样本。
作为一种可选的方案,终端采集每个状态帧内的交互状态的状态标识包括:
S1,终端截屏每个状态帧内的交互状态的状态画面;
S2,终端根据状态画面确定交互状态的状态标识。
具体结合图6所示进行说明,终端采集每个状态帧内的交互状态的状态标识,具体包括以下步骤:
S602,启动终端内的实时截屏模块;
S604,终端运行人机交互应用;
S606,在运行人机交互应用的过程中,终端内的实时截屏模块实时截屏状态帧内的状态画面;
S608,终端得到多个状态画面,按照帧序号存储得到状态帧序列。
在本实施例中,终端截屏每个状态帧的交互状态的状态画面,然后根据状态画面确定交互状态的状态标识,以实现在人机交互应用运行的过程中,实时采集每个状态帧内的交互状态的状态标识。
作为一种可选的方案,终端采集每个状态帧内的交互动作的动作标识包括:
1)终端采集触屏操作;获取在人机交互应用中与触屏操作对应的交互动作的动作标识;或者
2)终端采集外部设备的输入事件,其中,输入事件包括以下至少之一:键盘输入事件、体感输入事件、传感设备输入事件;获取在人机交互应用中与输入事件对应的交互动作的动作标识。
以下对采集触屏操作以及采集外部设备的输入事件进行具体说明:
(1)首先以采集触屏操作为例进行说明,通常会在移动终端上进行采集触屏操作,移动终端上的人机交互应用中,通常以下几种操作模式:触摸按键、触摸屏上万向轮操作、终端内的陀螺仪操作、电子屏幕触摸操作等,主要通过将交互动作映射到移动终端上的触摸按键、触摸屏上的万向轮、触摸屏等,通过移动终端或交互应用内的动作采集模块监听键盘事件,在获取到相应的事件后,记录该事件对应的动作,以保存动作帧序列。
(2)通常外部设备包括键盘、红外线感知、温度传感器等,该外部设备可以根据相应的操作对交互应用进行事件输入。以外部设备为键盘为例进行说明,如图7所示,终端采集外部设备的输入事件的步骤包括以下步骤:
S702,先将人机交互应用中所需的交互动作映射到键盘中,建立键盘事件;
S704,然后通过动作采集模块监听键盘事件;
S706,在获取到键盘事件;
S708,终端记录该键盘事件对应的动作,以保存动作帧序列。
在本实施例中,终端采集每个状态帧内的交互动作的动作标识包括应用于终端上的采集触屏操作以及采集外部设备的输入事件,提供了采集交互动作的动作标识的多种方式,提高了交互应用采集动作标识的范围。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都 表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
根据本申请实施例的另一方面,还提供了一种用于实施上述神经网络训练方法的神经网络训练装置,应用于终端中。如图8所示,该装置包括:
1)获取单元802,设置为获取用于训练人机交互应用中的神经网络的离线样本集合,其中,离线样本集合中包括满足预定配置条件的离线样本;
2)离线训练单元804,设置为使用离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在人机交互应用中,对象神经网络的处理能力高于初始神经网络的处理能力;
3)在线训练单元806,设置为将对象神经网络接入人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
可选地,在本实施例中,上述神经网络训练方法可以但不限于应用于以下人机交互应用的场景中:1)人机对抗类应用中,训练得到的目标神经网络用于与在线账号实现人机对抗过程;2)挂机对抗应用中,训练得到的目标神经网络可以代替在线账号,继续后续的人机对抗过程。也就是说,通过本实施例中提供的利用离线样本集合经过离线训练和在线训练得到的具备多项技能的目标神经网络,来完成在人机交互应用中的智能操作。
需要说明的是,在本实施例中,通过预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,利用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效 率。
可选地,在本实施例中,上述不同应用场景中的目标神经网络可以包括但不限于通过以下在线训练方式得到:
1)将对象神经网络接入人机交互应用的在线运行环境,与人机交互应用中的在线账号进行在线对抗训练;或者
2)将对象神经网络接入人机交互应用的在线运行环境,替代人机交互应用中的第一在线账号,继续与第二在线账号进行在线对抗训练。
需要说明的是,在线账号可以但不限于为人机交互应用中的用户控制账号,如以图3所示为例进行说明,对象A可以为用户操控对象,对象B为机器操控对象,用于得到上述目标神经网络的对象神经网络可以但不限于为对象B,通过在线对抗训练,来完善对象神经网络中的权重值,得到对应的目标神经网络;此外,仍以图3所示为例进行说明,对象A可以为用户操控对象,对象B也可以用户操控对象,在对象A运行一段时间且选择挂机操作后,可以但不限于将对象A替换为对象神经网络,通过与对象B继续进行人机对抗过程,来完善对象神经网络中的权重值,得到对应的目标神经网络。
可选地,在本实施例中,使用离线样本集合离线训练初始神经网络,得到对象神经网络包括:
1)在预定配置条件指示获取高等级对象神经网络的情况下,使用高等级离线样本集合训练得到高等级对象神经网络,其中,高等级离线样本集合中的离线样本在人机交互应用中的运行结果高于预定阈值;或者
2)在预定配置条件指示获取多个等级的对象神经网络的情况下,分别使用每个等级的离线样本集合训练得到对应等级的对象神经网络,其中,多个等级的离线样本集合中的离线样本在人机交互应用中的运行结果分别处在不同的目标阈值范围内,其中,多个等级的对象神经网络至少包括第一等级对象网络,第二等级对象网络,其中,第一等级对象网络的处理 能力高于第二等级对象网络的处理能力。
需要说明的是,在本实施例中,上述目标神经网络可以但不限于根据不同离线样本集合中的离线样本的交互水平,而训练得到具有不同等级的交互水平的神经网络。例如,上述方式1),从离线样本中获取运行结果高于预定阈值的优质离线样本,通过离线训练得到高等级对象神经网络,以提升人机对抗中机器的胜率,从而吸引更多用户账号参与人机交互应用;上述方式2),从离线样本中获取运行结果分别处在不同的目标阈值范围内的多个等级的离线样本集合,通过离线训练得到多个等级的对象神经网络,以丰富人机交互中的对抗层级。
可选地,在本实施例中,上述离线样本可以但不限于通过以下方式获取:在使用训练账号运行人机交互应用的过程中,采集训练账号在每个状态帧内的交互参数的参数值,其中,交互参数包括:交互状态、交互动作、交互反馈激励;根据交互参数的参数值获取离线样本。
需要说明的是,可以但不限于指在人机交互应用运行的过程中按照帧序号依次逐帧显示每一个状态帧,并采集每一个状态帧内的交互参数的参数值,以得到每一个交互参数的参数值的帧序列,进而利用该帧序列获取离线样本。其中,交互状态可以但不限于根据人机交互应用的交互画面确定,交互动作可以但不限于根据人机交互应用中收到的交互操作确定,交互反馈激励可以但不限于根据与人机交互应用的应用类型匹配的交互反馈激励参数的参数值确定。
通过本申请提供的实施例,通过预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,利用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效率。
作为一种可选的方案,如图9所示,获取单元802包括:
1)获取模块902,设置为获取使用训练账号运行人机交互应用后得到的离线样本;
2)筛选模块904,设置为根据预定配置条件从获取到的离线样本中筛选得到离线样本集合。
作为一种可选的方案,获取模块包括:
1)采集子模块,设置为在使用训练账号运行人机交互应用的过程中,采集训练账号在每个状态帧内的交互参数的参数值,其中,交互参数包括:交互状态、交互动作、交互反馈激励;
2)获取子模块,设置为根据交互参数的参数值获取离线样本。
需要说明的是,在本实施例中,交互反馈激励是由DQN算法在人机交互应用中,根据交互状态的变化计算得到当前状态对动作的反馈激励值,以得到上述交互反馈激励的参数值。具体的计算公式可以但不限于根据不同类型的人机交互应用而设置为不同公开。例如,以多人互动游戏应用为例,上述交互反馈激励的参数可以但不限于为每个角色对象的血量,在训练过程中获取到训练账号血量较高时,可以配置正激励反馈值,否则,配置负激励反馈值。又例如,以距离竞技类应用为例,上述交互反馈激励的参数可以但不限于为已完成的里程,在训练过程中获取到训练账号完成的里程越远时,可以配置激励反馈值越大,否则,配置激励反馈值越小。上述仅是一种示例,本实施例中对此不做任何限定。此外,在本实施例中,上述交互反馈激励的参数可以但不限于按照状态帧的帧序号依次记录。
具体结合如图4所示的示例进行说明,在人机交互应用运行的过程中,采集交互状态st,记录得到状态帧序列(s0,s1…st);获取动作输出以采集交互动作at,记录得到动作帧序列(a0,a1…at);进一步计算交互反馈激励参数的参数值以确定交互反馈激励的参数值rt,记录得到反馈激励帧序列(r0,r1…rt)。并将上述采集得到的中间样本,通过组合上述中间样 本以得到离线样本,并将组合确定的离线样本存储到离线样本库中。
在本实施例中,将上述交互状态,交互动作,交互反馈激励三部分的采集数据按状态帧的帧序号进行同步组合,以生成离线样本,如DQN样本,进一步将生成的DQN样本保存到离线样本库中。
作为一种可选的方案,获取子模块通过以下步骤实现根据交互参数的参数值获取离线样本:
1)根据第i个状态帧内的交互参数的参数值,及第i+1个状态帧内的交互参数的参数值,组合确定离线样本,其中,i大于等于1,小于等于N,N为运行一次人机交互应用的总帧数量。
具体结合图5所示进行说明,上述离线样本可以但不限于为一个四元组(s,a,r,s’),其含义分别为:
s:第i个状态帧内的交互状态(state,简称s)
a:第i个状态帧内的交互动作(action,简称a)
r:第i个状态帧内的交互做出交互状态s下,做出动作a后,获得的交互反馈激励(reward,简称r)
s’:第i+1个状态帧内的交互状态(next state,简称s’)
如图5所示,将当前时刻第i个状态帧内的交互参数的参数值,与下一时刻第i+1个状态帧内的交互参数的参数值进行组合,从而得到右侧的一组离线样本。实际上为当前状态帧的交互参数的参数值与下一状态帧的交互参数的交互参数值相组合。
在本实施例中,通过将第i个状态帧内的交互参数的参数值,及第i+1个状态帧内的交互参数的参数值,组合确定离线样本,可以生成准确的离线样本数据,以加速神经网络的收敛过程。
作为一种可选的方案,采集子模块通过以下至少一种方式采集训练账号在每个状态帧内的交互参数的参数值:
1)采集每个状态帧内的交互状态的状态标识,得到使用训练账号运行人机交互应用的过程中的状态帧序列;
2)采集每个状态帧内的交互动作的动作标识,得到使用训练账号运行人机交互应用的过程中的动作帧序列;
3)获取与人机交互应用的应用类型匹配的交互反馈激励参数;计算交互反馈激励参数的参数值,得到使用训练账号运行人机交互应用的过程中的反馈激励帧序列。
以如图4所示的示例进行说明,在人机交互应用运行的过程中,采集交互状态st,记录得到状态帧序列(s0,s1…st);获取动作输出以采集交互动作at,记录得到动作帧序列(a0,a1…at);进一步计算交互反馈激励参数的参数值以确定交互反馈激励的参数值rt,记录得到反馈激励帧序列(r0,r1…rt)。
在本实施例中,获取各个状态帧内的交互状态、交互动作。根据交互反馈激励参数获取交互反馈激励参数的参数值从而得到在人机交互应用过程中对应的状态帧序列,动作帧序列以及反馈激励帧序列,以便于组合得到DQN(神经网络)离线样本。
作为一种可选的方案,采集子模块通过以下步骤采集每个状态帧内的交互状态的状态标识:
S1,截屏每个状态帧内的交互状态的状态画面;
S2,根据状态画面确定交互状态的状态标识。
具体结合图6所示进行说明,采集每个状态帧内的交互状态的状态标识,具体包括以下步骤:
S602,启动终端内的实时截屏模块;
S604,运行人机交互应用;
S606,在运行人机交互应用的过程中,实时截屏状态帧内的状态画面;
S608,得到多个状态画面,按照帧序号存储得到状态帧序列。
在本实施例中,截屏每个状态帧的交互状态的状态画面,然后根据状态画面确定交互状态的状态标识,以实现在人机交互应用运行的过程中,实时采集每个状态帧内的交互状态的状态标识。
作为一种可选的方案,采集子模块通过以下步骤采集每个状态帧内的交互动作的动作标识:
1)采集触屏操作;获取在人机交互应用中与触屏操作对应的交互动作的动作标识;或者
2)采集外部设备的输入事件,其中,输入事件包括以下至少之一:键盘输入事件、体感输入事件、传感设备输入事件;获取在人机交互应用中与输入事件对应的交互动作的动作标识。
以下对采集触屏操作以及采集外部设备的输入事件进行具体说明:
(1)首先以采集触屏操作为例进行说明,通常会在移动终端上进行采集触屏操作,移动终端上的人机交互应用中,通常以下几种操作模式:触摸按键、触摸屏上万向轮操作、终端内的陀螺仪操作、电子屏幕触摸操作等,主要通过将交互动作映射到移动终端上的触摸按键、触摸屏上的万向轮、触摸屏等,通过移动终端或交互应用内的动作采集模块监听键盘事件,在获取到相应的事件后,记录该事件对应的动作,以保存动作帧序列。
(2)通常外部设备包括键盘、红外线感知、温度传感器等,该外部设备可以根据相应的操作对交互应用进行事件输入。以外部设备为键盘为例进行说明,如图7所示,采集外部设备的输入事件的步骤包括以下步骤:
S702,先将人机交互应用中所需的交互动作映射到键盘中,建立键盘事件;
S704,然后通过动作采集模块监听键盘事件;
S706,在获取到键盘事件;
S708,录该键盘事件对应的动作,以保存动作帧序列。
在本实施例中,采集每个状态帧内的交互动作的动作标识包括应用于终端上的采集触屏操作以及采集外部设备的输入事件,提供了采集交互动作的动作标识的多种方式,提高了交互应用采集动作标识的范围。
根据本申请实施例的又一方面,还提供了一种用于实施上述神经网络训练方法的电子装置,如图10所示,该电子装置包括:一个或多个(图中仅示出一个)处理器1002、存储器1004、显示器1006、用户接口1008、传输装置1010。其中,存储器1004可用于存储软件程序以及模块,如本申请实施例中的安全漏洞检测方法和装置对应的程序指令/模块,处理器1002通过运行存储在存储器1004内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的系统漏洞攻击的检测方法。存储器1004可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器1004可进一步包括相对于处理器1002远程设置的存储器,这些远程存储器可以通过网络连接至终端A。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置1010用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置1010包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置1010为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
其中,存储器1004用于存储预设动作条件和预设权限用户的信息、以及应用程序。
可选地,本实施例中的具体示例可以参考上述实施例1和实施例2中 所描述的示例,本实施例在此不再赘述。
本领域普通技术人员可以理解,图10所示的结构仅为示意,电子装置也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图10其并不对上述电子装置的结构造成限定。例如,电子装置还可包括比图10中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图10所示不同的配置。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
根据本申请的实施例的又一方面,还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以位于的网络中的多个网络设备中的至少一个网络设备。
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
S1,获取用于训练人机交互应用中的神经网络的离线样本集合,其中,离线样本集合中包括满足预定配置条件的离线样本;
S2,使用离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在人机交互应用中,对象神经网络的处理能力高于初始神经网络的处理能力;
S3,将对象神经网络接入人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,获取使用训练账号运行人机交互应用后得到的离线样本;
S2,根据预定配置条件从获取到的离线样本中筛选得到离线样本集合。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:
S1,在使用训练账号运行人机交互应用的过程中,采集训练账号在每个状态帧内的交互参数的参数值,其中,交互参数包括:交互状态、交互动作、交互反馈激励;
S2,根据交互参数的参数值获取离线样本。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,RandomAccess Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
可选地,本实施例中的具体示例可以参考上述实施例1和实施例2中所描述的示例,本实施例在此不再赘述。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可 通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本申请的可选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。
工业实用性
在本申请实施例中,利用获取到的用于训练人机交互应用中的神经网络的离线样本集合,离线训练初始神经网络,以得到对象神经网络,其中,该对象神经网络的处理能力高于初始神经网络的处理能力。然后,将上述对象神经网络接入人机交互应用的在线运行环境,以实现在线训练,从而得到与人机交互应用匹配的目标神经网络。也就是说,通过预先获取满足预定配置条件的离线样本集合,来对初始神经网络进行离线训练,得到处理能力较高的对象神经网络,而不再是将初始神经网络接入在线运行环境直接进行在线训练,从而克服目前相关技术中提供的仅能通过在线训练得到目标神经网络所导致的训练时长较长,训练效率较低的问题。此外,利 用离线样本集合离线训练得到对象神经网络,还扩大了用于进行神经网络训练的样本范围,以便于得到更优质或不同等级的离线样本,保证了神经网络训练的训练效率。

Claims (18)

  1. 一种神经网络训练方法,包括:
    终端获取用于训练人机交互应用中的神经网络的离线样本集合,其中,所述离线样本集合中包括满足预定配置条件的离线样本;
    所述终端使用所述离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在所述人机交互应用中,所述对象神经网络的处理能力高于所述初始神经网络的处理能力;
    所述终端将所述对象神经网络接入所述人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
  2. 根据权利要求1所述的方法,其中,所述终端获取用于训练人机交互应用中的神经网络的离线样本集合包括:
    所述终端获取使用训练账号运行所述人机交互应用后得到的离线样本;
    所述终端根据所述预定配置条件从获取到的所述离线样本中筛选得到所述离线样本集合。
  3. 根据权利要求2所述的方法,其中,所述终端获取使用训练账号运行所述人机交互应用后得到的离线样本包括:
    在使用所述训练账号运行所述人机交互应用的过程中,所述终端采集所述训练账号在每个状态帧内的交互参数的参数值,其中,所述交互参数包括:交互状态、交互动作、交互反馈激励;
    所述终端根据所述交互参数的参数值获取所述离线样本。
  4. 根据权利要求3所述的方法,其中,所述终端根据所述交互参数的参数值获取所述离线样本包括:
    所述终端根据第i个状态帧内的所述交互参数的参数值,及第i+1个状态帧内的所述交互参数的参数值,组合确定所述离线样本,其中,i大于等于1,小于等于N,N为运行一次所述人机交互应用的总帧数 量。
  5. 根据权利要求3所述的方法,其中,所述终端采集所述训练账号在每个状态帧内的交互参数的参数值包括以下至少之一:
    所述终端采集每个所述状态帧内的所述交互状态的状态标识,得到使用所述训练账号运行所述人机交互应用的过程中的状态帧序列;
    所述终端采集每个所述状态帧内的所述交互动作的动作标识,得到使用所述训练账号运行所述人机交互应用的过程中的动作帧序列;
    所述终端获取与所述人机交互应用的应用类型匹配的交互反馈激励参数;计算所述交互反馈激励参数的参数值,得到使用所述训练账号运行所述人机交互应用的过程中的反馈激励帧序列。
  6. 根据权利要求5所述的方法,其中,所述终端采集每个所述状态帧内的所述交互状态的状态标识包括:
    所述终端截屏每个所述状态帧内的所述交互状态的状态画面;
    所述终端根据所述状态画面确定所述交互状态的状态标识。
  7. 根据权利要求5所述的方法,其中,所述终端采集每个所述状态帧内的所述交互动作的动作标识包括:
    所述终端采集触屏操作;获取在所述人机交互应用中与所述触屏操作对应的所述交互动作的所述动作标识;或者
    所述终端采集外部设备的输入事件,其中,所述输入事件包括以下至少之一:键盘输入事件、体感输入事件、传感设备输入事件;获取在所述人机交互应用中与所述输入事件对应的所述交互动作的所述动作标识。
  8. 根据权利要求1所述的方法,其中,所述终端使用所述离线样本集合离线训练初始神经网络,得到对象神经网络包括:
    在所述预定配置条件指示获取高等级对象神经网络的情况下,所述终端使用高等级离线样本集合训练得到所述高等级对象神经网络, 其中,所述高等级离线样本集合中的所述离线样本在所述人机交互应用中的运行结果高于预定阈值;或者
    在所述预定配置条件指示获取多个等级的对象神经网络的情况下,所述终端分别使用每个等级的离线样本集合训练得到对应等级的对象神经网络,其中,多个等级的离线样本集合中的离线样本在所述人机交互应用中的运行结果分别处在不同的目标阈值范围内,其中,所述多个等级的对象神经网络至少包括第一等级对象网络,第二等级对象网络,其中,所述第一等级对象网络的处理能力高于所述第二等级对象网络的处理能力。
  9. 根据权利要求1所述的方法,其中,所述终端将所述对象神经网络接入所述人机交互应用的在线运行环境进行在线训练,得到目标神经网络包括:
    所述终端将所述对象神经网络接入所述人机交互应用的所述在线运行环境,与所述人机交互应用中的在线账号进行在线对抗训练;或者
    所述终端将所述对象神经网络接入所述人机交互应用的所述在线运行环境,替代所述人机交互应用中的第一在线账号,继续与第二在线账号进行在线对抗训练。
  10. 一种神经网络训练装置,应用于终端,包括:
    获取单元,设置为获取用于训练人机交互应用中的神经网络的离线样本集合,其中,所述离线样本集合中包括满足预定配置条件的离线样本;
    离线训练单元,设置为使用所述离线样本集合离线训练初始神经网络,得到对象神经网络,其中,在所述人机交互应用中,所述对象神经网络的处理能力高于所述初始神经网络的处理能力;
    在线训练单元,设置为将所述对象神经网络接入所述人机交互应用的在线运行环境进行在线训练,得到目标神经网络。
  11. 根据权利要求10所述的装置,其中,所述获取单元包括:
    获取模块,设置为获取使用训练账号运行所述人机交互应用后得到的离线样本;
    筛选模块,设置为根据所述预定配置条件从获取到的所述离线样本中筛选得到所述离线样本集合。
  12. 根据权利要求11所述的装置,其中,所述获取模块包括:
    采集子模块,设置为在使用所述训练账号运行所述人机交互应用的过程中,采集所述训练账号在每个状态帧内的交互参数的参数值,其中,所述交互参数包括:交互状态、交互动作、交互反馈激励;
    获取子模块,设置为根据所述交互参数的参数值获取所述离线样本。
  13. 根据权利要求12所述的装置,其中,所述获取子模块通过以下步骤实现根据所述交互参数的参数值获取所述离线样本:
    根据第i个状态帧内的所述交互参数的参数值,及第i+1个状态帧内的所述交互参数的参数值,组合确定所述离线样本,其中,i大于等于1,小于等于N,N为运行一次所述人机交互应用的总帧数量。
  14. 根据权利要求12所述的装置,其中,所述采集子模块通过以下至少一种方式采集所述训练账号在每个状态帧内的交互参数的参数值:
    采集每个所述状态帧内的所述交互状态的状态标识,得到使用所述训练账号运行所述人机交互应用的过程中的状态帧序列;
    采集每个所述状态帧内的所述交互动作的动作标识,得到使用所述训练账号运行所述人机交互应用的过程中的动作帧序列;
    获取与所述人机交互应用的应用类型匹配的交互反馈激励参数;计算所述交互反馈激励参数的参数值,得到使用所述训练账号运行所述人机交互应用的过程中的反馈激励帧序列。
  15. 根据权利要求14所述的装置,其中,所述采集子模块通过以下步骤 采集每个所述状态帧内的所述交互状态的状态标识:
    截屏每个所述状态帧内的所述交互状态的状态画面;
    根据所述状态画面确定所述交互状态的状态标识。
  16. 根据权利要求15所述的装置,其中,所述采集子模块通过以下步骤采集每个所述状态帧内的所述交互动作的动作标识:
    采集触屏操作;获取在所述人机交互应用中与所述触屏操作对应的所述交互动作的所述动作标识;或者
    采集外部设备的输入事件,其中,所述输入事件包括以下至少之一:键盘输入事件、体感输入事件、传感设备输入事件;获取在所述人机交互应用中与所述输入事件对应的所述交互动作的所述动作标识。
  17. 一种存储介质,所述存储介质包括存储的程序,其中,所述程序运行时执行所述权利要求1至9任一项中所述的方法。
  18. 一种电子装置,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器通过所述计算机程序执行所述权利要求1至9任一项中所述的方法。
PCT/CN2018/111914 2017-10-27 2018-10-25 神经网络训练方法和装置、存储介质及电子装置 WO2019080900A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711037964.3 2017-10-27
CN201711037964.3A CN109726808B (zh) 2017-10-27 2017-10-27 神经网络训练方法和装置、存储介质及电子装置

Publications (1)

Publication Number Publication Date
WO2019080900A1 true WO2019080900A1 (zh) 2019-05-02

Family

ID=66246220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111914 WO2019080900A1 (zh) 2017-10-27 2018-10-25 神经网络训练方法和装置、存储介质及电子装置

Country Status (2)

Country Link
CN (1) CN109726808B (zh)
WO (1) WO2019080900A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610169A (zh) * 2019-09-20 2019-12-24 腾讯科技(深圳)有限公司 图片标注方法和装置、存储介质及电子装置
CN110796248A (zh) * 2019-08-27 2020-02-14 腾讯科技(深圳)有限公司 数据增强的方法、装置、设备及存储介质
CN114637209A (zh) * 2022-03-22 2022-06-17 华北电力大学 一种基于强化学习的神经网络逆控制器进行控制的方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104925B (zh) * 2019-12-30 2022-03-11 上海商汤临港智能科技有限公司 图像处理方法、装置、存储介质和电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630144A (zh) * 2009-08-18 2010-01-20 湖南大学 电子节气门的自学习逆模型控制方法
CN105184213A (zh) * 2014-06-12 2015-12-23 松下知识产权经营株式会社 图像识别方法、相机系统
CN106650721A (zh) * 2016-12-28 2017-05-10 吴晓军 一种基于卷积神经网络的工业字符识别方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9679258B2 (en) * 2013-10-08 2017-06-13 Google Inc. Methods and apparatus for reinforcement learning
EP3204896A1 (en) * 2014-10-07 2017-08-16 Google, Inc. Training neural networks on partitioned training data
US10445641B2 (en) * 2015-02-06 2019-10-15 Deepmind Technologies Limited Distributed training of reinforcement learning systems
CN108027897B (zh) * 2015-07-24 2022-04-12 渊慧科技有限公司 利用深度强化学习的连续控制
CN106940801B (zh) * 2016-01-04 2019-10-22 中国科学院声学研究所 一种用于广域网络的深度强化学习推荐系统及方法
CN107291232A (zh) * 2017-06-20 2017-10-24 深圳市泽科科技有限公司 一种基于深度学习与大数据的体感游戏交互方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630144A (zh) * 2009-08-18 2010-01-20 湖南大学 电子节气门的自学习逆模型控制方法
CN105184213A (zh) * 2014-06-12 2015-12-23 松下知识产权经营株式会社 图像识别方法、相机系统
CN106650721A (zh) * 2016-12-28 2017-05-10 吴晓军 一种基于卷积神经网络的工业字符识别方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796248A (zh) * 2019-08-27 2020-02-14 腾讯科技(深圳)有限公司 数据增强的方法、装置、设备及存储介质
CN110610169A (zh) * 2019-09-20 2019-12-24 腾讯科技(深圳)有限公司 图片标注方法和装置、存储介质及电子装置
CN110610169B (zh) * 2019-09-20 2023-12-15 腾讯科技(深圳)有限公司 图片标注方法和装置、存储介质及电子装置
CN114637209A (zh) * 2022-03-22 2022-06-17 华北电力大学 一种基于强化学习的神经网络逆控制器进行控制的方法

Also Published As

Publication number Publication date
CN109726808A (zh) 2019-05-07
CN109726808B (zh) 2022-12-09

Similar Documents

Publication Publication Date Title
WO2019080900A1 (zh) 神经网络训练方法和装置、存储介质及电子装置
JP6920771B2 (ja) 3d畳み込みニューラルネットワークに基づく動作識別方法及び装置
US11323659B2 (en) Video communication device, video communication method, and video communication mediating method
Brown et al. Finding waldo: Learning about users from their interactions
CN110339569B (zh) 控制游戏场景中虚拟角色的方法及装置
CN107798027B (zh) 一种信息热度预测方法、信息推荐方法及装置
CN109176535B (zh) 基于智能机器人的交互方法及系统
TW201814572A (zh) 終端設備、智慧型手機、基於臉部識別的認證方法和系統
KR102033050B1 (ko) 시간차 모델을 위한 비지도 학습 기법
CN108229262B (zh) 一种色情视频检测方法及装置
JP2013176590A5 (zh)
CN111479129B (zh) 直播封面的确定方法、装置、服务器、介质及系统
US10559215B2 (en) Education reward system and method
CN112257645B (zh) 人脸的关键点定位方法和装置、存储介质及电子装置
CN113240778A (zh) 虚拟形象的生成方法、装置、电子设备和存储介质
WO2018171196A1 (zh) 一种控制方法、终端及系统
CN103927452B (zh) 一种远程健康监护系统、方法和装置
CN114513694B (zh) 评分确定方法、装置、电子设备和存储介质
US20220068158A1 (en) Systems and methods to provide mental distress therapy through subject interaction with an interactive space
US20160271498A1 (en) System and method for modifying human behavior through use of gaming applications
WO2019206043A1 (zh) 信息处理方法及系统
CN110443852A (zh) 一种图像定位的方法及相关装置
CN115311723A (zh) 活体检测方法、装置及计算机可读存储介质
CN104796786B (zh) 信息处理设备和信息处理方法
CN113573091A (zh) 家庭康复软件系统以及应用于家庭康复的人机交互方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18871367

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18871367

Country of ref document: EP

Kind code of ref document: A1