WO2021189952A1 - Model training method and apparatus, action recognition method and apparatus, and device and storage medium - Google Patents

Model training method and apparatus, action recognition method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2021189952A1
WO2021189952A1 PCT/CN2020/135245 CN2020135245W WO2021189952A1 WO 2021189952 A1 WO2021189952 A1 WO 2021189952A1 CN 2020135245 W CN2020135245 W CN 2020135245W WO 2021189952 A1 WO2021189952 A1 WO 2021189952A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
result
action
trained
classification
Prior art date
Application number
PCT/CN2020/135245
Other languages
French (fr)
Chinese (zh)
Inventor
李泽远
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021189952A1 publication Critical patent/WO2021189952A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an action recognition model training method, action training method, device, equipment, and storage medium.
  • This application provides a method for training an action recognition model, and the method includes:
  • This application also provides an action recognition method, which includes:
  • the model is obtained by training the above-mentioned action recognition model training method.
  • This application also provides a device for training an action recognition model, the device including:
  • the sample acquisition module is used to acquire video images, action data, and the corresponding action labels of the video images and action data;
  • the network training module is used to perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels , To obtain the trained network model and prediction results;
  • a classification training module for training a pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results;
  • model merging module It is used to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;
  • a joint learning module is used to combine the local
  • the model parameters of the recognition model and the local recognition results are uploaded to the cloud server for joint learning to obtain learning parameters;
  • the model update module is used to receive the learning parameters sent by the cloud server and update the local A recognition model, using the updated local recognition model as a trained action recognition model.
  • This application also provides an action recognition device, which includes:
  • the data acquisition module is used to obtain the image to be recognized and the motion data corresponding to the image to be recognized; the motion recognition module is used to input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain Recognition result; wherein, the pre-trained action recognition model is obtained by training the above-mentioned action recognition model training method.
  • the application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and realizes when the computer program is executed
  • the following steps obtain video images, action data, and action labels corresponding to the video images and action data; perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels to obtain the trained network model and prediction Result; based on the action data and corresponding action labels to train the pre-configured classifier to obtain the trained classification model and the classification result; merge the trained network model and the trained classification model to obtain a local Identifying a model, and obtaining a local recognition result according to the prediction result and the classification result; uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters; receiving the The learning parameters sent by the cloud server, and updating the local recognition model according to the learning parameters, and using the updated local recognition model as a trained action recognition model; and
  • the steps are as follows: acquiring the image to be recognized and the motion data corresponding to the image to be recognized; inputting the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-training The action recognition model is obtained by training the above-mentioned action recognition model training method.
  • the present application also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps: acquiring video images, motion data, and The action label corresponding to the video image and the action data; the two-stream convolutional neural network is trained on the network based on the video image and the corresponding action label to obtain the trained network model and the prediction result; based on the action data and the corresponding The action tag trains the pre-configured classifier to obtain the trained classification model and the classification result; merge the trained network model and the trained classification model to obtain a local recognition model, and according to the prediction result and The classification result obtains the local recognition result; the model parameters of the local recognition model and the local recognition result are uploaded to the cloud server for joint learning to obtain the learning parameters; the learning parameters sent by the cloud server are received, and the learning parameters are received according to the Updating the local recognition model with the learning parameters, and using the updated local recognition model as a trained action recognition model; and
  • the steps are as follows: acquiring the image to be recognized and the motion data corresponding to the image to be recognized; inputting the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-training The action recognition model is obtained by training the above-mentioned action recognition model training method.
  • This application discloses an action recognition model training method, action training method, device, equipment, and storage medium.
  • By acquiring video images, action data and video images, action labels corresponding to the action data, and then based on the video images and corresponding action labels Perform network training on the dual-stream convolutional neural network to obtain the trained network model and prediction results.
  • train the pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results, and then
  • the trained network model and the trained classification model are combined to obtain a local recognition model, and the local recognition result is obtained according to the prediction result and the classification result, and the model parameters of the local recognition model and the local recognition result are uploaded to the cloud server for joint learning.
  • each participant receives the learning parameters sent by the cloud server, and updates the local recognition model according to the learning parameters to complete the model training.
  • Each participant conducts model training locally to obtain their own local recognition model, and then uploads the local recognition model to the cloud server for joint learning, expands the number of samples when training the model, and improves the recognition accuracy of the trained action recognition model, and Since each participant conducts model training locally, the training data is not interoperable, which also ensures the security and privacy of the data.
  • FIG. 1 is a schematic flowchart of an action recognition model training method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a network structure of a network model provided by an embodiment of the present application
  • Fig. 3 is a schematic flowchart of sub-steps of a method for training an action recognition model provided in Fig. 1;
  • Fig. 4 is an action recognition method provided by an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of an apparatus for training an action recognition model according to an embodiment of the present application
  • FIG. 6 is a schematic block diagram of a device for action recognition according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application.
  • the embodiments of the present application provide an action recognition model training method, device, computer equipment, and storage medium.
  • the action recognition model training method can be used to train the action recognition model to recognize human actions, improve the recognition accuracy of the trained action recognition model, and then improve the accuracy of the action recognition.
  • FIG. 1 is a schematic flowchart of an action recognition model training method provided by an embodiment of the present application.
  • the action recognition model training method can be applied to each participant, that is, each local client.
  • the action recognition model training method achieves the purpose of enriching the number of samples and improving the recognition accuracy of the action recognition model obtained by training by performing joint training on the sample data of multiple participants.
  • the method for training an action recognition model specifically includes: step S101 to step S106.
  • the action recognition model includes two parts, namely a network model and a classification model, it is possible to obtain video images, action data, and action labels corresponding to the video images and action data to train the network model and the classification model respectively.
  • the user wears the smart wearable device to perform an action, and then takes a picture of the user's action to obtain a video image.
  • the user's movement data during the movement is collected.
  • the action performed by the user is the action tag corresponding to the video image and the action data.
  • the video image, the action data and the corresponding action label are the local data of each participant, that is, each participant performs model training based on the local data, and there is no need to share data with other participants, thereby improving data security Sex and reliability.
  • S102 Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain a trained network model and a prediction result.
  • the dual-stream convolutional neural network is trained to obtain the network model.
  • the network model includes a spatial stream convolutional network and a time stream convolutional network.
  • the spatial stream convolutional network and the time stream convolutional network respectively include several convolutional layers, fully connected layers and softmax layers.
  • step S102 specifically includes step S1021 to step S1024.
  • S1022 Use the video image and the corresponding action label to train the spatial stream convolutional network in the dual-stream convolutional neural network, and obtain a spatial prediction result.
  • each frame of the video image terminal is input into the spatial stream convolutional network for training. train. Then the loss between the spatial prediction result and the corresponding action label is calculated. When the loss value reaches the preset condition, the training of the spatial stream convolutional network is considered to be completed, and the spatial prediction result is obtained.
  • the time flow convolutional network can use L2 regularization to monitor the loss and prevent overfitting.
  • L2 regularization the objective function expression using L2 regularization is:
  • L represents the loss function with regularization
  • J( ⁇ ) represents the loss function
  • represents all parameters in the convolutional neural network
  • represents the regular coefficient. It refers to the sum of the squares of the weights
  • i represents the action number of each recognized action
  • k represents the total number of recognized actions.
  • the regular coefficient can be determined by itself according to the actual situation.
  • the softmax layer After obtaining the output of the fully connected layer, the softmax layer will perform data conversion based on the output value of the fully connected layer, so that the final output spatial prediction result is the probability that the video image is predicted to be a certain action.
  • the conversion formula of the softmax layer can be:
  • V i is the output value of the fully connected layers
  • i denotes any one of the operation type
  • N denotes the total number of operation types
  • S i represents the index of all the output values of the index of the current output value of the entire connecting layer and the fully connected layers
  • the ratio which is the probability of the output of the spatial stream convolutional network.
  • S1023 Use the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result.
  • the optical flow image includes the motion state information between frames
  • the optical flow image and the corresponding action label are input into the time flow convolutional network in the dual-stream convolutional neural network for training, and the time prediction result is obtained, and then calculated The loss between the time prediction result and the corresponding action tag.
  • the loss value reaches the preset condition, it is considered that the training of the time flow convolutional network is completed, and the time prediction result is obtained.
  • the time flow convolutional network can use L2 regularization to monitor the loss and prevent overfitting.
  • L2 regularization the objective function expression using L2 regularization is:
  • L represents the loss function with regularization
  • J( ⁇ ) represents the loss function
  • represents all the parameters in the convolutional neural network
  • represents the regular coefficient. It refers to the square sum of the weight
  • i represents the action number of each recognized action
  • k represents the total number of recognized actions.
  • the regular coefficient can be determined by itself according to the actual situation.
  • the softmax layer After obtaining the output of the fully connected layer, the softmax layer will perform data conversion based on the output value of the fully connected layer, so that the final output time prediction result is the probability that the optical flow image is predicted to be a certain action.
  • the conversion formula of the softmax layer can be:
  • V i is the output value of the fully connected layers
  • i denotes any one of the operation type
  • N denotes the total number of operation types
  • S i represents the index of all the output values of the index of the current output value of the entire connecting layer and the fully connected layers
  • the ratio that is, the probability of the output of the time flow convolutional network.
  • the direct average method can be used to take the average value for aggregation, or the SVM method can be used for aggregation.
  • S103 Train a pre-configured classifier based on the action data and the corresponding action label, to obtain a trained classification model and a classification result.
  • the motion data includes the three-axis angular velocity data collected by the gyroscope sensor mounted in the smart wearable device and the three-axis acceleration data collected by the acceleration sensor when the user performs the corresponding action.
  • the pre-configured classifier may be a support vector machine.
  • each participant merges the trained network model and the trained classification model to obtain a local recognition model. After the local recognition model is obtained, the local recognition result is obtained according to the prediction result and the classification result.
  • the obtaining a local recognition result according to the prediction result and the classification result includes: obtaining a local recognition result according to the prediction result and the classification result based on a weight calculation formula.
  • the local recognition model includes two parts: the network model and the classification model
  • the local recognition model can calculate the prediction results of the network model and the classification results of the classification model according to the weight coefficients according to the preset weight coefficients, so as to obtain the final Local recognition result.
  • the weight calculation formula includes:
  • R represents the local recognition result
  • P a represents the most probable result among the prediction results
  • ⁇ 1 represents the weight coefficient of the most probable result P a
  • P b represents the most probable result among the classification results
  • ⁇ 2 represents the most probable result P b
  • the weight coefficient
  • the model parameters of the local recognition model and the local recognition result of the local recognition model are uploaded to the cloud server, and the cloud server performs joint learning based on the received information to obtain the learning parameters.
  • the cloud server can use the global average method to perform joint learning to obtain the learning parameters, that is, to calculate the average value of the model parameters in the local recognition model model, and then adjust some model parameters that are too different from the average value. Lower the weight of its parameters to get the learning parameters.
  • uploading the model parameters of the local recognition model and the local recognition results to a cloud server for joint learning includes: encrypting the model parameters of the local recognition model and the local recognition results To obtain encrypted data; upload the encrypted data to a cloud server for joint learning.
  • Each participant encrypts the data that needs to be uploaded to obtain the encrypted data, and then uploads the encrypted data to the cloud server.
  • the cloud server decrypts the encrypted data, and then conducts joint learning based on the data to reduce data. The leakage in the transmission process improves data security.
  • data encryption For example, privacy calculation methods such as homomorphic encryption, differential privacy, or multi-party secure calculation can be used. It should be noted that when homomorphic encryption is used, the cloud server may not decrypt the encrypted data, but directly conduct joint learning based on the encrypted data.
  • the method before step S104, includes: uploading the trained network model and the prediction result to a cloud server for joint learning to obtain a joint network model; and receiving the joint network model sent by the cloud server , And use the joint network model as a trained network model; and/or upload the trained classification model and classification results to a cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server A classification model, and the joint classification model is used as a trained classification model.
  • model parameters and prediction results of the locally trained network model can be uploaded to the cloud server, so that the cloud server can receive the training completed network uploaded by each participant.
  • the model parameters and prediction results of the model are jointly learned to obtain a joint network model.
  • the joint network model is used as the trained network model, that is, after the cloud server obtains the joint network model, it sends the parameters of the joint network model to each participant, and each participant receives the model parameters of the joint network model, and Update the locally trained network model according to the model parameters of the joint network model, and then use the updated network model as the trained network model.
  • model parameters and classification results of the locally trained classification model can be uploaded to the cloud server, so that the cloud server will receive the training uploaded by each participant.
  • the model parameters and classification results of the completed classification model are jointly learned to obtain a joint classification model.
  • the joint classification model is used as the trained classification model, that is, after the cloud server obtains the joint classification model, the parameters of the joint classification model are delivered to each participant, and each participant receives the model parameters of the joint classification model, and Update the locally trained classification model according to the model parameters of the joint classification model, and then use the updated classification model as the trained classification model.
  • the training method of the action recognition model at most three different joint learnings can be performed, and the three different joint learnings refer to the joint learning of the network model completed by local training, and the classification of the completed local training. Joint learning of models and joint learning of local recognition models.
  • S106 Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
  • Each participant receives the learning parameters sent by the cloud server, and updates the local recognition model according to the learning parameters, and uses the updated local recognition model as the trained action recognition model to complete the training of the action recognition model.
  • the motion recognition model training method obtaineds video images, motion data, and motion labels corresponding to the video images and motion data, and then conducts network training on the dual-stream convolutional neural network based on the video images and the corresponding motion labels to obtain the training
  • the completed network model and prediction results are trained on the pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results, and then the trained network model and the trained classification model are performed Merge to obtain a local recognition model, and obtain the local recognition result according to the prediction result and classification result, upload the model parameters of the local recognition model and the local recognition result to the cloud server for joint learning, obtain the learning parameters, and finally each participant receives the cloud server to send According to the learning parameters, the local recognition model is updated according to the learning parameters, and the model training is completed.
  • Each participant conducts model training locally to obtain their own local recognition model, and then uploads the local recognition model to the cloud server for joint learning, expands the number of samples when training the model, and improves the recognition accuracy of the trained action recognition model, and Since each participant conducts model training locally, the training data is not interoperable, which also ensures the security and privacy of the data.
  • FIG. 4 is an action recognition method provided by an embodiment of the present application.
  • the method for training an action recognition model specifically includes: step S201 and step S202.
  • S201 Acquire an image to be recognized and motion data corresponding to the image to be recognized.
  • the image to be recognized when the user performs the action and the action data corresponding to the image to be recognized can be acquired.
  • the motion data includes the three-axis angular velocity data collected by the gyroscope sensor mounted in the smart wearable device and the three-axis acceleration data collected by the acceleration sensor when the user performs a corresponding action.
  • the pre-trained action recognition model refers to a model trained according to the aforementioned action recognition model training method.
  • the pre-trained action recognition model includes a network model and a prediction model
  • the image to be recognized is input into the network model
  • the network model performs action prediction based on the image to be recognized to obtain a prediction result.
  • the motion data is input into the classification model, and the classification model performs action classification according to the motion data to obtain the classification result.
  • the prediction results obtained by the network model and the classification results obtained by the classification model are respectively calculated according to the corresponding weight coefficients, and finally a certain recognition result is obtained, and the action recognition is completed, and the recognition result is obtained. Perform output.
  • the images or motion data to be recognized can also be input into a pre-trained motion recognition model for motion recognition.
  • the above embodiment provides an action recognition method that obtains the image to be recognized and the motion data corresponding to the image to be recognized, and then inputs the image to be recognized and the motion data into a pre-trained motion recognition model to perform motion recognition, obtain the recognition result, and complete the action Recognition. Perform action recognition based on both the image to be recognized and the motion data, and combine the recognition results of the two to improve the accuracy of action recognition.
  • FIG. 5 is a schematic block diagram of an apparatus for training an action recognition model according to an embodiment of the present application.
  • the apparatus for training an action recognition model is used to execute the aforementioned method for training an action recognition model.
  • the motion recognition model training device can be configured in a server or a terminal.
  • the server can be an independent server or a server cluster.
  • the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
  • the action recognition model training device 300 includes: a sample acquisition module 301, a network training module 302, a classification training module 303, a model merging module 304, a joint learning module 305, and a model update module 306.
  • the sample acquisition module 301 is used to acquire video images, motion data, and motion tags corresponding to the video images and motion data.
  • the network training module 302 is configured to perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result.
  • the network training module 302 includes an optical flow extraction sub-module 3021, a spatial training sub-module 3022, a time training sub-module 3023, and a result aggregation sub-module 3024.
  • the optical flow extraction sub-module 3021 is configured to extract an optical flow image corresponding to the video image according to the video image.
  • the spatial training sub-module 3022 is used to train the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtain the spatial prediction result.
  • the time training sub-module 3023 is used to train the time flow convolutional network in the dual-stream convolutional neural network by using the optical flow image and the corresponding action label, and obtain the time prediction result.
  • the result aggregation sub-module 3024 is configured to aggregate the spatial prediction result and the temporal prediction result to obtain a prediction result.
  • the classification training module 303 is configured to train a pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result.
  • the model merging module 304 is configured to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result.
  • the joint learning module 305 is configured to upload the model parameters of the local recognition model and the local recognition results to a cloud server for joint learning to obtain learning parameters.
  • the model update module 306 is configured to receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
  • FIG. 6 is a schematic block diagram of a motion recognition device provided in an embodiment of the present application, and the motion recognition device is used to execute the aforementioned motion recognition method.
  • the action recognition device can be configured in a server or a terminal.
  • the server can be an independent server or a server cluster.
  • the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
  • the action recognition device 400 includes: a data acquisition module 401 and an action recognition module 402.
  • the data acquisition module 401 is configured to acquire the image to be recognized and the motion data corresponding to the image to be recognized.
  • the action recognition module 402 is configured to input the image to be recognized and the motion data into a pre-trained action recognition model for action recognition, and obtain a recognition result; wherein, the pre-trained action recognition model is based on the aforementioned action recognition model
  • the training method is trained.
  • the aforementioned action recognition model training device and action recognition device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 7.
  • FIG. 7 is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application.
  • the computer equipment can be a server or a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute any action recognition model training method or action recognition method.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute any action recognition model training method or action recognition method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the processor is used to run a computer program stored in a memory to implement the action recognition model training method, and is used to implement the following steps:
  • the processor when the processor implements the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtains the trained network model and the prediction result, it is used to implement:
  • the video image extracts the optical flow image corresponding to the video image; uses the video image and the corresponding action tag to train the spatial flow convolutional network in the dual-stream convolutional neural network, and obtains the spatial prediction result;
  • the optical flow image and the corresponding action label train the time flow convolutional network in the dual-stream convolutional neural network and obtain the time prediction result; aggregate the space prediction result and the time prediction result to obtain the prediction result.
  • the processor when the processor realizes the local recognition result obtained according to the prediction result and the classification result, it is used to realize: based on the weight calculation formula, obtain according to the prediction result and the classification result Local recognition result; the weight calculation formula includes:
  • R represents the local recognition result
  • P a represents the most probable result among the prediction results
  • ⁇ 1 represents the weight coefficient of the most probable result P a
  • P b represents the most probable result among the classification results
  • ⁇ 2 represents the most probable result P b
  • the weight coefficient
  • the processor is used to realize: when the model parameters of the local recognition model and the local recognition result are uploaded to a cloud server for joint learning: The parameters and the local recognition result are encrypted to obtain encrypted data; the encrypted data is uploaded to the cloud server for joint learning.
  • the processor is used to implement: before uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the processor is configured to: The result is uploaded to the cloud server for joint learning to obtain a joint network model; the joint network model sent by the cloud server is received, and the joint network model is used as the trained network model; and/or the trained classification model And upload the classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
  • the processor is used to run a computer program stored in a memory, and when implementing the action recognition method, it is used to implement the following steps:
  • the model is obtained by training according to the above-mentioned action recognition model training method.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions, and the processor executes the program instructions to implement any of the methods provided in the embodiments of the present application.
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) ) Card, Flash Card, etc.
  • a plug-in hard disk equipped on the computer device such as a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) ) Card, Flash Card, etc.
  • SD Secure Digital

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

An action recognition model training method and apparatus, an action training method and apparatus, and a device and a storage medium. The action recognition model training method comprises: acquiring a video image, action data, and action labels corresponding to the video image and the action data (S101); performing network training on a two-stream convolutional neural network on the basis of the video image and the corresponding action label, so as to obtain a network model and a prediction result (S102); training a classifier on the basis of the action data and the corresponding action label, so as to obtain a classification model and a classification result (S103); merging the network model and the classification model to obtain a local recognition model, and obtaining a local recognition result according to the prediction result and the classification result (S104); uploading the local recognition model and the local recognition result to perform joint learning, so as to obtain a learning parameter (S105); and receiving the learning parameter, and updating the local recognition model according to the learning parameter (S106). The present invention is used for improving the recognition accuracy of an action recognition model obtained by means of training.

Description

模型训练方法、动作识别方法、装置、设备及存储介质Model training method, action recognition method, device, equipment and storage medium
本申请要求于2020年10月21日提交中国专利局、申请号为2020111339503,发明名称为“模型训练方法、动作识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 21, 2020, the application number is 2020111339503, and the invention title is "Model Training Method, Action Recognition Method, Device, Equipment, and Storage Medium", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种动作识别模型训练方法、动作训练方法、装置、设备及存储介质。This application relates to the field of artificial intelligence, and in particular to an action recognition model training method, action training method, device, equipment, and storage medium.
背景技术Background technique
在人际交互与协作、智能看护、智能监护以及运动分析等领域中,均需要对人体的动作行为进行识别,来判断人的行为类别。发明人意识到传统的动作识别方法大多是利用计算机图像处理方法来提取视频帧中的运动轨迹和人物特征,然后训练分类器来识别人体行为,准确率较低且识别速度较慢。而利用卷积神经网络等方法构建的动作识别模型由于样本数量较少,也导致训练效果不够理想,进而导致识别准确率不高。In the fields of interpersonal interaction and collaboration, intelligent nursing, intelligent monitoring, and sports analysis, it is necessary to recognize the action and behavior of the human body to determine the type of human behavior. The inventor realizes that traditional motion recognition methods mostly use computer image processing methods to extract motion trajectories and character features in video frames, and then train a classifier to recognize human behaviors, with low accuracy and slow recognition speed. However, the action recognition model constructed by methods such as convolutional neural network has a small number of samples, which also leads to unsatisfactory training effects, which in turn leads to low recognition accuracy.
因此,如何提高训练得到的动作识别模型的识别准确率成为亟待解决的问题。Therefore, how to improve the recognition accuracy of the trained action recognition model becomes an urgent problem to be solved.
发明内容Summary of the invention
本申请提供了一种动作识别模型训练方法,所述方法包括:This application provides a method for training an action recognition model, and the method includes:
获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。Obtain video images, action data, and action labels corresponding to the video images and action data; perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels to obtain the trained network model and prediction results; based on The action data and the corresponding action label train a pre-configured classifier to obtain a trained classification model and a classification result; merge the trained network model and the trained classification model to obtain a local recognition model, And obtaining a local recognition result according to the prediction result and the classification result; uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters; receiving a transmission from the cloud server And update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
本申请还提供了一种动作识别方法,所述方法包括:This application also provides an action recognition method, which includes:
获取待识别图像和所述待识别图像对应的运动数据;将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;其中,所述预先训练的动作识别模型为上述的动作识别模型训练方法训练得到的。Obtain the image to be recognized and the motion data corresponding to the image to be recognized; input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-trained motion recognition The model is obtained by training the above-mentioned action recognition model training method.
本申请还提供了一种动作识别模型训练装置,所述装置包括:This application also provides a device for training an action recognition model, the device including:
样本获取模块,用于获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;网络训练模块,用于基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;分类训练模块,用于基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;模型合并模块,用于将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;联合学习模块,用于将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;模型更新模块,用于接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。The sample acquisition module is used to acquire video images, action data, and the corresponding action labels of the video images and action data; the network training module is used to perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels , To obtain the trained network model and prediction results; a classification training module for training a pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results; model merging module, It is used to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result; a joint learning module is used to combine the local The model parameters of the recognition model and the local recognition results are uploaded to the cloud server for joint learning to obtain learning parameters; the model update module is used to receive the learning parameters sent by the cloud server and update the local A recognition model, using the updated local recognition model as a trained action recognition model.
本申请还提供了一种动作识别装置,所述装置包括:This application also provides an action recognition device, which includes:
数据获取模块,用于获取待识别图像和所述待识别图像对应的运动数据;动作识别模块,用于将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识 别结果;其中,所述预先训练的动作识别模型为上述的动作识别模型训练方法训练得到的。The data acquisition module is used to obtain the image to be recognized and the motion data corresponding to the image to be recognized; the motion recognition module is used to input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain Recognition result; wherein, the pre-trained action recognition model is obtained by training the above-mentioned action recognition model training method.
本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机程序;所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型;和The application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and realizes when the computer program is executed The following steps: obtain video images, action data, and action labels corresponding to the video images and action data; perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels to obtain the trained network model and prediction Result; based on the action data and corresponding action labels to train the pre-configured classifier to obtain the trained classification model and the classification result; merge the trained network model and the trained classification model to obtain a local Identifying a model, and obtaining a local recognition result according to the prediction result and the classification result; uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters; receiving the The learning parameters sent by the cloud server, and updating the local recognition model according to the learning parameters, and using the updated local recognition model as a trained action recognition model; and
如下步骤:获取待识别图像和所述待识别图像对应的运动数据;将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;其中,所述预先训练的动作识别模型为上述的动作识别模型训练方法训练得到的。The steps are as follows: acquiring the image to be recognized and the motion data corresponding to the image to be recognized; inputting the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-training The action recognition model is obtained by training the above-mentioned action recognition model training method.
本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型;和The present application also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps: acquiring video images, motion data, and The action label corresponding to the video image and the action data; the two-stream convolutional neural network is trained on the network based on the video image and the corresponding action label to obtain the trained network model and the prediction result; based on the action data and the corresponding The action tag trains the pre-configured classifier to obtain the trained classification model and the classification result; merge the trained network model and the trained classification model to obtain a local recognition model, and according to the prediction result and The classification result obtains the local recognition result; the model parameters of the local recognition model and the local recognition result are uploaded to the cloud server for joint learning to obtain the learning parameters; the learning parameters sent by the cloud server are received, and the learning parameters are received according to the Updating the local recognition model with the learning parameters, and using the updated local recognition model as a trained action recognition model; and
如下步骤:获取待识别图像和所述待识别图像对应的运动数据;将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;其中,所述预先训练的动作识别模型为上述的动作识别模型训练方法训练得到的。The steps are as follows: acquiring the image to be recognized and the motion data corresponding to the image to be recognized; inputting the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-training The action recognition model is obtained by training the above-mentioned action recognition model training method.
本申请公开了一种动作识别模型训练方法、动作训练方法、装置、设备及存储介质,通过获取视频图像、动作数据和视频图像、动作数据对应的动作标签,随后基于视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果,同时基于动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果,然后将训练完成的网络模型和训练完成的分类模型进行合并,得到本地识别模型,并且根据预测结果和分类结果得到本地识别结果,将本地识别模型的模型参数和本地识别结果上传至云服务器进行联合学习,得到学习参数,最终各个参与方接收云服务器发送的学习参数,根据学习参数更新本地识别模型,完成模型训练。各个参与方在本地进行模型训练得到各自的本地识别模型,然后将本地识别模型上传至云服务器进行联合学习,扩充训练模型时的样本数量,提高了训练得到的动作识别模型的识别准确率,并且由于各个参与方是在本地进行模型训练,训练数据并不互通,也保证了数据的安全性和隐私性。This application discloses an action recognition model training method, action training method, device, equipment, and storage medium. By acquiring video images, action data and video images, action labels corresponding to the action data, and then based on the video images and corresponding action labels Perform network training on the dual-stream convolutional neural network to obtain the trained network model and prediction results. At the same time, train the pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results, and then The trained network model and the trained classification model are combined to obtain a local recognition model, and the local recognition result is obtained according to the prediction result and the classification result, and the model parameters of the local recognition model and the local recognition result are uploaded to the cloud server for joint learning. Obtain the learning parameters, and finally each participant receives the learning parameters sent by the cloud server, and updates the local recognition model according to the learning parameters to complete the model training. Each participant conducts model training locally to obtain their own local recognition model, and then uploads the local recognition model to the cloud server for joint learning, expands the number of samples when training the model, and improves the recognition accuracy of the trained action recognition model, and Since each participant conducts model training locally, the training data is not interoperable, which also ensures the security and privacy of the data.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1是本申请实施例提供的一种动作识别模型训练方法的示意流程图;FIG. 1 is a schematic flowchart of an action recognition model training method provided by an embodiment of the present application;
图2是本申请实施例提供的网络模型的网络结构示意图;FIG. 2 is a schematic diagram of a network structure of a network model provided by an embodiment of the present application;
图3是图1提供的一种动作识别模型训练方法的子步骤示意流程图;Fig. 3 is a schematic flowchart of sub-steps of a method for training an action recognition model provided in Fig. 1;
图4是本申请实施例提供的一种动作识别方法;Fig. 4 is an action recognition method provided by an embodiment of the present application;
图5是本申请的实施例还提供一种动作识别模型训练装置的示意性框图;FIG. 5 is a schematic block diagram of an apparatus for training an action recognition model according to an embodiment of the present application;
图6是本申请的实施例还提供一种动作识别装置的示意性框图;FIG. 6 is a schematic block diagram of a device for action recognition according to an embodiment of the present application;
图7是本申请实施例提供的一种计算机设备的结构示意性框图。FIG. 7 is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowchart shown in the drawings is only an example, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions.
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.
本申请的实施例提供了一种动作识别模型训练方法、装置、计算机设备及存储介质。动作识别模型训练方法可用于训练动作识别模型,以对人体动作进行识别,提高训练得到的动作识别模型的识别准确率,进而提高动作识别的准确率。The embodiments of the present application provide an action recognition model training method, device, computer equipment, and storage medium. The action recognition model training method can be used to train the action recognition model to recognize human actions, improve the recognition accuracy of the trained action recognition model, and then improve the accuracy of the action recognition.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.
请参阅图1,图1是本申请实施例提供的一种动作识别模型训练方法的示意流程图。该动作识别模型训练方法可以应用于各个参与方,也即各个本地客户端中。该动作识别模型训练方法通过对多个参与方的样本数据进行联合训练,达到丰富样本数量,提高训练得到的动作识别模型的识别准确率的目的。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an action recognition model training method provided by an embodiment of the present application. The action recognition model training method can be applied to each participant, that is, each local client. The action recognition model training method achieves the purpose of enriching the number of samples and improving the recognition accuracy of the action recognition model obtained by training by performing joint training on the sample data of multiple participants.
如图1所示,该动作识别模型训练方法,具体包括:步骤S101至步骤S106。As shown in FIG. 1, the method for training an action recognition model specifically includes: step S101 to step S106.
S101、获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签。S101. Acquire a video image, action data, and an action tag corresponding to the video image and action data.
由于动作识别模型包括两个部分,分别为网络模型和分类模型,因此,可以获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签分别对网络模型和分类模型进行训练。Since the action recognition model includes two parts, namely a network model and a classification model, it is possible to obtain video images, action data, and action labels corresponding to the video images and action data to train the network model and the classification model respectively.
用户穿戴智能穿戴设备进行一个动作,然后对用户进行的动作进行拍摄,得到视频图像,根据智能穿戴设备中搭载的陀螺仪传感器和加速度传感器采集用户在运动过程中的动作数据。用户执行的动作即为视频图像和动作数据对应的动作标签。The user wears the smart wearable device to perform an action, and then takes a picture of the user's action to obtain a video image. According to the gyroscope sensor and acceleration sensor mounted in the smart wearable device, the user's movement data during the movement is collected. The action performed by the user is the action tag corresponding to the video image and the action data.
其中,视频图像、动作数据和对应的动作标签均是各个参与方各自的本地数据,也即,各个参与方都根据本地数据进行模型训练,不必与其他参与方进行数据共享,由此提高数据安全性和可靠性。Among them, the video image, the action data and the corresponding action label are the local data of each participant, that is, each participant performs model training based on the local data, and there is no need to share data with other participants, thereby improving data security Sex and reliability.
S102、基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果。S102: Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain a trained network model and a prediction result.
根据视频图像和对应的动作标签对双流卷积神经网络进行模型训练,从而得到网络模型。According to the video image and the corresponding action label, the dual-stream convolutional neural network is trained to obtain the network model.
如图2所示,为网络模型的网络结构示意图。该网络模型包括空间流卷积网络和时间流卷积网络,空间流卷积网络和时间流卷积网络分别包括若干个卷积层、全连接层和softmax层。As shown in Figure 2, it is a schematic diagram of the network structure of the network model. The network model includes a spatial stream convolutional network and a time stream convolutional network. The spatial stream convolutional network and the time stream convolutional network respectively include several convolutional layers, fully connected layers and softmax layers.
在一些实施例中,如图3所示,步骤S102具体包括步骤S1021至步骤S1024。In some embodiments, as shown in FIG. 3, step S102 specifically includes step S1021 to step S1024.
S1021、根据所述视频图像提取与所述视频图像对应的光流图像。S1021, according to the video image, extract an optical flow image corresponding to the video image.
在从视频图像中提取出对应的光流图像时,可以使用OpenCV处理视频图像中的某帧,得到关键点,然后对视频图像相邻帧进行梯度计算,得到关键点的像素点移动的信息,也即光流,将该帧以及后面的多个帧叠合成一个光流栈,也即光流图像。When extracting the corresponding optical flow image from the video image, you can use OpenCV to process a certain frame in the video image to obtain the key points, and then perform gradient calculation on the adjacent frames of the video image to obtain the information of the pixel point movement of the key point. That is, the optical flow, the frame and the following multiple frames are superimposed into an optical flow stack, that is, the optical flow image.
S1022、利用所述视频图像和对应的动作标签对双流卷积神经网络中的空间流卷积网络进行训练,并得到空间预测结果。S1022: Use the video image and the corresponding action label to train the spatial stream convolutional network in the dual-stream convolutional neural network, and obtain a spatial prediction result.
将视频图像和对应的动作标签输入双流卷积神经网络中的空间流卷积网络进行训练,得到空间预测结果,在具体实施过程中,将视频图像终端每一帧分别输入空间流卷积网络进行训练。然后计算空间预测结果与对应的动作标签之间的损失,当损失值达到预设条件时,认为空间流卷积网络训练完成,得到空间预测结果。Input the video image and the corresponding action label into the spatial stream convolutional network in the dual-stream convolutional neural network for training, and obtain the spatial prediction results. In the specific implementation process, each frame of the video image terminal is input into the spatial stream convolutional network for training. train. Then the loss between the spatial prediction result and the corresponding action label is calculated. When the loss value reaches the preset condition, the training of the spatial stream convolutional network is considered to be completed, and the spatial prediction result is obtained.
在一实施例中,时间流卷积网络可使用L2正则化来监督损失,防止过拟合。其中,使用L2正则化的目标函数表达式为:In an embodiment, the time flow convolutional network can use L2 regularization to monitor the loss and prevent overfitting. Among them, the objective function expression using L2 regularization is:
Figure PCTCN2020135245-appb-000001
Figure PCTCN2020135245-appb-000001
其中,L表示带正则化的损失函数,J(θ)代表损失函数,θ表示卷积神经网络中的所有参数,λ表示正则系数,
Figure PCTCN2020135245-appb-000002
则是指权重的平方和,i表示每个识别出的动作的动作编号,k表示识别出的动作总个数,其中,正则系数可以根据实际情况自行确定。
Among them, L represents the loss function with regularization, J(θ) represents the loss function, θ represents all parameters in the convolutional neural network, and λ represents the regular coefficient.
Figure PCTCN2020135245-appb-000002
It refers to the sum of the squares of the weights, i represents the action number of each recognized action, and k represents the total number of recognized actions. Among them, the regular coefficient can be determined by itself according to the actual situation.
在得到全连接层的输出后,softmax层会基于全连接层的输出值进行数据的转换,以使得最终输出的空间预测结果为视频图像被预测为某一动作的概率。After obtaining the output of the fully connected layer, the softmax layer will perform data conversion based on the output value of the fully connected layer, so that the final output spatial prediction result is the probability that the video image is predicted to be a certain action.
其中softmax层的转换公式可以为:The conversion formula of the softmax layer can be:
Figure PCTCN2020135245-appb-000003
Figure PCTCN2020135245-appb-000003
其中,V i是全连接层的输出值,i表示任一动作类别,N表示动作类别的总数,S i表示全连接层的当前输出值的指数与全连接层的所有输出值的指数和的比值,也即空间流卷积网络输出的概率。 Wherein, V i is the output value of the fully connected layers, i denotes any one of the operation type, N denotes the total number of operation types, S i represents the index of all the output values of the index of the current output value of the entire connecting layer and the fully connected layers, and the The ratio, which is the probability of the output of the spatial stream convolutional network.
S1023、利用所述光流图像和对应的动作标签对双流卷积神经网络中的时间流卷积网络进行训练,并得到时间预测结果。S1023: Use the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result.
由于光流图像中包括帧与帧之间的运动状态信息,因此,将光流图像和对应的动作标签输入双流卷积神经网络中的时间流卷积网络进行训练,得到时间预测结果,然后计算时间预测结果与对应的动作标签之间的损失,当损失值达到预设条件时,认为时间流卷积网络训练完成,得到时间预测结果。Since the optical flow image includes the motion state information between frames, the optical flow image and the corresponding action label are input into the time flow convolutional network in the dual-stream convolutional neural network for training, and the time prediction result is obtained, and then calculated The loss between the time prediction result and the corresponding action tag. When the loss value reaches the preset condition, it is considered that the training of the time flow convolutional network is completed, and the time prediction result is obtained.
在一实施例中,时间流卷积网络可使用L2正则化来监督损失,防止过拟合。其中,使用L2正则化的目标函数表达式为:In an embodiment, the time flow convolutional network can use L2 regularization to monitor the loss and prevent overfitting. Among them, the objective function expression using L2 regularization is:
Figure PCTCN2020135245-appb-000004
Figure PCTCN2020135245-appb-000004
其中,L表示带正则化的损失函数,J(θ)代表损失函数,θ表示卷积神经网络中的所有参数,λ表示正则系数,
Figure PCTCN2020135245-appb-000005
则是指权重的平方和,i表示每个识别出的动作的动作编号,k表示识别出的动作总个数,其中,正则系数可以根据实际情况自行确定。
Among them, L represents the loss function with regularization, J(θ) represents the loss function, θ represents all the parameters in the convolutional neural network, and λ represents the regular coefficient.
Figure PCTCN2020135245-appb-000005
It refers to the square sum of the weight, i represents the action number of each recognized action, and k represents the total number of recognized actions. Among them, the regular coefficient can be determined by itself according to the actual situation.
在得到全连接层的输出后,softmax层会基于全连接层的输出值进行数据的转换,以使得最终输出的时间预测结果为光流图像被预测为某一动作的概率。After obtaining the output of the fully connected layer, the softmax layer will perform data conversion based on the output value of the fully connected layer, so that the final output time prediction result is the probability that the optical flow image is predicted to be a certain action.
其中,softmax层的转换公式可以为:Among them, the conversion formula of the softmax layer can be:
Figure PCTCN2020135245-appb-000006
Figure PCTCN2020135245-appb-000006
其中,V i是全连接层的输出值,i表示任一动作类别,N表示动作类别的总数,S i表示全连接层的当前输出值的指数与全连接层的所有输出值的指数和的比值,也即时间流卷积网络输出的概率。 Wherein, V i is the output value of the fully connected layers, i denotes any one of the operation type, N denotes the total number of operation types, S i represents the index of all the output values of the index of the current output value of the entire connecting layer and the fully connected layers, and the The ratio, that is, the probability of the output of the time flow convolutional network.
S1024、将所述空间预测结果和所述时间预测结果进行聚合,得到预测结果。S1024. Aggregate the spatial prediction result and the temporal prediction result to obtain a prediction result.
在空间流卷积网络输出空间预测结果,以及时间流卷积网络输出时间预测结构后,即可将时间预测结果和空间预测结果进行聚合,得到预测结果P A={a 1:p a1;a 2:p a2……a n:p an},其中,a 1、a 2……a n代表动作标签,p a1、p a2……p an代表预测为相应人体动作的概率。其中,可以采用直接平均法,取平均值进行聚合,也可以采用SVM的方法进行聚合。 After the spatial stream convolutional network outputs the spatial prediction results and the temporal stream convolutional network outputs the temporal prediction structure, the temporal prediction results and the spatial prediction results can be aggregated to obtain the prediction result P A = {a 1 :p a1 ; a 2: p a2 ...... a n: p an}, where, a 1, a 2 ...... a n represent actions tag, p a1, p a2 ...... p an appropriate representative of predicted probability of human action. Among them, the direct average method can be used to take the average value for aggregation, or the SVM method can be used for aggregation.
S103、基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果。S103: Train a pre-configured classifier based on the action data and the corresponding action label, to obtain a trained classification model and a classification result.
运动数据包括用户在进行相应动作时,智能穿戴设备中搭载的陀螺仪传感器采集到的三轴角速度数据和加速度传感器采集到的三轴加速度数据。The motion data includes the three-axis angular velocity data collected by the gyroscope sensor mounted in the smart wearable device and the three-axis acceleration data collected by the acceleration sensor when the user performs the corresponding action.
分别计算三轴角速度数据和三轴加速度数据的均值、方差和均方根,并组成特征矩阵,然后将该特征矩阵和对应的动作标签输入至预先配置的分类器中进行动作分类,得到训练完成的分类模型和分类结果P B={b 1:p b1;b 2:p b2……b n:p bn},其中,b 1、b 2……b n代表人体动作标签,p b1、p b2……p bn代表预测为相应人体动作的概率。其中,预先配置的分类器可以是支持向量机。 Calculate the mean, variance, and root mean square of the three-axis angular velocity data and three-axis acceleration data respectively, and form a feature matrix, and then input the feature matrix and the corresponding action label into the pre-configured classifier for action classification, and the training is completed The classification model and classification results of P B = {b 1 :p b1 ; b 2 :p b2 ……b n :p bn }, where b 1 , b 2 ……b n represent human action labels, p b1 , p b2 ......p bn represents the probability that it is predicted to be the corresponding human action. Among them, the pre-configured classifier may be a support vector machine.
S104、将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果。S104. Combine the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result.
由于本地识别模型包括网络模型和分类模型两部分,因此,各个参与方将训练完成的网络模型和训练完成的分类模型进行合并,得到本地识别模型。得到本地识别模型后,再根据预测结果和分类结果得到本地识别结果。Since the local recognition model includes two parts: a network model and a classification model, each participant merges the trained network model and the trained classification model to obtain a local recognition model. After the local recognition model is obtained, the local recognition result is obtained according to the prediction result and the classification result.
在一实施例中,所述根据所述预测结果和所述分类结果得到本地识别结果,包括:基于权重计算公式,根据所述预测结果和所述分类结果得到本地识别结果。In an embodiment, the obtaining a local recognition result according to the prediction result and the classification result includes: obtaining a local recognition result according to the prediction result and the classification result based on a weight calculation formula.
由于本地识别模型中包括网络模型和分类模型两部分,因此,本地识别模型可以按照预先设置的权重系数,分别将网络模型的预测结果和分类模型的分类结果按照权重系数进行计算,从而得到最终的本地识别结果。Since the local recognition model includes two parts: the network model and the classification model, the local recognition model can calculate the prediction results of the network model and the classification results of the classification model according to the weight coefficients according to the preset weight coefficients, so as to obtain the final Local recognition result.
所述权重计算公式包括:The weight calculation formula includes:
R=λ 1P a+ 2P b R=λ 1 P a + 2 P b
其中,R表示本地识别结果,P a表示预测结果中概率最大的结果,λ 1表示概率最大结果P a的权重系数,P b表示分类结果中概率最大的结果,λ 2表示概率最大结果P b的权重系数。 Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
S105、将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数。S105. Upload the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters.
在各个参与方在本地得到本地识别模型后,将本地识别模型的模型参数和本地识别模型的本地识别结果上传至云服务器,由云服务器根据接收到的这些信息进行联合学习,得到学习参数。After each participant obtains the local recognition model locally, the model parameters of the local recognition model and the local recognition result of the local recognition model are uploaded to the cloud server, and the cloud server performs joint learning based on the received information to obtain the learning parameters.
在具体实施过程中,云服务器可以采用全局平均法进行联合学习,得到学习参数,也即,分别计算本地识别模型模型中模型参数的平均值,然后对于一些与平均值相差过大的模型参数调低其参数权重,以得到学习参数。In the specific implementation process, the cloud server can use the global average method to perform joint learning to obtain the learning parameters, that is, to calculate the average value of the model parameters in the local recognition model model, and then adjust some model parameters that are too different from the average value. Lower the weight of its parameters to get the learning parameters.
在一实施例中,所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,包括:对所述本地识别模型的模型参数和所述本地识别结果进行加密,得到加密数据;将所述加密数据上传至云服务器进行联合学习。In an embodiment, uploading the model parameters of the local recognition model and the local recognition results to a cloud server for joint learning includes: encrypting the model parameters of the local recognition model and the local recognition results To obtain encrypted data; upload the encrypted data to a cloud server for joint learning.
各个参与方对需要上传的数据进行数据加密,得到加密数据,然后将加密数据上传至云服务器,云服务器在接收到加密数据后,对加密数据进行解密,然后根据这些数据进行联合学习,减少数据在传输过程中的泄露情况,提高数据安全性。Each participant encrypts the data that needs to be uploaded to obtain the encrypted data, and then uploads the encrypted data to the cloud server. After receiving the encrypted data, the cloud server decrypts the encrypted data, and then conducts joint learning based on the data to reduce data The leakage in the transmission process improves data security.
在进行数据加密时,例如可以采用同态加密、差分隐私或多方安全计算等隐私计算方法。需要说明的是,当采用同态加密时,云服务器可以不对加密数据进行解密,直接根据加密数据进行联合学习。When data encryption is performed, for example, privacy calculation methods such as homomorphic encryption, differential privacy, or multi-party secure calculation can be used. It should be noted that when homomorphic encryption is used, the cloud server may not decrypt the encrypted data, but directly conduct joint learning based on the encrypted data.
在一实施例中,在步骤S104之前,所述方法包括:将训练完成的网络模型和预测结果上 传至云服务器进行联合学习,得到联合网络模型;接收所述云服务器发送的所述联合网络模型,并将所述联合网络模型作为训练完成的网络模型;和/或将训练完成的分类模型和分类结果上传至云服务器进行联合学习,得到联合分类模型;接收所述云服务器发送的所述联合分类模型,并将所述联合分类模型作为训练完成的分类模型。In an embodiment, before step S104, the method includes: uploading the trained network model and the prediction result to a cloud server for joint learning to obtain a joint network model; and receiving the joint network model sent by the cloud server , And use the joint network model as a trained network model; and/or upload the trained classification model and classification results to a cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server A classification model, and the joint classification model is used as a trained classification model.
在各个参与方对本地的网络模型训练完成后,即可将本地的训练完成的网络模型的模型参数和预测结果上传至云服务器,使云服务器根据接收到的各个参与方上传的训练完成的网络模型的模型参数和预测结果进行联合学习,得到联合网络模型。After each participant has completed the training of the local network model, the model parameters and prediction results of the locally trained network model can be uploaded to the cloud server, so that the cloud server can receive the training completed network uploaded by each participant The model parameters and prediction results of the model are jointly learned to obtain a joint network model.
将所述联合网络模型作为训练完成的网络模型,也即,云服务器得到联合网络模型后,将联合网络模型的参数下发至各个参与方,各个参与方接收该联合网络模型的模型参数,并根据该联合网络模型的模型参数更新本地的训练完成的网络模型,然后将更新后的网络模型作为训练完成的网络模型。The joint network model is used as the trained network model, that is, after the cloud server obtains the joint network model, it sends the parameters of the joint network model to each participant, and each participant receives the model parameters of the joint network model, and Update the locally trained network model according to the model parameters of the joint network model, and then use the updated network model as the trained network model.
同样的,在各个参与方对本地的分类模型训练完成后,即可将本地的训练完成的分类模型的模型参数和分类结果上传至云服务器,使云服务器根据接收到的各个参与方上传的训练完成的分类模型的模型参数和分类结果进行联合学习,得到联合分类模型。Similarly, after each participant has completed the training of the local classification model, the model parameters and classification results of the locally trained classification model can be uploaded to the cloud server, so that the cloud server will receive the training uploaded by each participant. The model parameters and classification results of the completed classification model are jointly learned to obtain a joint classification model.
将所述联合分类模型作为训练完成的分类模型,也即,云服务器得到联合分类模型后,将联合分类模型的参数下发至各个参与方,各个参与方接收该联合分类模型的模型参数,并根据该联合分类模型的模型参数更新本地的训练完成的分类模型,然后将更新后的分类模型作为训练完成的分类模型。The joint classification model is used as the trained classification model, that is, after the cloud server obtains the joint classification model, the parameters of the joint classification model are delivered to each participant, and each participant receives the model parameters of the joint classification model, and Update the locally trained classification model according to the model parameters of the joint classification model, and then use the updated classification model as the trained classification model.
也即,在该动作识别模型的训练方法中,至多可进行三次不同的联合学习,所述三次不同的联合学习是指对本地的训练完成的网络模型的联合学习、对本地的训练完成的分类模型的联合学习以及对本地识别模型的联合学习。That is, in the training method of the action recognition model, at most three different joint learnings can be performed, and the three different joint learnings refer to the joint learning of the network model completed by local training, and the classification of the completed local training. Joint learning of models and joint learning of local recognition models.
S106、接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。S106: Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
各个参与方接收云服务器发送的学习参数,并且根据学习参数更新本地识别模型,将更新后的本地识别模型作为训练完成的动作识别模型,完成动作识别模型的训练。Each participant receives the learning parameters sent by the cloud server, and updates the local recognition model according to the learning parameters, and uses the updated local recognition model as the trained action recognition model to complete the training of the action recognition model.
上述实施例提供的动作识别模型训练方法,通过获取视频图像、动作数据和视频图像、动作数据对应的动作标签,随后基于视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果,同时基于动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果,然后将训练完成的网络模型和训练完成的分类模型进行合并,得到本地识别模型,并且根据预测结果和分类结果得到本地识别结果,将本地识别模型的模型参数和本地识别结果上传至云服务器进行联合学习,得到学习参数,最终各个参与方接收云服务器发送的学习参数,根据学习参数更新本地识别模型,完成模型训练。各个参与方在本地进行模型训练得到各自的本地识别模型,然后将本地识别模型上传至云服务器进行联合学习,扩充训练模型时的样本数量,提高了训练得到的动作识别模型的识别准确率,并且由于各个参与方是在本地进行模型训练,训练数据并不互通,也保证了数据的安全性和隐私性。The motion recognition model training method provided by the foregoing embodiment obtains video images, motion data, and motion labels corresponding to the video images and motion data, and then conducts network training on the dual-stream convolutional neural network based on the video images and the corresponding motion labels to obtain the training The completed network model and prediction results are trained on the pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results, and then the trained network model and the trained classification model are performed Merge to obtain a local recognition model, and obtain the local recognition result according to the prediction result and classification result, upload the model parameters of the local recognition model and the local recognition result to the cloud server for joint learning, obtain the learning parameters, and finally each participant receives the cloud server to send According to the learning parameters, the local recognition model is updated according to the learning parameters, and the model training is completed. Each participant conducts model training locally to obtain their own local recognition model, and then uploads the local recognition model to the cloud server for joint learning, expands the number of samples when training the model, and improves the recognition accuracy of the trained action recognition model, and Since each participant conducts model training locally, the training data is not interoperable, which also ensures the security and privacy of the data.
请参阅图4,图4是本申请实施例提供的一种动作识别方法。Please refer to FIG. 4, which is an action recognition method provided by an embodiment of the present application.
如图4所示,该动作识别模型训练方法,具体包括:步骤S201和步骤S202。As shown in FIG. 4, the method for training an action recognition model specifically includes: step S201 and step S202.
S201、获取待识别图像和所述待识别图像对应的运动数据。S201: Acquire an image to be recognized and motion data corresponding to the image to be recognized.
在对穿戴有可穿戴设备的用户的动作进行识别时,可以获取用户执行该动作时的待识别图像和待识别图像对应的动作数据。When recognizing an action of a user wearing a wearable device, the image to be recognized when the user performs the action and the action data corresponding to the image to be recognized can be acquired.
其中,运动数据包括用户在进行相应动作时,智能穿戴设备中搭载的陀螺仪传感器采集到的三轴角速度数据和加速度传感器采集到的三轴加速度数据。Among them, the motion data includes the three-axis angular velocity data collected by the gyroscope sensor mounted in the smart wearable device and the three-axis acceleration data collected by the acceleration sensor when the user performs a corresponding action.
S202、将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果。S202. Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition, to obtain a recognition result.
其中,预先训练的动作识别模型是指根据前述的动作识别模型训练方法训练得到的模型。Among them, the pre-trained action recognition model refers to a model trained according to the aforementioned action recognition model training method.
由于预先训练的动作识别模型中包括网络模型和预测模型,因此,将待识别图像输入网络模型中,由网络模型根据待识别图像进行动作预测,得到预测结果。将运动数据输入分类模型中,由分类模型根据运动数据进行动作分类,得到分类结果。Since the pre-trained action recognition model includes a network model and a prediction model, the image to be recognized is input into the network model, and the network model performs action prediction based on the image to be recognized to obtain a prediction result. The motion data is input into the classification model, and the classification model performs action classification according to the motion data to obtain the classification result.
然后根据动作识别模型内配置的权重系数,将网络模型得到的预测结果和分类模型得到的分类结果分别按照对应的权重系数进行计算,最终得到一确定的识别结果,完成动作识别,并将识别结果进行输出。Then, according to the weight coefficients configured in the action recognition model, the prediction results obtained by the network model and the classification results obtained by the classification model are respectively calculated according to the corresponding weight coefficients, and finally a certain recognition result is obtained, and the action recognition is completed, and the recognition result is obtained. Perform output.
需要说明的是,若仅有待识别图像或仅有运动数据,也可将待识别图像或运动数据输入预先训练的动作识别模型中进行动作识别。It should be noted that if there are only images to be recognized or only motion data, the images or motion data to be recognized can also be input into a pre-trained motion recognition model for motion recognition.
上述实施例提供的一种动作识别方法,通过获取待识别图像和待识别图像对应的运动数据,然后将待识别图像和运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果,完成动作识别。根据待识别图像和运动数据两者进行动作识别,将两者的识别结果进行结合,提高了动作识别的准确率。The above embodiment provides an action recognition method that obtains the image to be recognized and the motion data corresponding to the image to be recognized, and then inputs the image to be recognized and the motion data into a pre-trained motion recognition model to perform motion recognition, obtain the recognition result, and complete the action Recognition. Perform action recognition based on both the image to be recognized and the motion data, and combine the recognition results of the two to improve the accuracy of action recognition.
请参阅图5,图5是本申请的实施例还提供一种动作识别模型训练装置的示意性框图,该动作识别模型训练装置用于执行前述的动作识别模型训练方法。其中,该动作识别模型训练装置可以配置于服务器或终端中。Please refer to FIG. 5. FIG. 5 is a schematic block diagram of an apparatus for training an action recognition model according to an embodiment of the present application. The apparatus for training an action recognition model is used to execute the aforementioned method for training an action recognition model. Wherein, the motion recognition model training device can be configured in a server or a terminal.
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。Among them, the server can be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
如图5所示,动作识别模型训练装置300包括:样本获取模块301、网络训练模块302、分类训练模块303、模型合并模块304、联合学习模块305和模型更新模块306。As shown in FIG. 5, the action recognition model training device 300 includes: a sample acquisition module 301, a network training module 302, a classification training module 303, a model merging module 304, a joint learning module 305, and a model update module 306.
样本获取模块301,用于获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签。The sample acquisition module 301 is used to acquire video images, motion data, and motion tags corresponding to the video images and motion data.
网络训练模块302,用于基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果。The network training module 302 is configured to perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result.
其中,网络训练模块302包括光流提取子模块3021、空间训练子模块3022、时间训练子模块3023和结果聚合子模块3024。Among them, the network training module 302 includes an optical flow extraction sub-module 3021, a spatial training sub-module 3022, a time training sub-module 3023, and a result aggregation sub-module 3024.
具体地,光流提取子模块3021,用于根据所述视频图像提取与所述视频图像对应的光流图像。空间训练子模块3022,用于利用所述视频图像和对应的动作标签对双流卷积神经网络中的空间流卷积网络进行训练,并得到空间预测结果。时间训练子模块3023,用于利用所述光流图像和对应的动作标签对双流卷积神经网络中的时间流卷积网络进行训练,并得到时间预测结果。结果聚合子模块3024,用于将所述空间预测结果和所述时间预测结果进行聚合,得到预测结果。Specifically, the optical flow extraction sub-module 3021 is configured to extract an optical flow image corresponding to the video image according to the video image. The spatial training sub-module 3022 is used to train the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtain the spatial prediction result. The time training sub-module 3023 is used to train the time flow convolutional network in the dual-stream convolutional neural network by using the optical flow image and the corresponding action label, and obtain the time prediction result. The result aggregation sub-module 3024 is configured to aggregate the spatial prediction result and the temporal prediction result to obtain a prediction result.
分类训练模块303,用于基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果。The classification training module 303 is configured to train a pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result.
模型合并模块304,用于将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果。The model merging module 304 is configured to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result.
联合学习模块305,用于将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数。The joint learning module 305 is configured to upload the model parameters of the local recognition model and the local recognition results to a cloud server for joint learning to obtain learning parameters.
模型更新模块306,用于接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。The model update module 306 is configured to receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的动作识别模型训练装置和各模块的具体工作过程,可以参考前述动作识别模型训练方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and brevity of description, the above-described action recognition model training device and the specific working process of each module can be referred to in the foregoing action recognition model training method embodiment The corresponding process will not be repeated here.
请参阅图6,图6是本申请的实施例还提供一种动作识别装置的示意性框图,该动作识别装置用于执行前述的动作识别方法。其中,该动作识别装置可以配置于服务器或终端中。Please refer to FIG. 6. FIG. 6 is a schematic block diagram of a motion recognition device provided in an embodiment of the present application, and the motion recognition device is used to execute the aforementioned motion recognition method. Wherein, the action recognition device can be configured in a server or a terminal.
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。Among them, the server can be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
如图6所示,动作识别装置400包括:数据获取模块401和动作识别模块402。As shown in FIG. 6, the action recognition device 400 includes: a data acquisition module 401 and an action recognition module 402.
数据获取模块401,用于获取待识别图像和所述待识别图像对应的运动数据。The data acquisition module 401 is configured to acquire the image to be recognized and the motion data corresponding to the image to be recognized.
动作识别模块402,用于将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;其中,所述预先训练的动作识别模型为根据上述的动作识别模型训练方法训练得到的。The action recognition module 402 is configured to input the image to be recognized and the motion data into a pre-trained action recognition model for action recognition, and obtain a recognition result; wherein, the pre-trained action recognition model is based on the aforementioned action recognition model The training method is trained.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的动作识别装置和各模块的具体工作过程,可以参考前述动作识别方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of description, the specific working process of the action recognition device and each module described above can refer to the corresponding process in the aforementioned action recognition method embodiment. I won't repeat them here.
上述的动作识别模型训练装置和动作识别装置可以实现为一种计算机程序的形式,该计算机程序可以在如图7所示的计算机设备上运行。The aforementioned action recognition model training device and action recognition device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 7.
请参阅图7,图7是本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器或终端。Please refer to FIG. 7, which is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application. The computer equipment can be a server or a terminal.
参阅图7,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。Referring to FIG. 7, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种动作识别模型训练方法或动作识别方法。The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions, and when the program instructions are executed, the processor can execute any action recognition model training method or action recognition method.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种动作识别模型训练方法或动作识别方法。The internal memory provides an environment for the operation of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any action recognition model training method or action recognition method.
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,实现动作识别模型训练方法时,用于实现如下步骤:Wherein, in one embodiment, the processor is used to run a computer program stored in a memory to implement the action recognition model training method, and is used to implement the following steps:
获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。Obtain video images, action data, and action labels corresponding to the video images and action data; perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels to obtain the trained network model and prediction results; based on The action data and the corresponding action label train a pre-configured classifier to obtain a trained classification model and a classification result; merge the trained network model and the trained classification model to obtain a local recognition model, And obtaining a local recognition result according to the prediction result and the classification result; uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters; receiving a transmission from the cloud server And update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
在一个实施例中,所述处理器在实现所述基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果时,用于实现:根据所述视频图像提取与所述视频图像对应的光流图像;利用所述视频图像和对应的动作标签对双流卷积神经网络中的空间流卷积网络进行训练,并得到空间预测结果;利用所述光流图像和对应的动作标签对双流卷积神经网络中的时间流卷积网络进行训练,并得到时间预测结果;将所述空间预测结果和所述时间预测结果进行聚合,得到预测结果。In one embodiment, when the processor implements the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtains the trained network model and the prediction result, it is used to implement: The video image extracts the optical flow image corresponding to the video image; uses the video image and the corresponding action tag to train the spatial flow convolutional network in the dual-stream convolutional neural network, and obtains the spatial prediction result; The optical flow image and the corresponding action label train the time flow convolutional network in the dual-stream convolutional neural network and obtain the time prediction result; aggregate the space prediction result and the time prediction result to obtain the prediction result.
在一个实施例中,所述处理器在实现所述根据所述预测结果和所述分类结果得到本地识 别结果时,用于实现:基于权重计算公式,根据所述预测结果和所述分类结果得到本地识别结果;所述权重计算公式包括:In one embodiment, when the processor realizes the local recognition result obtained according to the prediction result and the classification result, it is used to realize: based on the weight calculation formula, obtain according to the prediction result and the classification result Local recognition result; the weight calculation formula includes:
R=λ 1P a2P b R=λ 1 P a2 P b
其中,R表示本地识别结果,P a表示预测结果中概率最大的结果,λ 1表示概率最大结果P a的权重系数,P b表示分类结果中概率最大的结果,λ 2表示概率最大结果P b的权重系数。 Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
在一个实施例中,所述处理器在实现所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习时,用于实现:对所述本地识别模型的模型参数和所述本地识别结果进行加密,得到加密数据;将所述加密数据上传至云服务器进行联合学习。In one embodiment, the processor is used to realize: when the model parameters of the local recognition model and the local recognition result are uploaded to a cloud server for joint learning: The parameters and the local recognition result are encrypted to obtain encrypted data; the encrypted data is uploaded to the cloud server for joint learning.
在一个实施例中,所述处理器在实现所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习之前,用于实现:将训练完成的网络模型和预测结果上传至云服务器进行联合学习,得到联合网络模型;接收所述云服务器发送的所述联合网络模型,并将所述联合网络模型作为训练完成的网络模型;和/或将训练完成的分类模型和分类结果上传至云服务器进行联合学习,得到联合分类模型;接收所述云服务器发送的所述联合分类模型,并将所述联合分类模型作为训练完成的分类模型。In an embodiment, the processor is used to implement: before uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the processor is configured to: The result is uploaded to the cloud server for joint learning to obtain a joint network model; the joint network model sent by the cloud server is received, and the joint network model is used as the trained network model; and/or the trained classification model And upload the classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,实现动作识别方法时,用于实现如下步骤:Wherein, in one embodiment, the processor is used to run a computer program stored in a memory, and when implementing the action recognition method, it is used to implement the following steps:
获取待识别图像和所述待识别图像对应的运动数据;将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;其中,所述预先训练的动作识别模型为根据上述的动作识别模型训练方法训练得到的。Obtain the image to be recognized and the motion data corresponding to the image to be recognized; input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-trained motion recognition The model is obtained by training according to the above-mentioned action recognition model training method.
本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项……方法。The embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program. The computer program includes program instructions, and the processor executes the program instructions to implement any of the methods provided in the embodiments of the present application.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) ) Card, Flash Card, etc.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种动作识别模型训练方法,其中,所述方法包括:A method for training an action recognition model, wherein the method includes:
    获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;Acquiring a video image, action data, and an action tag corresponding to the video image and action data;
    基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result;
    基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;Training the pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result;
    将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;Merging the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;
    将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;Uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;
    接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
  2. 根据权利要求1所述的动作识别模型训练方法,其中,所述基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果,包括:The method for training an action recognition model according to claim 1, wherein the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label to obtain the trained network model and the prediction result comprises:
    根据所述视频图像提取与所述视频图像对应的光流图像;Extracting an optical flow image corresponding to the video image according to the video image;
    利用所述视频图像和对应的动作标签对双流卷积神经网络中的空间流卷积网络进行训练,并得到空间预测结果;Training the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtaining a spatial prediction result;
    利用所述光流图像和对应的动作标签对双流卷积神经网络中的时间流卷积网络进行训练,并得到时间预测结果;Using the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result;
    将所述空间预测结果和所述时间预测结果进行聚合,得到预测结果。The spatial prediction result and the temporal prediction result are aggregated to obtain a prediction result.
  3. 根据权利要求1所述的动作识别模型训练方法,其中,所述根据所述预测结果和所述分类结果得到本地识别结果,包括:The method for training an action recognition model according to claim 1, wherein said obtaining a local recognition result according to said prediction result and said classification result comprises:
    基于权重计算公式,根据所述预测结果和所述分类结果得到本地识别结果;Obtain a local recognition result according to the prediction result and the classification result based on the weight calculation formula;
    所述权重计算公式包括:The weight calculation formula includes:
    R=λ 1P a2P b R=λ 1 P a2 P b
    其中,R表示本地识别结果,P a表示预测结果中概率最大的结果,λ 1表示概率最大结果P a的权重系数,P b表示分类结果中概率最大的结果,λ 2表示概率最大结果P b的权重系数。 Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
  4. 根据权利要求1所述的动作识别模型训练方法,其中,所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,包括:The action recognition model training method according to claim 1, wherein the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning comprises:
    对所述本地识别模型的模型参数和所述本地识别结果进行加密,得到加密数据;Encrypting the model parameters of the local recognition model and the local recognition result to obtain encrypted data;
    将所述加密数据上传至云服务器进行联合学习。Upload the encrypted data to the cloud server for joint learning.
  5. 根据权利要求1所述的动作识别模型训练方法,其中,在所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习之前,所述方法包括:The action recognition model training method according to claim 1, wherein before said uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the method comprises:
    将训练完成的网络模型和预测结果上传至云服务器进行联合学习,得到联合网络模型;接收所述云服务器发送的所述联合网络模型,并将所述联合网络模型作为训练完成的网络模型;和/或Uploading the trained network model and prediction results to the cloud server for joint learning to obtain a joint network model; receiving the joint network model sent by the cloud server, and using the joint network model as the trained network model; and /or
    将训练完成的分类模型和分类结果上传至云服务器进行联合学习,得到联合分类模型;接收所述云服务器发送的所述联合分类模型,并将所述联合分类模型 作为训练完成的分类模型。Upload the trained classification model and classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
  6. 根据权利要求2所述的动作识别模型训练方法,其中,所述空间预测结果和所述时间预测结果采用直接平均法或SVM的方法进行聚合。The method for training an action recognition model according to claim 2, wherein the spatial prediction result and the temporal prediction result are aggregated using a direct averaging method or an SVM method.
  7. 根据权利要求4所述的动作识别模型训练方法,其中,对所述本地识别模型的模型参数和所述本地识别结果采用同态加密算法、差分隐私算法以及多方安全算法中的任意一种进行加密。The action recognition model training method according to claim 4, wherein the model parameters of the local recognition model and the local recognition result are encrypted using any one of a homomorphic encryption algorithm, a differential privacy algorithm, and a multi-party security algorithm .
  8. 一种动作识别方法,其中,包括:An action recognition method, which includes:
    获取待识别图像和所述待识别图像对应的运动数据;Acquiring the image to be recognized and the motion data corresponding to the image to be recognized;
    将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result;
    其中,所述预先训练的动作识别模型为根据权利要求1-5任一项所述的动作识别模型训练方法训练得到的。Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method of any one of claims 1-5.
  9. 一种动作识别模型训练装置,其中,包括:An action recognition model training device, which includes:
    样本获取模块,用于获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;A sample acquisition module for acquiring video images, motion data, and motion tags corresponding to the video images and motion data;
    网络训练模块,用于基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;The network training module is used to perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, to obtain the trained network model and the prediction result;
    分类训练模块,用于基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;The classification training module is used to train the pre-configured classifier based on the action data and the corresponding action label, to obtain the trained classification model and the classification result;
    模型合并模块,用于将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;A model merging module, configured to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;
    联合学习模块,用于将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;A joint learning module, configured to upload the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;
    模型更新模块,用于接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型。The model update module is configured to receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
  10. 一种动作识别装置,其中,包括:An action recognition device, which includes:
    数据获取模块,用于获取待识别图像和所述待识别图像对应的运动数据;A data acquisition module for acquiring the image to be recognized and the motion data corresponding to the image to be recognized;
    动作识别模块,用于将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;An action recognition module, configured to input the image to be recognized and the motion data into a pre-trained action recognition model for action recognition, and obtain a recognition result;
    其中,所述预先训练的动作识别模型为根据权利要求1-5任一项所述的动作识别模型训练方法训练得到的。Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method of any one of claims 1-5.
  11. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;A computer device, wherein the computer device includes a memory and a processor;
    所述存储器用于存储计算机程序;The memory is used to store a computer program;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:The processor is configured to execute the computer program and implement the following steps when the computer program is executed:
    获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;Acquiring a video image, action data, and an action tag corresponding to the video image and action data;
    基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果;Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result;
    基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;Training the pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result;
    将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;Merging the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;
    将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;Uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;
    接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型:Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model:
    以及如下步骤:获取待识别图像和所述待识别图像对应的运动数据;And the following steps: acquiring the image to be recognized and the motion data corresponding to the image to be recognized;
    将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result;
    其中,所述预先训练的动作识别模型为根据所述的动作识别模型训练方法训练得到的。Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method.
  12. 根据权利要求11所述的计算机设备,其中,所述基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果,包括:The computer device according to claim 11, wherein the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label to obtain the trained network model and the prediction result comprises:
    根据所述视频图像提取与所述视频图像对应的光流图像;Extracting an optical flow image corresponding to the video image according to the video image;
    利用所述视频图像和对应的动作标签对双流卷积神经网络中的空间流卷积网络进行训练,并得到空间预测结果;Training the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtaining a spatial prediction result;
    利用所述光流图像和对应的动作标签对双流卷积神经网络中的时间流卷积网络进行训练,并得到时间预测结果;Using the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result;
    将所述空间预测结果和所述时间预测结果进行聚合,得到预测结果。The spatial prediction result and the temporal prediction result are aggregated to obtain a prediction result.
  13. 根据权利要求11所述的计算机设备,其中,所述根据所述预测结果和所述分类结果得到本地识别结果,包括:11. The computer device according to claim 11, wherein said obtaining a local recognition result according to said prediction result and said classification result comprises:
    基于权重计算公式,根据所述预测结果和所述分类结果得到本地识别结果;Obtain a local recognition result according to the prediction result and the classification result based on the weight calculation formula;
    所述权重计算公式包括:The weight calculation formula includes:
    R=λ 1P a2P b R=λ 1 P a2 P b
    其中,R表示本地识别结果,P a表示预测结果中概率最大的结果,λ 1表示概率最大结果P a的权重系数,P b表示分类结果中概率最大的结果,λ 2表示概率最大结果P b的权重系数。 Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
  14. 根据权利要求11所述的计算机设备,其中,所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,包括:The computer device according to claim 11, wherein said uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning comprises:
    对所述本地识别模型的模型参数和所述本地识别结果进行加密,得到加密数据;Encrypting the model parameters of the local recognition model and the local recognition result to obtain encrypted data;
    将所述加密数据上传至云服务器进行联合学习。Upload the encrypted data to the cloud server for joint learning.
  15. 根据权利要求11所述的计算机设备,其中,在所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习之前,所述方法包括:The computer device according to claim 11, wherein, before the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the method comprises:
    将训练完成的网络模型和预测结果上传至云服务器进行联合学习,得到联合网络模型;接收所述云服务器发送的所述联合网络模型,并将所述联合网络模型作为训练完成的网络模型;和/或Uploading the trained network model and prediction results to the cloud server for joint learning to obtain a joint network model; receiving the joint network model sent by the cloud server, and using the joint network model as the trained network model; and /or
    将训练完成的分类模型和分类结果上传至云服务器进行联合学习,得到联合分类模型;接收所述云服务器发送的所述联合分类模型,并将所述联合分类模型作为训练完成的分类模型。Upload the trained classification model and the classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:
    获取视频图像、动作数据和所述视频图像、动作数据对应的动作标签;Acquiring a video image, action data, and an action tag corresponding to the video image and action data;
    基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得 到训练完成的网络模型和预测结果;Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result;
    基于所述动作数据和对应的动作标签对预先配置的分类器进行训练,得到训练完成的分类模型和分类结果;Training the pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result;
    将所述训练完成的网络模型和所述训练完成的分类模型合并得到本地识别模型,以及根据所述预测结果和所述分类结果得到本地识别结果;Merging the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;
    将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,以得到学习参数;Uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;
    接收所述云服务器发送的学习参数,并根据所述学习参数更新所述本地识别模型,将更新后的所述本地识别模型作为训练完成的动作识别模型:Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model:
    以及如下步骤:获取待识别图像和所述待识别图像对应的运动数据;And the following steps: acquiring the image to be recognized and the motion data corresponding to the image to be recognized;
    将所述待识别图像和所述运动数据输入预先训练的动作识别模型进行动作识别,得到识别结果;Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result;
    其中,所述预先训练的动作识别模型为根据所述的动作识别模型训练方法训练得到的。Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述基于所述视频图像和对应的动作标签对双流卷积神经网络进行网络训练,得到训练完成的网络模型和预测结果,包括:The computer-readable storage medium according to claim 16, wherein the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label to obtain the trained network model and the prediction result comprises:
    根据所述视频图像提取与所述视频图像对应的光流图像;Extracting an optical flow image corresponding to the video image according to the video image;
    利用所述视频图像和对应的动作标签对双流卷积神经网络中的空间流卷积网络进行训练,并得到空间预测结果;Training the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtaining a spatial prediction result;
    利用所述光流图像和对应的动作标签对双流卷积神经网络中的时间流卷积网络进行训练,并得到时间预测结果;Using the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result;
    将所述空间预测结果和所述时间预测结果进行聚合,得到预测结果。The spatial prediction result and the temporal prediction result are aggregated to obtain a prediction result.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述预测结果和所述分类结果得到本地识别结果,包括:The computer-readable storage medium according to claim 16, wherein the obtaining a local recognition result according to the prediction result and the classification result comprises:
    基于权重计算公式,根据所述预测结果和所述分类结果得到本地识别结果;Obtain a local recognition result according to the prediction result and the classification result based on the weight calculation formula;
    所述权重计算公式包括:The weight calculation formula includes:
    R=λ 1P a2P b R=λ 1 P a2 P b
    其中,R表示本地识别结果,P a表示预测结果中概率最大的结果,λ 1表示概率最大结果P a的权重系数,P b表示分类结果中概率最大的结果,λ 2表示概率最大结果P b的权重系数。 Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习,包括:The computer-readable storage medium according to claim 16, wherein the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning comprises:
    对所述本地识别模型的模型参数和所述本地识别结果进行加密,得到加密数据;Encrypting the model parameters of the local recognition model and the local recognition result to obtain encrypted data;
    将所述加密数据上传至云服务器进行联合学习。Upload the encrypted data to the cloud server for joint learning.
  20. 根据权利要求16所述的计算机可读存储介质,其中,在所述将所述本地识别模型的模型参数和所述本地识别结果上传至云服务器进行联合学习之前,所述方法包括:The computer-readable storage medium according to claim 16, wherein, before the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the method comprises:
    将训练完成的网络模型和预测结果上传至云服务器进行联合学习,得到联合网络模型;接收所述云服务器发送的所述联合网络模型,并将所述联合网络模型作为训练完成的网络模型;和/或Uploading the trained network model and prediction results to the cloud server for joint learning to obtain a joint network model; receiving the joint network model sent by the cloud server, and using the joint network model as the trained network model; and /or
    将训练完成的分类模型和分类结果上传至云服务器进行联合学习,得到联合分类模型;接收所述云服务器发送的所述联合分类模型,并将所述联合分类模型 作为训练完成的分类模型。Upload the trained classification model and classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
PCT/CN2020/135245 2020-10-21 2020-12-10 Model training method and apparatus, action recognition method and apparatus, and device and storage medium WO2021189952A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011133950.3 2020-10-21
CN202011133950.3A CN112257579A (en) 2020-10-21 2020-10-21 Model training method, action recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021189952A1 true WO2021189952A1 (en) 2021-09-30

Family

ID=74263410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135245 WO2021189952A1 (en) 2020-10-21 2020-12-10 Model training method and apparatus, action recognition method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN112257579A (en)
WO (1) WO2021189952A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114070547A (en) * 2021-11-16 2022-02-18 河南大学 Integrated learning-based multi-layer composite recognition method for cryptographic algorithm
CN114399698A (en) * 2021-11-30 2022-04-26 西安交通大学 Hand washing quality scoring method and system based on smart watch
CN114464289A (en) * 2022-02-11 2022-05-10 武汉大学 ERCP report generation method and device, electronic equipment and computer readable storage medium
CN115908948A (en) * 2023-01-05 2023-04-04 北京霍里思特科技有限公司 Intelligent sorting system for online adjustment model and control method thereof

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023082788A1 (en) * 2021-11-11 2023-05-19 新智我来网络科技有限公司 Method and apparatus for predicting oxygen content in flue gas and load, method and apparatus for selecting prediction model, and method and apparatus for predicting flue gas emission
CN117132790B (en) * 2023-10-23 2024-02-02 南方医科大学南方医院 Digestive tract tumor diagnosis auxiliary system based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109011419A (en) * 2018-08-22 2018-12-18 重庆电政信息科技有限公司 A kind of athletic performance training method based on MEMS sensor
EP3605394A1 (en) * 2018-08-03 2020-02-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for recognizing body movement
CN110909672A (en) * 2019-11-21 2020-03-24 江苏德劭信息科技有限公司 Smoking action recognition method based on double-current convolutional neural network and SVM
CN111428620A (en) * 2020-03-20 2020-07-17 深圳前海微众银行股份有限公司 Identity recognition method, device, equipment and medium based on federal in-vivo detection model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3605394A1 (en) * 2018-08-03 2020-02-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for recognizing body movement
CN109011419A (en) * 2018-08-22 2018-12-18 重庆电政信息科技有限公司 A kind of athletic performance training method based on MEMS sensor
CN110909672A (en) * 2019-11-21 2020-03-24 江苏德劭信息科技有限公司 Smoking action recognition method based on double-current convolutional neural network and SVM
CN111428620A (en) * 2020-03-20 2020-07-17 深圳前海微众银行股份有限公司 Identity recognition method, device, equipment and medium based on federal in-vivo detection model

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114070547A (en) * 2021-11-16 2022-02-18 河南大学 Integrated learning-based multi-layer composite recognition method for cryptographic algorithm
CN114399698A (en) * 2021-11-30 2022-04-26 西安交通大学 Hand washing quality scoring method and system based on smart watch
CN114399698B (en) * 2021-11-30 2024-04-02 西安交通大学 Hand washing quality scoring method and system based on intelligent watch
CN114464289A (en) * 2022-02-11 2022-05-10 武汉大学 ERCP report generation method and device, electronic equipment and computer readable storage medium
CN115908948A (en) * 2023-01-05 2023-04-04 北京霍里思特科技有限公司 Intelligent sorting system for online adjustment model and control method thereof
CN115908948B (en) * 2023-01-05 2024-04-26 北京霍里思特科技有限公司 Intelligent sorting system for online adjustment model and control method thereof

Also Published As

Publication number Publication date
CN112257579A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2021189952A1 (en) Model training method and apparatus, action recognition method and apparatus, and device and storage medium
JP6639700B2 (en) Method and system for generating a multimodal digital image
US20230082173A1 (en) Data processing method, federated learning training method, and related apparatus and device
US10235562B2 (en) Emotion recognition in video conferencing
WO2020199693A1 (en) Large-pose face recognition method and apparatus, and device
CN105874474B (en) System and method for face representation
KR102611454B1 (en) Storage device for decentralized machine learning and machine learning method thereof
US9805305B2 (en) Boosted deep convolutional neural networks (CNNs)
CN110135185A (en) The machine learning of privatization is carried out using production confrontation network
WO2020253127A1 (en) Facial feature extraction model training method and apparatus, facial feature extraction method and apparatus, device, and storage medium
CN109101602A (en) Image encrypting algorithm training method, image search method, equipment and storage medium
WO2022206498A1 (en) Federated transfer learning-based model training method and computing nodes
CN113505882B (en) Data processing method based on federal neural network model, related equipment and medium
WO2020238353A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
CN113366542A (en) Techniques for implementing augmented based normalized classified image analysis computing events
US20200143169A1 (en) Video recognition using multiple modalities
CN111414879A (en) Face shielding degree identification method and device, electronic equipment and readable storage medium
Purwanto et al. Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation
US20200082002A1 (en) Determining contextual confidence of images using associative deep learning
CN110097004B (en) Facial expression recognition method and device
Wang et al. Efficient global-local memory for real-time instrument segmentation of robotic surgical video
Gupta et al. Toward asynchronously weight updating federated learning for AI-on-edge IoT systems
CN108228823A (en) A kind of binary-coding method and system of high dimensional image dimensionality reduction
Mahmud et al. Gaze estimation with eye region segmentation and self-supervised multistream learning
CN114049417B (en) Virtual character image generation method and device, readable medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20927039

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20927039

Country of ref document: EP

Kind code of ref document: A1