WO2021189952A1

WO2021189952A1 - Model training method and apparatus, action recognition method and apparatus, and device and storage medium

Info

Publication number: WO2021189952A1
Application number: PCT/CN2020/135245
Authority: WO
Inventors: 李泽远; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-10-21
Filing date: 2020-12-10
Publication date: 2021-09-30
Also published as: CN112257579A

Abstract

An action recognition model training method and apparatus, an action training method and apparatus, and a device and a storage medium. The action recognition model training method comprises: acquiring a video image, action data, and action labels corresponding to the video image and the action data (S101); performing network training on a two-stream convolutional neural network on the basis of the video image and the corresponding action label, so as to obtain a network model and a prediction result (S102); training a classifier on the basis of the action data and the corresponding action label, so as to obtain a classification model and a classification result (S103); merging the network model and the classification model to obtain a local recognition model, and obtaining a local recognition result according to the prediction result and the classification result (S104); uploading the local recognition model and the local recognition result to perform joint learning, so as to obtain a learning parameter (S105); and receiving the learning parameter, and updating the local recognition model according to the learning parameter (S106). The present invention is used for improving the recognition accuracy of an action recognition model obtained by means of training.

Description

Model training method, action recognition method, device, equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 21, 2020, the application number is 2020111339503, and the invention title is "Model Training Method, Action Recognition Method, Device, Equipment, and Storage Medium", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence, and in particular to an action recognition model training method, action training method, device, equipment, and storage medium.

Background technique

In the fields of interpersonal interaction and collaboration, intelligent nursing, intelligent monitoring, and sports analysis, it is necessary to recognize the action and behavior of the human body to determine the type of human behavior. The inventor realizes that traditional motion recognition methods mostly use computer image processing methods to extract motion trajectories and character features in video frames, and then train a classifier to recognize human behaviors, with low accuracy and slow recognition speed. However, the action recognition model constructed by methods such as convolutional neural network has a small number of samples, which also leads to unsatisfactory training effects, which in turn leads to low recognition accuracy.

Therefore, how to improve the recognition accuracy of the trained action recognition model becomes an urgent problem to be solved.

Summary of the invention

This application provides a method for training an action recognition model, and the method includes:

Obtain video images, action data, and action labels corresponding to the video images and action data; perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels to obtain the trained network model and prediction results; based on The action data and the corresponding action label train a pre-configured classifier to obtain a trained classification model and a classification result; merge the trained network model and the trained classification model to obtain a local recognition model, And obtaining a local recognition result according to the prediction result and the classification result; uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters; receiving a transmission from the cloud server And update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.

This application also provides an action recognition method, which includes:

Obtain the image to be recognized and the motion data corresponding to the image to be recognized; input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-trained motion recognition The model is obtained by training the above-mentioned action recognition model training method.

This application also provides a device for training an action recognition model, the device including:

The sample acquisition module is used to acquire video images, action data, and the corresponding action labels of the video images and action data; the network training module is used to perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels , To obtain the trained network model and prediction results; a classification training module for training a pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results; model merging module, It is used to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result; a joint learning module is used to combine the local The model parameters of the recognition model and the local recognition results are uploaded to the cloud server for joint learning to obtain learning parameters; the model update module is used to receive the learning parameters sent by the cloud server and update the local A recognition model, using the updated local recognition model as a trained action recognition model.

This application also provides an action recognition device, which includes:

The data acquisition module is used to obtain the image to be recognized and the motion data corresponding to the image to be recognized; the motion recognition module is used to input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain Recognition result; wherein, the pre-trained action recognition model is obtained by training the above-mentioned action recognition model training method.

The application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and realizes when the computer program is executed The following steps: obtain video images, action data, and action labels corresponding to the video images and action data; perform network training on the dual-stream convolutional neural network based on the video images and corresponding action labels to obtain the trained network model and prediction Result; based on the action data and corresponding action labels to train the pre-configured classifier to obtain the trained classification model and the classification result; merge the trained network model and the trained classification model to obtain a local Identifying a model, and obtaining a local recognition result according to the prediction result and the classification result; uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters; receiving the The learning parameters sent by the cloud server, and updating the local recognition model according to the learning parameters, and using the updated local recognition model as a trained action recognition model; and

The steps are as follows: acquiring the image to be recognized and the motion data corresponding to the image to be recognized; inputting the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-training The action recognition model is obtained by training the above-mentioned action recognition model training method.

The present application also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps: acquiring video images, motion data, and The action label corresponding to the video image and the action data; the two-stream convolutional neural network is trained on the network based on the video image and the corresponding action label to obtain the trained network model and the prediction result; based on the action data and the corresponding The action tag trains the pre-configured classifier to obtain the trained classification model and the classification result; merge the trained network model and the trained classification model to obtain a local recognition model, and according to the prediction result and The classification result obtains the local recognition result; the model parameters of the local recognition model and the local recognition result are uploaded to the cloud server for joint learning to obtain the learning parameters; the learning parameters sent by the cloud server are received, and the learning parameters are received according to the Updating the local recognition model with the learning parameters, and using the updated local recognition model as a trained action recognition model; and

This application discloses an action recognition model training method, action training method, device, equipment, and storage medium. By acquiring video images, action data and video images, action labels corresponding to the action data, and then based on the video images and corresponding action labels Perform network training on the dual-stream convolutional neural network to obtain the trained network model and prediction results. At the same time, train the pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results, and then The trained network model and the trained classification model are combined to obtain a local recognition model, and the local recognition result is obtained according to the prediction result and the classification result, and the model parameters of the local recognition model and the local recognition result are uploaded to the cloud server for joint learning. Obtain the learning parameters, and finally each participant receives the learning parameters sent by the cloud server, and updates the local recognition model according to the learning parameters to complete the model training. Each participant conducts model training locally to obtain their own local recognition model, and then uploads the local recognition model to the cloud server for joint learning, expands the number of samples when training the model, and improves the recognition accuracy of the trained action recognition model, and Since each participant conducts model training locally, the training data is not interoperable, which also ensures the security and privacy of the data.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic flowchart of an action recognition model training method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a network structure of a network model provided by an embodiment of the present application;

Fig. 3 is a schematic flowchart of sub-steps of a method for training an action recognition model provided in Fig. 1;

Fig. 4 is an action recognition method provided by an embodiment of the present application;

FIG. 5 is a schematic block diagram of an apparatus for training an action recognition model according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of a device for action recognition according to an embodiment of the present application;

FIG. 7 is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The flowchart shown in the drawings is only an example, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions.

It should be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.

The embodiments of the present application provide an action recognition model training method, device, computer equipment, and storage medium. The action recognition model training method can be used to train the action recognition model to recognize human actions, improve the recognition accuracy of the trained action recognition model, and then improve the accuracy of the action recognition.

Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an action recognition model training method provided by an embodiment of the present application. The action recognition model training method can be applied to each participant, that is, each local client. The action recognition model training method achieves the purpose of enriching the number of samples and improving the recognition accuracy of the action recognition model obtained by training by performing joint training on the sample data of multiple participants.

As shown in FIG. 1, the method for training an action recognition model specifically includes: step S101 to step S106.

S101. Acquire a video image, action data, and an action tag corresponding to the video image and action data.

Since the action recognition model includes two parts, namely a network model and a classification model, it is possible to obtain video images, action data, and action labels corresponding to the video images and action data to train the network model and the classification model respectively.

The user wears the smart wearable device to perform an action, and then takes a picture of the user's action to obtain a video image. According to the gyroscope sensor and acceleration sensor mounted in the smart wearable device, the user's movement data during the movement is collected. The action performed by the user is the action tag corresponding to the video image and the action data.

Among them, the video image, the action data and the corresponding action label are the local data of each participant, that is, each participant performs model training based on the local data, and there is no need to share data with other participants, thereby improving data security Sex and reliability.

S102: Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain a trained network model and a prediction result.

According to the video image and the corresponding action label, the dual-stream convolutional neural network is trained to obtain the network model.

As shown in Figure 2, it is a schematic diagram of the network structure of the network model. The network model includes a spatial stream convolutional network and a time stream convolutional network. The spatial stream convolutional network and the time stream convolutional network respectively include several convolutional layers, fully connected layers and softmax layers.

In some embodiments, as shown in FIG. 3, step S102 specifically includes step S1021 to step S1024.

S1021, according to the video image, extract an optical flow image corresponding to the video image.

When extracting the corresponding optical flow image from the video image, you can use OpenCV to process a certain frame in the video image to obtain the key points, and then perform gradient calculation on the adjacent frames of the video image to obtain the information of the pixel point movement of the key point. That is, the optical flow, the frame and the following multiple frames are superimposed into an optical flow stack, that is, the optical flow image.

S1022: Use the video image and the corresponding action label to train the spatial stream convolutional network in the dual-stream convolutional neural network, and obtain a spatial prediction result.

Input the video image and the corresponding action label into the spatial stream convolutional network in the dual-stream convolutional neural network for training, and obtain the spatial prediction results. In the specific implementation process, each frame of the video image terminal is input into the spatial stream convolutional network for training. train. Then the loss between the spatial prediction result and the corresponding action label is calculated. When the loss value reaches the preset condition, the training of the spatial stream convolutional network is considered to be completed, and the spatial prediction result is obtained.

In an embodiment, the time flow convolutional network can use L2 regularization to monitor the loss and prevent overfitting. Among them, the objective function expression using L2 regularization is:

Among them, L represents the loss function with regularization, J(θ) represents the loss function, θ represents all parameters in the convolutional neural network, and λ represents the regular coefficient.

It refers to the sum of the squares of the weights, i represents the action number of each recognized action, and k represents the total number of recognized actions. Among them, the regular coefficient can be determined by itself according to the actual situation.

After obtaining the output of the fully connected layer, the softmax layer will perform data conversion based on the output value of the fully connected layer, so that the final output spatial prediction result is the probability that the video image is predicted to be a certain action.

The conversion formula of the softmax layer can be:

Wherein, V _i is the output value of the fully connected layers, i denotes any one of the operation type, N denotes the total number of operation types, S _i represents the index of all the output values of the index of the current output value of the entire connecting layer and the fully connected layers, and the The ratio, which is the probability of the output of the spatial stream convolutional network.

S1023: Use the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result.

Since the optical flow image includes the motion state information between frames, the optical flow image and the corresponding action label are input into the time flow convolutional network in the dual-stream convolutional neural network for training, and the time prediction result is obtained, and then calculated The loss between the time prediction result and the corresponding action tag. When the loss value reaches the preset condition, it is considered that the training of the time flow convolutional network is completed, and the time prediction result is obtained.

Among them, L represents the loss function with regularization, J(θ) represents the loss function, θ represents all the parameters in the convolutional neural network, and λ represents the regular coefficient.

It refers to the square sum of the weight, i represents the action number of each recognized action, and k represents the total number of recognized actions. Among them, the regular coefficient can be determined by itself according to the actual situation.

After obtaining the output of the fully connected layer, the softmax layer will perform data conversion based on the output value of the fully connected layer, so that the final output time prediction result is the probability that the optical flow image is predicted to be a certain action.

Among them, the conversion formula of the softmax layer can be:

Wherein, V _i is the output value of the fully connected layers, i denotes any one of the operation type, N denotes the total number of operation types, S _i represents the index of all the output values of the index of the current output value of the entire connecting layer and the fully connected layers, and the The ratio, that is, the probability of the output of the time flow convolutional network.

S1024. Aggregate the spatial prediction result and the temporal prediction result to obtain a prediction result.

After the spatial stream convolutional network outputs the spatial prediction results and the temporal stream convolutional network outputs the temporal prediction structure, the temporal prediction results and the spatial prediction results can be aggregated to obtain the prediction result P _A = {a ₁ :p _a1 ; a _{_{_{2: p a2 ...... a n:}}} p an}, _{_{where, a 1, a 2 ...... a}} n represent actions _{_{tag, p a1, p a2 ...... p}} an appropriate representative of predicted probability of human action. Among them, the direct average method can be used to take the average value for aggregation, or the SVM method can be used for aggregation.

S103: Train a pre-configured classifier based on the action data and the corresponding action label, to obtain a trained classification model and a classification result.

The motion data includes the three-axis angular velocity data collected by the gyroscope sensor mounted in the smart wearable device and the three-axis acceleration data collected by the acceleration sensor when the user performs the corresponding action.

Calculate the mean, variance, and root mean square of the three-axis angular velocity data and three-axis acceleration data respectively, and form a feature matrix, and then input the feature matrix and the corresponding action label into the pre-configured classifier for action classification, and the training is completed The classification model and classification results of P _B = {b ₁ :p _b1 ; b ₂ :p _b2 ……b _n :p _bn }, where b ₁ , b ₂ ……b _n represent human action labels, p _b1 , p _b2 ......p _bn represents the probability that it is predicted to be the corresponding human action. Among them, the pre-configured classifier may be a support vector machine.

S104. Combine the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result.

Since the local recognition model includes two parts: a network model and a classification model, each participant merges the trained network model and the trained classification model to obtain a local recognition model. After the local recognition model is obtained, the local recognition result is obtained according to the prediction result and the classification result.

In an embodiment, the obtaining a local recognition result according to the prediction result and the classification result includes: obtaining a local recognition result according to the prediction result and the classification result based on a weight calculation formula.

Since the local recognition model includes two parts: the network model and the classification model, the local recognition model can calculate the prediction results of the network model and the classification results of the classification model according to the weight coefficients according to the preset weight coefficients, so as to obtain the final Local recognition result.

The weight calculation formula includes:

R=λ ₁ P _a + ₂ P _b

Among them, R represents the local recognition result, P _a represents the most probable result among the prediction results, λ ₁ represents the weight coefficient of the most probable result P _a _{, P b} represents the most probable result among the classification results, and λ ₂ represents the most probable result P _b The weight coefficient.

S105. Upload the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters.

After each participant obtains the local recognition model locally, the model parameters of the local recognition model and the local recognition result of the local recognition model are uploaded to the cloud server, and the cloud server performs joint learning based on the received information to obtain the learning parameters.

In the specific implementation process, the cloud server can use the global average method to perform joint learning to obtain the learning parameters, that is, to calculate the average value of the model parameters in the local recognition model model, and then adjust some model parameters that are too different from the average value. Lower the weight of its parameters to get the learning parameters.

In an embodiment, uploading the model parameters of the local recognition model and the local recognition results to a cloud server for joint learning includes: encrypting the model parameters of the local recognition model and the local recognition results To obtain encrypted data; upload the encrypted data to a cloud server for joint learning.

Each participant encrypts the data that needs to be uploaded to obtain the encrypted data, and then uploads the encrypted data to the cloud server. After receiving the encrypted data, the cloud server decrypts the encrypted data, and then conducts joint learning based on the data to reduce data The leakage in the transmission process improves data security.

When data encryption is performed, for example, privacy calculation methods such as homomorphic encryption, differential privacy, or multi-party secure calculation can be used. It should be noted that when homomorphic encryption is used, the cloud server may not decrypt the encrypted data, but directly conduct joint learning based on the encrypted data.

In an embodiment, before step S104, the method includes: uploading the trained network model and the prediction result to a cloud server for joint learning to obtain a joint network model; and receiving the joint network model sent by the cloud server , And use the joint network model as a trained network model; and/or upload the trained classification model and classification results to a cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server A classification model, and the joint classification model is used as a trained classification model.

After each participant has completed the training of the local network model, the model parameters and prediction results of the locally trained network model can be uploaded to the cloud server, so that the cloud server can receive the training completed network uploaded by each participant The model parameters and prediction results of the model are jointly learned to obtain a joint network model.

The joint network model is used as the trained network model, that is, after the cloud server obtains the joint network model, it sends the parameters of the joint network model to each participant, and each participant receives the model parameters of the joint network model, and Update the locally trained network model according to the model parameters of the joint network model, and then use the updated network model as the trained network model.

Similarly, after each participant has completed the training of the local classification model, the model parameters and classification results of the locally trained classification model can be uploaded to the cloud server, so that the cloud server will receive the training uploaded by each participant. The model parameters and classification results of the completed classification model are jointly learned to obtain a joint classification model.

The joint classification model is used as the trained classification model, that is, after the cloud server obtains the joint classification model, the parameters of the joint classification model are delivered to each participant, and each participant receives the model parameters of the joint classification model, and Update the locally trained classification model according to the model parameters of the joint classification model, and then use the updated classification model as the trained classification model.

That is, in the training method of the action recognition model, at most three different joint learnings can be performed, and the three different joint learnings refer to the joint learning of the network model completed by local training, and the classification of the completed local training. Joint learning of models and joint learning of local recognition models.

S106: Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.

Each participant receives the learning parameters sent by the cloud server, and updates the local recognition model according to the learning parameters, and uses the updated local recognition model as the trained action recognition model to complete the training of the action recognition model.

The motion recognition model training method provided by the foregoing embodiment obtains video images, motion data, and motion labels corresponding to the video images and motion data, and then conducts network training on the dual-stream convolutional neural network based on the video images and the corresponding motion labels to obtain the training The completed network model and prediction results are trained on the pre-configured classifier based on the action data and corresponding action labels to obtain the trained classification model and classification results, and then the trained network model and the trained classification model are performed Merge to obtain a local recognition model, and obtain the local recognition result according to the prediction result and classification result, upload the model parameters of the local recognition model and the local recognition result to the cloud server for joint learning, obtain the learning parameters, and finally each participant receives the cloud server to send According to the learning parameters, the local recognition model is updated according to the learning parameters, and the model training is completed. Each participant conducts model training locally to obtain their own local recognition model, and then uploads the local recognition model to the cloud server for joint learning, expands the number of samples when training the model, and improves the recognition accuracy of the trained action recognition model, and Since each participant conducts model training locally, the training data is not interoperable, which also ensures the security and privacy of the data.

Please refer to FIG. 4, which is an action recognition method provided by an embodiment of the present application.

As shown in FIG. 4, the method for training an action recognition model specifically includes: step S201 and step S202.

S201: Acquire an image to be recognized and motion data corresponding to the image to be recognized.

When recognizing an action of a user wearing a wearable device, the image to be recognized when the user performs the action and the action data corresponding to the image to be recognized can be acquired.

Among them, the motion data includes the three-axis angular velocity data collected by the gyroscope sensor mounted in the smart wearable device and the three-axis acceleration data collected by the acceleration sensor when the user performs a corresponding action.

S202. Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition, to obtain a recognition result.

Among them, the pre-trained action recognition model refers to a model trained according to the aforementioned action recognition model training method.

Since the pre-trained action recognition model includes a network model and a prediction model, the image to be recognized is input into the network model, and the network model performs action prediction based on the image to be recognized to obtain a prediction result. The motion data is input into the classification model, and the classification model performs action classification according to the motion data to obtain the classification result.

Then, according to the weight coefficients configured in the action recognition model, the prediction results obtained by the network model and the classification results obtained by the classification model are respectively calculated according to the corresponding weight coefficients, and finally a certain recognition result is obtained, and the action recognition is completed, and the recognition result is obtained. Perform output.

It should be noted that if there are only images to be recognized or only motion data, the images or motion data to be recognized can also be input into a pre-trained motion recognition model for motion recognition.

The above embodiment provides an action recognition method that obtains the image to be recognized and the motion data corresponding to the image to be recognized, and then inputs the image to be recognized and the motion data into a pre-trained motion recognition model to perform motion recognition, obtain the recognition result, and complete the action Recognition. Perform action recognition based on both the image to be recognized and the motion data, and combine the recognition results of the two to improve the accuracy of action recognition.

Please refer to FIG. 5. FIG. 5 is a schematic block diagram of an apparatus for training an action recognition model according to an embodiment of the present application. The apparatus for training an action recognition model is used to execute the aforementioned method for training an action recognition model. Wherein, the motion recognition model training device can be configured in a server or a terminal.

Among them, the server can be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.

As shown in FIG. 5, the action recognition model training device 300 includes: a sample acquisition module 301, a network training module 302, a classification training module 303, a model merging module 304, a joint learning module 305, and a model update module 306.

The sample acquisition module 301 is used to acquire video images, motion data, and motion tags corresponding to the video images and motion data.

The network training module 302 is configured to perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result.

Among them, the network training module 302 includes an optical flow extraction sub-module 3021, a spatial training sub-module 3022, a time training sub-module 3023, and a result aggregation sub-module 3024.

Specifically, the optical flow extraction sub-module 3021 is configured to extract an optical flow image corresponding to the video image according to the video image. The spatial training sub-module 3022 is used to train the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtain the spatial prediction result. The time training sub-module 3023 is used to train the time flow convolutional network in the dual-stream convolutional neural network by using the optical flow image and the corresponding action label, and obtain the time prediction result. The result aggregation sub-module 3024 is configured to aggregate the spatial prediction result and the temporal prediction result to obtain a prediction result.

The classification training module 303 is configured to train a pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result.

The model merging module 304 is configured to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result.

The joint learning module 305 is configured to upload the model parameters of the local recognition model and the local recognition results to a cloud server for joint learning to obtain learning parameters.

The model update module 306 is configured to receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.

It should be noted that those skilled in the art can clearly understand that for the convenience and brevity of description, the above-described action recognition model training device and the specific working process of each module can be referred to in the foregoing action recognition model training method embodiment The corresponding process will not be repeated here.

Please refer to FIG. 6. FIG. 6 is a schematic block diagram of a motion recognition device provided in an embodiment of the present application, and the motion recognition device is used to execute the aforementioned motion recognition method. Wherein, the action recognition device can be configured in a server or a terminal.

As shown in FIG. 6, the action recognition device 400 includes: a data acquisition module 401 and an action recognition module 402.

The data acquisition module 401 is configured to acquire the image to be recognized and the motion data corresponding to the image to be recognized.

The action recognition module 402 is configured to input the image to be recognized and the motion data into a pre-trained action recognition model for action recognition, and obtain a recognition result; wherein, the pre-trained action recognition model is based on the aforementioned action recognition model The training method is trained.

It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of description, the specific working process of the action recognition device and each module described above can refer to the corresponding process in the aforementioned action recognition method embodiment. I won't repeat them here.

The aforementioned action recognition model training device and action recognition device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 7.

Please refer to FIG. 7, which is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application. The computer equipment can be a server or a terminal.

Referring to FIG. 7, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions, and when the program instructions are executed, the processor can execute any action recognition model training method or action recognition method.

The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.

The internal memory provides an environment for the operation of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any action recognition model training method or action recognition method.

The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

Wherein, in one embodiment, the processor is used to run a computer program stored in a memory to implement the action recognition model training method, and is used to implement the following steps:

In one embodiment, when the processor implements the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtains the trained network model and the prediction result, it is used to implement: The video image extracts the optical flow image corresponding to the video image; uses the video image and the corresponding action tag to train the spatial flow convolutional network in the dual-stream convolutional neural network, and obtains the spatial prediction result; The optical flow image and the corresponding action label train the time flow convolutional network in the dual-stream convolutional neural network and obtain the time prediction result; aggregate the space prediction result and the time prediction result to obtain the prediction result.

In one embodiment, when the processor realizes the local recognition result obtained according to the prediction result and the classification result, it is used to realize: based on the weight calculation formula, obtain according to the prediction result and the classification result Local recognition result; the weight calculation formula includes:

R=λ ₁ P _a +λ ₂ P _b

In one embodiment, the processor is used to realize: when the model parameters of the local recognition model and the local recognition result are uploaded to a cloud server for joint learning: The parameters and the local recognition result are encrypted to obtain encrypted data; the encrypted data is uploaded to the cloud server for joint learning.

In an embodiment, the processor is used to implement: before uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the processor is configured to: The result is uploaded to the cloud server for joint learning to obtain a joint network model; the joint network model sent by the cloud server is received, and the joint network model is used as the trained network model; and/or the trained classification model And upload the classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.

Wherein, in one embodiment, the processor is used to run a computer program stored in a memory, and when implementing the action recognition method, it is used to implement the following steps:

Obtain the image to be recognized and the motion data corresponding to the image to be recognized; input the image to be recognized and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result; wherein, the pre-trained motion recognition The model is obtained by training according to the above-mentioned action recognition model training method.

The embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program. The computer program includes program instructions, and the processor executes the program instructions to implement any of the methods provided in the embodiments of the present application.

The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) ) Card, Flash Card, etc.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A method for training an action recognition model, wherein the method includes:

Acquiring a video image, action data, and an action tag corresponding to the video image and action data;

Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result;

Training the pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result;

Merging the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;

Uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;

Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
The method for training an action recognition model according to claim 1, wherein the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label to obtain the trained network model and the prediction result comprises:

Extracting an optical flow image corresponding to the video image according to the video image;

Training the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtaining a spatial prediction result;

Using the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result;

The spatial prediction result and the temporal prediction result are aggregated to obtain a prediction result.
The method for training an action recognition model according to claim 1, wherein said obtaining a local recognition result according to said prediction result and said classification result comprises:

Obtain a local recognition result according to the prediction result and the classification result based on the weight calculation formula;

The weight calculation formula includes:

R=λ 1 P a +λ 2 P b

Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
The action recognition model training method according to claim 1, wherein the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning comprises:

Encrypting the model parameters of the local recognition model and the local recognition result to obtain encrypted data;

Upload the encrypted data to the cloud server for joint learning.
The action recognition model training method according to claim 1, wherein before said uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the method comprises:

Uploading the trained network model and prediction results to the cloud server for joint learning to obtain a joint network model; receiving the joint network model sent by the cloud server, and using the joint network model as the trained network model; and /or

Upload the trained classification model and classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
The method for training an action recognition model according to claim 2, wherein the spatial prediction result and the temporal prediction result are aggregated using a direct averaging method or an SVM method.
The action recognition model training method according to claim 4, wherein the model parameters of the local recognition model and the local recognition result are encrypted using any one of a homomorphic encryption algorithm, a differential privacy algorithm, and a multi-party security algorithm .
An action recognition method, which includes:

Acquiring the image to be recognized and the motion data corresponding to the image to be recognized;

Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result;

Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method of any one of claims 1-5.
An action recognition model training device, which includes:

A sample acquisition module for acquiring video images, motion data, and motion tags corresponding to the video images and motion data;

The network training module is used to perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, to obtain the trained network model and the prediction result;

The classification training module is used to train the pre-configured classifier based on the action data and the corresponding action label, to obtain the trained classification model and the classification result;

A model merging module, configured to merge the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;

A joint learning module, configured to upload the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;

The model update module is configured to receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model.
An action recognition device, which includes:

A data acquisition module for acquiring the image to be recognized and the motion data corresponding to the image to be recognized;

An action recognition module, configured to input the image to be recognized and the motion data into a pre-trained action recognition model for action recognition, and obtain a recognition result;

Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method of any one of claims 1-5.
A computer device, wherein the computer device includes a memory and a processor;

The memory is used to store a computer program;

The processor is configured to execute the computer program and implement the following steps when the computer program is executed:

Acquiring a video image, action data, and an action tag corresponding to the video image and action data;

Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result;

Training the pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result;

Merging the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;

Uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;

Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model:

And the following steps: acquiring the image to be recognized and the motion data corresponding to the image to be recognized;

Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result;

Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method.
The computer device according to claim 11, wherein the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label to obtain the trained network model and the prediction result comprises:

Extracting an optical flow image corresponding to the video image according to the video image;

Training the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtaining a spatial prediction result;

Using the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result;

The spatial prediction result and the temporal prediction result are aggregated to obtain a prediction result.
11. The computer device according to claim 11, wherein said obtaining a local recognition result according to said prediction result and said classification result comprises:

Obtain a local recognition result according to the prediction result and the classification result based on the weight calculation formula;

The weight calculation formula includes:

R=λ 1 P a +λ 2 P b

Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
The computer device according to claim 11, wherein said uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning comprises:

Encrypting the model parameters of the local recognition model and the local recognition result to obtain encrypted data;

Upload the encrypted data to the cloud server for joint learning.
The computer device according to claim 11, wherein, before the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the method comprises:

Uploading the trained network model and prediction results to the cloud server for joint learning to obtain a joint network model; receiving the joint network model sent by the cloud server, and using the joint network model as the trained network model; and /or

Upload the trained classification model and the classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:

Acquiring a video image, action data, and an action tag corresponding to the video image and action data;

Perform network training on the dual-stream convolutional neural network based on the video image and the corresponding action label, and obtain the trained network model and the prediction result;

Training the pre-configured classifier based on the action data and the corresponding action label to obtain the trained classification model and the classification result;

Merging the trained network model and the trained classification model to obtain a local recognition model, and obtain a local recognition result according to the prediction result and the classification result;

Uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning to obtain learning parameters;

Receive the learning parameters sent by the cloud server, update the local recognition model according to the learning parameters, and use the updated local recognition model as a trained action recognition model:

And the following steps: acquiring the image to be recognized and the motion data corresponding to the image to be recognized;

Input the to-be-recognized image and the motion data into a pre-trained motion recognition model for motion recognition to obtain a recognition result;

Wherein, the pre-trained action recognition model is obtained by training according to the action recognition model training method.
The computer-readable storage medium according to claim 16, wherein the network training of the dual-stream convolutional neural network based on the video image and the corresponding action label to obtain the trained network model and the prediction result comprises:

Extracting an optical flow image corresponding to the video image according to the video image;

Training the spatial stream convolutional network in the dual-stream convolutional neural network by using the video image and the corresponding action label, and obtaining a spatial prediction result;

Using the optical flow image and the corresponding action label to train the time flow convolutional network in the dual-stream convolutional neural network, and obtain a time prediction result;

The spatial prediction result and the temporal prediction result are aggregated to obtain a prediction result.
The computer-readable storage medium according to claim 16, wherein the obtaining a local recognition result according to the prediction result and the classification result comprises:

Obtain a local recognition result according to the prediction result and the classification result based on the weight calculation formula;

The weight calculation formula includes:

R=λ 1 P a +λ 2 P b

Among them, R represents the local recognition result, P a represents the most probable result among the prediction results, λ 1 represents the weight coefficient of the most probable result P a , P b represents the most probable result among the classification results, and λ 2 represents the most probable result P b The weight coefficient.
The computer-readable storage medium according to claim 16, wherein the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning comprises:

Encrypting the model parameters of the local recognition model and the local recognition result to obtain encrypted data;

Upload the encrypted data to the cloud server for joint learning.
The computer-readable storage medium according to claim 16, wherein, before the uploading the model parameters of the local recognition model and the local recognition result to a cloud server for joint learning, the method comprises:

Uploading the trained network model and prediction results to the cloud server for joint learning to obtain a joint network model; receiving the joint network model sent by the cloud server, and using the joint network model as the trained network model; and /or

Upload the trained classification model and classification result to the cloud server for joint learning to obtain a joint classification model; receive the joint classification model sent by the cloud server, and use the joint classification model as the trained classification model.