CN113642472A

CN113642472A - Training method and action recognition method of discriminator model

Info

Publication number: CN113642472A
Application number: CN202110939838.7A
Authority: CN
Inventors: 范锡睿; 赵亚飞; 陈超; 张世昌; 郭紫垣
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-11-12

Abstract

The disclosure provides a training method and an action recognition method of a discriminator model, and relates to the field of artificial intelligence, in particular to the fields of deep learning and computer vision. The specific implementation scheme is as follows: determining a first positive sample pair, wherein the positive sample pair comprises first template video data and second template video data, and the first template video data and the second template video data both comprise a first template action; determining a first negative sample pair, the negative sample pair comprising third template video data and at least one of the first template video data and the second template video data, the third template video data comprising a second template action different from the first template action; and training the discriminator model by using the first positive sample pair and the first negative sample pair.

Description

Training method and action recognition method of discriminator model

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and further relates to the field of deep learning and computer vision, and in particular, to a training method, an action recognition method, an apparatus, an electronic device, and a storage medium for a discriminator model.

Background

With the popularization of video equipment, the increasing of video software, the increasing of network speed and other factors, a large amount of videos are spread on the network and exponentially increase. The video information is various and large in quantity, and far exceeds the capability of manual processing of human beings. Therefore, it is necessary to invent a motion recognition method in video suitable for various applications such as video recommendation, human behavior analysis, video monitoring, and the like.

Disclosure of Invention

The disclosure provides a training method, an action recognition device, an electronic device and a storage medium of a discriminator model.

According to an aspect of the present disclosure, there is provided a training method of a discriminator model, including: determining a first positive sample pair, the positive sample pair comprising first template video data and second template video data, the first template video data and the second template video data each comprising a first template action; determining a first negative sample pair comprising third template video data and at least one of the first template video data and the second template video data, the third template video data comprising a second template action different from the first template action; and training the discriminator model by using the first positive sample pair and the first negative sample pair.

According to another aspect of the present disclosure, there is provided a motion recognition method including: inputting a first time sequence feature vector of a video sequence to be recognized comprising a motion to be recognized and a second time sequence feature vector of a template video sequence comprising a template motion into a discriminator model to obtain a target template motion which belongs to the same category as at least part of the motion represented by the first time sequence feature vector; determining the action category of the action to be recognized according to the target template action; the discriminant model is obtained by training based on the training method.

According to another aspect of the present disclosure, there is provided a training apparatus of a discriminator model, including: a first determining module, configured to determine a first positive sample pair, where the positive sample pair includes first template video data and second template video data, and the first template video data and the second template video data both include a first template action; a second determining module, configured to determine a first negative sample pair, where the negative sample pair includes third template video data and at least one of the first template video data and the second template video data, and the third template video data includes a second template action different from the first template action; and a first training module for training the discriminator model using the first positive sample pair and the first negative sample pair.

According to another aspect of the present disclosure, there is provided a motion recognition apparatus including: the input module is used for inputting a first time sequence feature vector of a video sequence to be recognized comprising a motion to be recognized and a second time sequence feature vector of a template video sequence comprising a template motion into the discriminator model to obtain a target template motion which belongs to the same category as at least part of the motion represented by the first time sequence feature vector; the determining module is used for determining the action type of the action to be recognized according to the target template action; wherein, the discriminator model is obtained based on the training of the training device.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates an exemplary system architecture to which the motion recognition method and apparatus may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of training a discriminant model according to an embodiment of the present disclosure;

FIG. 3 schematically shows a flow chart of a method of motion recognition according to an embodiment of the present disclosure;

FIG. 4A schematically illustrates a schematic diagram of data acquisition and discriminant training in accordance with an embodiment of the present disclosure;

FIG. 4B schematically shows a schematic diagram of feature extraction according to an embodiment of the present disclosure;

FIG. 4C schematically shows a schematic diagram of a discriminator process according to an embodiment of the present disclosure;

FIG. 4D schematically illustrates a diagram of post-processing output results according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a block diagram of a training apparatus for a discriminator model according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a motion recognition apparatus according to an embodiment of the present disclosure; and

FIG. 7 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.

Real-time action identification based on a monocular RGB (red, green, blue, green, blue, green, blue, green.

The implementation of motion recognition, which usually uses deep neural network to directly end-to-end train a multi-classifier, generally includes five steps: and data acquisition, namely acquiring data aiming at an application scene. And (4) data labeling, namely, giving a labeling rule according to the target action type, and labeling the data according to the labeling rule. And (4) network training, namely, feeding the marked data into a deep neural network for training. And (4) network deployment, namely performing engineering deployment on the trained network model. And (4) network inference, namely inputting the image network inference output category.

The inventor finds in the course of implementing the disclosed concept that end-to-end training of multiple classifiers is strictly tied to the application scenario. For each scene, data needs to be collected and corresponding labeling rules are formulated to determine which actions need to be identified. The data is then manually annotated. These two steps require a significant labor and time cost, usually in days. The subsequent network training step is time-consuming. In addition, the action categories supported by the multiple classifiers obtained by one-time training are fixed, and the three steps of repeated data acquisition, data labeling and network training are required if the support for new actions is added subsequently, so that the cost is high, and the flexibility and expansibility are poor. In addition, when the multi-classifier based on end-to-end training is used for motion recognition, the multi-classifier is very sensitive to external conditions such as environment, visual angle and clothes, and the generalization performance is poor. If the distribution of the actual data and the training data is greatly different, the effect is seriously deteriorated.

Therefore, the current dynamic motion recognition technology is not mature enough, and a stable and reliable general solution is not available. In the digital live broadcast scene, highly stable and accurate dynamic action recognition capability is required as a support.

Fig. 1 schematically illustrates an exemplary system architecture to which the motion recognition method and apparatus may be applied, according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the motion recognition method and apparatus may be applied may include a terminal device, but the terminal device may implement the motion recognition method and apparatus provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for content browsed by the user using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that the motion recognition method provided by the embodiment of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Accordingly, the motion recognition device provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

Alternatively, the action recognition method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the motion recognition device provided by the embodiment of the present disclosure may be generally disposed in the server 105. The action recognition method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the motion recognition device provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

For example, when the motion to be recognized needs to be recognized, the

terminal devices

101, 102, and 103 may input the first time-series feature vector of the video sequence to be recognized including the motion to be recognized and the second time-series feature vector of the template video sequence including the template motion into the discriminator model, so as to obtain a target template motion belonging to the same category as at least part of the motion represented by the first time-series feature vector. And then, determining the action type of the action to be recognized according to the target template action. Or by a server or server cluster capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105, to analyze the target content and to enable determining the action category of the action to be recognized.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

According to an embodiment of the present disclosure, the motion recognition method may be implemented using a discriminator model obtained based on a confrontational training.

FIG. 2 schematically shows a flow chart of a method of training a discriminant model according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, a first positive sample pair is determined, the positive sample pair including first template video data and second template video data, the first template video data and the second template video data each including a first template action.

In operation S220, a first negative sample pair is determined, the negative sample pair including third template video data and at least one of the first template video data and the second template video data, the third template video data including a second template action different from the first template action.

In operation S230, a discriminator model is trained using the first positive sample pair and the first negative sample pair.

According to an embodiment of the present disclosure, a plurality of pieces of template video may be previously acquired, and the first template video data, the second template video data, and the third template video data may be determined from the plurality of pieces of template video. All required actions, namely template actions, are included in the template video. From two template videos comprising the same template action, a first template video data and a second template video data may be determined, thereby determining a first positive sample pair. At least one of the first template video data and the third template video data, and the second template video data and the third template video data may be determined from the two template videos including different template actions, thereby determining the first negative example pair. Training of the discriminator model can be completed according to the first positive sample pair and the first negative sample pair, so that the discriminator model can recognize the action of the same category as the action of the first template.

According to an embodiment of the present disclosure, each of the first template video data, the second template video data, and the third template video data includes an annotation specific to a video frame, which may characterize a current action category of each video frame.

According to the embodiment of the disclosure, the discriminator training is performed based on the positive sample pair and the negative sample pair formed by the annotated template video data. By inputting two samples and outputting the result of whether the samples are of the same type, a discriminator model which can be used for motion recognition can be obtained through training.

Through the embodiment of the disclosure, the discriminator model is trained by utilizing the positive sample pair and the negative sample pair, so that the robustness of the discriminator model can be effectively improved.

The method shown in fig. 2 is further described below with reference to specific embodiments.

According to an embodiment of the present disclosure, determining the first positive sample pair comprises: based on the first generation scenario, first template video data including a first template action is generated. Based on the second generated scene, second template video data including the first template action is generated. Wherein the first generation scenario is different from the second generation scenario.

According to the embodiment of the disclosure, the first generated scene and the second generated scene may include a scene corresponding to at least one of different background sounds, different background colors, different recording angles, different character objects executing the first template action, and the like. The generation scenes of the two template video data in the positive sample pair may be different. For example, the first template video data may be video data generated by an object a in an outdoor scene, the second template video data may be video data generated by an object b in an indoor scene, and the like.

Through the embodiment of the disclosure, the first template video and the second template video with the same template action are generated based on different scenes, and the confrontation training is carried out on the discriminator model, so that the influence of external factors on action recognition can be effectively removed, and the method has good robustness on environment, identity and visual angle.

According to the embodiment of the disclosure, under the condition that the newly added discriminator model is required to support the identified action, one or more sections of template videos can be shot and labeled according to the newly added action, and the label can represent the current action category of each video frame. On this basis, the training method of the discriminator model may further include: in response to an indication to add a third template action different from the first template action, fourth template video data including the third template action and fifth template video data including the third template action are generated as a second positive sample pair. Generating sixth template video data comprising a fourth template action different from the third template action. And taking at least one of the fourth template video data and the fifth template video data and the sixth template video data as a second negative sample pair. Training the discriminator model by using the second positive sample pair and the second negative sample pair.

According to an embodiment of the present disclosure, the third template action may be a template action that needs to be added. The third template action may be the same as the second template action. The fourth template video data and the fifth template video data may be determined from the additionally shot template video. The sixth template video data can be obtained by shooting, or can be determined according to the first template video data and the second template video data. The sixth template video data may also be determined from the third template video data in the event that the third template action is different from the second template action.

According to the embodiment of the disclosure, the fourth template video data and the fifth template video data can be determined according to the additional shooting template video including the same additional template action, so that the second positive sample pair is determined. At least one of the fourth template video data and the sixth template video data, and the fifth template video data and the sixth template video data may be determined from the two template videos including different template actions, thereby determining a second negative example pair. Training of the discriminator model can be completed according to the second positive sample pair and the second negative sample pair, so that the discriminator model can recognize the action of the third template, namely the action of the newly added template in the same category.

Through the embodiment of the disclosure, new positive sample pairs and negative sample pairs can be determined in a mode of additionally shooting the template video, the types of actions which can be supported and identified by the discriminator model are increased through training the discriminator model, and the flexible expansibility of the discriminator model is improved in a simpler implementation mode. In addition, the template acquisition and labeling steps only need single-section video shooting and simple labeling, and can be completed within several minutes, so that the time cost caused by data acquisition, data labeling and the like is reduced.

According to an embodiment of the present disclosure, generating fourth template video data including the third template action and fifth template video data including the third template action includes: based on the third generated scene, fourth template video data including the third template action is generated. Based on the fourth generated scene, fifth template video data including the third template action is generated. Wherein the third generation scenario and the fourth generation scenario are different.

According to an embodiment of the present disclosure, the third generated scene and the fourth generated scene may also include a scene corresponding to at least one of different background sounds, different background colors, different recording angles, different character objects executing the first template action, and the like. And is not limited thereto.

According to the embodiment of the disclosure, the motion recognition method can be implemented by using the above-mentioned discriminator model obtained based on the confrontation training.

Fig. 3 schematically shows a flow chart of a motion recognition method according to an embodiment of the present disclosure.

As shown in fig. 3, the method includes operations S310 to S320.

In operation S310, a first time sequence feature vector of a to-be-recognized video sequence including a to-be-recognized action and a second time sequence feature vector of a template video sequence including a template action are input into a discriminator model, so as to obtain a target template action belonging to the same category as at least part of actions represented by the first time sequence feature vector.

In operation S320, an action category of the action to be recognized is determined according to the target template action.

According to an embodiment of the present disclosure, the first timing feature vector may be a depth feature vector corresponding to a portion of a video sequence to be identified. The second time-series feature vector may be a depth feature vector corresponding to a portion of the template video sequence. The template video may include at least one of the first template video data, the second template video data, the fourth template video data, and the fifth template video data in the training process, or may be another template video newly added in the new training. At least part of the motion can represent part of the motion corresponding to any incomplete motion segment in the motion to be recognized.

According to the embodiment of the disclosure, the extraction of each depth feature vector can be obtained by extracting the video to be recognized and the template video frame by frame based on a human body feature extraction network, such as frankmocap (a 3D human body posture and shape estimation algorithm) and openposition (a human body posture estimation algorithm).

According to the embodiment of the disclosure, in human body action recognition, under the condition that the depth features to be extracted include body part features and gesture part features, the body part features and the gesture part features are different in feature dimension, and the body part features and the gesture part features can be respectively normalized and then spliced to obtain the depth feature vector representing the whole human body action.

According to an embodiment of the present disclosure, the depth feature vectors extracted for the template video may be saved as a separate file to support multiplexing.

According to the embodiment of the disclosure, the action to be recognized can be input into a trained discriminator model together with the template action, and the discriminator model can be used for determining the target template action belonging to the same category as each part of the action to be recognized. For each partial action, one or more target template actions belonging to the same category as it can be derived. According to the target template actions belonging to the same category as the partial actions, the target template actions of which the whole actions to be recognized corresponding to the partial actions belong to the same category can be further determined, and then the action category of the actions to be recognized can be determined.

Through the embodiment of the disclosure, the action recognition is carried out by utilizing the discriminator model obtained based on the confrontation training, so that the robustness of the action recognition process can be improved.

The method shown in fig. 3 is further described below with reference to specific embodiments.

According to the embodiment of the present disclosure, before performing the motion recognition method, for example, a first timing feature vector needs to be acquired first. The determining process of each first timing feature vector may include: the method comprises the steps of acquiring a first frame sequence comprising a target video frame and at least one video frame adjacent to the target video frame aiming at the target video frame in a video sequence to be identified. The feature vector of the first frame sequence is taken as a first timing feature vector.

According to the embodiment of the present disclosure, the target video frame may include one or more determined video frames in the video sequence to be identified, and may also include each video frame in the video sequence to be identified. In consideration of the time sequence continuity of the action, for example, the feature vector of each video frame in the video to be identified and the feature vector of each video frame in one or more video frames adjacent to the feature vector may be spliced in a manner of a window sequence to serve as the dynamic sequence feature of the video frame, that is, the first time sequence feature vector. For example, feature vectors of 30 frames before and after each video frame may be spliced to serve as the dynamic sequence feature of the video frame. The 30 frames before and after may constitute the first frame sequence.

By the embodiment of the disclosure, the time sequence characteristic vector of each video frame is determined based on the consideration of the time sequence continuity of the action, so that the accuracy of the identification result can be effectively enhanced.

According to the embodiment of the present disclosure, before performing the motion recognition method, for example, a second time-series feature vector needs to be acquired first. The determining of each second time-series feature vector may include: and aiming at the target template video frame in the template video sequence, acquiring a second frame sequence comprising the target template video frame and the template video frame adjacent to the target template video frame. The feature vector of the second frame sequence is taken as a second time-sequential feature vector.

According to an embodiment of the present disclosure, the target template video frame may include one or more determined template video frames in the template video sequence, and may also include each template video frame in the template video sequence. For example, for each video frame in the template video, the feature vector of each video frame and the feature vector of each video frame in one or more video frames adjacent to the feature vector may also be spliced in a window sequence manner to serve as the dynamic sequence feature of the video frame, that is, the second time-series feature vector. In order to realize better discrimination, for example, feature vectors of 30 frames before and after each video frame in the template video may be taken and spliced as the dynamic sequence features of the video frame. The 30 frames before and after may constitute the second frame sequence.

According to the embodiment of the disclosure, since the template video includes a plurality of template actions, in order to further improve the working efficiency of the discriminator model, for example, a template video segment corresponding to an action of each category may be determined according to a label in the template video, and then a second time sequence feature vector may be determined for each template video segment in combination with a window sequence.

According to the embodiment of the disclosure, inputting a plurality of first time sequence feature vectors of a to-be-recognized video sequence including a to-be-recognized action and a second time sequence feature vector of a template video including a template action into a discriminator model, and obtaining a target template action belonging to the same category as part of the to-be-recognized action comprises: and inputting the first time sequence feature vector and at least one second time sequence feature vector into a discriminator model to obtain a target second time sequence feature vector, wherein the similarity between the target second time sequence feature vector and the first time sequence feature vector is greater than or equal to a preset threshold value. And taking the template action represented by the target second time sequence feature vector as a target template action.

According to the embodiment of the present disclosure, whether two actions belong to the same category may be determined, for example, according to the similarity of time-series feature vectors corresponding to the two actions, and a preset threshold may be constructed as a criterion for determining whether the actions belong to the same category. For example, when the similarity of the time sequence feature vectors corresponding to two actions is greater than or equal to the preset threshold, it may be determined that the two actions belong to the same category; when the similarity of the time sequence feature vectors corresponding to the two actions is smaller than the preset threshold, it can be determined that the two actions do not belong to the same category.

According to an embodiment of the present disclosure, the two actions may include a partial action in the action to be recognized and a labeled template action, or a partial action in the action to be recognized and a partial action in a labeled template action. When the similarity between the first time sequence feature vector corresponding to a certain part of the actions to be recognized and the second time sequence feature vector corresponding to a certain part of the labeled template actions is larger than or equal to a preset threshold value, the labeled template actions or the part of the actions can be used as target template actions belonging to the same category as the part of the actions to be recognized.

Through the embodiment of the disclosure, whether the action to be recognized and the template action belong to the same category is determined based on the similarity of the time sequence feature vectors, so that the accuracy of the discriminator model in the action recognition process can be effectively enhanced.

According to an embodiment of the present disclosure, the target template action includes a plurality of target template actions. Determining the action category of the action to be recognized according to the target template action comprises the following steps: a number of occurrences of each of the plurality of target template actions is determined. And taking the action type of the target template action with the largest occurrence number as the action type of the action to be recognized.

According to the embodiment of the disclosure, a plurality of partial actions can be obtained correspondingly according to the action to be recognized, and the target template action belonging to the same category as each partial action can comprise one or more actions. The target template action for determining the action category of the action to be recognized may be a target template action that occurs the most frequently among a plurality of target template actions belonging to the same category as all of the partial actions in the action to be recognized.

For example, the template actions include lifting one hand, lifting both hands, waving hands, clasping heads with both hands, tying hair, and the like. If the similarity is greater than a preset threshold value, the one-hand lifting, the two-hand lifting, the hand waving, the two-hand head clasping and the hair binding can be used as target template actions belonging to the same category as the first part of actions. And further recognizing that a second part of the actions to be recognized is represented by extending the two hands upwards, for example, the second part of the actions has higher similarity with a certain part of actions of lifting the two hands, holding the head with the two hands and binding the hair, and if the similarity is greater than a preset threshold value, the lifting the two hands, holding the head with the two hands and binding the hair can be used as target templates belonging to the same category as the second part of the actions. If the similarity is greater than a preset threshold value, the lifting hands can be used as target template actions belonging to the same category as the second part of actions. Because the lifting hands and at least three partial actions in the actions to be recognized belong to the same category, namely when the actions to be recognized and the template are distinguished, the lifting hands have higher occurrence times, and the action category of the actions to be recognized can be judged to comprise the lifting hands.

By the embodiment of the disclosure, the target template action capable of determining the action type of the action to be recognized is determined based on the occurrence frequency of the target template action belonging to the same type as each part of the action to be recognized, and the action type of the action to be recognized is further determined, so that the accuracy of action recognition can be further enhanced.

The above-mentioned motion recognition method will be further described with reference to fig. 4A to 4D in conjunction with specific embodiments.

According to an embodiment of the present disclosure, the motion recognition method implemented based on the discriminator model mainly includes the following steps: and (4) data acquisition and discriminator training, feature extraction, discriminator processing and post-processing output results.

Fig. 4A schematically illustrates a schematic diagram of data acquisition and discriminant training in accordance with an embodiment of the present disclosure.

As shown in fig. 4A, the data collection process may be implemented by collecting multiple segments of template video 401 including the same or different template actions in the same or different scenarios. Different template actions in the template video 401 may be distinguished by annotations 403. The action category of each part of the template action can be determined by the label 403. The depth feature extraction network 402 may extract template features 404 of the template video 401. The countermeasure training 405 can be performed on the discriminator model 406 by using the positive sample pairs of the template feature configurations representing the same template actions and the negative sample pairs of the template feature configurations representing different template actions, so that the discriminator model 406 can complete the action recognition of the action to be recognized based on the template features.

It should be noted that the related process of the annotation 403 can be completed in the template video 401. The template features 404 extracted by the deep feature extraction network 402 may be stored as a separate file for reuse in subsequent discriminator processing.

Fig. 4B schematically illustrates a schematic diagram of feature extraction according to an embodiment of the present disclosure.

As shown in fig. 4B, since the template features 404 are already saved as a separate file supporting multiplexing, the partial feature extraction process may only extract the video features 408 of the video 407 to be identified, and the partial feature extraction process may also be implemented by the depth feature extraction network 402.

It should be noted that the video feature 408 may represent a first time-series feature vector corresponding to each part of the motion in the video 407 to be recognized. Template features 404 may represent a second temporal feature vector corresponding to portions of actions in the template video data.

FIG. 4C schematically shows a schematic diagram of a discriminator process according to an embodiment of the disclosure.

As shown in fig. 4C, the discriminator model 409 may take the template feature 404 and the video feature 408 as input, and output the similarity that the template feature and the video feature belong to the same category as the discrimination result 410. One or more template features belonging to the same category as the video features can be determined based on the discrimination result in combination with a preset threshold.

FIG. 4D schematically shows a diagram of post-processing output results according to an embodiment of the disclosure.

As shown in fig. 4D, the discrimination result 410 may embody one or more template features belonging to the same category as the respective video features. The post-processing module 411 may determine an action category of the action in the video to be recognized according to the occurrence number of each template feature, and output the action category through the output category 412.

Through the embodiment of the disclosure, the method for recognizing the action with light weight, high precision and low cost is provided, compared with the existing mainstream method, the method greatly reduces the time consumption of the flow, has good expansibility and supports flexible expansion of any newly added action. On the premise of ensuring the identification effect, the research and development deployment cost is greatly reduced. The method can be applied to products related to the 3D virtual digital man, including virtual anchor, virtual customer service, virtual assistant, virtual teacher, virtual idol and the like, and can also be applied to the fields of label identification, agenda detection and the like, and fast iteration of the products is supported with good expansibility and excellent performance.

FIG. 5 schematically shows a block diagram of a training apparatus for a discriminator model according to an embodiment of the present disclosure.

As shown in fig. 5, the training apparatus 500 for the classifier model includes a first determining module 510, a second determining module 520, and a first training module 530.

A first determination module 510 for determining a first positive sample pair. The positive sample pair includes first template video data and second template video data. The first template video data and the second template video data each include a first template action.

A second determining module 520 for determining the first negative example pair. The negative sample pair includes third template video data and at least one of the first template video data and the second template video data. The third template video data includes a second template action different from the first template action.

A first training module 530 for training the discriminator model using the first positive sample pair and the first negative sample pair.

According to the embodiment of the disclosure, the training device of the discriminator model further comprises a first generation module, a second generation module, a first definition module and a second training module.

A first generating module to generate, as a second positive sample pair, fourth template video data including a third template action and fifth template video data including the third template action in response to an indication to add the third template action different from the first template action.

And the second generation module is used for generating sixth template video data comprising a fourth template action different from the third template action.

And the first definition module is used for taking at least one of the fourth template video data and the fifth template video data and the sixth template video data as a second negative sample pair.

And the second training module is used for training the discriminator model by utilizing the second positive sample pair and the second negative sample pair.

According to an embodiment of the present disclosure, the first determination module includes a first generation unit and a second generation unit.

A first generation unit configured to generate first template video data including a first template action based on the first generation scene.

And a second generation unit configured to generate second template video data including the first template action based on the second generation scene.

Wherein the first generation scenario is different from the second generation scenario.

According to an embodiment of the present disclosure, the first generation module includes a third generation unit and a fourth generation unit.

A third generating unit configured to generate fourth template video data including a third template action based on the third generated scene.

And a fourth generating unit configured to generate fifth template video data including the third template action based on the fourth generation scene.

Wherein the third generation scenario and the fourth generation scenario are different.

Fig. 6 schematically shows a block diagram of a motion recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the motion recognition apparatus 600 includes an input module 610 and a determination module 620.

The input module 610 is configured to input a first time sequence feature vector of a to-be-recognized video sequence including a to-be-recognized action and a second time sequence feature vector of a template video sequence including a template action into the discriminator model, so as to obtain a target template action that belongs to the same category as at least part of actions represented by the first time sequence feature vector.

And the determining module 620 is configured to determine an action category of the action to be recognized according to the target template action.

Wherein the discriminator model is obtained based on the apparatus of any one of claims 10 to 13.

According to an embodiment of the present disclosure, an input module includes an input unit and a first defining unit.

And the input unit is used for inputting the first time sequence feature vector and at least one second time sequence feature vector into the discriminator model to obtain a target second time sequence feature vector, wherein the similarity between the target second time sequence feature vector and the first time sequence feature vector is greater than or equal to a preset threshold value.

And the first defining unit is used for taking the template action represented by the target second time sequence feature vector as the target template action.

According to an embodiment of the present disclosure, the target template action includes a plurality of target template actions, and the determination module includes a determination unit and a second definition unit.

A determining unit for determining the number of occurrences of each of the plurality of target template actions.

And the second definition unit is used for taking the action type of the target template action with the largest occurrence frequency as the action type of the action to be recognized.

According to an embodiment of the present disclosure, the action recognition apparatus further includes a first obtaining module and a second defining module.

The first obtaining module is used for obtaining a first frame sequence comprising a template video frame and at least one video frame adjacent to the template video frame aiming at the template video frame in a video sequence to be identified.

And the second defining module is used for taking the feature vector of the first frame sequence as the first time sequence feature vector.

According to an embodiment of the present disclosure, the action recognition apparatus further includes a second obtaining module and a third defining module.

And the second obtaining module is used for obtaining a second frame sequence comprising the template video frame and the template video frame adjacent to the template video frame aiming at the template video frame in the template video sequence.

And the third defining module is used for taking the feature vector of the second frame sequence as a second time sequence feature vector.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the training method of the discriminator model, the motion recognition method. For example, in some embodiments, the discriminant model training method, the motion recognition method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into the RAM 703 and executed by the computing unit 701, a computer program may perform one or more steps of the discriminant model training method, the motion recognition method described above. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform a training method, an action recognition method, of the arbiter model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training a discriminant model, comprising:

determining a first positive sample pair, the positive sample pair comprising first template video data and second template video data, the first template video data and the second template video data each comprising a first template action;

determining a first negative sample pair comprising third template video data and at least one of the first template video data and the second template video data, the third template video data comprising a second template action different from the first template action; and

training the discriminator model using the first positive sample pair and the first negative sample pair.

2. The method of claim 1, further comprising:

in response to an indication to add a third template action different from the first template action, generating fourth template video data comprising the third template action and fifth template video data comprising the third template action as a second positive sample pair;

generating sixth template video data comprising a fourth template action different from the third template action;

taking the sixth template video data and at least one of the fourth template video data and the fifth template video data as a second negative sample pair; and

training the discriminator model using the second positive sample pair and the second negative sample pair.

3. The method of claim 1, wherein the determining a first positive sample pair comprises:

generating first template video data including the first template action based on a first generation scenario; and

generating second template video data including the first template action based on a second generation scene,

4. The method of claim 2, wherein the generating fourth template video data comprising the third template action and fifth template video data comprising the third template action comprises:

generating fourth template video data including the third template action based on a third generated scene; and

generating fifth template video data including the third template action based on a fourth generation scene,

5. A motion recognition method, comprising:

inputting a first time sequence feature vector of a video sequence to be recognized comprising a motion to be recognized and a second time sequence feature vector of a template video sequence comprising a template motion into a discriminator model to obtain a target template motion which belongs to the same category as at least part of the motion represented by the first time sequence feature vector; and

determining the action type of the action to be recognized according to the target template action;

wherein the discriminant model is trained based on the method of any one of claims 1 to 4.

6. The method of claim 5, wherein inputting a first time-series feature vector of a video sequence to be recognized including an action to be recognized and a second time-series feature vector of a template video sequence including a template action into a discriminator model to obtain a target template action belonging to a same category as at least part of the actions characterized by the first time-series feature vector comprises:

inputting a first time sequence feature vector and at least one second time sequence feature vector into the discriminator model to obtain a target second time sequence feature vector, wherein the similarity between the target second time sequence feature vector and the first time sequence feature vector is greater than or equal to a preset threshold; and

and taking the template action represented by the target second time sequence feature vector as the target template action.

7. The method of claim 5, wherein the target template actions include a plurality of target template actions, and determining the action category of the action to be recognized from the target template actions includes:

determining a number of occurrences of each of the plurality of target template actions; and

and taking the action type of the target template action with the largest occurrence number as the action type of the action to be recognized.

8. The method of claim 5, further comprising:

aiming at a target video frame in the video sequence to be identified, acquiring a first frame sequence comprising the target video frame and at least one video frame adjacent to the target video frame; and

taking the feature vector of the first frame sequence as the first timing feature vector.

9. The method of claim 5, further comprising:

aiming at a target template video frame in the template video sequence, acquiring a second frame sequence comprising the target template video frame and a template video frame adjacent to the target template video frame; and

taking the feature vector of the second frame sequence as the second timing feature vector.

10. An apparatus for training a discriminator model, comprising:

a first determining module, configured to determine a first positive sample pair, where the positive sample pair includes first template video data and second template video data, and the first template video data and the second template video data both include a first template action;

a second determining module, configured to determine a first negative sample pair, where the negative sample pair includes third template video data and at least one of the first template video data and the second template video data, and the third template video data includes a second template action different from the first template action; and

a first training module for training the discriminator model using the first positive sample pair and the first negative sample pair.

11. The apparatus of claim 10, further comprising:

a first generation module for generating, as a second positive sample pair, fourth template video data including a third template action different from the first template action and fifth template video data including the third template action in response to an indication of addition of the third template action;

a second generation module for generating sixth template video data including a fourth template action different from the third template action;

a first defining module, configured to use at least one of the fourth template video data and the fifth template video data, and the sixth template video data as a second negative sample pair; and

12. The apparatus of claim 10, wherein the first determining means comprises:

a first generation unit configured to generate first template video data including the first template action based on a first generation scene; and

a second generation unit configured to generate second template video data including the first template action based on a second generation scene,

13. The apparatus of claim 11, wherein the first generating means comprises:

a third generating unit configured to generate fourth template video data including the third template action based on a third generation scene; and

a fourth generation unit configured to generate fifth template video data including the third template action based on a fourth generation scene,

14. A motion recognition device comprising:

the input module is used for inputting a first time sequence feature vector of a video sequence to be recognized comprising a motion to be recognized and a second time sequence feature vector of a template video sequence comprising a template motion into the discriminator model to obtain a target template motion which belongs to the same category as at least part of the motion represented by the first time sequence feature vector; and

the determining module is used for determining the action type of the action to be recognized according to the target template action;

wherein the discriminant model is trained based on the apparatus of any one of claims 10 to 13.

15. The apparatus of claim 14, wherein the input module comprises:

the input unit is used for inputting a first time sequence feature vector and at least one second time sequence feature vector into the discriminator model to obtain a target second time sequence feature vector, and the similarity between the target second time sequence feature vector and the first time sequence feature vector is greater than or equal to a preset threshold value; and

16. The apparatus of claim 14, wherein the target template action comprises a plurality of target template actions, the determining module comprising:

a determination unit configured to determine a number of occurrences of each of the plurality of target template actions; and

17. The apparatus of claim 14, further comprising:

a first obtaining module, configured to obtain, for a target video frame in the video sequence to be identified, a first frame sequence including the target video frame and at least one video frame adjacent to the target video frame; and

a second defining module, configured to use the feature vector of the first frame sequence as the first time-series feature vector.

18. The apparatus of claim 14, further comprising:

a second obtaining module, configured to obtain, for a target template video frame in the template video sequence, a second frame sequence including the target template video frame and a template video frame adjacent to the target template video frame; and

a third defining module, configured to use the feature vector of the second frame sequence as the second time-series feature vector.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4 or 5-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-4 or 5-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-4 or 5-9.