CN115527080A

CN115527080A - Method for generating video motion recognition model and electronic equipment

Info

Publication number: CN115527080A
Application number: CN202211101991.3A
Authority: CN
Inventors: 孙熠; 孙凯; 杨晓刚
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2022-12-27

Abstract

The embodiment of the application discloses a method for generating a video action recognition model and electronic equipment, wherein the method comprises the following steps: acquiring training sample data, wherein the training sample data comprises a plurality of pieces of video data; in the process of training a video motion recognition model, for an unsupervised training part, generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to motion segments, wherein in each sampling, a sampling range in a time dimension is determined according to an average continuous frame number of a single motion. By the embodiment of the application, the manual processing cost of the training sample data can be saved.

Description

Method for generating video motion recognition model and electronic equipment

Technical Field

The present application relates to the field of video motion recognition technologies, and in particular, to a method for generating a video motion recognition model and an electronic device.

Background

Video understanding and recognition are one of basic tasks of computer vision, and video action recognition is one of the tasks which are full of challenges and have high practical application value. The video motion recognition means that a video motion recognition model is used to recognize specific motion types in a video.

In the prior art, each video data needs to be watched in a manual mode, and the video data is segmented into action segments according to actions, and may need to be labeled and the like, so that the manual operation cost is very high.

Disclosure of Invention

The application provides a method for generating a video motion recognition model and electronic equipment, which can save the manual processing cost of training sample data.

The application provides the following scheme:

a method of generating a video motion recognition model, comprising:

acquiring training sample data, wherein the training sample data comprises a plurality of video data;

in the process of training a video motion recognition model, for an unsupervised training part, generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to motion segments, wherein in each sampling, a sampling range in a time dimension is determined according to an average continuous frame number of a single motion.

Wherein, still include:

when the same group of differentiated input data is constructed by performing video frame sampling on the same video data at least twice, each video frame sampling has the same sampling starting point and the same sampling range on a time axis, so that the training effect of the algorithm model is evaluated through the constructed multiple groups of differentiated input data.

In the process of performing at least two video frame samplings on the same video data, one video frame sampling adopts a uniform sampling mode to determine a frame interval, and other video frame samplings adopt a mode of adding an offset on the basis of the uniform sampling to determine a sampling frame interval.

Wherein the offset is a fixed value or a random value.

Wherein, still include:

and selecting part of representative video data from the multiple pieces of video data for segmenting into a plurality of action segments, and acquiring action category marking information of each action segment so as to train the algorithm model in a mode of combining supervision and unsupervised.

Wherein, the selecting part of representative video data from the plurality of video data comprises:

and selecting partial video data corresponding to different characters at different generation times from the multiple pieces of video data.

predicting action fragments contained in the video data and corresponding action categories by utilizing the algorithm model;

determining partial action segments with larger difference from a plurality of action segments corresponding to the same action type as representative action segments corresponding to the action type;

and taking the video data of the representative action segment as representative video data of the selected part participating in the supervised training.

Wherein, the determining the partial action segments with larger difference from each other from the plurality of action segments corresponding to the same action category comprises:

acquiring feature vectors of a plurality of action segments included in the same action category by using the algorithm model;

determining a difference quantization value between different motion segments by calculating distances between feature vectors of the plurality of motion segments;

and determining the partial action segments with larger difference according to the difference quantization values between the different action segments.

A video processing method, comprising:

carrying out video acquisition on the production working process of workers in a target factory to obtain video data;

identifying the action type executed by the worker in the video data by utilizing a pre-generated video action identification model; when the video motion recognition model is trained, generating input data of the model based on a mode of sampling video frames of original video data which are not segmented according to motion segments for an unsupervised training part, wherein a sampling range on a time dimension is determined according to an average continuous frame number of a single motion during each sampling;

and determining whether the action executed by the worker in the production working process meets the specification or not according to the identified action type and the action type specification information required to be executed by the production link where the worker is located.

Wherein, still include:

and if the action executed by the worker in the production working process does not meet the specification, sending prompt information to the worker.

Wherein, still include:

and if the action executed by the worker in the production working process does not meet the specification, determining corresponding production object identification information and providing quality inspection prompt information for a corresponding quality inspection user.

An apparatus for generating a video motion recognition model, comprising:

the training sample data acquisition unit is used for acquiring training sample data, and the training sample data comprises a plurality of video data;

and the input data construction unit is used for generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to the action segments for an unsupervised training part in the process of training the video action recognition model, wherein in each sampling, a sampling range in a time dimension is determined according to the average continuous frame number of single action.

A video processing apparatus comprising:

the video acquisition unit is used for carrying out video acquisition on the production working process of workers in a target factory so as to obtain video data;

the action recognition unit is used for recognizing the action type executed by the worker in the video data by utilizing a pre-generated video action recognition model; when the video motion recognition model is trained, generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to motion segments for an unsupervised training part, wherein a sampling range in a time dimension is determined according to an average continuous frame number of single motion during each sampling;

and the judging unit is used for determining whether the action executed by the worker in the production working process accords with the standard or not according to the identified action type and the action type standard information required to be executed in the production link where the worker is located.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the preceding claims.

An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding claims.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the scheme provided by the embodiment of the application, in the process of training the video motion recognition model, if the unsupervised training is involved, the unsupervised training can be directly carried out by adopting the original video data which is not segmented, namely, the specific input data is constructed by directly carrying out video frame sampling from the original video data. Therefore, the original video data does not need to be segmented or cut, and therefore the manual operation cost in the model training process can be reduced. In addition, during sampling each time, the sampling range on the time dimension can be determined according to the average continuous frame number of a single action, so that the probability of the input data crossing action categories is greatly reduced, and the effectiveness of the model input data can be ensured to a certain extent.

In addition, when the same set of differentiated input data is constructed by performing at least two video frame samples on the same video data, each video frame sample may have the same sampling start point and the same sampling range on the time axis, so as to evaluate the training effect of the algorithm model through the constructed multiple sets of differentiated input data. By the method, the requirement that the same or similar action content corresponds to the differential input data in the same group can be met, so that multiple groups of differential input data can be used for evaluating the training effect of the model.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a training sample processing method in the prior art;

FIG. 2 is a schematic diagram of a training sample processing method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a sampling mode provided by an embodiment of the present application;

FIG. 4 is a flow chart of a first method provided by an embodiment of the present application;

FIG. 5 is a flow chart of a second method provided by embodiments of the present application;

FIG. 6 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

fig. 8 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In order to facilitate understanding of the solutions provided in the embodiments of the present application, an algorithm model training method in the prior art is first described below.

For algorithm model training, the algorithm model training can be generally divided into three modes, namely supervised training, unsupervised training and semi-supervised training, and a video motion recognition model is not exceptional, and is respectively and simply introduced below.

1. Supervised training means that each input data has its corresponding label information, and the label information is used as a theoretical result to perform model training. That is, supervised training requires that all input data pass through a data labeling process, and if the supervised training mode is completely adopted to train the video motion recognition model, a large amount of motion video data needs to be collected and the data needs to be labeled in a full amount. Specifically, when marking, a marking person needs to watch complete video data in a manual mode, mark start and stop times of a plurality of action segments and corresponding action types, and further divide one video data into a plurality of action segments, wherein each action segment can only comprise one complete action and comprises corresponding action type marking information. Although this method can obtain a relatively accurate prediction result, the labeling process takes a lot of labor, and therefore, the cost is relatively high. In addition, the generalization of the model trained in this way is relatively poor. Generalization is the difference between the accuracy of a model when processing data that has never been found and the accuracy when processing data for training. The smaller the gap, the stronger the generalization of the model.

2. Unsupervised training refers to model training according to the similarity and difference between the relationship between a plurality of input data and the relationship between the corresponding model outputs, wherein the input data has no known correct label information. If the video motion recognition model is trained in an unsupervised training mode, the fact that specific motion categories do not need to be marked means that labor cost is reduced, and compared with a supervised training mode, the model can obtain stronger generalization. However, since an original video is usually composed of a plurality of motion clips connected together on a time axis, the algorithm model training process is trained on the premise that only one motion is included in one input data. Therefore, in the prior art, when the video motion recognition model is trained in an unsupervised manner, the original video data still needs to be segmented to obtain individual motion segments, and each motion segment only contains one complete motion. In addition, in the process of segmenting the original video data, the start and end time points of each segment still need to be marked in a manual viewing manner. That is, compared to labeling data in the supervised mode, only the step of labeling the operation type corresponding to a specific operation segment is omitted. However, in the overall process of action video annotation, action segmentation is a step which consumes more manual judgment time, and when action segmentation is performed, an annotator already completely views an action segment, that is, already knows the action type, does not perform action marking, only omits a simple manual action of clicking, and the time required by action segmentation for completely viewing the video cannot be avoided.

3. Semi-supervised training refers to a combination of supervised training and unsupervised training, that is, a part of training data is labeled, and the rest of training data may not be labeled. Specifically, the labeled data and the unlabeled data may be mixed in a certain ratio, for example, the labeled data may be no more than 10%, specifically, the labeled data is 10%, the unlabeled data is 90%, and the like. In the training process, the two modes can be performed alternately, or can be performed simultaneously, for example, part of the data input into the model at one time can be labeled data, and the other part can be unlabeled data, and the like. The semi-supervised mode can give consideration to the advantages of supervised training and unsupervised training, and can obtain a model with higher prediction accuracy and stronger generalization through lower cost.

However, in the process of training the video motion recognition model by the semi-supervised mode, since the unsupervised training part is still involved, if the mode in the prior art is adopted, the specific video data also needs to be segmented to obtain a plurality of motion segments, so as to ensure that each motion segment only contains one complete motion. That is, as shown in fig. 1, it shows a data processing flow of the semi-supervised training in the above-described manner. That is, in the supervised training part, the video data in the sample needs to be divided into a plurality of action segments, and each action segment is labeled with specific action category information; in the unsupervised training part, the video data in the sample also needs to be cut into a plurality of action segments, but the specific action category does not need to be labeled. Therefore, the semi-supervised training mode still has the problem of high labor cost occupied in the segmentation process.

In view of the above situation, in the embodiment of the present application, a semi-supervised training mode may be adopted, and meanwhile, the data input form of the unsupervised training part included therein is improved to reduce the training cost, and of course, the improved mode is also applicable to a completely unsupervised training mode.

In the process of the unsupervised training, original video data which are not segmented can be directly used as input data of the model for training. Due to the fact that segmentation is not needed, each video does not need to be seen in advance in a manual mode, starting and stopping time marks do not need to be added, and the like, and therefore labor cost can be greatly saved. For example, as shown in fig. 2, in the embodiment of the present application, if a semi-supervised training mode is adopted, a small part of video data may be subjected to processing such as segmentation and labeling, and another large part of data may be directly involved in training of the model without segmentation or labeling.

Of course, since the raw video data that is not segmented may be formed by connecting a plurality of motion segments, it does not meet the assumption that only one motion is included in one input data in the specific model training process. In view of the above situation, when the input data is constructed, a sampling process of video frames is involved (the length of each input data needs to be kept consistent, that is, the number of frames is the same, so that the purpose can be achieved by sampling), and therefore, the above assumption can be satisfied as much as possible by improving a specific sampling manner.

Specifically, regarding the sampling process of the video frame, firstly, a scheme of performing motion segment segmentation in advance in the prior art is taken as an example for description. Since the motion recognition refers to processing a segment of motion video clip (i.e. motion clip) as an input of the recognition model, the model outputs a classification result of motion categories in the motion clip. Specifically, although the algorithm model can process a plurality of input data at a time, it generally requires that the dimension (i.e. length, in this embodiment, specifically, the number of video frames included in each piece of input data) of a single input data is fixed for the classification task. While the number of video frames contained in different action segments is typically not consistent for an action segment. Therefore, each action segment needs to be sampled by video frames, that is, a fixed number of video frames are sampled from the complete action segment in a preset sampling mode to represent the action segment and input the action segment to the model for processing.

In the training process of the model, one premise is that one piece of input data only contains the same action content, and for the condition that action segments are segmented in advance in the prior art, each piece of input data is obtained by sampling video frames from the action segments, so that the premise can be naturally met. However, in the embodiment of the present application, since the original video that is not sliced is sampled, and the original video may include a plurality of different types of action contents, if control is not performed, a situation that the same input data includes a plurality of different action contents may result. For this reason, in the embodiment of the present application, a specific sampling manner is improved, and specifically, a sampling range may be limited according to an average number of continuous frames of a single motion. For example, assuming that the average number of continuous frames for a single action is 25 frames, each sample may be made in the range of 25 frames. For example, the length of the input data is 8 frames, the starting point of a certain sampling is the 10 th frame in the original video, then this sampling can sample 8 video frames from the 10 th frame to the 35 th frame in the video as one piece of input data, and so on. Thus, since a single action will last around 25 frames on average (assumed for the purposes of illustration only), one sample, if taken within the sample range, will greatly reduce the probability that the input data will cross the action class. In other words, in this way, the constructed input data can be made to include the action content of only one action type as much as possible, and even if the action type is crossed, the action type is only crossed by two action types as much as possible, instead of multiple action types, so that the effectiveness of most input data can be guaranteed at least.

In addition, in the unsupervised training, since the real label of the input data is unknown, in order to evaluate the training effect of the model, at least two different transformations are usually performed on the same action segment, so as to construct a group of differentiated input data, and then the similarity between two corresponding outputs is maximized according to the constraint that the actual categories of the differentiated data in the same group are consistent to train the model. Specifically, during the self-supervision training of the motion recognition, not only the images on the single-frame content of a plurality of sampled video frames need to be transformed, but also the time dimensions covered by the sampled video frames need to be transformed by means of different sampling modes and sampling parameters. For example, as shown in fig. 3 (a), if 31 is a motion segment, two different input data (i.e., a set of differentiated input data) can be constructed by sampling the motion segment twice, and the two different input data have the same video frame number, but are different in specific video frame composition. For example, the first piece of input data may be the 1 st, 4 th, 7 th, 10 th frames in an action segment, the second piece may be the 3 rd, 6 th, 9 th, 12 th frames in the action segment, and so on. However, since the two input data are sampled from the same motion segment, if the training effect of the model is good, the output result of the model should be the same after the two input data are input into the model, or at least the similarity should be high. Therefore, the loss function can be constructed by the similarity between the model output results corresponding to the two pieces of differential input data constructed by the same action segment, and the function value of the loss function is maximized by continuously adjusting the parameter values in the model training process until the algorithm converges.

In the case where a plurality of motion segments are previously divided, since the structure is such that the input data is differentiated based on a specific motion segment, a method of matching a random initial position at a fixed interval may be employed each time sampling is performed for the same motion segment. For example, in the aforementioned example shown in fig. 3 (a), two different sampling operations may both be at 3-frame intervals, where the first sampling is from the 1 st frame, the second sampling is from the 3 rd frame, and so on.

In the embodiment of the present application, original video data that is not sliced is directly used for training, and there is a difference in length between different original video data, so that it is also necessary to sample such original video data. However, since the original video data is not sliced, a plurality of different action segments may be included, some non-action segments may be included, and so on. For example, as shown at 32 in fig. 3 (B), wherein the action segments and the non-action segments are represented by different gray-scale color blocks, currently, only the differences between the action segments and the non-action segments are shown in the figure, and the differences between the different action segments are not shown, but it is understood that each different action segment in the same video data generally corresponds to a different action.

Under the condition shown in fig. 3 (B), if the sampling mode of matching the fixed interval with the random initial position shown in fig. 3 (a) is still adopted, there may be a problem that a set of differentiated data constructed by two (or more) samplings has a too large difference, that is, actual action contents included in a fixed number of video frames obtained by two samplings have a large difference, so that a precondition for constructing a differentiated change result on the same data required in unsupervised training fails, and a recognition result becomes bad. That is, as shown in fig. 3 (B), the upper and lower two diagrams illustrate two sampling operations based on the same video data, wherein the arrow indicates the position of a specific sampled video frame in the video. As can be seen from the figure, since the starting positions of the two previous and subsequent samples are random, the video frames sampled at the first sample mainly come from the 1 st motion segment, the 2 nd non-motion segment, and in addition, include the video frames of part of the 2 nd motion segment. And the second time of sampling, the sampled video frames are mainly from the 2 nd motion segment and the 3 rd motion segment, and so on. In this way, the data frames corresponding to different action segments between two constructed input data are made, and obviously, the effect of model training cannot be evaluated for such input data pairs.

In view of the above situation, in the embodiment of the present application, on the basis of performing unsupervised training using raw video data that is not segmented, in order to construct multiple groups of differentiated input data that can be used for effectively evaluating the model training effect, a differentiated data construction mode of "fixed sampling start point + fixed sampling range" may also be adopted. That is, in particular, in the process of performing at least two sampling operations based on the same video data to construct a set of differentiated input data, each sampling operation may have the same sampling start point and the same sampling range on the time axis. For example, as shown in fig. 3 (C), it is assumed that some is from the 10 th frame of the video, and video frames within 25 frames after the frame are sampled, and so on. The sampling method fully considers the characteristic of unknown effective action segment distribution in an un-segmented video, and limits the coverage of each sampling frame through fixing the sampling starting point and the sampling range, thereby ensuring the similarity of the sampling frames containing action contents to the maximum extent. Of course, in order to ensure differentiation between sampling results of each time, the frame interval may be limited in each sampling, for example, one of the video frame samples may determine the frame interval in a uniform sampling manner, and the other video frame samples may determine the sampling frame interval in a manner of adding an offset on the basis of the uniform sampling. Wherein the offset may be random, may be a fixed value, and the like. In addition, regarding the foregoing sampling range, the determination may be made based on the average number of continuous frames of a single action as well. For example, assuming that the number of frames for which a single action lasts is typically 25 frames, the sampling range may be set to 25 frames, etc., which may reduce the probability that the same input data spans multiple action categories.

In summary, according to the embodiment of the present application, when a video motion recognition model is trained, if an unsupervised training mode is involved, the training may be performed directly based on raw video data that is not segmented, but in order to ensure the validity of input data, a sampling range of each sampling may be limited, and specifically, the sampling range may be determined according to an average continuous frame number of a single motion, so that a probability of a cross-motion category in the obtained input data is reduced. In addition, in constructing a plurality of sets of differential input data for evaluating the training effect of the algorithm model by sampling video data, when sampling is performed a plurality of times based on the same video data to construct the same set of differential input data, each video frame sample may have the same sampling start point and the same sampling range on a time axis. In this way, the similarity of action contents contained in input data obtained by multiple times of sampling is ensured to the maximum extent, so that the method can be used for effectively evaluating the model training result.

The following describes in detail specific implementations provided in embodiments of the present application.

Example one

First, an embodiment of the present application provides a method for generating a video motion recognition model, referring to fig. 4, where the method may include:

s401: obtaining training sample data, wherein the training sample data comprises a plurality of video data.

Specifically, in the embodiment of the present application, the video motion recognition model is mainly trained, so that the specific training sample may mainly be video data including specific motion content. Also, a specific video motion recognition model may be specifically trained for a specific scene, for example, as described in the background section, if it is required to recognize the motion of a worker in a smart garment factory, a specific training sample may be video data collected for the work process of the worker in such a smart garment factory. Specifically, new video data may be generated every day and everyone, and the video data may be used as a training sample in the embodiment of the present application to participate in training of the video motion recognition model.

In the embodiment of the present application, a semi-supervised training mode may be mainly adopted. The semi-supervised training process may include a supervised training part and an unsupervised training part, so that after a specific training sample is obtained, a part of the training sample may be divided according to a certain proportion to participate in the supervised training process, and for the part of the training sample, the part of the training sample needs to be divided into a plurality of action segments in advance, so that each action segment only includes the content of one action, and the action segments may reach the standard, that is, specific action categories are labeled for the specific action segments. And other training samples participating in the unsupervised training do not need to be segmented and marked.

It should be noted that, in the semi-supervised training mode, although a combination of supervised and unsupervised modes is involved, specific algorithm models may be the same, that is, the same algorithm model is trained in the supervised and unsupervised modes, and in the specific training process, the supervised mode and the unsupervised mode may be performed alternately or simultaneously, and so on. Of course, the scheme provided by the embodiment of the application can also be applied to a pure unsupervised training process.

S402: in the process of training the video motion recognition model, for an unsupervised training part, generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to motion segments, wherein in each sampling, a sampling range in a time dimension is determined according to an average continuous frame number of single motion.

Since, in the embodiment of the present application, the unsupervised training part can be directly performed based on the raw video data that is not sliced, specifically, when a specific input data is constructed for the algorithm model, the video frame sampling is directly performed based on such raw video data, so that each input data has a fixed dimension. To ensure the validity of the input data, the sampling range may be limited. In the embodiment of the present application, a specific sampling range may be set to be a relatively small range, and specifically, the sampling range may be determined according to an average sustained frame number of a single action, so as to reduce the probability that different actions of multiple categories are included in the same input data. That is, as described above, if a plurality of different types of actions are included in a piece of input data, the probability that the model recognizes that the input data belongs to each action type may be low, and thus, the number of actions across the input data may be reduced as much as possible. By reducing the sampling range, the above object can be achieved. For example, assuming that the average number of continuous frames for a single action is 25 frames, the sampling range may be set to 25, i.e., assuming that the nth frame is taken as a starting point, sampling may be performed in a range from the nth frame to the (N + 25) th frame, thereby reducing the probability that the same input data spans multiple different action categories.

It should be noted that, since an original video may be relatively long, the sampling start position is not limited in this application, that is, where the sampling start position is unknown, but the purpose of reducing the probability that the sampling result spans multiple action categories can be achieved by controlling the sampling range within the average continuous frame number range of a single action. For example, if the sampling start point is exactly the position of the start of a certain motion segment, which is ideal, the sampling range can be controlled to make the sampled input data include only a single motion content as much as possible. If the sampling start point is located at the center of a certain motion segment, two different motion contents may exist in the same input data if the sampling range is controlled to the number of continuous frames of a single motion. Or, if the sampling start point is located at a position where a certain motion segment is to be ended and the duration of the next motion segment is shorter, there may be a case where three or more different motion contents exist in the same input data, and so on. Of course, in practical applications, after the sampling range is limited, the probability of the third case is greatly reduced. That is to say, since the average sustained frame number of a single action can be counted, and there may be some non-action segments interspersed between different action segments in the same video, in general, an arbitrary point is taken as a sampling starting point, and within a certain sampling range after the starting point, the probability of including multiple different action segments is relatively low, and in most cases, the average sustained frame number may be controlled within a range of one or two action contents.

In the case that the same input data spans two categories, if the model training is ideal, although the probability of recognizing that the input data belongs to a certain action category may be only 80% (because there is 20% of video content belonging to another action category), the recognition result still has a certain value compared with the case of a higher probability recognition result corresponding to a single action, and can play a forward role in the learning training of the model. Of course, if the input data constructed by the method provided in the embodiment of the present application is output by the model, the recognition result that the input data has a low probability of belonging to each action category may be caused by the fact that the input data includes action contents of multiple categories, and at this time, the data (usually only a small portion) may be removed to avoid negative influence on the accuracy of the model training result.

In addition, in order to construct a loss function capable of evaluating the training effect of the unsupervised training and update the parameters in the model, as described above, a plurality of groups of differentiated input data need to be constructed, specifically, the same video data may be sampled at least twice to construct a group of differentiated input data based on the same or similar action content, and then the model training effect is evaluated according to the similarity between the action category results predicted by the model for the same group of differentiated input data.

In the process of constructing the differentiated input data, in the embodiment of the present application, since specific original video data may include a plurality of different types of action contents, a sampling manner with a fixed sampling start point and a fixed sampling range may be adopted, that is, when a set of differentiated input data is constructed based on the same video data, sampling may be performed in a certain range starting from the same start point each time sampling is performed. Of course, in order to make the difference between the sampling results at each time, the determination may be made by setting different sampling intervals. For example, in a specific implementation, one of the video frame samples may determine the frame interval by using uniform sampling, and the other video frame samples may determine the frame interval by using offset added on the basis of uniform sampling. The specific offset may be a fixed value, or may also be a random value, and so on.

For example, assuming that the sampling start point is the first frame, the range is 25 frames, and a total of 8 frames need to be sampled, in the case of uniform sampling, the frame interval may be 3 frames, that is, the sampling result may be the 1 st, 4 th, 7 th, 10 th, 13 th, 16 th, 19 th, 22 th frames; the second time an offset may be added on the basis above, e.g., one frame back, the sampling results may be 1 st, 5 th, 8 th, 11 th, 14 th, 17 th, 20 th, 23 th frames, etc. Alternatively, the specific offset may be random, for example, there may be a forward offset, a backward offset, there may be a case where a part of the offsets is 0, and so on.

It should be noted that, in particular, since a piece of original video data may be generally longer (compared to an action segment of a single action content), multiple sets of differential input data may be constructed from one piece of original video data, and each set of differential input data may include at least two pieces of differential input data. It is only necessary to ensure that the same set of differential input data is obtained by sampling at the same sampling start point and within the same sampling range. For example, assuming that sampling is performed twice from the 1 st frame of a certain video data as a sampling start point and within the range of the following 25 frames (i.e., the 1 st to 25 th frames), a set of differential input data can be obtained. In addition, the sampling may be performed twice from the 30 th frame of the video data as a sampling start point and within the next 25 frames (i.e., 30 th to 55 th frames), another set of differential input data may be obtained, and so on.

While the above is described for the unsupervised training, in the specific implementation, since it is usually combined with the supervised training to perform the semi-supervised training, as mentioned above, the process of segmenting and labeling the part of the training samples is also involved. During specific implementation, part of the video data can be selected from the multiple pieces of video data to be used for being segmented into multiple action segments, and action category marking information of each action segment is obtained, so that the algorithm model can be trained in a manner of combining supervision and unsupervised.

Specifically, in the process of selecting a part of video data for supervised training, although a small part of video data can be selected, in order to improve the model training effect and efficiency, the quality of a training sample can be improved as much as possible, that is, some video data with representative significance can be selected to participate in the supervised training. Specifically, assuming that 5 specific action categories need to be identified by the algorithm model, the specific training sample (action fragment) may include action fragments corresponding to the 5 action categories, and each action category may include a plurality of specific action fragments. Although the same action category is used, different people may perform different actions, or different actions may be performed in different scenes, and so on. Therefore, in order to make the model training effect better, the different methods can be covered as much as possible among the action segments under the same action category, so that the generalization capability of the model is improved. In other words, if under the same action category, action segments capable of embodying a plurality of different practices are more meaningful for model training. Therefore, when the training data is selected (that is, which training samples are selected to be segmented and labeled), the quality of the samples participating in the supervised training can be improved by controlling in some ways, and the training efficiency and efficiency of the model are further improved.

In order to achieve the above object, a simpler way may be to select parts of video data corresponding to different persons at different generation times from a plurality of pieces of video data. That is, the samples can be distributed at different times and different persons, and since the probability of differences between specific practices is relatively high when different persons execute the same action type at different times, the above purpose can be achieved in this way.

Or, in another mode, the algorithm model may be trained in an unsupervised manner, so that on the basis that the algorithm model has a certain prediction capability, the motion segments and the corresponding motion categories included in each video data in the training sample are predicted by using the algorithm model. Specifically, the algorithm model may identify the video data in segments, for example, a certain number of video frames are extracted from a 2S video segment each time, the motion categories included in each segment are identified, and the video segments corresponding to adjacent and same categories are connected together to obtain a specific motion segment (of course, in an unsupervised manner, the name of the specific motion category cannot be directly given). Then, from a plurality of action segments corresponding to the same action category, a partial action segment with a large difference therebetween may be determined as a representative action segment corresponding to the action category, and then, the video data in which the representative action segment is located may be used as partial video data participating in supervised training.

Specifically, in the process of performing motion recognition by using the algorithm model, corresponding feature vectors can be generated for specific motion segments, so that the feature vectors of multiple motion segments included in the same motion category can be obtained by using the algorithm model, and then, a difference quantization value between different motion segments can be determined by calculating distances between the feature vectors of multiple motion segments; further, the partial motion segments having a large difference from each other may be determined according to the difference quantization value between the different motion segments.

That is, after a model is trained in an unsupervised manner, an action segment can be identified from original video data, and although a category name of a specific action segment is not known, it can be roughly determined which action segments belong to the same category, and a specific feature vector can be generated for each action segment. In this way, the degree of difference between the motion contents contained in the motion segments can be determined according to the distance between the feature vectors of the motion segments in the same category. And the specific action segments belong to original video data, so that the action segments with more representative significance can be roughly determined in the video data, the video data can be selected, and then the action segments are segmented and marked manually, so that the quality of the training data participating in the supervised training is improved.

After the training of the video motion recognition model is completed, the model can be specifically applied to an application for recognizing or predicting motion types included in a video. The specific predicted object may be a pre-recorded video file, or may also be a video stream content generated in real time, and so on. However, whether the video file or the real-time video stream is used, the specific prediction process may be to predict the segments, and then aggregate the motion category prediction results corresponding to each segment, for example, to connect adjacent segments of the same category together to form a motion segment, and so on. When segment prediction is performed, 2S (or other values) may be used as a segment, and each 2S segment may be sampled and input to a model for prediction. The result of the prediction or recognition of the action segment can be used in the process of guidance or correction of the operation specification of the worker, so as to help improve the quality of the specific produced product, and the like.

In summary, according to the embodiment of the present application, in the process of training the video motion recognition model, if unsupervised training is involved, the unslit original video data can be directly used for unsupervised training, that is, specific input data is constructed in a manner of directly sampling video frames from the original video data. Therefore, the original video data does not need to be cut or cropped, and therefore the manual operation cost in the model training process can be reduced. In addition, during sampling each time, the sampling range on the time dimension can be determined according to the average continuous frame number of a single action, so that the probability of the input data crossing action categories is greatly reduced, and the effectiveness of the model input data can be ensured to a certain extent.

In addition, when the same set of differentiated input data is constructed by performing video frame sampling on the same video data at least twice, each video frame sampling may have the same sampling start point and the same sampling range on the time axis, so as to evaluate the training effect of the algorithm model through the constructed multiple sets of differentiated input data. By the method, the requirement that the same or similar action content corresponds to the differential input data in the same group can be met, so that multiple groups of differential input data can be used for evaluating the training effect of the model.

Example two

The second embodiment is mainly introduced from the application perspective of a specific video motion recognition model. In particular, video motion recognition has a wide range of application scenarios. For example, in an intelligent factory or a digital factory (specifically, an intelligent clothing factory, etc.), workers in each link perform their actions according to preset criteria to collectively complete a production task. Therefore, the normative action of workers in each link often determines the quality of the finally produced product. In the process, the production flow of workers can be managed through a digital means to improve the product quality, so that an effective video action recognition model needs to be generated to automatically recognize whether the actions of the workers meet the specifications or not from the recorded video or the video stream collected in real time, and the like.

Specifically, the second embodiment of the present application provides a video processing method for an application in the above intelligent factory or digital factory, and referring to fig. 5, the method may include:

s501: and carrying out video acquisition on the production working process of workers in the target factory to obtain video data.

The video data may be a recorded video file, or may also be a video stream acquired in real time. In specific implementation, video acquisition equipment can be deployed in a working area in a target factory so as to be used for carrying out video acquisition on the production working process of a specific worker. That is, the particular video data may include actions that are specifically performed by a particular worker during a production job.

S502: identifying the action type executed by the worker in the video data by utilizing a pre-generated video action identification model; when the video motion recognition model is trained, for an unsupervised training part, generating input data of the model in a mode of carrying out video frame sampling on original video data which are not segmented according to motion segments, wherein in each sampling, a sampling range in a time dimension is determined according to the average continuous frame number of a single motion.

After the video data to be processed is determined, the type of the action performed by the human in the video data may be identified by using a pre-generated video action identification model, where the model may be established in the manner described in the first embodiment. Specifically, when the action type is identified, as described in the first embodiment, the video data may be subjected to segment prediction, and then the action type prediction results corresponding to each segment are aggregated. For example, segments of adjacent and same categories are connected together to form an action fragment, and so on. When segment prediction is performed, 2S (or other values) may be used as a segment, and each 2S segment may be sampled and input to a model for prediction.

S503: and determining whether the action executed by the worker in the production working process meets the specification or not according to the identified action type and the action type specification information required to be executed by the production link where the worker is located.

In order to judge whether the actions of the workers are standard or not, the action type standard information required to be executed in the production link where the workers are located can be stored in advance, and therefore after the action types actually executed by the specific workers in the corresponding production link are identified, the action types can be compared with the standard information to judge whether the actions of the workers meet the standards or not. For example, in a certain production link, action 1, action 2, action 3, etc. need to be performed in sequence, and in the working process, a worker is not enough standard for performing action 2, so that the algorithm model cannot recognize the action 2, that is, the worker only performs actions 1 and 3 in a normative manner, and does not perform or perform action 2 in a normative manner, which may cause the quality of the final production object to be affected. Therefore, the worker operation specification can be guided or corrected through the judgment result in the embodiment of the application, and the quality of a specific production object is improved.

In a specific implementation, a large-screen device may be provided in a specific factory, or a terminal device may be provided for a specific worker, and after recognizing that an action performed by a certain worker does not meet a specification, a prompt message may be provided to the worker. For example, the reminder information may be presented on a large screen device, may be sent directly to a terminal device of a particular individual, and so on.

In addition, if the actions performed by the workers in the production work process do not meet the specifications, the corresponding production object identification information can be further determined, and quality inspection prompt information can be provided for the corresponding quality inspection users. For example, a batch identifier of the corresponding production object may be determined according to a time period in which the specifically detected non-compliant operation is performed, and then the batch identifier may be provided to a quality inspection user to prompt the quality inspection user to increase attention in a quality inspection process of the production object of the batch, and so on.

For the unrecited parts in the second embodiment, reference may be made to the description in the first embodiment, and details are not repeated here.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the schemes described herein within the scope permitted by applicable laws and regulations under the condition of meeting the requirements of applicable laws and regulations in the country (for example, clear agreement of the user, actual notification to the user, and the like).

Corresponding to the first embodiment, the present application further provides an apparatus for generating a video motion recognition model, referring to fig. 6, where the apparatus may include:

a training sample data obtaining unit 601, configured to obtain training sample data, where the training sample data includes multiple pieces of video data;

an input data constructing unit 602, configured to, during training of the video motion recognition model, generate, for an unsupervised training portion, input data of the model based on a video frame sampling of original video data that is not segmented according to the motion segments, where, at each sampling, a sampling range in a time dimension is determined according to an average number of continuous frames of a single motion.

Specifically, the apparatus may further include:

the differential input data construction unit is used for constructing the same group of differential input data by performing video frame sampling on the same video data at least twice, wherein each video frame sampling has the same sampling starting point and the same sampling range on a time axis, so that the training effect of the algorithm model is evaluated through the constructed multiple groups of differential input data.

Specifically, in the process of performing at least two video frame samplings on the same video data, one of the video frame samplings determines the frame interval in a uniform sampling manner, and the other video frame samplings determine the sampling frame interval in a manner of adding an offset on the basis of the uniform sampling.

Wherein the offset is a fixed value or a random value.

In a specific implementation, the apparatus may further include:

and the sample selecting unit is used for selecting part of representative video data from the multiple pieces of video data to be used for segmenting into a plurality of action segments and acquiring action class marking information of each action segment so as to train the algorithm model in a mode of combining supervision and unsupervised.

In particular, the sample selection unit may be particularly adapted for

Or, the sample selecting unit may specifically be configured to:

the prediction subunit is used for predicting the action segments contained in the video data and the corresponding action categories by utilizing the algorithm model;

the difference calculation subunit is used for determining partial action segments with larger difference from a plurality of action segments corresponding to the same action type as representative action segments corresponding to the action type;

and the selecting subunit is used for taking the video data in which the representative action segment is positioned as the representative video data of the selected part participating in the supervised training.

Wherein the difference calculation subunit is specifically configured to:

Corresponding to the foregoing embodiment, an embodiment of the present application further provides a video processing apparatus, and referring to fig. 7, the apparatus may include:

the video acquisition unit 701 is used for carrying out video acquisition on the production working process of workers in a target factory to obtain video data;

a motion recognition unit 702, configured to recognize a type of motion performed by a human in the video data by using a pre-generated video motion recognition model; when the video motion recognition model is trained, generating input data of the model based on a mode of sampling video frames of original video data which are not segmented according to motion segments for an unsupervised training part, wherein a sampling range on a time dimension is determined according to an average continuous frame number of a single motion during each sampling;

the determining unit 703 is configured to determine whether an action performed by the worker in the production working process meets a specification according to the identified action type and the information of the specification of the action type that needs to be performed in the production link where the worker is located.

In a specific implementation, the apparatus may further include:

the first prompting unit is used for sending prompting information to the worker if the action executed by the worker in the production working process does not meet the standard.

Alternatively, the method may further include:

and the second prompting unit is used for determining corresponding production object identification information and providing quality inspection prompting information for a corresponding quality inspection user if the action executed by the worker in the production working process does not meet the specification.

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Fig. 8 illustrates an architecture of an electronic device, which may include, in particular, a processor 810, a video display adapter 811, a disk drive 812, an input/output interface 813, a network interface 814, and a memory 820. The processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, and the memory 820 may be communicatively connected by a communication bus 830.

The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 820 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 820 may store an operating system 821 for controlling operation of the electronic device 800, a Basic Input Output System (BIOS) for controlling low-level operation of the electronic device 800. In addition, a web browser 823, a data storage management system 824, and a model generation processing system 825, among others, may also be stored. The model generation processing system 825 may be an application program that implements the operations of the foregoing steps in this embodiment of the present application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 820 and called for execution by the processor 810.

The input/output interface 813 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 814 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

Bus 830 includes a pathway for communicating information between various components of the device, such as processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820.

It should be noted that although the above-mentioned devices only show the processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, the memory 820, the bus 830, etc., in a specific implementation, the devices may also include other components necessary for normal operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The method for generating a video motion recognition model and the electronic device provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of generating a video motion recognition model, comprising:

in the process of training the video motion recognition model, for an unsupervised training part, generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to motion segments, wherein in each sampling, a sampling range in a time dimension is determined according to an average continuous frame number of single motion.

2. The method of claim 1, further comprising:

3. The method of claim 2,

in the process of performing at least two times of video frame sampling on the same video data, one time of video frame sampling adopts a uniform sampling mode to determine a frame interval, and other times of video frame sampling adopts a mode of adding an offset on the basis of uniform sampling to determine a sampling frame interval.

4. The method of claim 3,

the offset is a fixed value or a random value.

5. The method of claim 1, further comprising:

6. The method of claim 5,

the selecting part of representative video data from the multiple pieces of video data comprises the following steps:

7. The method of claim 5,

the selecting of a portion of representative video data from the multiple pieces of video data includes:

8. The method of claim 7,

the step of determining a partial action segment with a larger difference from a plurality of action segments corresponding to the same action category comprises the following steps:

9. A video processing method, comprising:

identifying the action type executed by the worker in the video data by utilizing a pre-generated video action identification model; when the video motion recognition model is trained, generating input data of the model based on a mode of performing video frame sampling on original video data which is not segmented according to motion segments for an unsupervised training part, wherein a sampling range in a time dimension is determined according to an average continuous frame number of single motion during each sampling;

10. The method of claim 9, further comprising:

11. The method of claim 9, further comprising:

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.

13. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 11.