CN116310952A

CN116310952A - Sample data set generation method, device, equipment and storage medium

Info

Publication number: CN116310952A
Application number: CN202310127837.1A
Authority: CN
Inventors: 张靖; 张博雅; 余钱红; 张伯宁
Original assignee: Great Wall Motor Co Ltd
Current assignee: Great Wall Motor Co Ltd
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-06-23

Abstract

The application discloses a sample data set generation method, a sample data set generation device, sample data set generation equipment and a sample data set storage medium, and belongs to the technical field of computers. Comprising the following steps: dividing video segments of the video stream to obtain a plurality of candidate video segments, wherein the plurality of candidate video segments are video segments containing at least one object in the video stream; performing target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, wherein the target video segment is a candidate video segment containing one object, and the gesture of the object in the target video segment accords with a target gesture condition; based on the at least one target video segment and the labels of the at least one target video segment, a sample data set is generated, the labels are used for indicating whether the object in the corresponding target video segment is in a target gesture, and the sample data set is used for training a gesture recognition model. Thus, the sample data set for training the gesture recognition model can be automatically generated, so that the labor cost is reduced, and the manufacturing efficiency of the sample data set is improved.

Description

Sample data set generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a sample data set.

Background

At present, in order to enhance the alertness of staff during work in a park or a factory, the action accuracy of the staff is increased, and the staff is regulated to do special actions, so that the safety of the staff during the work process is improved. For example, in some factories, employees are prescribed to make finger-calling actions to focus the employee on performing a job. However, the rule action is the autonomous consciousness of the staff, and the rule action may not be performed for the staff with poor autonomy, so that the behavior of each staff needs to be supervised in the working process of the staff.

In the related art, supervision of employee behaviors is achieved through an automated manner. That is, the actions of the staff in the working process are identified through a neural network model, so that whether the staff makes the specified actions is judged.

However, in the above manner, the data set of the training neural network model is generally obtained through manual processing, which increases labor cost and has low data set manufacturing efficiency.

Disclosure of Invention

The application provides a sample data set generation method, device, equipment and storage medium, which can automatically generate a sample data set, reduce labor cost and improve the manufacturing efficiency of the data set. The technical scheme is as follows:

In a first aspect, there is provided a sample data set generation method, the method comprising:

dividing video segments of a video stream to obtain a plurality of candidate video segments, wherein the plurality of candidate video segments are video segments containing at least one object in the video stream;

performing target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, wherein the target video segment is a candidate video segment containing one object, and the gesture of the object in the target video segment accords with a target gesture condition;

a sample data set is generated based on at least one of the target video segments and labels of at least one of the target video segments, the labels being used to indicate whether objects in the corresponding target video segment are in a target pose, the sample data set being used to train a pose recognition model.

In the application, video segments of a video stream are divided to obtain a plurality of candidate video segments, that is, a plurality of candidate video segments containing at least one object are obtained based on the video stream division. And then carrying out target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, namely, screening out candidate video segments only comprising one object from the plurality of candidate video segments, wherein the gesture of the object in the candidate video segments accords with the target gesture condition. And then generating a sample data set based on at least one target video segment and labels of the at least one target video segment, wherein the labels can indicate whether objects in the corresponding target video segment are in target gestures, and the generated sample data set can be used for training a gesture recognition model. Thus, the sample data set for training the gesture recognition model can be automatically generated, so that the labor cost is reduced, and the manufacturing efficiency of the sample data set is improved. In addition, the target video segment only comprising one object can be automatically obtained, so that the sample data set can be suitable for training the gesture recognition model under the single object scene, and the recognition efficiency of the gesture recognition model under the single object scene can be improved.

Optionally, the video segment dividing is performed on the video stream to obtain a plurality of candidate video segments, including any one of the following:

performing target detection on the video stream for any object in the video stream to obtain a plurality of target video frames, wherein the plurality of target video frames comprise the object; aggregating the target video frames to obtain candidate video segments corresponding to the objects;

or, for any object in the video stream, performing object detection on the video stream to obtain a reference video frame, wherein the reference video frame contains the object; and carrying out target tracking in the video stream by taking the reference video frame as a starting point to obtain a candidate video segment corresponding to the object.

Optionally, before performing object detection and gesture recognition on the plurality of candidate video segments to obtain at least one object video segment in the plurality of candidate video segments, the method further includes:

deleting a candidate video segment of any one of the plurality of candidate video segments under the condition that the candidate video segment does not meet a preset condition to obtain a plurality of reference video segments, wherein the preset condition is that the number of frames of the candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, and the first preset number of frames is less than the second preset number of frames;

Optionally, the performing object detection and gesture recognition on the plurality of candidate video segments to obtain at least one object video segment in the plurality of candidate video segments includes:

and performing target detection and gesture recognition on the plurality of reference video segments to obtain at least one target video segment in the plurality of reference video segments.

for any one candidate video segment in the plurality of candidate video segments, sliding the candidate frame on the candidate video segment;

classifying the regions covered by the candidate frame in the sliding process of the candidate frame to obtain region classification results of a plurality of regions of the candidate video segment;

determining at least one detection frame of the candidate video segment based on the region classification results of the plurality of regions, the at least one detection frame surrounding at least one object in the candidate video segment;

screening a plurality of single-object video segments from the plurality of candidate video segments based on at least one detection frame in the plurality of candidate video segments, wherein the single-object video segments are candidate video segments containing one object;

And carrying out gesture recognition on the plurality of single-object video segments to obtain at least one target video segment.

Optionally, the filtering, based on at least one detection frame in the plurality of reference video segments, a plurality of single-object video segments from the plurality of candidate video segments includes:

for any one candidate video segment in the plurality of candidate video segments, deleting the candidate video segment under the condition that a plurality of detection frames are detected in the candidate video segment;

and under the condition that one detection frame is detected in the candidate video segments, determining the candidate video segments as the single-object video segments.

Optionally, the performing gesture recognition on the plurality of single-object video segments to obtain at least one target video segment includes:

extracting key points of the objects in the single-object video segment for any single-object video segment in the plurality of single-object video segments to obtain a plurality of key points of the objects in the single-object video segment;

judging whether the plurality of key points meet a target attitude condition or not;

and determining the single-object video segment as the target video segment under the condition that the plurality of key points meet the target gesture condition.

Optionally, the method further comprises:

determining a plurality of keypoints for objects in at least one of the target video segments based on the sample dataset;

the gesture recognition model is trained based on a plurality of keypoints of the object in at least one of the target video segments.

Optionally, the training the gesture recognition model based on a plurality of keypoints of the object in at least one of the target video segments includes:

inputting a plurality of key points of an object in at least one target video segment into the gesture recognition model, performing gesture recognition on the plurality of key points of the object in the target video segment through the gesture recognition model, and outputting a prediction recognition result;

and adjusting model parameters of the gesture recognition model based on difference information between the prediction recognition result and labels corresponding to the target video segments.

In a second aspect, there is provided a sample data set generating apparatus, the apparatus comprising:

the first processing module is used for dividing video segments of a video stream to obtain a plurality of candidate video segments, wherein the candidate video segments are video segments containing at least one object in the video stream;

The second processing module is used for carrying out target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, wherein the target video segment is a candidate video segment containing one object, and the gesture of the object in the target video segment accords with a target gesture condition;

and the generating module is used for generating a sample data set based on at least one target video segment and the label of the at least one target video segment, wherein the label is used for indicating whether an object in the corresponding target video segment is in a target gesture or not, and the sample data set is used for training a gesture recognition model.

Optionally, the first processing module is configured to:

Optionally, the apparatus further comprises:

a first filtering module, configured to delete, for any one candidate video segment of the plurality of candidate video segments, the candidate video segment to obtain a plurality of reference video segments when the candidate video segment does not satisfy a preset condition, where the preset condition is that a number of frames of the candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, and the first preset number of frames is less than the second preset number of frames;

optionally, the second processing module is configured to:

Optionally, the second processing module is configured to:

for any one single-object video segment in a plurality of single-object video segments, extracting key points of objects in the single-object video segment to obtain a plurality of key points of the objects in the single-object video segment;

Optionally, the apparatus further comprises:

a determining module, configured to determine, based on the sample data set, a plurality of keypoints of an object in at least one of the target video segments;

and the training module is used for training the gesture recognition model based on a plurality of key points of the object in at least one target video segment.

Optionally, the training module is configured to:

In a third aspect, a computer device is provided, the computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the computer program implementing the sample data set generation method described above when executed by the processor.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, which when executed by a processor, implements the sample data set generation method described above.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the sample data set generation method described above.

It will be appreciated that the advantages of the second, third, fourth and fifth aspects may be found in the relevant description of the first aspect, and are not repeated here.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a sample data set generation method provided in an embodiment of the present application;

FIG. 2 is a flow chart of another sample data set generation method provided by an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a sample data set generating device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that reference herein to "a plurality" means two or more. In the description of the present application, "/" means or, unless otherwise indicated, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, for the purpose of facilitating the clear description of the technical solutions of the present application, the words "first", "second", etc. are used to distinguish between the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

Before explaining the embodiments of the present application in detail, application scenarios of the embodiments of the present application are described.

At present, in order to strengthen the alertness of staff during work and improve the safety of the staff during work, the staff can be regulated to do some special actions. However, for employees with poor autonomy, special actions may not be prescribed, so the employee behavior is monitored during the employee's operation.

In general, staff behavior can be supervised by human initiative in the staff operation process, namely whether the staff makes special actions or not is judged manually. However, this method requires a lot of manpower, thereby increasing unnecessary manpower cost, and errors such as missed detection and false detection may occur when the personnel is supervised, thereby reducing efficiency of supervising the employee's behavior.

With the development of technology, supervision of employee behaviors can also be achieved through an automatic mode. The actions of the staff in the working process are generally identified through the neural network model, so that whether the staff makes special actions is judged.

In addition, the data set for training the neural network model is typically derived from historical video information within the plant. Specifically, generally, the historical video information of the operation area collected by the camera in the factory is firstly obtained, and since redundant information such as no one or multiple people exists in the historical video information, the historical video information is also subjected to post-processing. The area of the historical video information where employees are present is typically preserved by manual processing to obtain a dataset of trained neural network models. However, this way of obtaining the data set for training the neural network model through manual processing still consumes manpower, thereby increasing labor cost and making the data set less efficient.

To this end, the embodiment of the application provides a sample data set generating method, which can be applied to a scene for making a sample data set for training a gesture recognition model. Specifically, video segment division is performed on video stream data to obtain a plurality of video segments containing at least one object. And then screening the plurality of video segments containing at least one object to obtain at least one video segment containing only one object, wherein the gestures of the objects in the video segment containing only one object meet specific conditions, and finally generating a sample data set based on the obtained at least one video segment containing one object and corresponding labels, wherein the sample data set can be used for training a gesture recognition model. Thus, the sample data set for training the gesture recognition model can be automatically generated, so that the labor cost is reduced, and the manufacturing efficiency of the data set is improved. In addition, in the embodiment of the application, the video segment containing only one object can be automatically obtained, so that the sample data set can be suitable for training the gesture recognition model in the single-object scene, and the recognition efficiency of the gesture recognition model in the single-object scene can be improved.

The sample data set generating method provided in the embodiment of the present application is explained in detail below.

Fig. 1 is a flowchart of a sample data set generating method according to an embodiment of the present application. The method may be applied to a computer device, see fig. 1, the method comprising the following steps.

Step 101: the computer equipment divides the video segments of the video stream to obtain a plurality of candidate video segments.

The video stream is video data containing a plurality of objects. For example, when generating the finger call data set, the video stream may be video data of a work area photographed by a camera in a factory, the video data including a plurality of workers in the work area.

The plurality of candidate video segments are video segments of the video stream containing at least one object, and each of the plurality of candidate video segments is a video segment corresponding to the object in the video stream. That is, each candidate video segment of the plurality of candidate video segments is a video segment obtained from the video stream, and the video segment includes at least one object.

Specifically, the operation of step 101 may be implemented in two possible ways as follows.

A first possible way may include the following steps (1) -step (2).

(1) For any object in the video stream, the computer device performs object detection on the video stream to obtain a plurality of object video frames.

Object detection is used to detect a plurality of objects in the video stream. Specifically, the object in each of the plurality of video frames is detected by performing object detection on a plurality of video frames of the video stream, and as an example, after the computer device performs object detection on one of the video frames in the video stream, an ID (Identity Document, identification code) of the object is marked in the video frame, and the ID is used for uniquely identifying the object.

Each of the plurality of target video frames contains the object. As an example, after the target detection is performed on the plurality of video frames of the video stream, the computer device may know the object in each of the plurality of video frames, and then reserve the video frame containing the object in the plurality of video frames, that is, obtain a plurality of target video frames corresponding to the object.

In this case, the computer device may obtain all video frames including the object in the video stream, that is, obtain a plurality of target video frames, where the plurality of target video frames are video frames corresponding to the object, so that the candidate video segments corresponding to the object may be obtained subsequently according to the plurality of target video frames.

Alternatively, the operation of step (1) may be: the computer device employing the candidate frame to slide on the video stream; in the sliding process of the candidate frame, the computer equipment classifies the areas covered by the candidate frame to obtain area classification results of a plurality of areas of the video stream; the computer device determines a plurality of target video frames based on the region classification results for the plurality of regions of the video stream.

The region classification result is used for indicating whether the region covered by the candidate frame contains the object or not, and the region classification result is a probability value that a region contains the object.

Specifically, the computer device employs the candidate frame to slide over a plurality of video frames of the video stream; in the sliding process of the candidate frame, the computer equipment classifies the areas covered by the candidate frame to obtain area classification results of a plurality of areas on a plurality of video frames of the video stream; the computer device determines a plurality of target video frames of the video stream based on region classification results for a plurality of regions over the plurality of video frames.

As an example, for any one of a plurality of video frames of the video stream, sliding over the one video frame by the candidate frame, and classifying the region on the one video frame covered by the candidate frame, thereby determining a probability value of the object contained in the region covered by the candidate frame, that is, determining a region classification result of the one region in the one video frame covered by the candidate frame. Then, based on the region classification result of the plurality of regions covered by the candidate frame during the sliding process of the video frame, it can be determined whether the object exists in the video frame. The candidate frame slides on each video frame, so that the region classification result of a plurality of regions covered by the candidate frame on each video frame can be obtained, and then the region classification result of a plurality of regions covered by the candidate frame on each video frame can be used for determining which video frames in the plurality of video frames have the object. In case this object is present in a video frame, this video frame is preserved, i.e. the target video frame is obtained. Thus, when the object exists in some video frames in the plurality of video frames, the video frames in the plurality of video frames can be reserved, that is, a plurality of target video frames corresponding to the object are obtained.

Alternatively, the step size of the sliding of the candidate frame may be set in advance, and the step size may be set smaller.

Therefore, when the candidate frame slides on each video frame of the video stream according to the step length, each position on each video frame can be covered by the candidate frame, and then the area covered by the candidate frame is classified, so that whether the object is contained in each position covered by the candidate frame can be judged, and whether the object is contained in the video stream can be more accurately determined.

Alternatively, the computer device may generate a plurality of candidate boxes, which may be different in size. Then, for each of the plurality of candidate frames, the candidate frame may be used to slide over each video frame of the video stream, thereby obtaining a region classification result for a plurality of regions covered by the candidate frame over each video frame of the video stream. By adopting the above operation for each of the plurality of candidate frames, a region classification result of a plurality of regions covered by the plurality of candidate frames on each video frame of the video stream can be obtained, and then the plurality of target video frames can be determined based on the region classification result of a plurality of regions covered by the plurality of candidate frames on each video frame of the video stream.

In this case, since the sizes of the areas that the plurality of candidate frames can cover are different and the computer device does not know the sizes of the objects in the video stream, by sliding the candidate frames of different sizes over each video frame of the video stream, the objects in each video frame can be covered as much as possible, and thus, whether the objects are contained in the video stream can be determined more accurately.

Optionally, the computer device may also perform object detection on the video stream through an object detection algorithm, for example, the object detection algorithm may be YOLO (You Only Look Once, you need only see once) algorithm, fast RCNN (Fast Region-based Convolutional Network, fast Region-based convolutional neural network) algorithm, and the embodiment of the present application does not limit the object detection algorithm.

(2) And the computer equipment aggregates the target video frames to obtain candidate video segments corresponding to the object.

Because each target video frame in the plurality of target video frames contains the object, the plurality of target video frames are aggregated to obtain a section of video where the object is located in the video stream, that is, to obtain a candidate video segment corresponding to the object. In addition, since the computer device reserves the video frame containing the object, not only the object but also other objects may be contained in the video frame of the video stream, so that other objects may be contained in the multiple target video frames corresponding to the object, and then other objects may be contained in the candidate video segment obtained by aggregating the multiple target video frames, that is, at least one object is contained in the candidate video segment.

In addition, consider a case where at least two objects exist in the video stream and the motion processes of the at least two objects are the same, the at least two objects may correspond to the same candidate video segment, and the candidate video segment corresponding to the at least two objects includes the at least two objects. In this case, the plurality of target video frames of the at least two objects obtained by the computer device are the same, and the at least two objects are included in the plurality of target video frames. The candidate video segments aggregated by the computer device for the plurality of target video frames are then the same for the at least two objects, so the at least two objects may correspond to the same candidate video segment.

It should be noted that, for each of the plurality of objects of the video stream, the computer device may obtain the candidate video segment corresponding to each of the plurality of objects through the steps (1) - (2) above.

In this case, the candidate video segments corresponding to each object may respectively represent the motion track of each object in the video stream, so that a sample data set with rich motion may be generated subsequently according to the candidate video segment corresponding to each object in the plurality of objects in the video stream.

A second possible way may include the following steps (1) -step (2).

(1) For any object in the video stream, the computer device performs object detection on the video stream to obtain a reference video frame.

The reference video frame contains this object. Alternatively, the reference video frame may be the first video frame of the plurality of video frames of the video stream that contains the object. As an example, the computer device sequentially performs object detection on the plurality of video frames from a first video frame of the video stream until the object is detected for the first time, and the video frame where the object is detected for the first time is the reference video frame. For example, the computer device performs object detection on a first video frame of the video stream, where the first video frame does not have the object, then the computer device performs object detection on a second video frame, and if the second video frame has the object, then the second video frame in the video stream is the reference video frame.

In this case, the computer device may be aware of the video frame in the video stream in which this object first appears. And then candidate video segments corresponding to the object can be obtained from the video stream starting from the video frame in which the object appears for the first time.

Alternatively, the operation of step (1) may be: the computer equipment sequentially slides on a plurality of video frames of the video stream by adopting the candidate frames; in the process that the candidate frame slides on the plurality of video frames in sequence, the computer equipment classifies the areas covered by the candidate frame to obtain an area classification result of the plurality of areas on one video frame; the computer device determining whether the object is present on the video frame based on the region classification results for the plurality of regions on the video frame; in the event that the object is present on the video frame, the video frame is determined to be the reference video frame.

As an example, the computer device may first slide a candidate frame over a first video frame of the video stream, and in the process of sliding the candidate frame over the first video frame, the computer device classifies the region covered by the candidate frame to obtain a region classification result for a plurality of regions over the first video frame. The computer device then determines that the object is not present on the first video frame based on the region classification of the plurality of regions on the first video frame, and the computer device determines that the first video frame is not a reference video frame. The computer device may then slide over the second video frame with the candidate frame, classify the region covered by the candidate frame during the sliding of the candidate frame over the second video frame to obtain a region classification result for the plurality of regions over the second video frame, and then determine that the object is present over the second video frame based on the region classification result for the plurality of regions over the second video frame, and then determine that the second video frame is the reference video frame.

(2) And carrying out target tracking in the video stream by taking the reference video frame as a starting point to obtain a candidate video segment corresponding to the object.

The object tracking is used for tracking the object in the video stream, and the whole motion process from appearance to disappearance of the object in the video stream can be obtained through the object tracking for any one object. In addition, consider a case where at least two objects exist in the video stream and the motion processes of the at least two objects are the same, the at least two objects may correspond to the same candidate video segment, and the candidate video segment corresponding to the at least two objects includes the at least two objects.

The reference video frame obtained through the object detection is a video frame in which the object appears for the first time in the video stream, in this case, by performing object tracking with the reference video frame as a starting point, the whole process from appearance to disappearance of the object can be obtained, that is, a plurality of candidate video segments corresponding to the object are obtained.

It should be noted that, by performing the steps (1) - (2) above on each of the plurality of objects in the video stream, the whole process from appearance to disappearance of each object can be obtained, that is, a plurality of candidate video segments corresponding to each of the plurality of objects can be obtained.

Alternatively, the operation of step (2) may be: the computer device retains the reference video frames in the video stream and extracts first features of the reference video frames; extracting a plurality of second features of any one of a plurality of video frames subsequent to the reference video frame in the video stream; retaining the video frame in the video stream if a similarity between one of the plurality of second features and the first feature satisfies a similarity threshold; deleting the video frame and all video frames after the video frame in the video stream under the condition that the similarity between each of the plurality of second features and the first feature does not meet the similarity threshold; and aggregating all video frames in the reserved video stream to obtain a candidate video segment corresponding to the object.

The first feature is a feature of this object in the reference video frame. The plurality of second features are features of all objects in one video frame.

The similarity threshold may be set in advance, and the similarity threshold may be set larger. In this case, if the similarity between a second feature and the first feature meets the similarity threshold, it indicates that the similarity between the second feature and the first feature is larger, that is, the second feature is closer to the first feature, and it indicates that the object to which the second feature belongs is one object with the object. Then the object is said to be present in the video frame, and the video frame in the video stream may be preserved. If the similarity between each of the plurality of second features and the first feature does not meet the similarity threshold, it is indicated that the similarity between each of the plurality of second features and the first feature is smaller, that is, each of the plurality of second features is not close enough to the first feature, it is indicated that the object to which each of the plurality of second features belongs may not be an object with the object. Then it is stated that the object may not be present in the video frame and may not be present in every subsequent video frame, then the video frame and all video frames following the video frame in the video stream may be deleted.

Thus, it can be determined which of the video frames the object is in from appearance to disappearance in the video stream, and the video segments formed by the video frames are the candidate video segments corresponding to the object. Then, by performing the above operation on each of the plurality of objects, the candidate video segments corresponding to each object can be obtained, that is, the plurality of candidate video segments are obtained.

Optionally, the computer device may also perform target tracking on the video stream through a target tracking algorithm, for example, the target tracking algorithm may be a Deep SORT (Deep Simple Online and Realtime Tracking, deep simple online and real-time tracking) algorithm, a Strong SORT (Deep SORT enhancer) algorithm, or the like, which is not limited uniquely in this embodiment of the present application.

Step 102: and the computer equipment performs target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, wherein the target video segment is a candidate video segment containing one object, and the gesture of the object in the target video segment accords with a target gesture condition.

The gesture recognition is used for recognizing the gesture of the object in the plurality of candidate video segments so as to judge whether the gesture of the object meets the target gesture condition.

The target posture condition corresponds to the target posture, the target posture condition may be set in advance, and the target posture condition may be set according to the target posture. As an example, if the target pose is a finger-differential call, the target pose condition may be set to an arm presence angle of the object in the candidate video segment. As another example, where the target pose is a high lift leg in physical training, the target pose condition may be set to the leg presence angle of the object in the candidate video segment.

In this case, the computer device may acquire at least one target video segment from the plurality of candidate video segments, that is, acquire at least one video segment containing a single object from the plurality of candidate video segments, and the pose of the object satisfies the target pose condition. In this manner, the computer device obtains data that may be used to train the object pose model.

Optionally, before the computer device performs object detection and gesture recognition on the multiple candidate video segments to obtain at least one object video segment in the multiple candidate video segments, the computer device may further filter the multiple candidate video segments to keep the candidate video segments with valid information in the multiple candidate video segments.

Specifically, for any one of the plurality of candidate video segments, if one candidate video segment does not satisfy a preset condition, the candidate video segment is deleted, so as to obtain a plurality of reference video segments.

The preset condition is that the number of frames of one candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, the first preset number of frames and the second preset number of frames can be preset by a technician according to actual requirements, and the first preset number of frames is less than the second preset number of frames.

As an example, the first preset frame number may be set to 5 frames, and the second preset frame number may be set to 50 frames, and the preset condition is that the number of frames of one candidate video segment is greater than or equal to 5 frames and less than or equal to 50 frames. Then video segments having a frame number less than 5 frames, or greater than 50 frames, may be deleted later, i.e., video segments having a video length less than 1 second, or greater than 10 seconds, of the plurality of candidate video segments may be deleted. Thus, video segments with insufficient information or video segments with too much redundant information can be removed from the plurality of candidate video segments, so that candidate video segments with effective information are reserved.

In this case, when a candidate video segment does not satisfy the preset condition, it is indicated that the candidate video segment is a video segment with insufficient information or a video segment with too much redundant information, and the candidate video segment may be deleted. When a candidate video segment meets a preset condition, the candidate video segment is indicated to be a video segment with effective information, and the candidate video segment can be reserved. Thus, the filtering operation is performed on all the plurality of candidate video segments, so that a plurality of candidate video segments with effective information can be filtered from the plurality of candidate video segments, that is, the plurality of reference video segments are obtained.

The operation of step 102 in this case may be: and the computer equipment performs target detection and gesture recognition on the plurality of reference video segments to obtain at least one target video segment in the plurality of reference video segments.

Therefore, by carrying out target detection and gesture recognition on the plurality of reference video segments obtained after filtering, the effective information in the reference video segments can be fully utilized, so that at least one target video segment can be more accurately obtained from the plurality of reference videos.

Alternatively, the operation of step 102 may be achieved by steps (1) -step (3) as follows.

(1) For any one of the plurality of candidate video segments, the computer device slides over the candidate video segment using the candidate frame; classifying the regions covered by the candidate frame in the sliding process of the candidate frame to obtain region classification results of a plurality of regions of the candidate video segment; at least one detection box for the candidate video segment is determined based on the region classification results for the plurality of regions.

The region classification result of the plurality of regions of the candidate video segment is used to indicate whether or not there is an object in the plurality of regions covered by the candidate frame.

Specifically, the operation of step (1) may be: for any one of the plurality of candidate video segments, the computer device employing the candidate frame to slide over a plurality of video frames of the candidate video segment; classifying the areas covered by the candidate frames in the sliding process of the candidate frames to obtain the area classification results of the plurality of areas on the plurality of video frames of the candidate video segment; at least one detection box for the candidate video segment is determined based on the region classification of the plurality of regions over the plurality of video frames.

As an example, for any one of the plurality of video frames of the candidate video segment, the candidate frame is slid over the one video frame, and the region of the video frame covered by the candidate frame is classified, so as to determine a probability value of an object contained in the region covered by the candidate frame, that is, determine a region classification result of the region of the video frame covered by the candidate frame. Then, based on the region classification result of the plurality of regions covered in the sliding process on the video frame of the candidate frame, it can be determined which regions on the video frame have objects. The candidate frame slides on each video frame, so that the region classification result of a plurality of regions covered by the candidate frame on each video frame can be obtained, and then the region classification result of a plurality of regions covered by the candidate frame on each video frame can be used for determining which regions on each video frame have objects. In case an object is present in an area, a candidate frame covering this area is reserved, so that when an object is present in an area covered by at least one candidate frame, the at least one candidate frame may be reserved, i.e. the at least one detection frame is obtained.

In this way, the computer device may search each video frame of the candidate video segment for an area in which an object may exist, classify all areas covered by the candidate frame, and obtain at least one detection frame of the candidate video segment based on the area classification result of all areas, so as to know which objects exist in the candidate video segment.

Alternatively, the step size of the candidate frame sliding may also be set in advance, and the step size may be set smaller.

Therefore, when the candidate frame slides on each video frame of the candidate video segment according to the step length, each position in each video frame of the candidate video segment can be fully covered by the candidate frame, and then the area covered by the candidate frame is classified, so that whether the object is contained in each position covered by the candidate frame can be judged, and at least one detection frame of the candidate video segment can be more accurately determined.

Alternatively, the computer device may generate a plurality of candidate boxes, which may be different in size. Then, for each candidate frame in the plurality of candidate frames, the candidate frame may be used to slide over each video frame of the candidate video segment, so as to obtain a region classification result of a plurality of regions covered by the candidate frame over each video frame of the candidate video segment. By adopting the above operation for each candidate frame, the region classification result of the plurality of regions covered by the plurality of candidate frames on each video frame of the candidate video segment can be obtained, and then at least one detection frame in the candidate video segment can be determined based on the region classification result of the plurality of regions covered by the plurality of candidate frames on each video frame of the candidate video segment.

In this case, by sliding the candidate frames with different sizes over each video frame of the candidate video segment, the object in each video frame of the candidate video segment can be covered as much as possible, so that at least one detection frame of the candidate video segment can be determined more accurately, that is, at least one object included in the candidate video segment can be determined more accurately.

(2) The computer device screens the plurality of candidate video segments from the plurality of candidate video segments based on at least one detection box in the plurality of candidate video segments.

A single object video segment is a candidate video segment that contains one object.

In this case, the computer device may obtain a candidate video segment in which one object exists from the plurality of candidate video segments, and thus, the plurality of single-object video segments may be used to generate a training data set of the gesture recognition model in a single-object scene, so that the gesture recognition task in any single-object scene may be dealt with later.

Optionally, when the computer device screens out a plurality of single-object video segments from the plurality of candidate video segments, the coordinate values of the objects in the plurality of single-object video segments may be further output, so that the position of each object in the plurality of single-object video segments may be known.

Specifically, the operation of step (2) may be: the computer equipment deletes any one candidate video segment in the candidate video segments under the condition that a plurality of detection frames are detected in the candidate video segment; and determining the candidate video segment as a single-object video segment under the condition that a detection frame is detected in the candidate video segment.

In the embodiment of the present application, the target gesture recognition under the single-object scene is mainly aimed at, so that a plurality of single-object video segments need to be acquired from the plurality of candidate video segments.

In this case, if a plurality of detection frames are detected in the candidate video segment, which indicates that the candidate video segment includes a plurality of objects, the candidate video segment may be deleted. If a detection frame is detected in the candidate video segment, which indicates that the candidate video segment contains an object and accords with the single-object scene, the candidate video segment can be determined as the single-object video segment.

(3) And the computer equipment performs gesture recognition on the plurality of single-object video segments to obtain the at least one target video segment.

Specifically, the operation of step (3) may include the following steps a-c.

a. And extracting key points of the object in one single-object video segment from any one of the plurality of single-object video segments to obtain a plurality of key points of the object in the single-object video segment.

The keypoints may be skeletal keypoints of the object, which may be human keypoints in the case of a portrait of the object in the video segment. For example, the plurality of keypoints may be: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle.

Specifically, the computer device may input the single-object video segment into a key point extraction model, perform key point extraction on the object in the single-object video segment through the key point extraction model, and output a plurality of key points of the object in the single-object video segment.

The key point extraction model is used for extracting key points of an object.

It is noted that the key point extraction model needs to be trained before the computer device inputs the single-object video segment into the key point extraction model.

Specifically, the computer device may obtain a plurality of training samples, and train the neural network model using the plurality of training samples to obtain the keypoint extraction model. For example, the plurality of training samples may be a COCO dataset.

The plurality of training samples may be preset. Each of the plurality of training samples includes sample data and sample labels, the sample data being video information comprising sample objects, the sample labels being a plurality of keypoints of the sample objects in the sample data. That is, the input data in each of the plurality of training samples is sample data including a sample object, and the samples are labeled as a plurality of key points of the sample object in the sample data.

The neural network model may include a plurality of network layers including an input layer, a plurality of hidden layers, and an output layer. The input layer is responsible for receiving input data; the output layer is responsible for outputting the processed data; a plurality of hidden layers are located between the input layer and the output layer, responsible for processing data, the plurality of hidden layers being invisible to the outside. Alternatively, the neural network model may be a deep neural network or the like, and may be a convolutional neural network or the like in the deep neural network, for example, the neural network model may be an HRNet (High-Resolution Network ) model.

When the computer equipment trains the neural network model by using a plurality of training samples, for each training sample in the plurality of training samples, input data in the training sample can be input into the neural network model to obtain output data; determining a loss value between the output data and a sample marker in the training sample by a loss function; and adjusting parameters in the neural network model according to the loss value. After the parameters in the neural network model are adjusted based on each training sample in the training samples, the neural network model with the adjusted parameters is the key point extraction model.

The operation of the computer device to adjust the parameters in the neural network model according to the loss value may refer to the related art, which is not described in detail in the embodiments of the present application.

For example, the computer device may model the neural network by a formula

Any one parameter of TH is adjusted. The TH1T meter H2 storage machine program H0 processor is the adjusted parameter. W is a parameter before adjustment. The learning rate may be preset, for example, may be 0.001, 0.000001, etc., which is not limited in the embodiment of the present application. dw is the derivative of the loss function with respect to W and can be derived from the loss value.

b. The computer device determines whether the plurality of keypoints meets a target gesture condition.

As one example, in the case where the target gesture is a finger call, the gesture recognition model is a finger call recognition model. The target pose condition may be set to an arm presence angle of the object in the video segment. In this case, the computer device may first obtain the key points related to the arm from the plurality of key points, that is, may first obtain the left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist key points from the plurality of key points. Then, calculating whether angles exist among the key points of the left shoulder, the left elbow and the left wrist, and calculating whether angles exist among the key points of the right shoulder, the right elbow and the right wrist, namely judging whether the left arm is bent or not or whether the right arm is bent or not. When any one of the left arm or the right arm is curved, it may be determined that the plurality of key points satisfy the target posture condition, and when neither the left arm nor the right arm is curved, it is determined that the plurality of key points do not satisfy the target condition.

c. In the event that the plurality of keypoints meets the target pose condition, the computer device determines this single-object video segment as the target video segment.

In the above example, if the plurality of keypoints satisfy the target gesture condition, it is indicated that the object in the single-object video segment has an arm gesture, and the arm gesture may or may not refer to a bad call. Because the finger-difference call is also an arm gesture, the finger-difference call recognition model is trained through the video segment with the arm gesture, so that the finger-difference call recognition model obtained through training can be more accurate, the single-object video segment can be determined as a target video segment, and a sample data set is generated based on the target video segment. If the plurality of key points do not meet the target gesture condition, it is indicated that the object in the single-object video segment does not have an arm gesture, and at this time, it can be directly determined that the gesture of the object in the single-object video segment does not refer to a bad call, so that the single-object video segment does not have reference to model training, and the single-object video segment can be deleted.

In this case, when the plurality of key points meet the target condition, the single-object video is determined to be the target video segment, so that the gesture recognition model can be trained through the target video segment later, and the accuracy of the gesture recognition model obtained through training is higher.

It is noted that after the computer device performs the steps a-c for each of the plurality of single-object video segments, the at least one target video segment may be obtained, so that a sample data set may be generated based on the at least one target video segment, that is, the step 103 may be further performed.

Step 103: the computer device generates a sample dataset based on the at least one target video segment and a label for the at least one target video segment, the label being used to indicate whether an object in the corresponding target video segment is in a target pose, the sample dataset being used to train a pose recognition model.

Alternatively, the labeling of the at least one target video segment may be obtained by means of manual classification. Specifically, for each target video segment in the at least one target video segment, whether the object in the target video segment is in the target pose can be judged by manually observing each target video segment, so that each target video segment is classified, and a classification result (label) of each target video segment is obtained.

Under the condition, the computer equipment can automatically generate a sample data set for training the gesture recognition model, and the sample data set can be used for training the gesture recognition model in a single object scene, so that the labor cost is reduced, the manufacturing efficiency of the data set is improved, and the accuracy of the gesture recognition model in the single object scene is higher.

Notably, the computer device, after generating the sample data set, can also train a gesture recognition model for recognizing the target gesture based on the sample data.

Specifically, the operation of the computer device to train the gesture recognition model may include the following steps (1) -step (2).

(1) The computer device determines a plurality of keypoints for the object in the at least one target video segment based on the sample dataset.

The plurality of keypoints of the object in the at least one target video segment are the plurality of keypoints of the object in each video frame of each target video segment in the at least one target video segment.

Optionally, for any one of the at least one target video segment, the computer device may input the target video segment into the keypoint extraction model, extract, by using the keypoint extraction model, a plurality of keypoints of the object in each video frame of the target video segment, and output a plurality of keypoints of the object in each video frame of the target video segment.

Specifically, the details of the key point extraction model are described in the above step 102, and are not repeated here.

Optionally, for any one of the at least one target video segment, the computer device may obtain the coordinate values of the object in each video frame of the target video segment before inputting the target video segment into the key point extraction model. The coordinate value in each video frame and the target video segment can be input into the key point extraction model, and the position of the object in each video frame is firstly determined through the key point extraction model; and extracting a plurality of key points of the object indicated by the position in each video frame through the key point extraction model, and outputting the plurality of key points of the object in each video frame of the target video segment.

In this case, the key point extraction model can accurately find the object in each video frame, and the key point extraction can be directly performed on the object indicated by the coordinate value in each video frame, so that the accuracy of extracting the key point of the object in each video frame of the target video segment is improved.

Optionally, when the computer device outputs the plurality of key points of the object in each video frame of the target video segment through the key point extraction model, the confidence level of the plurality of key points of the object in each video frame of the target video segment may also be output through the key point extraction model.

The confidence is used to indicate how accurate the plurality of keypoints of the object in each video frame are. The higher the confidence of the key points, the more accurate the key points of the object in each video frame output by the key extraction model are; the lower the confidence level of the plurality of keypoints, the less accurate the plurality of keypoints of the object in each video frame output by the key extraction model.

Optionally, the computer device may further save the plurality of key points of the object and the confidence of the plurality of key points in each video frame of the target video segment after obtaining the plurality of key points of the object and the confidence of the plurality of key points in each video frame of the target video segment.

The computer device may save the plurality of keypoints and the confidence of the plurality of keypoints for the object in each video frame of this target video segment by generating a pkl (a form of save file in python) file, for example.

It should be noted that the computer device may obtain a plurality of key points of the object in each video frame of each target video segment by performing the above operation on each target video segment in the at least one target video segment. The computer device may then train the gesture recognition model based on the plurality of keypoints of the object in each video frame of each target video segment, i.e., proceed to step (2) below.

(2) The computer device trains the gesture recognition model based on a plurality of keypoints of the object in the at least one target video segment.

In this case, the gesture recognition model may be trained, so that the target gesture may be recognized later by the gesture recognition model. In addition, the gesture recognition model can be applied to single-object scenes, namely, the target gesture in any single-object scene can be accurately recognized through the gesture recognition model.

Specifically, the operation of step (2) may be: inputting a plurality of key points of an object in the target video segment into the gesture recognition model for any one of the at least one target video segment, performing gesture recognition on the plurality of key points of the object in the target video segment through the gesture recognition model, and outputting a prediction recognition result; and adjusting model parameters of the gesture recognition model based on the difference information between the prediction recognition result and the labels corresponding to the target video segments.

The predicted recognition result is whether the object in the target video segment predicted by the gesture recognition model is in the target gesture. The label corresponding to the target video segment is whether the object in the target video segment is actually in the target gesture.

The difference information may be a loss value between the predictive recognition result and the corresponding annotation determined by a loss function.

In this case, whether the object in the target video segment is in the target pose is predicted by the pose recognition model, and then the model parameters of the pose recognition model are adjusted based on the difference information between the predicted recognition result and the actual result (label) of the target video segment, so as to improve the recognition accuracy of the pose recognition model.

Specifically, the entire training process may be: the computer device may obtain a plurality of key points of the object in the at least one target video segment and labels corresponding to the target video segment, where the sample data is the plurality of key points of the object in the target video segment and the sample label is the label corresponding to the target video segment. That is, the input data of the gesture recognition model is a plurality of key points of the object in the target video segment, and the sample marks are marks corresponding to the target video segment.

The gesture recognition model may include a plurality of network layers including an input layer, a plurality of hidden layers, and an output layer. The input layer is responsible for receiving input data; the output layer is responsible for outputting the processed data; a plurality of hidden layers are located between the input layer and the output layer, responsible for processing data, the plurality of hidden layers being invisible to the outside. Alternatively, the gesture recognition model may be a deep neural network or the like, and may be a convolutional neural network or the like in the deep neural network, for example, the gesture recognition model may be a phase Convolitional 3D (gesture recognition based on 3-dimensional Convolution) model based on a pyskl environment.

When the computer equipment trains the gesture recognition model, input data of the gesture recognition model can be input into the gesture recognition model, a two-dimensional vector is obtained by processing the input data of the gesture recognition model through the gesture recognition model, and then the two-dimensional vector is classified through a classification function, so that output data (prediction recognition result) is obtained; determining a loss value between the output data (predictive recognition result) and the sample signature by a loss function; model parameters in the gesture recognition model are adjusted according to the loss value. And after the parameters in the gesture recognition model are adjusted based on the plurality of key points in each target video segment in the at least one target video segment and the labels corresponding to the target video segments, the trained gesture recognition model can be obtained after the parameter adjustment is completed. The classification function may be argmax function, softmax function, or the like, which is not limited in the embodiment of the present application.

Optionally, the operation of step (2) may further be: inputting a plurality of key points of an object in the target video segment and the confidence degrees of the key points into the gesture recognition model for any one of the at least one target video segment, performing gesture recognition on the key points of the object in the target video segment based on the confidence degrees of the key points through the gesture recognition model, and outputting a prediction recognition result; and adjusting model parameters of the gesture recognition model based on the difference information between the prediction recognition result and the labels corresponding to the target video segments.

The confidence level of the plurality of key points may indicate the accuracy level of the plurality of key points. In this case, the gesture recognition model can easily recognize a plurality of key points with high confidence. Thus, the accuracy of the gesture recognition model can be improved in the training process.

Notably, the computer device may also first divide the sample data set into three portions, a training set, a validation set, and a test set. The training set is used for training the gesture recognition model, the verification set is used for verifying the gesture recognition model after training, and the testing set is used for testing the accuracy of the gesture recognition model.

After training the gesture recognition model, the gesture recognition model can be verified by adopting a verification set, and the gesture recognition model with the best performance on the verification set is selected to be tested on a test set.

Therefore, after the gesture recognition model is trained, the gesture recognition model is verified and tested, so that the recognition accuracy of the gesture recognition model can be improved.

In the embodiment of the application, a sample data set may be generated through the steps 101-103, and then a gesture recognition model that may be used for recognition in a single object scene may be trained based on the sample data set. When the target gesture needs to be identified, the gesture identification model can be used for identifying the target gesture, and the identification accuracy rate of the single object scene can be improved.

For easy understanding, taking the target gesture as the finger call, the finger call data set is generated as an example, and the sample data set generating method provided in the embodiment of the application is described by way of example with reference to fig. 2.

Referring to fig. 2, the procedure for generating a finger call data set includes a video stream 201, a plurality of candidate video segments 202, an ID coordinate 203, a single video segment 204, a target video segment 205, and a finger call data set 206.

Referring to fig. 2, the overall process of generating the finger call data set includes the following steps (1) -step (4).

(1) The computer device first performs video segment division on the video stream 201 to obtain a plurality of candidate video segments 202, where the plurality of candidate video segments 202 are video segments corresponding to a plurality of artifacts in the video stream 201.

(2) The computer device performs object detection on each candidate video segment 202 in the plurality of candidate video segments 202, and screens out candidate video segments 202 having a portrait in the plurality of candidate video segments 202 to obtain a plurality of single-person video segments 204. In addition, the computer device may also obtain the coordinate value 203 of the portrait ID in the candidate video segment 202, that is, obtain the position of the portrait in the candidate video segment.

(3) The computer device performs gesture recognition on each of the single video segments 204 in the plurality of single video segments 204, and retains the single video segment of the single video segments 204 that performs the arm motion, thereby obtaining at least one target video segment 205.

(4) For each target video segment 205 in the at least one target video segment 205, the target video segment 205 is classified by a person first, and whether the portrait in the target video segment 205 is in the gesture of the finger-call is determined, so as to obtain the label of each target video segment 205. The computer device then generates a set of finger call data 206 based on the at least one target video segment 205 and the corresponding annotation for the target video segment 205.

Notably, after the generation of the differential call data set 206, the differential call data set 206 can also be used to train a differential call recognition model. Referring to the model training process in fig. 2, it includes a keypoint extraction model 207, a plurality of keypoints and confidence levels 208, and a finger differential call recognition model 209.

Referring to fig. 2, the overall model training process includes the following steps (1) -step (3).

(1) The computer device inputs the finger call data set 206 and the ID coordinate values 203 output in the process of generating the finger call data set into the key point extraction model 207.

(2) The computer device extracts a plurality of keypoints of the object in the target video segment 205 through a keypoint extraction model 207, resulting in a plurality of keypoints and keypoint confidence 208.

(3) The computer device uses the plurality of keypoints and keypoint confidence 208 for the object in at least one target video segment in the finger call data set to train, verify, and test the finger call identification model 209 to arrive at a finger call identification model 209 that can ultimately be used for identification.

In this embodiment of the present application, the computer device performs video segment division on the video stream to obtain a plurality of candidate video segments, that is, obtains a plurality of candidate video segments including at least one object based on the video stream division. And then carrying out target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, namely, screening out candidate video segments only comprising one object from the plurality of candidate video segments, wherein the gesture of the object in the candidate video segments accords with the target gesture condition. And then generating a sample data set based on at least one target video segment and labels of the at least one target video segment, wherein the labels can indicate whether objects in the corresponding target video segment are in target gestures, and the generated sample data set can be used for training a gesture recognition model. Thus, the sample data set for training the gesture recognition model can be automatically generated, so that the labor cost is reduced, and the manufacturing efficiency of the sample data set is improved. In addition, the target video segment only comprising one object can be automatically obtained, so that the sample data set can be suitable for training the gesture recognition model under the single object scene, and the recognition efficiency of the gesture recognition model under the single object scene can be improved.

Fig. 3 is a schematic structural diagram of a sample data set generating device according to an embodiment of the present application. The sample data set generating means may be implemented by software, hardware or a combination of both as part or all of a computer device, which may be a computer device as shown in fig. 4 below. Referring to fig. 3, the apparatus includes: a first processing module 301, a second processing module 302, a generating module 303.

The first processing module 301 is configured to divide a video segment of a video stream to obtain a plurality of candidate video segments, where the plurality of candidate video segments are video segments that include at least one object in the video stream;

the second processing module 302 is configured to perform object detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, where the target video segment is a candidate video segment including an object, and the gesture of the object in the target video segment meets a target gesture condition;

a generating module 303, configured to generate a sample data set based on the at least one target video segment and a label of the at least one target video segment, where the label is used to indicate whether an object in a corresponding target video segment is in a target pose, and the sample data set is used to train a pose recognition model.

Optionally, the first processing module 301 is configured to:

for any object in the video stream, performing target detection on the video stream to obtain a plurality of target video frames, wherein the target video frames comprise the object; aggregating the target video frames to obtain candidate video segments corresponding to the object;

Optionally, the apparatus further comprises:

the first filtering module is configured to delete, for any one of the plurality of candidate video segments, the candidate video segment when the candidate video segment does not satisfy a preset condition, so as to obtain a plurality of reference video segments, where the preset condition is that a number of frames of the candidate video segment is greater than or equal to a first preset number of frames and less than or equal to a second preset number of frames, and the first preset number of frames is less than the second preset number of frames;

optionally, the second processing module 302 is configured to:

Optionally, the second processing module 302 is configured to:

and carrying out gesture recognition on the plurality of single-object video segments to obtain the at least one target video segment.

Optionally, the second processing module 302 is configured to:

for any one candidate video segment in the plurality of candidate video segments, deleting the candidate video segment when a plurality of detection frames are detected in the candidate video segment;

and under the condition that a detection frame is detected in the candidate video segment, determining the candidate video segment as the single-object video segment.

Optionally, the second processing module 302 is configured to:

for any one single-object video segment in a plurality of single-object video segments, extracting key points of the objects in the single-object video segment to obtain a plurality of key points of the objects in the single-object video segment;

judging whether the plurality of key points meet the target attitude condition;

in the case where the plurality of key points satisfy the target pose condition, the single-object video segment is determined as the target video segment.

Optionally, the apparatus further comprises:

a determining module for determining a plurality of keypoints of the object in the at least one target video segment based on the sample dataset;

and the training module is used for training the gesture recognition model based on a plurality of key points of the object in the at least one target video segment.

Optionally, the training module is configured to:

inputting a plurality of key points of an object in the target video segment into the gesture recognition model for any one of the at least one target video segment, performing gesture recognition on the plurality of key points of the object in the target video segment through the gesture recognition model, and outputting a prediction recognition result;

and adjusting model parameters of the gesture recognition model based on the difference information between the prediction recognition result and the labels corresponding to the target video segments.

In the embodiment of the present application, video segments of a video stream are divided to obtain a plurality of candidate video segments, that is, a plurality of candidate video segments including at least one object are obtained based on the video stream division. And then carrying out target detection and gesture recognition on the plurality of candidate video segments to obtain at least one target video segment in the plurality of candidate video segments, namely, screening out candidate video segments only comprising one object from the plurality of candidate video segments, wherein the gesture of the object in the candidate video segments accords with the target gesture condition. And then generating a sample data set based on at least one target video segment and labels of the at least one target video segment, wherein the labels can indicate whether objects in the corresponding target video segment are in target gestures, and the generated sample data set can be used for training a gesture recognition model. Thus, the sample data set for training the gesture recognition model can be automatically generated, so that the labor cost is reduced, and the manufacturing efficiency of the sample data set is improved. In addition, the target video segment only comprising one object can be automatically obtained, so that the sample data set can be suitable for training the gesture recognition model under the single object scene, and the recognition efficiency of the gesture recognition model under the single object scene can be improved.

It should be noted that: the sample data set generating device provided in the above embodiment is only exemplified by the division of the above functional modules when generating a sample data set, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above.

The functional units and modules in the above embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiments of the present application.

The sample data set generating device and the sample data set generating method provided in the foregoing embodiments belong to the same concept, and specific working processes and technical effects brought by units and modules in the foregoing embodiments may be referred to a method embodiment section, and are not repeated herein.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 4, the computer device 4 includes: a processor 40, a memory 41 and a computer program 42 stored in the memory 41 and executable on the processor 40, the processor 40 implementing the steps in the sample dataset generation method in the above-described embodiments when executing the computer program 42.

The computer device 4 may be a general purpose computer device or a special purpose computer device. In a specific implementation, the computer device 4 may be a terminal or a network server such as a desktop, a portable computer, a palm computer, a mobile phone, a tablet computer, etc., and the embodiments of the present application are not limited to the type of the computer device 4. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the computer device 4 and is not meant to be limiting as the computer device 4 may include more or fewer components than shown, or may combine certain components, or may include different components, such as may also include input-output devices, network access devices, etc.

Processor 40 may be a central processing unit (Central Processing Unit, CPU), processor 40 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or may be any conventional processor.

The memory 41 may in some embodiments be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. The memory 41 may also be an external storage device of the computer device 4 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the computer device 4. The memory 41 is used to store an operating system, application programs, boot Loader (Boot Loader), data, and other programs. The memory 41 may also be used to temporarily store data that has been output or is to be output.

The embodiment of the application also provides a computer device, which comprises: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.

The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the respective method embodiments described above.

The present embodiments provide a computer program product which, when run on a computer, causes the computer to perform the steps of the various method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. With such understanding, the present application implements all or part of the flow of the above-described method embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, may implement the steps of the above-described method embodiments. Wherein the computer program comprises computer program code which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal device, recording medium, computer Memory, ROM (Read-Only Memory), RAM (Random Access Memory ), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and so forth. The computer readable storage medium mentioned in the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps to implement the above-described embodiments may be implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of generating a sample dataset, the method comprising:

2. The method of claim 1, wherein the video segment partitioning of the video stream results in a plurality of candidate video segments, comprising any one of:

3. The method of claim 1, wherein prior to performing object detection and gesture recognition on the plurality of candidate video segments to obtain at least one object video segment of the plurality of candidate video segments, further comprising:

the performing object detection and gesture recognition on the plurality of candidate video segments to obtain at least one object video segment in the plurality of candidate video segments includes:

4. The method of claim 1, wherein performing object detection and gesture recognition on the plurality of candidate video segments to obtain at least one object video segment of the plurality of candidate video segments comprises:

5. The method of claim 4, wherein the filtering the plurality of candidate video segments from the plurality of candidate video segments based on at least one detection box comprises:

6. The method of claim 4, wherein said performing gesture recognition on said plurality of single-object video segments to obtain at least one of said target video segments comprises:

7. The method of claim 1, wherein the method further comprises:

8. The method of claim 7, wherein the training the gesture recognition model based on a plurality of keypoints of objects in at least one of the target video segments comprises:

9. A sample data set generating apparatus, the apparatus comprising:

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which computer program, when executed by the processor, implements the method according to any of claims 1 to 8.

11. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed, implements the method according to any of claims 1 to 8.