CN110363220B

CN110363220B - Behavior class detection method and device, electronic equipment and computer readable medium

Info

Publication number: CN110363220B
Application number: CN201910503133.3A
Authority: CN
Inventors: 杨洋
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2021-08-20
Anticipated expiration: 2039-06-11
Also published as: CN110363220A

Abstract

The embodiment of the application discloses a behavior class detection method and device, electronic equipment and a computer readable medium. An embodiment of the method comprises: carrying out human body detection on a frame in a target video, and determining a human body object region in the frame; determining a scene area in the frame, and respectively inputting the human body object area and the scene area into a pre-trained behavior type detection model to obtain behavior type detection results respectively corresponding to the human body object area and the scene area; and counting the obtained behavior type detection result to determine the behavior type of the human body object in the frame. This embodiment improves the accuracy of the detection of the behavior class of the human object in the video frame.

Description

Behavior class detection method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a behavior class detection method, a behavior class detection device, electronic equipment and a computer readable medium.

Background

The video understanding is used as a precondition and means for automatically analyzing and processing videos, and has important value and significance in aspects of video recommendation, highlight segment extraction, video tagging and the like. For example, in videos such as movie and television series, important behavior actions are often the key to the plot analysis of the video content. Therefore, the human behavior category detection is carried out on the frames in the video, and support can be provided for video analysis.

In the related method, usually, only the motion expression of the human body in the video frame is focused, and the behavior type determination is performed directly based on the motion expression of the human body in the video. However, in many scenes (for example, in movie and television shows), different behaviors may have similar action performances (for example, drinking water and eating), and the behavior category is determined only according to the action performance of the human body, and the accuracy is generally low.

Disclosure of Invention

The embodiment of the application provides a behavior class detection method, a behavior class detection device, electronic equipment and a computer readable medium, and aims to solve the technical problem that in the prior art, the detection of the behavior class is not accurate enough because the behavior class is determined only according to the performance of a human body.

In a first aspect, an embodiment of the present application provides a behavior class detection method, where the method includes: carrying out human body detection on a frame in a target video, and determining a human body object region in the frame; determining a scene area in a frame, and respectively inputting a human body object area and the scene area into a pre-trained behavior type detection model to obtain behavior type detection results respectively corresponding to the human body object area and the scene area, wherein the behavior type detection model is used for representing the corresponding relation between an image and a behavior type; and counting the obtained behavior type detection result to determine the behavior type of the human body object in the frame.

In some embodiments, the behavior class detection result includes a probability that the behavior class is a preset behavior class; and counting the obtained behavior type detection result to determine the behavior type of the human body object in the frame, wherein the behavior type comprises the following steps: counting the probability of the same preset behavior category in the obtained behavior category detection result to obtain the score of each preset behavior category; and determining the behavior category of the human body object in the frame based on the scores of the preset behavior categories.

In some embodiments, before determining the behavior class of the human object in the frame based on the obtained scores, counting the obtained behavior class detection results to determine the behavior class of the human object in the frame, further comprising: inputting the frame into a pre-trained object detection model to obtain an object detection result, wherein the object detection model is used for detecting an object in the image; and for each preset behavior category related to interaction with the object, determining an interactive object corresponding to the preset behavior category, extracting the score of the interactive object from the object detection result, weighting the score of the preset behavior category and the score of the interactive object, and taking the weighted result as the score of the preset behavior category to update the score.

In some embodiments, determining the behavior class of the human object in the frame based on the score of each preset behavior class includes: determining whether a score greater than a preset threshold exists; and in response to determining that the score exists, determining a preset behavior category corresponding to the maximum value of the score as the behavior category of the human body object in the frame.

In some embodiments, determining the behavior category of the human object in the frame based on the score of each preset behavior category further comprises: in response to determining that no score larger than a preset threshold exists, selecting at least one preset behavior category according to the sequence of scores from large to small; for each selected preset behavior category, extracting a behavior category judgment model matched with the preset behavior category, and inputting the frame to the behavior category judgment model to obtain a judgment result, wherein the behavior category judgment model is used for judging whether the behavior category of the human body object in the image is the preset behavior category; and determining the behavior type of the human body object in the frame based on the judgment result corresponding to each selected preset behavior type.

In some embodiments, the method further comprises: and according to the time sequence of the frames in the video, smoothing the behavior type of the human body object in each frame of the target video to generate a behavior type information sequence.

In some embodiments, the behavior class detection model is trained by the following model training steps: acquiring a training sample set, wherein samples in the training sample set comprise training image samples and first marking information, and the first marking information is used for indicating behavior categories of human body objects in the training image samples; and training by using a machine learning method to obtain a behavior type detection model by taking training image samples in the training sample set as input and first labeling information corresponding to the input training image samples as output.

In some embodiments, after training the behavior class detection model, the model training step further comprises: acquiring a test sample set, wherein samples in the test sample set comprise test image samples and second marking information, and the second marking information is used for indicating behavior types of human body objects in the test image samples; extracting samples in the test sample set, and executing the following test steps: inputting a test image sample in the extracted samples to a behavior class detection model; judging whether the behavior type detection result output by the behavior type detection model is matched with second labeling information in the extracted sample or not; in response to determining a mismatch, the extracted sample is determined to be a difficult sample.

In some embodiments, after the testing step is performed, the model training step further comprises: adding each hard sample into a corresponding target sample set according to behavior categories, wherein the behavior categories correspond to the target sample sets one to one; and for each target sample set, taking the behavior class corresponding to the target sample set as a target behavior class, taking the test image samples of the samples in the target sample set as input, taking the second labeling information corresponding to the input test image samples as output, and training by using a machine learning method to obtain a behavior class judgment model corresponding to the target behavior class.

In a second aspect, an embodiment of the present application provides a behavior class detection apparatus, including: the human body detection unit is configured to detect a human body of a frame in the target video and determine a human body object region in the frame; the behavior type detection unit is configured to determine a scene area in a frame, input a human body object area and the scene area into a pre-trained behavior type detection model respectively, and obtain behavior type detection results corresponding to the human body object area and the scene area respectively, wherein the behavior type detection model is used for representing the corresponding relation between an image and a behavior type; and the counting unit is configured to count the obtained behavior type detection result and determine the behavior type of the human body object in the frame.

In some embodiments, the behavior class detection result includes a probability that the behavior class is a preset behavior class; and a statistical unit comprising: the statistical module is configured to count the probability of the same preset behavior category in the obtained behavior category detection result to obtain the score of each preset behavior category; a determining module configured to determine a behavior class of the human object in the frame based on the score of each preset behavior class.

In some embodiments, the statistics module is further configured to: inputting the frame into a pre-trained object detection model to obtain an object detection result, wherein the object detection model is used for detecting an object in the image; and for each preset behavior category related to interaction with the object, determining an interactive object corresponding to the preset behavior category, extracting the score of the interactive object from the object detection result, weighting the score of the preset behavior category and the score of the interactive object, and taking the weighted result as the score of the preset behavior category to update the score.

In some embodiments, the determination module is further configured to: determining whether a score greater than a preset threshold exists; and in response to determining that the score exists, determining a preset behavior category corresponding to the maximum value of the score as the behavior category of the human body object in the frame.

In some embodiments, the determination module is further configured to: in response to determining that no score larger than a preset threshold exists, selecting at least one preset behavior category according to the sequence of scores from large to small; for each selected preset behavior category, extracting a behavior category judgment model matched with the preset behavior category, and inputting the frame to the behavior category judgment model to obtain a judgment result, wherein the behavior category judgment model is used for judging whether the behavior category of the human body object in the image is the preset behavior category; and determining the behavior type of the human body object in the frame based on the judgment result corresponding to each selected preset behavior type.

In some embodiments, the apparatus further comprises: and the smoothing unit is configured to smooth the behavior type of the human body object in each frame of the target video according to the time sequence of the frames in the video, and generate a behavior type information sequence.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the embodiments of the first aspect described above.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method according to any one of the embodiments of the first aspect.

According to the behavior type detection method and device provided by the embodiment of the application, the human body object area in the frame is determined by detecting the human body of the frame in the target video. And then determining a scene area in the frame so as to input the human body object area and the scene area into a pre-trained behavior type detection model respectively and obtain behavior type detection results corresponding to the human body object area and the scene area respectively. And finally, counting the obtained behavior type detection result, and determining the behavior type of the human body object in the frame. Therefore, the human body expression and the scene can be combined, more information is combined in the detection process of the behavior category, and the accuracy of detecting the behavior category of the human body object in the video frame is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of one embodiment of a behavior class detection method according to the present application;

FIG. 2 is a flow diagram of yet another embodiment of a behavior category detection method according to the present application;

FIG. 3 is a schematic diagram of one process of a behavior class detection method according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a behavior class detection apparatus according to the present application;

FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, a flow 100 of one embodiment of a behavior class detection method according to the present application is shown. The behavior class detection method comprises the following steps:

step 101, performing human body detection on a frame in a target video, and determining a human body object region in the frame.

In this embodiment, an execution subject (e.g., an electronic device such as a server) of the behavior class detection method may perform human body detection on a frame in a target video. Here, the target video may be any video currently to be processed. In practice, a video may be described in frames (frames). Here, a frame is the smallest visual unit constituting a video. Each frame is a static image. Temporally successive sequences of frames are composited together to form a video.

Here, the execution body may perform the human body detection using various manners. For example, a frame in the target video may be input to a human detection model trained in advance, and a human object region in the frame may be determined. The human body detection model is used for detecting a human body object region in an image. Here, the human body detection model may be obtained by performing supervised training on an existing Convolutional Neural Network (CNN) based on a sample set (including human body image samples and labels indicating positions of human body object regions) by using a machine learning method. Various existing structures can be used for the convolutional neural network, such as DenseBox, VGGNet, ResNet, SegNet, and the like.

In practice, the convolutional neural network is a feedforward neural network, and its artificial neurons can respond to surrounding units in a part of coverage range, and have excellent performance on image processing, so that the convolutional neural network can be used for extracting and processing the features of the frames in the target video. Convolutional neural networks may include convolutional layers, pooling layers, fully-connected layers, and the like. Among other things, convolutional layers may be used to extract image features. The pooling layer may be used to down-sample (down sample) the incoming information. It should be noted that the machine learning method and the supervised training method are well-known technologies that are widely researched and applied at present, and are not described herein again.

In some optional implementations of the present embodiment, the training of the human detection model may be performed using a fast-RCNN network structure. In practice, the fast-RCNN network structure is a neural network structure that can be used for target detection, and can accurately find the position of an object in an image. This allows the human body object region to be specified using the network. Because the fast-RCNN Network structure includes an RPN (regional candidate Network), the Network can quickly determine a region including a specified object in an image, and thus, detection of a human object region can be performed faster compared with other Network structures.

And 102, determining a scene area in the frame, and respectively inputting the human body object area and the scene area into a pre-trained behavior type detection model to obtain behavior type detection results respectively corresponding to the human body object area and the scene area.

In this embodiment, the execution body may first determine a scene area in the frame. The scene area may be an area of a scene in which a human object in the frame is located. For example, if a human object in a frame has a meal in a restaurant, the restaurant area may be used as the scene area. Here, the scene area may be determined by using an existing detection method such as a selective search (selective search) method or a sliding window approach (sliding window approach).

Optionally, when the fast-RCNN network structure is used as the structure of the human body detection model, the RPN in the network structure may also be multiplexed to determine the scene area. Thereby improving the detection speed of the scene area.

After the scene area is determined, the execution subject may input the human body object area and the scene area to a pre-trained behavior class detection model, respectively, to obtain a behavior class detection result corresponding to the human body object area, and obtain a behavior class detection result corresponding to the scene area. The behavior class detection model can be used for representing the corresponding relation between the image and the behavior class. As an example, the behavior class detection model may be a correspondence table that is previously established by a technician based on a large amount of data statistics and used for characterizing the correspondence of the image with the behavior class. As yet another example, the behavior class detection model may be a model trained by a machine learning method, and the model may perform behavior class detection on a human object in an image.

In some optional implementations of this embodiment, the behavior class detection model may be obtained by training through the following model training steps:

in the first step, a training sample set is obtained. The samples in the training sample set may include training image samples and first label information. The first labeling information may be used to indicate a behavior class of a human object in the training image sample.

And secondly, training to obtain a behavior type detection model by using a machine learning method by taking the training image samples in the training sample set as input and taking the first label information corresponding to the input training image samples as output. Here, the initial model for training the behavior class detection model may adopt various existing neural network structures (e.g., DenseBox, VGGNet, ResNet, SegNet, etc.) having a classification function.

And 103, counting the obtained behavior type detection results, and determining the behavior type of the human body object in the frame.

In this embodiment, the execution subject may count the obtained behavior type detection results (including the behavior type detection result corresponding to the human body object region and the behavior type detection result corresponding to the scene region), and determine the behavior type of the human body object in the frame. Here, the statistics may be performed in various ways. For example, the category detection results may include scores for different behavior categories. The higher the score of a certain behavior class, the greater the likelihood of belonging to that behavior class. The execution subject may add or perform weighted calculation on scores of the same behavior class in the two class detection results, and determine the behavior class corresponding to the maximum value of the calculation results as the behavior class of the human subject.

In some optional implementations of this embodiment, the behavior class detection result may include a probability that the behavior class is each preset behavior class. The executing entity may first perform statistics (for example, may directly add or weigh) on the probabilities of the same preset behavior categories in the obtained behavior category detection result to obtain the scores of the preset behavior categories. Then, the behavior class of the human body object in the frame can be determined based on the score of each preset behavior class. As an example, the behavior category corresponding to the maximum value of the score may be directly determined as the behavior category of the human object.

Optionally, in the implementation manner, after obtaining the score of each preset behavior category, the execution main body may further detect an object in the frame, and adjust the score of each preset behavior category according to an object detection result. The method can be specifically executed according to the following steps:

first, the frame may be input to a pre-trained object detection model to obtain an object detection result. The object detection model is used for detecting an object in an image. In practice, the YOLO framework (including the convolutional layer and the fully-connected layer) for object detection can be used for object detection.

Then, for each preset behavior category related to interaction with the object, an interactive object corresponding to the preset behavior category may be determined, a score of the interactive object is extracted from the object detection result, the score of the preset behavior category and the score of the interactive object are weighted, and the weighted result is used as the score of the preset behavior category to update the score. Thus, after the score is updated, the behavior category of the human body object in the frame is determined based on the updated score.

It should be noted that the preset behavior category related to interaction with the object and the corresponding relationship between the preset behavior category and the interaction object may be known. As an example, a certain preset behavior category is playing guitar. Since playing guitar requires interaction with an item (i.e. guitar), the preset behavior category is the behavior category interacting with the object, and the interacting object corresponding to the preset behavior category is the guitar. It should be noted that the weighting factor of the score of the preset behavior category and the weighting factor of the score of the above-mentioned interactive object may be preset by the skilled person as required.

In the method provided by the above embodiment of the present application, the human body object region in the frame is determined by performing human body detection on the frame in the target video. And then determining a scene area in the frame so as to input the human body object area and the scene area into a pre-trained behavior type detection model respectively, and obtaining behavior type detection results corresponding to the human body object area and the scene area respectively. And finally, counting the obtained behavior type detection result to determine the behavior type of the human body object in the frame. Therefore, the human body expression and the scene can be combined, more information is combined in the detection process of the behavior category, and the accuracy of detecting the behavior category of the human body object in the video frame is improved.

With further reference to fig. 2, a flow 200 of yet another embodiment of a behavior class detection method is shown. The flow 200 of the behavior class detection method includes the following steps:

step 201, performing human body detection on a frame in a target video, and determining a human body object region in the frame.

In the embodiment, an execution subject (for example, the server 105 shown in fig. 1) of the behavior class detection method may perform training of the human body detection model using the master-RCNN network structure. In practice, the fast-RCNN network structure is a neural network structure that can be used for target detection, and can accurately find the position of an object in an image. This allows the human body object region to be specified using the network. Since the fast-RCNN Network structure includes RPNs (Region candidate networks), detection of a human object Region can be performed faster than in other Network structures.

Step 202, determining a scene area in the frame, and inputting the human body object area and the scene area to a pre-trained behavior type detection model respectively to obtain behavior type detection results corresponding to the human body object area and the scene area respectively.

In this embodiment, the execution body may first determine a scene area in the frame. Here, the RPNs in the network structure may be multiplexed to determine the scene area. Thereby improving the detection speed of the scene area. After the scene area is determined, the execution subject may input the human body object area and the scene area to a pre-trained behavior class detection model, respectively, to obtain a behavior class detection result corresponding to the human body object area, and obtain a behavior class detection result corresponding to the scene area. Here, the behavior class detection result may include probabilities that the behavior class is each preset behavior class.

In this embodiment, the behavior class detection model may be obtained by training through the following model training steps:

In this embodiment, after the behavior class detection model is trained, the following steps may be further performed to determine a sample (which may be referred to as a difficult sample) that is difficult to detect by the behavior class detection model:

in a first step, a test sample set is obtained. The samples in the test sample set may include a test image sample and second label information. The second labeling information may be used to indicate a behavior class of the human object in the test image sample.

Secondly, extracting samples in the test sample set, and executing the following test steps: first, a test image sample among the extracted samples is input to the above-described behavior class detection model. And then, determining whether the behavior type detection result output by the behavior type detection model is matched with the second marking information in the extracted sample. Specifically, the class detection model may calculate a probability that the behavior of the human object in the test image sample belongs to each preset behavior class. It may be determined whether the preset behavior class corresponding to the maximum probability value is the same as the behavior class indicated by the second label information. If the two types of behavior detection results are the same, it can be determined that the behavior type detection result output by the behavior type detection model matches with the second labeling information in the extracted sample. Otherwise, it is not matched. In response to determining a mismatch, the extracted sample may be determined to be a difficult sample.

And thirdly, adding each hard sample into a corresponding target sample set according to the behavior type, wherein the behavior type corresponds to the target sample set one by one. Here, a target sample set corresponding to each behavior category may be established in advance. And when a certain sample is determined to be a difficult sample, determining the behavior type corresponding to the sample according to the second marking information in the sample. Thus, the sample may be added to the set of target samples corresponding to the behavior class. In addition to adding the difficult samples to the target sample set corresponding to a certain behavior class, the normal samples corresponding to the behavior class may be added in advance.

In this embodiment, the class determination model may be trained using a target sample set to which a hard sample is added. For a certain behavior class, the class determination model trained by using the target sample set corresponding to the behavior class can be used for determining whether the behavior of the human body object in the image belongs to the behavior class. Specifically, for each target sample set, a test image sample of a sample in the target sample set may be used as an input, second label information corresponding to the input test image sample may be used as an output, and a behavior type determination model corresponding to the target behavior type may be obtained through training by using a machine learning method. In practice, the above-mentioned class determination model may also adopt an existing neural network structure (e.g., DenseBox, VGGNet, ResNet, SegNet, etc.) having a classification function.

And 203, counting the probability of the same preset behavior category in the obtained behavior category detection result to obtain the score of each preset behavior category.

In this embodiment, the executing entity may first perform statistics (for example, may directly add or weigh) on the probabilities of the same preset behavior categories in the obtained behavior category detection result to obtain the scores of the preset behavior categories.

Step 204, inputting the frame to a pre-trained object detection model to obtain an object detection result.

In this embodiment, the execution subject may input the frame to a pre-trained object detection model to obtain an object detection result. The object detection model can be used for detecting an object in an image. In practice, the YOLO framework (including the convolutional layer and the fully-connected layer) for object detection can be used for object detection.

Step 205, for each preset behavior category related to interaction with the object, determining an interactive object corresponding to the preset behavior category, extracting a score of the interactive object from the object detection result, weighting the score of the preset behavior category and the score of the interactive object, and taking the weighted result as the score of the preset behavior category to update the score.

In this embodiment, for each preset behavior category related to interaction with an object, the executing entity may determine an interaction object corresponding to the preset behavior category. Then, the score of the interactive object may be extracted from the object detection result obtained in step 204, and the score of the preset behavior category and the score of the interactive object may be weighted. Finally, the weighting result can be used as the score of the preset behavior category to update the score.

Therefore, the object information in the video frame is further combined in the detection process of the behavior category, and the accuracy of detecting the behavior category of the human body object in the video frame is further improved.

And step 206, determining the behavior category of the human body object in the frame based on the score of each preset behavior category.

In this embodiment, the execution subject may determine the behavior class of the human object in the frame based on the score of each preset behavior class. As an example, a preset behavior category corresponding to the maximum value of the score may be determined as the behavior category of the human object in the frame.

In some optional implementations of this embodiment, the executing entity may first determine whether there is a score greater than a preset threshold. In response to determining that the score exists, a preset behavior category corresponding to the maximum value of the score may be determined as the behavior category of the human object in the frame.

In the foregoing implementation manner, in response to determining that there is no score greater than the preset threshold (in this case, it may be understood that the behavior class detection model cannot accurately determine the behavior class), the following steps may be performed:

firstly, at least one preset behavior category is selected according to the sequence of scores from large to small.

And then, for each selected preset behavior type, extracting a behavior type judgment model matched with the preset behavior type, and inputting the frame to the behavior type judgment model to obtain a judgment result. The behavior type determination model may be configured to determine whether a behavior type of the human body object in the image is the preset behavior type. Here, the generation step of the category discrimination model is already described in step 202, and is not described here again.

As an example, the preset behavior categories selected in the order of scores from large to small are guitar playing, horse riding, and song singing, respectively. At this time, a first action category determination model for determining whether or not the action category is "guitar playing", a second action category determination model for determining whether or not the action category is "horse riding", and a third action category determination model for determining whether or not the action category is "song" may be selected. Then, the frame may be input into the three behavior type determination models, respectively, to obtain the determination results output by the three behavior type determination models, respectively. The discrimination result may contain probabilities of belonging to the respective behavior classes. For example, the probability of the first behavior class determination model output is 0.8; the probability of the output of the second behavior type judgment model is 0.2; the third row determines the probability of the model output for the category to be 0.5.

Finally, the behavior type of the human body object in the frame can be determined based on the determination result corresponding to each selected preset behavior type. Here, the category corresponding to the behavior category determination model that outputs the maximum probability may be determined as the behavior category of the human object in the frame.

Therefore, when the behavior type detection model cannot detect the behavior type, the behavior type is determined through the behavior type judgment model, so that the detection precision of the behavior type of the human body object in the video frame is improved, and the detection accuracy is further improved.

And step 207, smoothing the behavior type of the human body object in each frame of the target video according to the time sequence of the frames in the video, and generating a behavior type information sequence.

In this embodiment, the execution subject may perform smoothing processing on the behavior class of the human body object in each frame of the target video according to the time sequence of the frames in the video, and generate a behavior class information sequence. Here, the behavior category information in the behavior category information sequence may be used to indicate the behavior category.

Here, for each frame, it may be first determined whether behavior class information corresponding to the front and rear of the frame is the same. If the behavior type information corresponding to the previous and subsequent frames of the frame is the same, and the behavior type information corresponding to the frame is different from the behavior type information corresponding to the previous and subsequent frames, the behavior type information of the frame may be deleted, and the smoothing process of the behavior type information may be performed. Therefore, the behavior category information with errors in identification can be eliminated, and the accuracy of identification is further improved.

As can be seen from fig. 2, compared with the embodiment corresponding to fig. 1, the flow 200 of the behavior class detection method in the present embodiment relates to a step of determining an object detection result by using an object detection model, and relates to a step of weighting a score of a preset behavior class and a score of an interactive object. Therefore, the scheme described in this embodiment further combines the object information in the video frame in the detection process of the behavior category, thereby further improving the accuracy of detecting the behavior category of the human body object in the video frame. On the other hand, when the behavior type detection model cannot detect the behavior type, the behavior type is determined by the behavior type determination model, so that the accuracy of detecting the behavior type of the human body object in the video frame is improved. Meanwhile, according to the time sequence of the frames in the video, the behavior types of the human body objects in the frames of the target video are smoothed, so that the behavior type detection result can be smoothed in a time domain, and the detection accuracy is further improved.

With continued reference to fig. 3, fig. 2 is a schematic diagram of a processing procedure of the behavior class detection method according to the present embodiment. In the processing procedure of fig. 3, the behavior class detection needs to be performed on the target video. The electronic device that performs the behavior class detection may store therein a behavior class detection model, an object detection model, a scene detection model, and the like that are trained in advance.

After the electronic device obtains the target video, human body detection can be performed on a frame in the target video, and a human body object area in the frame is determined.

Then, the electronic device may perform scene analysis on the frame, determine a scene region in the frame, and input the human body object region and the scene region to a pre-trained behavior type detection model respectively to obtain behavior type detection results corresponding to the human body object region and the scene region respectively. Then, the electronic device may count the probabilities of the same preset behavior categories in the obtained behavior category detection result to obtain the scores of the preset behavior categories.

Then, the electronic device may perform object detection on the frame. Specifically, the frame may be input to a pre-trained object detection model to obtain an object detection result.

Then, the electronic device may further count the object detection result. Specifically, for each preset behavior category related to interaction with an object, the electronic device may determine an interaction object corresponding to the preset behavior category. Then, from the object detection result, the score of the interactive object is extracted. And then, weighting the score of the preset behavior category and the score of the interactive object, and taking the weighted result as the score of the preset behavior category to update the scores.

Then, the electronic device may determine the behavior class of the human body object in the frame based on the further statistical result (i.e., the updated score of each preset behavior class).

Next, the electronic device may smooth the behavior type of the human body object in each frame of the target video according to the time sequence of the frames in the video, and generate a behavior type information sequence.

Therefore, object detection, human body detection, scene analysis and the like are carried out in the detection process of the behavior categories, so that various information is fused, and the accuracy of detecting the behavior categories of human body objects in the video frame is improved.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a behavior class detection apparatus, which corresponds to the method embodiment shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 4, the behavior class detection apparatus 400 according to the present embodiment includes: a human body detection unit 401 configured to perform human body detection on a frame in a target video, and determine a human body object region in the frame; a behavior type detection unit 402 configured to determine a scene region in the frame, and input the human body object region and the scene region into a pre-trained behavior type detection model respectively to obtain behavior type detection results corresponding to the human body object region and the scene region respectively, wherein the behavior type detection model is used for representing a corresponding relationship between an image and a behavior type; a statistical unit 403 configured to perform statistics on the obtained behavior type detection results to determine the behavior type of the human body object in the frame.

In some optional implementations of this embodiment, the behavior class detection result may include a probability that the behavior class is each preset behavior class. The statistical unit 403 may include a statistical module 4031 and a determination module 4032. The statistical module 4031 may be configured to count the probabilities of the same preset behavior categories in the obtained behavior category detection result to obtain scores of the preset behavior categories. The determining module 4032 may be configured to determine the behavior category of the human object in the frame based on the score of each preset behavior category.

In some optional implementations of this embodiment, the statistics unit 403 may be further configured to: inputting the frame to a pre-trained object detection model to obtain an object detection result, wherein the object detection model is used for detecting an object in an image; and for each preset behavior category related to interaction with the object, determining an interactive object corresponding to the preset behavior category, extracting the score of the interactive object from the object detection result, weighting the score of the preset behavior category and the score of the interactive object, and taking the weighted result as the score of the preset behavior category to update the score.

In some optional implementations of this embodiment, the determining module may be further configured to: determining whether a score greater than a preset threshold exists; and in response to the determination of existence, determining a preset behavior category corresponding to the maximum value of the scores as the behavior category of the human body object in the frame.

In some optional implementations of this embodiment, the determining module may be further configured to: in response to determining that no score larger than the preset threshold exists, selecting at least one preset behavior category according to the sequence of scores from large to small; for each selected preset behavior category, extracting a behavior category judgment model matched with the preset behavior category, and inputting the frame to the behavior category judgment model to obtain a judgment result, wherein the behavior category judgment model is used for judging whether the behavior category of the human body object in the image is the preset behavior category; and determining the behavior type of the human body object in the frame based on the judgment result corresponding to each selected preset behavior type.

In some optional implementations of this embodiment, the apparatus may further include a smoothing unit 404. The smoothing unit 404 may be configured to smooth the behavior class of the human body object in each frame of the target video according to the time sequence of the frames in the video, and generate a behavior class information sequence.

In some optional implementations of this embodiment, the behavior class detection model may be obtained by training through the following model training steps: acquiring a training sample set, wherein samples in the training sample set comprise training image samples and first marking information, and the first marking information is used for indicating behavior categories of human body objects in the training image samples; and training to obtain a behavior type detection model by using a machine learning method by taking the training image samples in the training sample set as input and taking the first label information corresponding to the input training image samples as output.

In some optional implementation manners of this embodiment, after the training obtains the behavior class detection model, the model training step may further include: acquiring a test sample set, wherein samples in the test sample set comprise test image samples and second marking information, and the second marking information is used for indicating behavior types of human body objects in the test image samples; extracting samples in the test sample set, and executing the following test steps: inputting a test image sample in the extracted samples to the behavior class detection model; determining whether the behavior type detection result output by the behavior type detection model is matched with second marking information in the extracted sample; in response to determining a mismatch, determining the extracted sample as a difficult sample; and adding each hard sample to a corresponding target sample set according to the behavior class, wherein the behavior class corresponds to the target sample set one by one.

In some optional implementations of this embodiment, the model training step may further include: and for each target sample set, taking the behavior type corresponding to the target sample set as a target behavior type, taking the test image samples of the samples in the target sample set as input, taking the second label information corresponding to the input test image samples as output, and training by using a machine learning method to obtain a behavior type judgment model corresponding to the target behavior type.

In the apparatus provided by the above embodiment of the present application, the human body detection unit 401 performs human body detection on a frame in the target video, so as to determine a human body object region in the frame. Then, the behavior type detection unit 402 determines the scene region in the frame so as to input the human body object region and the scene region to a pre-trained behavior type detection model, respectively, and obtains behavior type detection results corresponding to the human body object region and the scene region, respectively. Finally, the statistical unit 403 performs statistics on the obtained behavior class detection results to determine the behavior class of the human body object in the frame. Therefore, the human body expression and the scene can be combined, more information is combined in the detection process of the behavior category, and the accuracy of detecting the behavior category of the human body object in the video frame is improved.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing the electronic device of an embodiment of the present application. The final electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a human body detection unit, a behavior category detection unit, and a statistic unit. The names of these units do not form a limitation to the unit itself in some cases, for example, the human body detection unit may also be described as a "unit for performing human body detection on a frame in a target video and determining a human body object region in the frame".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: carrying out human body detection on a frame in a target video, and determining a human body object region in the frame; determining a scene area in the frame, and respectively inputting the human body object area and the scene area into a pre-trained behavior type detection model to obtain behavior type detection results respectively corresponding to the human body object area and the scene area; and counting the obtained behavior type detection result to determine the behavior type of the human body object in the frame.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for behavior class detection, the method comprising:

carrying out human body detection on a frame in a target video, and determining a human body object region in the frame;

determining a scene area in the frame, and respectively inputting the human body object area and the scene area into a pre-trained behavior type detection model to obtain behavior type detection results respectively corresponding to the human body object area and the scene area, wherein the behavior type detection model is used for representing the corresponding relation between an image and a behavior type;

counting the obtained behavior type detection result, and determining the behavior type of the human body object in the frame;

the behavior type detection result comprises the probability that the behavior type is each preset behavior type; the counting the obtained behavior class detection result to determine the behavior class of the human body object in the frame includes:

counting the probability of the same preset behavior category in the obtained behavior category detection result to obtain the score of each preset behavior category;

inputting the frame into a pre-trained object detection model to obtain an object detection result, wherein the object detection model is used for detecting an object in an image;

for each preset behavior category related to interaction with an object, determining an interactive object corresponding to the preset behavior category, extracting a score of the interactive object from the object detection result, weighting the score of the preset behavior category and the score of the interactive object, and taking the weighted result as the score of the preset behavior category to update the score;

and determining the behavior category of the human body object in the frame based on the scores of the preset behavior categories.

2. The behavior class detection method according to claim 1, wherein the determining the behavior class of the human object in the frame based on the score of each preset behavior class comprises:

determining whether a score greater than a preset threshold exists;

and in response to determining that the preset behavior category corresponding to the maximum value of the scores exists, determining the preset behavior category corresponding to the maximum value of the scores as the behavior category of the human body object in the frame.

3. The behavior class detection method according to claim 2, wherein the determining the behavior class of the human object in the frame based on the score of each preset behavior class further comprises:

in response to determining that no score larger than the preset threshold exists, selecting at least one preset behavior category according to the sequence of scores from large to small;

for each selected preset behavior category, extracting a behavior category judgment model matched with the preset behavior category, and inputting the frame to the behavior category judgment model to obtain a judgment result, wherein the behavior category judgment model is used for judging whether the behavior category of the human body object in the image is the preset behavior category;

and determining the behavior type of the human body object in the frame based on the judgment result corresponding to each selected preset behavior type.

4. The behavior class detection method according to claim 1, further comprising:

and according to the time sequence of the frames in the video, smoothing the behavior type of the human body object in each frame of the target video to generate a behavior type information sequence.

5. The behavior class detection method according to one of claims 1 to 4, wherein the behavior class detection model is trained by the following model training steps:

acquiring a training sample set, wherein samples in the training sample set comprise training image samples and first marking information, and the first marking information is used for indicating behavior categories of human body objects in the training image samples;

and training to obtain a behavior type detection model by using a machine learning method by taking the training image samples in the training sample set as input and taking the first labeling information corresponding to the input training image samples as output.

6. The behavior class detection method according to claim 5, wherein after the training obtains the behavior class detection model, the model training step further comprises:

obtaining a test sample set, wherein samples in the test sample set comprise test image samples and second marking information, and the second marking information is used for indicating behavior categories of human body objects in the test image samples;

extracting samples in the test sample set, and executing the following test steps: inputting a test image sample of the extracted samples to the behavior class detection model; judging whether the behavior type detection result output by the behavior type detection model is matched with second labeling information in the extracted sample or not; in response to determining a mismatch, the extracted sample is determined to be a difficult sample.

7. The behavior class detection method according to claim 6, wherein the model training step further comprises, after the testing step is completed:

adding each hard sample into a corresponding target sample set according to behavior categories, wherein the behavior categories correspond to the target sample sets one to one;

and for each target sample set, taking the behavior class corresponding to the target sample set as a target behavior class, taking the test image samples of the samples in the target sample set as input, taking the second labeling information corresponding to the input test image samples as output, and training by using a machine learning method to obtain a behavior class judgment model corresponding to the target behavior class.

8. A behavior class detection apparatus, characterized in that the apparatus comprises:

the human body detection unit is configured to detect a human body of a frame in a target video and determine a human body object region in the frame;

the behavior type detection unit is configured to determine a scene area in the frame, input the human body object area and the scene area into a pre-trained behavior type detection model respectively, and obtain behavior type detection results corresponding to the human body object area and the scene area respectively, wherein the behavior type detection model is used for representing the corresponding relation between an image and a behavior type;

a counting unit configured to count the obtained behavior class detection result and determine a behavior class of the human body object in the frame;

the behavior type detection result comprises the probability that the behavior type is each preset behavior type; and

the statistical unit comprises:

the statistical module is configured to count the probability of the same preset behavior category in the obtained behavior category detection result to obtain the score of each preset behavior category;

a determining module configured to determine a behavior class of the human object in the frame based on the score of each preset behavior class.

9. The behavior category detection device of claim 8, wherein the determination module is further configured to:

determining whether a score greater than a preset threshold exists;

10. The behavior category detection device of claim 9, wherein the determination module is further configured to:

11. The behavior class detection device according to claim 8, characterized in that the device further comprises:

and the smoothing unit is configured to smooth the behavior type of the human body object in each frame of the target video according to the time sequence of the frames in the video to generate a behavior type information sequence.

12. The behavior class detection device according to one of claims 8 to 11, wherein the behavior class detection model is trained by the following model training steps:

13. The behavior class detection device according to claim 12, wherein after the training obtains the behavior class detection model, the model training step further comprises:

14. The behavior class detection device according to claim 13, wherein the model training step further comprises, after the testing step is performed:

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.