CN110796069A

CN110796069A - Behavior detection method, system, equipment and machine readable medium

Info

Publication number: CN110796069A
Application number: CN201911031016.8A
Authority: CN
Inventors: 周曦; 姚志强; 李继伟; 高伽林; 施志祥
Original assignee: Guangzhou Boyan Intelligent Technology Co Ltd
Current assignee: Guangzhou Boyan Intelligent Technology Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-02-14
Anticipated expiration: 2039-10-28
Also published as: CN110796069B

Abstract

The invention provides a behavior detection method, a system, equipment and a machine readable medium, comprising the steps of obtaining one or more proposals containing target behaviors; classifying the proposal containing the target behavior according to one or more classifiers to obtain at least two classification results; and combining at least two classification results to obtain a target behavior detection result. The invention can make a continuous frame image sense all information contained in the video, generate scores according to the continuous frame image, sort the scores, select the generation candidate time domain and area with high score value, and can not cause missing detection and false detection. Meanwhile, the boundary of the proposal is adjusted, so that the starting time of the target behavior can be accurately positioned, and the method is suitable for diversified human behaviors and human behaviors with different time scales.

Description

Behavior detection method, system, equipment and machine readable medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a behavior detection method, system, device, and machine-readable medium for performing behavior detection.

Background

With the rapid development of the internet industry and the popularization of high-definition cameras, millions of videos are generated every day. Whether for video content review or public safety monitoring, automated video processing techniques are required to analyze the content of these videos and detect human behavior. At present, most of human behaviors are detected by using the sensing capability of an artificial intelligence technology, and the human behaviors are detected by sensing video content information. Existing human behavior detection procedures can be divided into two categories: (1) a sliding window based detection; (2) detection based on the behavioral score.

Detection through a sliding window is prone to two important problematic drawbacks: 1) the boundary of the behavior detection is inaccurate, namely the starting and stopping time of the behavior cannot be accurately positioned; 2) it is not able to adapt to diverse human behaviors, i.e. to human behaviors at different time scales simultaneously. And behavioral score detection relies heavily on scoring mechanisms, causing two problems: 1) the current scoring mechanism only focuses on the current content and cannot sense the global context information, so that the scoring quality is poor; 2) the candidate time domain region proposal generated according to the behavior score with poor quality easily causes the problems of missing detection, false detection and the like.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a behavior detection method, system, device and machine-readable medium for solving the problems existing in the prior art.

To achieve the above and other related objects, the present invention provides a behavior detection method, including:

obtaining one or more proposals containing target behaviors;

the system comprises one or more classifiers and a processing module, wherein the one or more classifiers are used for classifying a proposal containing target behaviors to obtain at least two classification results;

and combining the at least two classification results to obtain a target behavior detection result.

Optionally, the proposal includes at least one of: target behavior probability proposal and target behavior time scale proposal.

Optionally, the classification result comprises at least one of: target behavior probability proposal and target behavior time scale proposal.

Optionally, acquiring one or more first continuous frame images and second continuous frame images containing the target behaviors; representing a first continuous frame image by using one or more neural networks, so that the first continuous frame image can acquire target behavior characteristic information of a second continuous frame image;

and generating a proposal containing target behaviors according to the first continuous frame image and the second continuous frame image.

Optionally, the first continuous frame images are characterized by one or more neural networks, so that each frame image in the first continuous frame images can perceive the target behavior feature information of the current frame image and the target behavior feature information of the other frame images.

Optionally, the second continuous frame images are characterized by one or more neural networks, so that each frame image in the second continuous frame images can perceive the target behavior feature information of the current frame image and the target behavior feature information of the other frame images.

Optionally, the target behavior detection result includes at least one of: intermediate detection results of the target behaviors and final detection results of the target behaviors.

Optionally, the parameters in the one or more classifiers are updated according to the intermediate detection result of the target behavior or the final detection result of the target behavior.

Optionally, the method further includes labeling the first continuous frame image and the second continuous frame image, where the labeling includes at least one of: marking one or more target behavior categories, and marking the starting time of one or more target behaviors.

Optionally, the target behavior feature comprises at least one of: target behavior category, target behavior start time.

Optionally, the method further includes converting the target behavior start time into a first value, and converting the target behavior end time into a second value.

Optionally, before the characterizing the first continuous frame image or the second continuous frame image by using one or more neural networks, the method further includes performing feature segmentation on the continuous frame images, and segmenting the continuous frame images into one or more first continuous frame images.

Optionally, the feature segmentation comprises at least one of: divided by frame rate, divided by frame number, or divided by time value.

Optionally, a first continuous frame image is obtained, and the first continuous frame image is input to one or more convolutional neural networks for convolution, so as to obtain a feature matrix; and after the feature matrix is convolved again, inputting the feature matrix into a preset perceptron model for perception, or directly inputting the initial feature matrix into the preset perceptron model for perception, so that the first continuous frame image can perceive the target behavior feature information of the second continuous frame image.

Optionally, the sensing process of the preset sensing model includes:

performing data encoding on the convolved initial feature matrix or the initial feature matrix;

compressing the coded feature matrix in a time scale manner to obtain mean statistics;

inputting the mean value statistic value into one or more convolution neural networks for convolution, and normalizing convolution results;

and merging the normalized feature matrix and the initial feature matrix.

Optionally, the first continuous frame image is input to one or more convolution neural networks for convolution, and a feature matrix with a time sequence relation is obtained.

Optionally, the timing relationship comprises at least one of: the timing sequence relation, the reverse timing sequence relation, the timing sequence relation and the reverse timing sequence relation are contained at the same time;

the positive timing sequence relation represents that the information of a subsequent frame image of a certain frame image is involved when the convolution is carried out on the frame image of the continuous frame image; the anti-timing relationship indicates that the convolution of one frame of image of successive frames involves the previous frame of image information for that frame.

Optionally, a feature matrix simultaneously containing a positive timing sequence relation and a negative timing sequence relation is selected, the feature matrix with the positive timing sequence relation and the feature matrix with the negative timing sequence relation are cascaded, and convolution fusion is performed through one or more convolution neural networks to obtain a feature matrix after convolution fusion.

Optionally, one or more first continuous frame images are acquired, the first continuous frame images are convolved through one or more convolutional neural networks to obtain a probability for predicting a target behavior in the first continuous frame images, and different thresholds are set according to the probability to generate one or more target behavior probability proposals.

Optionally, each classifier comprises one or more layers, each layer comprises one or more templates and one or more convolutional neural networks to be trained; the template comprises one or more of the perceptron models, one or more trained convolutional neural networks;

acquiring the initial feature matrix after convolution, or inputting the initial feature matrix into the one or more templates for convolution and perception;

changing the time scale of the feature matrix after convolution and perception to make the time scale of the feature matrix after convolution and perception in the current layer in the classifier be twice as long as the time scale of the feature matrix after convolution and perception in the next layer;

and then, performing up-sampling on the characteristic matrix in the classifier, inputting the up-sampling result into one or more convolutional neural networks to be trained for training, and generating one or more target behavior time scale classification results.

Optionally, obtaining a target behavior probability proposal and a target behavior time scale proposal, and determining the coincidence degree of the target behavior probability proposal and the target behavior occurring in the target behavior time scale proposal; and screening out a target behavior probability proposal with the highest coincidence degree and a target behavior time scale proposal, and fusing the target behavior probability proposal with the highest coincidence degree and the target behavior time scale proposal to obtain a target behavior detection result.

Optionally, acquiring one or more first consecutive frame images;

calculating the percentage of overlap of the time scale of the target behavior in the training samples in the classifier template used for pre-training the convolutional neural network and the time scale of the target behavior in each first continuous frame image;

acquiring the maximum overlapping percentage, and setting all training samples in the template corresponding to the maximum overlapping percentage as first training samples; the training samples in the remaining templates are set as the second training sample.

Optionally, all the second training samples are obtained, whether the overlapping percentages corresponding to all the second training samples are smaller than a preset threshold value or not is judged, and the second training samples with the overlapping percentages smaller than the preset threshold value are screened out.

Optionally, a second training sample with an overlap percentage smaller than a preset threshold is obtained, and a loss function for updating the one or more classifiers is established according to the second training sample with the overlap percentage smaller than the preset threshold.

The invention also provides a behavior detection system, comprising:

the proposal acquisition module is used for acquiring one or more proposals containing target behaviors;

the classification detection module comprises one or more classifiers and is used for classifying a proposal containing target behaviors according to the one or more classifiers to obtain at least two classification results; (ii) a

And the result merging module is used for merging the at least two classification results to obtain a target behavior detection result.

Optionally, the system further comprises an image characterization module, configured to obtain one or more first continuous frame images and second continuous frame images that include the target behavior; and characterizing the first continuous frame image by using one or more neural networks, so that the first continuous frame image can acquire target behavior characteristic information of the second continuous frame image.

Optionally, the image characterization module includes:

the first characterization unit is used for acquiring a first continuous frame image, inputting the first continuous frame image into one or more convolution neural networks for convolution, and acquiring an initial characteristic matrix;

and the second characterization unit is used for convolving the initial characteristic matrix again and inputting the convolved initial characteristic matrix into a preset perceptron model for perception, or directly inputting the initial characteristic matrix into the preset perceptron model for perception, so that the first continuous frame image can perceive the target behavior characteristic information of the second continuous frame image.

Optionally, the sensing process of the preset sensor model includes:

the encoding unit is used for carrying out data encoding on the convolved initial characteristic matrix or the initial characteristic matrix;

the compression unit is used for compressing the coded feature matrix in a time scale manner to obtain mean value statistics;

the processing unit is used for inputting the mean value statistic value into one or more convolution neural networks for convolution and normalizing the convolution result;

and the merging unit is used for merging the normalized feature matrix and the initial feature matrix.

wherein the positive timing sequence relation represents convolution of the first continuous frame image from the current time to the next time; the inverse timing relationship represents the convolution of the first successive frame images from the next time instant to the current time instant.

Optionally, the system further includes a behavior probability module, where the behavior probability module is configured to obtain one or more first continuous frame images, perform convolution on the first continuous frame images through one or more convolutional neural networks, obtain a probability for predicting occurrence of a target behavior in the first continuous frame images, and set different thresholds according to the probability to generate one or more target behavior probability proposals.

Optionally, each classifier in the classification detection module comprises one or more layers, each layer comprises one or more templates and one or more convolutional neural networks to be trained; the template comprises one or more of the perceptron models, one or more trained convolutional neural networks;

the classification detection module further comprises a second detection unit, wherein the second detection unit is used for acquiring the convolved initial feature matrix, or inputting the initial feature matrix into the one or more templates for convolution and perception; changing the time scale of the feature matrix after convolution and perception to make the time scale of the feature matrix after convolution and perception in the current layer in the classifier be twice as long as the time scale of the feature matrix after convolution and perception in the next layer; and then, performing up-sampling on the characteristic matrix in the classifier, inputting the up-sampling result into one or more convolutional neural networks to be trained for training, and generating one or more target behavior time scale proposals.

Optionally, the result merging module obtains a target behavior probability proposal and a target behavior time scale proposal, and determines the coincidence degree of the target behavior probability proposal and the target behavior occurring in the target behavior time scale proposal; and screening out a target behavior probability proposal with the highest coincidence degree and a target behavior time scale proposal, and fusing the target behavior probability proposal with the highest coincidence degree and the target behavior time scale proposal to obtain a target behavior detection result.

Optionally, the system further comprises a sample calibration module, wherein the sample calibration module comprises:

the first segmentation unit is used for acquiring one or more second continuous frame images and segmenting each second continuous frame image into a plurality of first continuous frame images according to the number of frames;

the overlap percentage unit is used for calculating the overlap percentage of the time scale of the target behavior in the training sample used for training the convolutional neural network in advance in the classifier template and the time scale of the target behavior in each first continuous frame image after segmentation;

the calibration unit is used for acquiring the maximum overlapping percentage and setting all training samples in the template corresponding to the maximum overlapping percentage as first training samples; the training samples in the remaining templates are set as the second training sample.

Optionally, the sample calibration module further includes a sample screening unit, where the sample screening unit is configured to obtain all the second training samples, determine whether the overlapping percentages corresponding to all the second training samples are smaller than a preset threshold, and screen out the second training samples whose overlapping percentages are smaller than the preset threshold.

Optionally, the sample calibration module further includes an updating unit, where the updating unit is configured to obtain a second training sample with an overlap percentage smaller than a preset threshold, and establish a loss function for updating the one or more classifiers according to the second training sample with the overlap percentage smaller than the preset threshold.

The invention also provides a behavior detection device, comprising:

obtaining one or more proposals containing target behaviors;

The present invention also provides an apparatus comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as described in one or more of the above.

The present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the methods as described in one or more of the above.

As described above, the behavior detection method, system, device and machine-readable medium provided by the present invention have the following beneficial effects: the invention obtains one or more proposals containing target behaviors; the system comprises one or more classifiers and a processing module, wherein the one or more classifiers are used for classifying a proposal containing target behaviors to obtain at least two classification results; and combining the at least two classification results to obtain a target behavior detection result. The method can lead the selected segment video to sense the target behavior characteristic information of the whole complete video while adopting a scoring mechanism; and the classification results generated by the classifier are combined, so that the diversity of target behaviors can be better adapted.

Drawings

Fig. 1 is a flow chart illustrating a behavior detection method according to an embodiment.

Fig. 2 is a schematic connection diagram of a behavior detection system in an embodiment.

Fig. 3 is a schematic hardware structure diagram of a proposal obtaining module according to an embodiment.

Fig. 4 is a schematic connection diagram of a behavior detection system in another embodiment.

Fig. 5 is a schematic hardware structure diagram of an image segmentation module according to an embodiment.

Fig. 6 is a schematic hardware structure diagram of an image characterization module according to an embodiment.

FIG. 7 is a diagram of a sensor model in an embodiment.

Fig. 8 is a schematic hardware structure diagram of the classification detection module according to an embodiment.

Fig. 9 is a schematic hardware structure diagram of a sample calibration module according to an embodiment.

Fig. 10 is a schematic hardware structure diagram of a terminal device according to an embodiment.

Fig. 11 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.

Description of the element reference numerals

M10 proposal acquisition module; m20 classification detection module; an M30 result merging module; an M40 image characterization module;

an M50 image segmentation module; an M60 behavior probability detection module; an M70 sample calibration module;

a D10 image labeling unit; a D20 normalization unit; d30 frame rate segmentation unit; d40 frame number partition unit;

d50 time division unit; a D60 convolution unit; a D70 encoding unit; a D80 compression unit;

a D90 processing element; d100 merging unit; d110 a first detection unit; d120 a second detection unit;

d130 a first segmentation unit; d140 overlap percentage unit; a D150 calibration unit; d160 screening unit;

d170 updating unit; d210, a first characterization unit; d220, a second characterization unit;

1100 inputting equipment; 1101 a first processor; 1102 an output device; 1103 a first memory; 1104 a communication bus;

1200 a processing component; 1201 a second processor; 1202 a second memory; 1203 a communication component;

1204 a power supply component; 1205 multimedia components; 1206 a voice component; 1207 input/output interface; 1208 sensor assembly.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present invention provides a behavior detection method, including:

s100, acquiring one or more proposals containing target behaviors; wherein the proposal comprises at least one of: target behavior probability proposal and target behavior time scale proposal.

S200, one or more classifiers classify a proposal containing a target behavior according to the one or more classifiers to obtain at least two classification results; the classification result includes at least one of: target behavior probability proposal and target behavior time scale proposal.

And S300, merging the at least two classification results to obtain a target behavior detection result.

Through the scheme, the problems that the current content cannot be paid attention to and the global context information cannot be sensed when a behavior score detection method is adopted in the prior art and the problem that the human behavior cannot adapt to diversity when a sliding window detection method is adopted in the prior art can be solved. By the scheme, the scoring mechanism can be adopted, and meanwhile, the selected segment video can sense the target behavior characteristic information of the whole complete video; and the classification results generated by the classifier are combined, so that the diversity of target behaviors can be better adapted.

In an exemplary embodiment, the method further comprises acquiring one or more first continuous frame images and second continuous frame images containing the target behaviors; representing a first continuous frame image by using one or more neural networks, so that the first continuous frame image can acquire target behavior characteristic information of a second continuous frame image; and generating a proposal containing target behaviors according to the first continuous frame image and the second continuous frame image. For example, a segment of the entire video data is set as the second continuous frame image, and one of the segments is selected as the first continuous frame image.

In an exemplary embodiment, the method further includes characterizing the first continuous frame images by using one or more neural networks, so that each frame image in the first continuous frame images can perceive the target behavior feature information of the current frame image and the target behavior feature information of the other frame images.

In another exemplary embodiment, the method further includes characterizing the second continuous frame images by using one or more neural networks, so that each frame image in the second continuous frame images can perceive the target behavior characteristic information of the current frame image and the target behavior characteristic information of the other frame images.

According to the description of the above exemplary embodiments, the first continuous frame image and the second continuous frame image are, for example, captured or input with a time length of video data including human behaviors, animal body behaviors, and plant body behaviors. Taking human behavior as an example, the labeling of human behavior categories can be divided according to the description in "hugunn, wang-le army, niu-wen-xin, sports biomechanics, university press, 12 months 2013, 1 st edition". For example, the upper limb movement may include at least one of: pushing, pulling and whipping; the lower limb movement may include at least one of: buffering, pedaling, extending and whipping; the whole body movement may include at least one of: swinging, twisting and opposite movement. And the marking of the starting time of the human behavior can be realized by directly watching the video data or updating the required video data segment for user-defined marking.

In an exemplary embodiment, the target behavior detection result includes at least one of: intermediate detection results of the target behaviors and final detection results of the target behaviors. The target behavior intermediate detection result includes a classification result generated according to the classifier, for example, a target behavior action score value, a sorting result of the target behavior action score value, a Proposal (promosal) generated according to the target behavior action score value, a Proposal generated according to the sorting result of the target behavior action score value, and the like. The final detection result of the target behavior comprises a classification result generated according to the classifier and a final result generated according to the intermediate detection result of the target behavior. For example, the target behavior action score value, the sorting result of the target behavior action score value, the proposal generated according to the sorting result of the target behavior action score value, the target behavior action identified according to the sorting result of the target behavior action score value, the target behavior action identified according to the proposal generated according to the sorting result of the target behavior action score value, the proposal generated according to the target behavior action score value and the proposal generated according to the sorting result of the target behavior action score value are combined to generate the target behavior action identified according to the proposal, and the like.

In an exemplary embodiment, the parameters in the one or more neural networks are also updated according to the target behavior intermediate detection result or the target behavior final detection result. The intermediate detection result of the target behavior or the final detection result of the target behavior is input to update parameters in the neural network, and the neural network can be trained and optimized, so that the target behavior action can be identified more quickly and accurately. For example, the convolutional neural network and the perceptron are trained and optimized by using the intermediate detection result of the human behavior or the final detection result of the human behavior, so that the neural network can recognize the human behavior more quickly and accurately. Additionally, updating parameters in or training the neural network may also be accomplished by one or more successive frame images, target behavior features in one or more successive frame images. For example, for human behavior actions to be recognized, training can be performed by video data recorded by different people in the same environment, training can be performed by video data recorded by the same person in different environments, and training can be performed by video data recorded by different people in different environments; and the recorded video data comprises human behavior actions needing to be identified or detected.

In an exemplary embodiment, the parameters in the one or more classifiers are updated according to the target behavior intermediate detection result or the target behavior final detection result. The intermediate detection result of the target behavior or the final detection result of the target behavior is input to update parameters in the classifier, and the classifier can be trained and optimized, so that the classification of the target behavior action is faster and more accurate. For example, the classifier is trained and optimized by using the intermediate detection result of the human behavior or the final detection result of the human behavior, so that the classifier is quicker and more accurate when classifying or scoring the human behavior and cannot exceed the boundary of the human behavior.

In an exemplary embodiment, the method further includes labeling the first continuous frame image and the second continuous frame image, where the labeling includes at least one of: marking one or more target behavior categories, and marking the starting time of one or more target behaviors. For example, the labeling of human behavior categories may be classified according to the description in "hugundan, wang-band army, nivingxin, sports biomechanics, college press, 12 months in 2013, 1 st edition". For example, the upper limb movement may include at least one of: pushing, pulling and whipping; the lower limb movement may include at least one of: buffering, pedaling, extending and whipping; the whole body movement may include at least one of: swinging, twisting and opposite movement. And the marking of the starting time of the human behavior can be realized by directly watching the video data or updating the required video data segment for user-defined marking.

As noted in the above exemplary embodiments, the target behavior characteristics include at least one of: target behavior category, target behavior start time. Target behavior characteristics are found out through the target behavior category and the target behavior starting time, and corresponding labeling can be carried out on target behavior video data to be trained or target behavior video data to be detected. The training optimization in the early stage and the subsequent identification detection process are facilitated.

In an exemplary embodiment, the method further includes converting the target behavior start time into a first value, and converting the target behavior end time into a second value. The start time and the end time of the target behavior are both converted, for example, normalization and the like can be completed. For example, the start time of a certain motion in the video data of human behavior may be set to 0, and the end time of the motion may be set to 1.

According to the above exemplary embodiments, before characterizing the first continuous frame image or the second continuous frame image by using one or more neural networks, the method further includes performing feature segmentation on the acquired first continuous frame image or the acquired second continuous frame image, and segmenting the continuous frame image to be detected into one or more feature segments. The feature segmentation method comprises the following steps: divided by frame rate, divided by frame number, or divided by time value. In the embodiment of the present application, for example, video data including human behavior "push" may be selected to be divided according to a frame rate, and divided into a plurality of video segments, and the number of the divided video segments may be counted. Dividing video data containing human behavior 'pull' according to the number of frames, dividing the video data into a plurality of video segments, and counting the number of the divided video segments. The video data containing human behavior swing is divided into a plurality of video segments according to video time, and the number of the divided video segments is counted.

In an exemplary embodiment, one or more first continuous frame images are obtained, and the first continuous frame images are input into one or more three-dimensional convolution neural networks for convolution to obtain an initial feature matrix; and performing one or more convolutions on the initial characteristic matrix, inputting the initial characteristic matrix into a preset sensor model for sensing, and enabling the first continuous frame image to sense target behavior characteristic information of the second continuous frame image.

For example, all human behavior video data are processed, feature segmentation is performed on the human behavior video data according to one frame, and the total number t of pictures generated by each video data after the human behavior video data are segmented according to one frame is counted. Combining the segmented picture sequences into a video clip w according to a time scale T_iThen a video can be represented as a combination of segmentsLet T be 16, for example, and discard video lengths that are less than the time scale. For each video segment w_iAnd performing convolution by using the trained three-dimensional convolution neural network, and extracting an initial characteristic matrix T multiplied by C or a characteristic diagram T multiplied by C with the time scale T and the characteristic dimension C. And performing convolution on the initial characteristic matrix T multiplied by C for one time or multiple times, and inputting the initial characteristic matrix into a preset perceptron model for perception, or directly inputting the initial characteristic matrix into the preset perceptron model for perception. One of the segments is taken as a first continuous frame image and the entire complete video data is taken as a second continuous frame image.

The preset sensor model is obtained by training according to a historical first continuous frame image or a current first continuous frame image. The sensing process of the sensor model in the application comprises the following steps:

performing data encoding on the convolved initial feature matrix;

and merging the normalized feature matrix and the initial feature matrix.

Specifically, one or more first continuous frame images are acquired as a current first continuous frame image, the current first continuous frame image is input to one or more three-dimensional convolutional neural networks for convolution, the convolved initial feature matrix T '× C is acquired, for example, feature segmentation is performed on arbitrary human behavior video data according to one frame, convolution is performed respectively using three one-dimensional convolutional neural networks α, β, γ, and the convolved initial feature matrix T' × C is obtained.

For example, the convolved initial feature matrix T ' χ C is mapped to a coding space for data coding, and matrix multiplication is performed on the convolutional neural networks α and β to obtain a feature matrix T ' χ T '.

In order to obtain global information representation, the feature matrix T 'x T' is compressed in a time scale to obtain a mean statistic. The characteristic matrix T 'xT' comprises an upper triangular matrix and a lower triangular matrix, the upper triangular matrix is connected with the lower triangular matrix through a connecting layer, matrix multiplication is adopted to carry out matrix multiplication on the convolutional neural network gamma and the characteristic matrix T 'xT' connected with the connecting layer, and the result of the matrix multiplication is input into the global average pooling layer to be processed. And (3) performing time scale compression on the feature matrix T 'multiplied by T' after the global average pooling layer treatment to obtain the mean value statistic.

And convolving the mean value statistic value through two trained convolutional neural networks, normalizing the convolution result, and acquiring a normalized characteristic matrix 1 × C.

Convolving the feature matrix 1 XC with the convolved feature matrixThe initial feature matrix T 'xC is merged or summed, the information on the feature matrix 1 xC is added to the convolved initial feature matrix T' xC, the convolved initial feature matrix T 'xC is enhanced by global information, and any segment time T in the convolved initial feature matrix T' xC is set_iWith other times T_jTarget behavior feature information or human behavior feature information of the entire video can be perceived. Wherein, in a single video data or a single characteristic segment, the time T_iThe order of occurrence is earlier than the time T_j。

In another exemplary embodiment, the initial feature matrix T × C may also be directly obtained, and the initial feature matrix T × is input to the sensor model for sensing, which also enables any segment time T in the initial feature matrix T × C_iWith other times T_jTarget behavior feature information or human behavior feature information of the entire video can be perceived. The perception of the initial feature matrix T C includes: sensing the initial characteristic matrix T multiplied by C by adopting a sensor model after the training of the convolved initial characteristic matrix T' multiplied by C, wherein the sensing process of proving the initial characteristic is the same as that of the convolved initial characteristic matrix; let an arbitrary fragment time T in the initial feature matrix T C_iWith other times T_jTarget behavior feature information or human behavior feature information of the entire video can be perceived.

Wherein the content of the first and second substances,the time scale T and the characteristic dimension C may be specifically set according to actual situations, but the time scale T must be a multiple of 2, for example, T in the present embodiment may be set to 16, 32, 64, 128, 256, 512, and the like.

In an exemplary embodiment, feature segmentation is performed by the number of frames, and the second continuous frame image is segmented into one or more first continuous frame images by the number of frames; and inputting the segmented first continuous frame images into one or more convolution neural networks for convolution to obtain a characteristic matrix with a time sequence relation. Since the convolved initial feature matrix is also divided by the number of frames, the convolved initial feature matrix T' × C also has a time series relationship. The time sequence relation comprises the following positive time sequence relation, negative time sequence relation, and the positive time sequence relation and the negative time sequence relation are contained at the same time. The positive time sequence relation represents that the first continuous frame images are convoluted from the current time to the next time; the inverse timing relationship represents the convolution of the first successive frame images from the next time instant to the current time instant.

If the timing relationship is not considered, the convolved initial feature matrix T' x C is a symmetric matrix, indicating the time T_iAnd time T_jThe relationship of (1) is that the positive and negative timings are the same. If only positive timing is considered, i.e. from time T_iTo time T_jIf the matrix is an upper triangular matrix, it means that only the past information can be acted on in the future, otherwise, the matrix is not true. If only negative timing is considered, i.e. from time T_jTo time T_iThen the matrix is a lower triangular matrix, indicating that future information can be used to update past information. If the positive sequence and the reverse sequence are considered at the same time, the bi-directional relation is obtained, the positive sequence and the reverse sequence are cascaded by using a connecting layer, and the convolution layer is used for learning how to fuse the positive sequence and the reverse sequence, so that updated information is obtained, and a feature matrix after convolution fusion is obtained, wherein the feature matrix is still T' xC.

In an exemplary embodiment, one or more first continuous frame images capable of sensing target behavior characteristics in a second continuous frame image are obtained, and the first continuous frame images are convolved through one or more convolution neural networks to obtain a probability for predicting the target behavior in the first continuous frame images. Specifically, for example, in the embodiment of the present application, the feature matrix T '× C after convolution fusion may be selected, and the feature matrix T' × C after convolution fusion is convolved again one or more times to obtain a curve of T × 1; the probability of behavior in each video segment is performed according to this curve. And then setting different thresholds according to the probability, and generating one or more target action probability proposals (temporal action grouping proposals) or one or more first-class proposals (temporal action grouping proposals) according to the probability, wherein the target action probability proposals can be used for correcting the target action time scale proposals, and the first-class proposals can be used for correcting the boundary of the second-class proposals, so that the boundary is more accurate.

In an exemplary embodiment, each classifier comprises one or more layers, each layer comprises one or more templates (anchors), one or more convolutional neural networks to be trained; the template comprises one or more of the perceptron models, one or more trained convolutional neural networks; by setting the template, the target behavior action can be extracted quickly, the identification time is saved, and the identification efficiency is improved.

Acquiring a feature matrix, and inputting the feature matrix into the one or more templates for convolution and perception; changing the time scale of the feature matrix after convolution and perception to make the time scale of the feature matrix after convolution and perception in the current layer in the classifier be twice as long as the time scale of the feature matrix after convolution and perception in the next layer; and then, performing upsampling on the feature matrix subjected to convolution and sensing in the classifier, inputting the upsampling result into one or more convolution neural networks to be trained for training, and generating one or more target behavior time scale proposals or second-class proposals (relationship-aware pyramid proposals).

Specifically, an initial feature matrix T × C is obtained and input into the one or more templates for convolution and perception;

changing the time scale of the convolved and sensed initial characteristic matrix T multiplied by C to make the time scale of the convolved and sensed initial characteristic matrix T multiplied by C in the current layer in the classifier be twice as long as the time scale of the convolved and sensed initial characteristic matrix T multiplied by C in the next layer;

and then, performing upsampling on the initial characteristic matrix T multiplied by C after convolution and sensing in the classifier, performing upsampling on the last layer of cloth, performing upsampling on other layers, inputting an upsampling result into one or more convolutional neural networks to be trained, and training to generate one or more target behavior time scale proposals or second-class proposals (relationship-aware pyramid proposals), wherein the target behavior time scale proposals or second-class proposals can be used for detecting target behaviors with different time scales.

Specifically, for example, if a classifier is constructed in a pyramid layer, the top layer is the initial feature matrix T × C, and the ith layer isAnd by analogy, the time scale of the previous layer is twice as long as that of the next layer. Time scale, i.e. the time unit represented by each time point, each layer is

Except that the last layer of the pyramid is not subjected to upsampling, other layers of the pyramid layer are subjected to upsampling, the upsampling is input into a convolutional neural network to be trained for convolutional training, and one or more target behavior time scale proposals or second-class proposals (relationship-aware pyramid proposals) are generated according to a training result.

In order to identify human behavior more accurately, the above two proposals need to be adjusted for boundary information. Specifically, a target behavior probability proposal and a target behavior time scale proposal are obtained, and the coincidence degree of the target behavior probability proposal and the target behavior occurring in the target behavior time scale proposal is determined; screening out a target behavior probability proposal with the highest coincidence degree and a target behavior time scale proposal, proportionally fusing the target behavior probability proposal with the highest coincidence degree and the target behavior time scale proposal to obtain adjusted boundary information, scoring and sequencing the proposals after the boundary information is adjusted, determining a target behavior detection result according to the proposal with the highest sequencing or the highest score value, enabling the current frame image to sense the target characteristic behavior information of other frame images, generating a score value through the sensing mechanism, and generating candidate time domains and regions according to the score value, so that missing detection or false detection can not occur. The proportion in the embodiment of the application can be flexibly set according to actual conditions. By adjusting the boundary information of the proposal, the method makes up the defects in the prior art, ensures that the method can determine the starting time and the ending time of the target behavior, can adapt to the diversity of human behaviors and can adapt to the human behaviors with different time scales.

According to some exemplary embodiments, the method further includes calibrating the sample, specifically, acquiring one or more second continuous frame images, and dividing each second continuous frame image into a plurality of first continuous frame images according to the number of frames; calculating the overlapping percentage of the time scale of the target behavior in the first continuous frame images used for pre-training the convolutional neural network in the classifier template and the time scale of the target behavior in each segmented first continuous frame image; acquiring the maximum overlapping percentage, and setting a training sample used for pre-training the convolutional neural network in a template corresponding to the maximum overlapping percentage as a first training sample, namely setting a first continuous frame image used for pre-training the convolutional neural network as the first training sample; the training samples in the remaining templates are set as the second training sample.

And calculating the overlapping percentage of the time scale of the target behavior in the training video data in each template and the time scale of the target behavior in the label of the video aiming at each video data, setting all training samples in the template corresponding to the maximum overlapping percentage as first training samples or setting labels, and setting the samples of the rest modules according to the first training samples or setting different labels. For example, in the embodiment of the present application, all training samples in the template corresponding to the maximum percentage of overlap may be set as positive samples or set as labels 1, and training samples in the remaining templates may be set as negative samples or set as labels 0. Meanwhile, the setting method of the sample or the label introduces the imbalance of positive and negative sample categories, so that negative samples need to be screened. Specifically, all the second training samples are obtained, whether the overlapping percentage corresponding to all the second training samples is smaller than a preset threshold value theta is judged, and the second training samples with the overlapping percentage smaller than the preset threshold value theta are screened out. The preset threshold theta can be flexibly set according to actual conditions. After the second training sample with the overlapping percentage smaller than the preset threshold value theta is screened out, the second training sample can be considered as a true negative sample, and the remaining negative samples are not used for calculating the loss function. And the loss function is used for determining the direction of the convolutional neural network and the classifier when updating the parameters according to the second training sample. For example, in the embodiment of the present application, the loss of the loss function is set as: the template center point, the template scale, the template confidence coefficient, the target behavior category corresponding to the template, the behavior score and the like.

When the scoring mechanism is selected according to the invention, the target characteristic behavior information contained in each frame of image in the selected segment image can sense the target behavior characteristic information contained in the whole video, so that the problem that the scoring mechanism in the prior art only focuses on the current situation can be solved. And the generated scores are sorted, and the candidate time domain and the candidate region with high score values are selected, so that the problems of missing detection, false detection and the like are avoided. Meanwhile, the method carries out boundary adjustment on the detection time scale proposal through the detection behavior probability proposal, not only can accurately position the starting time of the target behavior, but also can adapt to diversified human behaviors and human behaviors with different time scales in the human behavior detection process.

As shown in fig. 2 to 9, the present invention further provides a behavior detection system, which includes:

a proposal obtaining module M10, configured to obtain one or more proposals containing target behaviors; wherein the proposal comprises at least one of: target behavior probability proposal and target behavior time scale proposal.

The classification detection module M20 includes one or more classifiers, and is configured to classify a proposal including a target behavior according to the one or more classifiers, and obtain at least two classification results; wherein the classification result comprises at least one of the following: target behavior probability proposal and target behavior time scale proposal.

And the result merging module M30 is configured to merge the at least two classification results to obtain a target behavior detection result.

Through the scheme, the problems that the current content cannot be paid attention to and the global context information cannot be sensed when a behavior score detection method is adopted in the prior art and the problem that the human behavior cannot adapt to diversity when a sliding window detection method is adopted in the prior art can be solved. By the scheme, the scoring mechanism can be adopted, and meanwhile, the selected segment video can sense the target behavior characteristic information of the whole complete video; and the target behavior probability proposal and the target behavior time scale proposal are combined, so that the diversity of the target behaviors can be better adapted.

In an exemplary embodiment, the system further includes an image characterization module M40, configured to obtain one or more first continuous frame images and second continuous frame images containing the target behavior; representing a first continuous frame image by using one or more neural networks, so that the first continuous frame image can acquire target behavior characteristic information of a second continuous frame image; and generating a proposal containing target behaviors according to the first continuous frame image and the second continuous frame image. For example, a segment of the entire video data is set as the second continuous frame image, and one of the segments is selected as the first continuous frame image.

In an exemplary embodiment, as shown in fig. 3, the proposal obtaining module M10 further includes an image labeling unit D10, wherein the image labeling unit D10 is configured to label the first continuous frame image and the second continuous frame image. The annotation includes at least one of: marking one or more target behavior categories, and marking the starting time of one or more target behaviors. For example, the labeling of human behavior categories may be classified according to the description in "hugundan, wang-band army, nivingxin, sports biomechanics, college press, 12 months in 2013, 1 st edition". For example, the upper limb movement may include at least one of: pushing, pulling and whipping; the lower limb movement may include at least one of: buffering, pedaling, extending and whipping; the whole body movement may include at least one of: swinging, twisting and opposite movement. And the marking of the starting time of the human behavior can be realized by directly watching the video data or updating the required video data segment for user-defined marking. As noted in the above exemplary embodiments, the target behavior characteristics include at least one of: target behavior category, target behavior start time. Target behavior characteristics are found out through the target behavior category and the target behavior starting time, and corresponding labeling can be carried out on target behavior video data to be trained or target behavior video data to be detected. The training optimization in the early stage and the subsequent identification detection process are facilitated.

In an exemplary embodiment, as shown in fig. 3, the proposal acquisition module M10 further includes a normalization unit D20 connected to the image annotation unit D10. The normalization unit D20 is configured to convert the target behavior start time into a first value, and convert the target behavior end time into a second value. The start time and the end time of the target behavior are both converted, for example, normalization and the like can be completed. For example, the start time of a certain motion in the video data of human behavior may be set to 0, and the end time of the motion may be set to 1.

In an exemplary embodiment, as shown in fig. 4, an image segmentation module M50 is further included, where the image segmentation module M50 is configured to perform feature segmentation on the acquired first continuous frame image and the acquired second continuous frame image, segment the second continuous frame image into one or more first continuous frame images, or segment the first continuous frame image into a plurality of video segments. As shown in fig. 5, the image segmentation module includes at least one of the following:

a frame rate dividing unit D30, configured to divide the continuous frame image into one or more first continuous frame images according to a frame rate;

a frame number dividing unit D40 for dividing the continuous frame images by frame number into one or more first continuous frame images;

a time division unit D50 for dividing the continuous frame image into one or more first continuous frame images according to time.

In the embodiment of the present application, for example, video data including a human behavior "push" may be selected to be divided according to a frame rate, and divided into a plurality of video segments or a first continuous frame image, and the number of the divided video segments may be counted. The method comprises the steps of dividing video data containing human behavior action 'pull' according to the number of frames, dividing the video data into a plurality of video segments or first continuous frame images, and counting the number of the divided video segments. The method comprises the steps of dividing video data containing human behavior swing according to video time, dividing the video data into a plurality of video segments or first continuous frame images, and counting the number of the divided video segments.

In an exemplary embodiment, as shown in fig. 6, the image characterization module M40 includes:

the first characterization unit D210 is configured to obtain one or more first continuous frame images, input the first continuous frame images to one or more three-dimensional convolutional neural networks for convolution, and obtain an initial feature matrix;

the second characterization unit D220 is configured to convolve the initial feature matrix one or more times and input the convolved initial feature matrix into a preset sensor model for sensing, or directly input the initial feature matrix into the preset sensor model for sensing, so that each frame of image can sense target behavior feature information of the current frame of image and target behavior feature information of other frames of images.

For example, all human behavior video data are processed, feature segmentation is performed on the human behavior video data according to one frame, and the total number t of pictures generated by each video data after the human behavior video data are segmented according to one frame is counted. Combining the segmented picture sequences into a video clip w according to a time scale T_iThen a video can be represented as a combination of segments

Let T be 16, for example, and discard video lengths that are less than the time scale. For each video segment w_iAnd performing convolution by using the trained three-dimensional convolution neural network, and extracting an initial characteristic matrix T multiplied by C or a characteristic diagram T multiplied by C with the time scale T and the characteristic dimension C. And performing convolution on the initial characteristic matrix T multiplied by C for one time or multiple times, and inputting the initial characteristic matrix into a preset perceptron model for perception, or directly inputting the initial characteristic matrix into the preset perceptron model for perception. The whole video data is taken as a second continuous frame image, and one video segment is taken as a first continuous frame image.

The preset sensor model is obtained by training according to a historical first continuous frame image or a current first continuous frame image. As shown in fig. 7, the sensing process of the sensor model in the present application at least includes:

an encoding unit D70, configured to perform data encoding on the convolved initial feature matrix;

the compression unit D80 is used for performing time scale compression on the coded feature matrix to obtain mean statistics;

the processing unit D90 is used for inputting the mean value statistic value into one or more convolution neural networks for convolution and normalizing the convolution result;

and a merging unit D100, configured to merge the normalized feature matrix and the initial feature matrix.

Specifically, the convolution unit D60 is configured to obtain one or more first continuous frame images as a current first continuous frame image, input the current first continuous frame image into one or more three-dimensional convolution neural networks for convolution, and obtain an initial feature matrix T '× C after convolution, where, for example, any human behavior video data is feature-segmented by one frame, and is convolved by three one-dimensional convolution neural networks α, β, and γ, respectively, so as to obtain an initial feature matrix T' × C after convolution.

And an encoding unit D70, configured to perform data encoding on the convolved initial feature matrix T ' xc to obtain a feature matrix T ' xc, for example, map the convolved initial feature matrix T ' xc to an encoding space to perform data encoding, and perform matrix multiplication on the convolutional neural networks α and β to obtain the feature matrix T ' xc '.

And the compression unit D80 is used for performing time scale compression on the feature matrix T 'multiplied by T' to obtain a mean statistic. The characteristic matrix T 'xT' comprises an upper triangular matrix and a lower triangular matrix, the upper triangular matrix is connected with the lower triangular matrix through a connecting layer, matrix multiplication is adopted to carry out matrix multiplication on the convolutional neural network gamma and the characteristic matrix T 'xT' connected with the connecting layer, and the result of the matrix multiplication is input into the global average pooling layer to be processed. And (3) performing time scale compression on the feature matrix T 'multiplied by T' after the global average pooling layer treatment to obtain the mean value statistic.

The processing unit D90 is used for convolving the mean value statistic value through one or more convolutional neural networks, normalizing the convolution result and acquiring a normalized feature matrix 1 × C;

a merging unit D100 merging the feature matrix 1 × C and the convolved initial feature matrix T' × C. For example, the feature matrix 1 × C is merged or summed with the convolved initial feature matrix T '× C, the information on the feature matrix 1 × C is added to the convolved initial feature matrix T' × C, the convolved initial feature matrix T '× C is enhanced with global information, and an arbitrary slice time T in the convolved initial feature matrix T' × C is set to be a time T_iWith other times T_jTarget behavior feature information or human behavior feature information of the entire video can be perceived. Wherein, in a single video data or a single characteristic segment, the time T_iThe order of occurrence is earlier than the time T_j。

In another exemplary embodiment, the initial feature matrix T × C can also be obtained directly, and the initial feature matrix T × input perception matrix T × C is inputThe model of the device performs perception, and can also enable any segment time T in the initial characteristic matrix T multiplied by C_iWith other times T_jTarget behavior feature information or human behavior feature information of the entire video can be perceived. The perception of the initial feature matrix T C includes: sensing the initial characteristic matrix T multiplied by C by adopting a sensor model after the training of the convolved initial characteristic matrix T' multiplied by C, wherein the sensing process of proving the initial characteristic is the same as that of the convolved initial characteristic matrix; let an arbitrary fragment time T in the initial feature matrix T C_iWith other times T_jTarget behavior feature information or human behavior feature information of the entire video can be perceived.

Wherein the content of the first and second substances,

the time scale T and the characteristic dimension C may be specifically set according to actual situations, but the time scale T must be a multiple of 2, for example, T in the present embodiment may be set to 16, 32, 64, 128, 256, 512, and the like.

In an exemplary embodiment, feature segmentation is performed by the number of frames, and the second continuous frame image is segmented into one or more first continuous frame images by the number of frames; and inputting the segmented first continuous frame images into one or more convolution neural networks for convolution to obtain a characteristic matrix with a time sequence relation. Since the convolved initial feature matrix T 'xc is also divided by the number of frames, the convolved initial feature matrix T' xc also has a timing relationship. The time sequence relation comprises the following positive time sequence relation, negative time sequence relation, and the positive time sequence relation and the negative time sequence relation are contained at the same time. The positive time sequence relation represents that the first continuous frame images are convoluted from the current time to the next time; the inverse timing relationship represents the convolution of the first successive frame images from the next time instant to the current time instant.

If the timing relationship is not considered, the convolved initial feature matrix T' x C is a symmetric matrix, indicating the time T_iAnd time T_jThe relationship of (1) is that the positive and negative timings are the same. If only positive timing is considered, i.e. slave timeT_iTo time T_jIf the matrix is an upper triangular matrix, it means that only the past information can be acted on in the future, otherwise, the matrix is not true. If only negative timing is considered, i.e. from time T_jTo time T_iThen the matrix is a lower triangular matrix, indicating that future information can be used to update past information. If the positive sequence and the reverse sequence are considered at the same time, the bi-directional relation is obtained, the positive sequence and the reverse sequence are cascaded by using a connecting layer, and the convolution layer is used for learning how to fuse the positive sequence and the reverse sequence, so that updated information is obtained, and a feature matrix after convolution fusion is obtained, wherein the feature matrix is still T' xC.

In an exemplary embodiment, as shown in fig. 4, the system further includes a classification detection module M20 behavior probability module M60, where the behavior probability module M60 is configured to obtain one or more first consecutive frame images capable of sensing behavior features of a target in a second consecutive frame image, and perform convolution on the first consecutive frame images through one or more convolutional neural networks to obtain a probability for predicting the occurrence of the target behavior in the second consecutive frame image in the first consecutive frame images. As shown in fig. 8, the classification detection module M20 includes a first detection unit D110, configured to obtain probabilities in the behavior probability module, set different thresholds according to the probabilities, and generate one or more target behavior probability proposals or first class proposals (temporal action grouping proposals) according to the probability. For example, in the embodiment of the present application, the feature matrix T '× C after convolution and fusion may be selected, and the feature matrix T' × C after convolution and fusion is convolved for one or more times to obtain a T × 1 curve; the probability of behavior in each video segment is performed according to this curve. And then setting different thresholds according to the probability, and generating one or more target behavior probability proposals or first-class proposals (temporal action grouping proposals, behavior probability grouping proposals in a time dimension), wherein the target behavior probability proposals can be used for correcting the boundary of the target behavior time scale proposals, or the first-class proposals can be used for correcting the boundary of the second-class proposals, so that the boundary is more accurate.

In an exemplary embodiment, each classifier in the classification detection module M20 includes one or more layers, each layer including one or more templates, one or more convolutional neural networks to be trained; the template comprises one or more of the perceptron models, one or more trained convolutional neural networks; the classification detection module further comprises a second detection unit, wherein the second detection unit is used for acquiring the convolved initial feature matrix, or inputting the initial feature matrix into the one or more templates for convolution and perception; changing the time scale of the feature matrix after convolution and perception to make the time scale of the feature matrix after convolution and perception in the current layer in the classifier be twice as long as the time scale of the feature matrix after convolution and perception in the next layer; and then, performing upsampling on the feature matrix subjected to convolution and sensing in the classifier, inputting an upsampling result into one or more convolution neural networks to be trained for training, and generating one or more target behavior time scale proposals or second-class proposals (relationship-aware pyramid proposals).

As shown in fig. 8, the classification detection module M20 further includes a second detection unit D120, where the second detection unit D120 is configured to obtain a feature matrix T × C, and input the feature matrix T × C into the one or more templates for convolution and sensing; changing the time scale of the convolved and sensed initial characteristic matrix T multiplied by C to make the time scale of the convolved and sensed initial characteristic matrix T multiplied by C in the current layer in the classifier be twice as long as the time scale of the convolved and sensed initial characteristic matrix T multiplied by C in the next layer; and then, performing upsampling on the initial characteristic matrix T multiplied by C after convolution and sensing in the classifier, performing upsampling on the last layer of cloth, performing upsampling on other layers, inputting an upsampling result into one or more convolutional neural networks to be trained, and training to generate one or more target behavior time scale proposals or second-class proposals (relationship-aware pyramid proposals), wherein the target behavior time scale proposals or second-class proposals can be used for detecting target behaviors with different time scales.

Specifically, for example, one classifier is constructed as a pyramid layer, the top layer is the initial feature matrix T × C, and the second layer is

And by analogy, the time scale of the previous layer is twice as long as that of the next layer. Time scale, i.e. the time unit represented by each time point, each layer is

In an exemplary embodiment, the result merging module M30 obtains a target behavior probability proposal and a target behavior time scale proposal, and determines the overlap ratio of the target behavior probability proposal and the target behavior occurring in the target behavior time scale proposal; screening out a target behavior probability proposal with the highest coincidence degree and a target behavior time scale proposal, proportionally fusing the target behavior probability proposal with the highest coincidence degree and the target behavior time scale proposal to obtain adjusted boundary information, scoring and sequencing the proposals after the boundary information is adjusted, determining a target behavior detection result according to the proposal with the highest sequencing or the highest score value, enabling the current frame image to sense the target characteristic behavior information of other frame images, generating a score value through the sensing mechanism, and generating candidate time domains and regions according to the score value, so that missing detection or false detection can not occur. The proportion in the embodiment of the application can be flexibly set according to actual conditions. By adjusting the boundary information of the proposal, the method makes up the defects in the prior art, ensures that the method can determine the starting time and the ending time of the target behavior, can adapt to the diversity of human behaviors and can adapt to the human behaviors with different time scales.

In an exemplary embodiment, as shown in fig. 4 and 9, a sample calibration module M70 is further included, and the sample calibration module M70 includes:

a first dividing unit D130, configured to acquire one or more second continuous frame images, and divide each of the second continuous frame images into a plurality of first continuous frame images according to a frame number;

an overlap percentage unit D140, configured to calculate an overlap percentage between a time scale of a target behavior in first continuous frame images used for training a convolutional neural network in advance in the classifier template and a time scale of the target behavior in each of the segmented first continuous frame images;

the calibration unit D150 is used for acquiring the maximum overlap percentage and setting all training samples used for training the convolutional neural network in advance in the template corresponding to the maximum overlap percentage as first training samples; the training samples in the remaining templates are set as the second training sample.

And calculating the overlapping percentage of the time scale of the target behavior in the training video data in each template and the time scale of the target behavior in the label of the video aiming at each video data, setting all training samples in the template corresponding to the maximum overlapping percentage as first training samples or setting labels, and setting the samples of the rest modules according to the first training samples or setting different labels. For example, in the embodiment of the present application, the training sample in the template corresponding to the maximum percentage of overlap may be set as a positive sample or set with a label of 1, and the training samples in the remaining templates may be set as a negative sample or set with a label of 0. Meanwhile, the setting method of the sample or the label introduces the imbalance of positive and negative sample categories, so that negative samples need to be screened.

Specifically, the sample calibration module M70 further includes a sample screening unit D160, where the sample screening unit D160 is configured to obtain all the second training samples, determine whether the overlapping percentages corresponding to all the second training samples are smaller than a preset threshold θ, and screen out the second training samples whose overlapping percentages are smaller than the preset threshold θ. The preset threshold theta can be flexibly set according to actual conditions. After the second training sample with the overlapping percentage smaller than the preset threshold value theta is screened out, the second training sample can be considered as a true negative sample, and the remaining negative samples are not used for calculating the loss function. The sample calibration module further includes an updating unit D170, where the updating unit is configured to obtain a second training sample with an overlap percentage smaller than a preset threshold, and establish a loss function for updating the one or more neural networks and the one or more classifiers according to the second training sample with the overlap percentage smaller than the preset threshold. And the loss function is used for determining the direction of the convolutional neural network and the classifier when updating the parameters according to the second training sample. For example, in the embodiment of the present application, the loss of the loss function is set as: the template center point, the template scale, the template confidence coefficient, the target behavior category corresponding to the template, the behavior score and the like.

When the system selects the scoring mechanism, the target characteristic behavior information contained in each frame of image can sense the target behavior characteristic information contained in the whole video, so that the problem that the scoring mechanism in the prior art only focuses on the current situation can be solved. And the generated scores are sorted, and the candidate time domain and the candidate region with high score values are selected, so that the problems of missing detection, false detection and the like are avoided. Meanwhile, the system carries out boundary adjustment on the detection time scale proposal through the detection behavior probability proposal, not only can accurately position the starting time of the target behavior, but also can adapt to diversified human behaviors and human behaviors with different time scales in the human behavior detection process.

The embodiment of the application also provides behavior detection equipment, which is used for acquiring continuous frame images containing target behaviors;

representing the continuous frame images by using one or more neural networks, so that each frame image can sense target behavior characteristic information of the current frame image and target behavior characteristic information of other frame images;

one or more classifiers, which classify the characterized continuous frame images according to the one or more classifiers to obtain at least two classification results;

In this embodiment, the behavior detection device executes the system or the method, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the behavior detection method in fig. 1 according to the present embodiment.

Fig. 10 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.

In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 11 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. FIG. 11 is a specific embodiment of the implementation of FIG. 10. As shown in fig. 11, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the behavior detection method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

From the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 involved in the embodiment of fig. 11 can be implemented as the input device in the embodiment of fig. 10.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for behavioral detection, comprising:

obtaining one or more proposals containing target behaviors;

2. The behavior detection method according to claim 1, characterized in that the proposal comprises at least one of: target behavior probability proposal and target behavior time scale proposal.

3. The behavior detection method according to claim 2, wherein the classification result includes at least one of: target behavior probability proposal and target behavior time scale proposal.

4. The behavior detection method according to claim 2, wherein one or more of a first continuous frame image and a second continuous frame image containing the target behavior are acquired; representing a first continuous frame image by using one or more neural networks, so that the first continuous frame image can acquire target behavior characteristic information of a second continuous frame image;

5. The behavior detection method according to claim 4, wherein the first continuous frame images are characterized by one or more neural networks, so that each frame image in the first continuous frame images can perceive the target behavior feature information of the current frame image and the target behavior feature information of the other frame images.

6. The behavior detection method according to claim 4, wherein the second continuous frame images are characterized by one or more neural networks, so that each frame image in the second continuous frame images can perceive the target behavior feature information of the current frame image and the target behavior feature information of the other frame images.

7. The behavior detection method according to claim 2, wherein the target behavior detection result includes at least one of: intermediate detection results of the target behaviors and final detection results of the target behaviors.

8. The behavior detection method according to claim 7, wherein parameters in the one or more classifiers are updated according to the target behavior intermediate detection result or the target behavior final detection result.

9. The behavior detection method according to claim 4, further comprising labeling the first and second consecutive frame images, wherein the labeling comprises at least one of: marking one or more target behavior categories, and marking the starting time of one or more target behaviors.

10. The behavior detection method according to claim 9, wherein the target behavior feature comprises at least one of: target behavior category, target behavior start time.

11. The method according to claim 10, further comprising converting the target behavior start time to a first value and converting the target behavior end time to a second value.

12. The behavior detection method according to claim 5 or 6, wherein before characterizing the first or second consecutive frame images using one or more neural networks, the method further comprises performing feature segmentation on the consecutive frame images to segment the consecutive frame images into one or more first consecutive frame images.

13. The behavior detection method according to claim 12, characterized in that the feature segmentation comprises at least one of: divided by frame rate, divided by frame number, or divided by time value.

14. The behavior detection method according to claim 4, wherein a first continuous frame image is obtained, and the first continuous frame image is input to one or more convolution neural networks for convolution to obtain an initial feature matrix; and after the initial characteristic matrix is convolved again, inputting the initial characteristic matrix into a preset perceptron model for perception, or directly inputting the initial characteristic matrix into the preset perceptron model for perception, so that the first continuous frame image can perceive the target behavior characteristic information of the second continuous frame image.

15. The behavior detection method according to claim 14, wherein the sensing process of the preset sensing model comprises:

and merging the normalized feature matrix and the initial feature matrix.

16. The behavior detection method according to claim 4 or 15, characterized in that the first continuous frame images are input to one or more convolution neural networks for convolution, and a characteristic matrix with a time sequence relation is obtained.

17. The behavior detection method according to claim 16, wherein the timing relationship comprises at least one of: the timing sequence relation, the reverse timing sequence relation, the timing sequence relation and the reverse timing sequence relation are contained at the same time;

wherein, the positive timing sequence relation indicates that the information of the subsequent frame image of a frame is involved when the convolution is carried out on one frame image of the continuous frame images; the anti-timing relationship indicates that convolution of one frame of image of successive frames involves the previous frame of image information for that frame.

18. The behavior detection method according to claim 17, wherein a feature matrix including both a positive timing relationship and a negative timing relationship is selected, the feature matrix having the positive timing relationship and the feature matrix having the negative timing relationship are concatenated, and convolution fusion is performed through one or more convolutional neural networks to obtain a convolution-fused feature matrix.

19. The behavior detection method according to claim 14 or 15, characterized in that one or more first continuous frame images are acquired, the first continuous frame images are convolved by one or more convolutional neural networks to obtain a probability for predicting occurrence of a target behavior in the first continuous frame images, different thresholds are set according to the probability, and one or more target behavior probability proposals are generated.

20. The behavior detection method according to claim 19, wherein each classifier comprises one or more layers, each layer comprising one or more templates, one or more convolutional neural networks to be trained; the template comprises one or more of the perceptron models, one or more trained convolutional neural networks;

21. The behavior detection method according to claim 20, wherein a target behavior probability proposal and a target behavior time scale proposal are obtained, and the coincidence of the target behavior probability proposal and the target behavior occurring in the target behavior time scale proposal is determined; and screening out a target behavior probability proposal with the highest coincidence degree and a target behavior time scale proposal, and fusing the target behavior probability proposal with the highest coincidence degree and the target behavior time scale proposal to obtain a target behavior detection result.

22. A method as claimed in claim 20, wherein one or more first successive frame images are acquired;

23. The behavior detection method according to claim 21, wherein all the second training samples are obtained, whether the overlapping percentages corresponding to all the second training samples are smaller than a preset threshold value is judged, and the second training samples with the overlapping percentages smaller than the preset threshold value are screened out.

24. The method according to claim 23, wherein a second training sample with an overlap percentage smaller than a preset threshold is obtained, and a loss function for updating the one or more classifiers is established according to the second training sample with the overlap percentage smaller than the preset threshold.

25. A behavior detection system, comprising:

26. The behavior detection system of claim 25, wherein the proposal comprises at least one of: target behavior probability proposal and target behavior time scale proposal.

27. The behavior detection system according to claim 26, wherein the classification result comprises at least one of: target behavior probability proposal and target behavior time scale proposal.

28. The behavior detection system of claim 27, further comprising an image characterization module configured to obtain one or more of a first continuous frame image and a second continuous frame image containing the target behavior; and characterizing the first continuous frame image by using one or more neural networks, so that the first continuous frame image can acquire target behavior characteristic information of the second continuous frame image.

29. The behavior detection system of claim 28, wherein the target behavior detection result comprises at least one of: intermediate detection results of the target behaviors and final detection results of the target behaviors.

30. The behavior detection system according to claim 29, wherein parameters in the one or more classifiers are updated based on the intermediate detection result of the target behavior or the final detection result of the target behavior.

31. A performance detection system according to claim 28, wherein the image characterization module comprises:

32. A performance detection system according to claim 31, wherein the sensing process of the preset sensor model comprises:

33. The behavior detection system according to claim 31, wherein the first continuous frame images are input to one or more convolutional neural networks for convolution to obtain a feature matrix with a time-series relationship.

34. The behavior detection system of claim 33, wherein the timing relationship comprises at least one of: the timing sequence relation, the reverse timing sequence relation, the timing sequence relation and the reverse timing sequence relation are contained at the same time;

35. The behavior detection system according to claim 34, wherein feature matrices that include both a positive timing relationship and a negative timing relationship are selected, the feature matrices with the positive timing relationship and the feature matrices with the negative timing relationship are concatenated, and convolution fusion is performed through one or more convolutional neural networks to obtain convolution-fused feature matrices.

36. The behavior detection system according to claim 30 or 31, further comprising a behavior probability module, wherein the behavior probability module is configured to obtain one or more first continuous frame images, perform convolution on the first continuous frame images through one or more convolutional neural networks, obtain a probability for predicting occurrence of a target behavior in the first continuous frame images, set different thresholds according to the probability, and generate one or more target behavior probability proposals.

37. The behavior detection system according to claim 36, wherein each classifier in the classification detection module comprises one or more layers, each layer comprising one or more templates, one or more convolutional neural networks to be trained; the template comprises one or more of the perceptron models, one or more trained convolutional neural networks;

38. The system of claim 37, wherein the result merging module obtains a target behavior probability proposal and a target behavior time scale proposal, and determines a degree of coincidence of the target behavior probability proposal and a target behavior occurring in the target behavior time scale proposal; and screening out a target behavior probability proposal with the highest coincidence degree and a target behavior time scale proposal, and fusing the target behavior probability proposal with the highest coincidence degree and the target behavior time scale proposal to obtain a target behavior detection result.

39. The behavior detection system according to claim 38, further comprising a sample calibration module, the sample calibration module comprising:

40. The behavior detection system according to claim 39, wherein the sample calibration module further comprises a sample screening unit, and the sample screening unit is configured to obtain all the second training samples, determine whether the overlapping percentages corresponding to all the second training samples are smaller than a preset threshold, and screen out the second training samples whose overlapping percentages are smaller than the preset threshold.

41. The behavior detection system according to claim 40, wherein the sample calibration module further comprises an updating unit, and the updating unit is configured to obtain a second training sample with an overlap percentage smaller than a preset threshold, and establish a loss function for updating the one or more classifiers according to the second training sample with the overlap percentage smaller than the preset threshold.

42. A behavior detection device, comprising:

obtaining one or more proposals containing target behaviors;

43. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-24.

44. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-24.