CN111079507B

CN111079507B - Behavior recognition method and device, computer device and readable storage medium

Info

Publication number: CN111079507B
Application number: CN201910995333.5A
Authority: CN
Inventors: 陈海波
Original assignee: Shenlan Technology Chongqing Co ltd
Current assignee: Shenlan Technology Chongqing Co ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2023-09-01
Anticipated expiration: 2039-10-18
Also published as: CN111079507A

Abstract

The embodiment of the invention provides a behavior recognition method and device, a computer device and a readable storage medium, wherein the method comprises the following steps: inputting a training sample to a behavior recognition model to be trained, wherein the behavior model to be trained comprises a total variation network and a double-flow convolution network; extracting a basic optical flow field and a distorted optical flow field from a training sample through a total variation network; respectively inputting the basic optical flow field, the distorted optical flow field and pixel information of each frame of image included in each source video extracted from the training sample into a double-flow convolution network to obtain a classification result of time flow classification and space flow classification of the training sample; performing convolution calculation on the classification results of the time stream and the space stream to obtain a target classification result; taking a behavior recognition model corresponding to the target classification result meeting the preset error range as a target behavior recognition model; and inputting the video to be identified into a target behavior identification model, and determining the behavior category included in the video to be identified.

Description

Behavior recognition method and device, computer device and readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a behavior recognition method and apparatus, a computer apparatus, and a readable storage medium.

Background

With the rapid development of internet technology, the difficulty and cost for producing short videos are reduced by the appearance of various short video APP, and a large number of new short videos are generated on the network every day. How to efficiently identify video behavior is an important direction for video understanding applications.

Currently, the more mainstream behavior recognition methods based on optical flow features are as follows. The first is identified based on optimized optical flow characteristics. Specifically, first, the optical flow of every two continuous frames is extracted by a total variation method based on LI paradigm, so as to obtain the optical flow field of each point, namely the pixel displacement change of the target in the continuous frames. Then, the convolutional neural network model is trained on the image pixel information (such as pixel coordinates, RGB values of pixel points) and the optical flow information respectively to judge the category, the apparent information is captured by using two-dimensional convolution, and the relation characteristics of motion and time sequence are captured by using the optical flow field. And finally, identifying the corresponding behavior category by using a classification algorithm. The second is identified based on an optical flow network. Specifically, firstly, an optical flow field is calculated through an optical flow network, and then a double-flow convolution network is input for behavior recognition. The optical flow network is composed of a deep volume network, and optical flow calculation is performed through a stacking optimization module.

In the prior art, whichever behavior recognition method is only suitable for behavior recognition with smaller motion amplitude. The calculation process of the optical flow characteristics is time-consuming, and is not suitable for identifying the behavior of the target with too high movement speed. Moreover, when the acquired scene motion information is too small, the recognition precision and accuracy cannot be ensured; when the acquired scene movement information is too much, the calculated amount is large, the load is large, the time consumption is long, and the real-time identification cannot be realized.

Therefore, the existing behavior recognition method based on the optical flow features has the technical problem of low recognition efficiency.

Disclosure of Invention

The embodiment of the invention provides a behavior recognition method and device, a computer device and a readable storage medium, which are used for solving the technical problem of low recognition efficiency of the existing behavior recognition method based on optical flow characteristics.

In a first aspect, an embodiment of the present invention provides a behavior recognition method, including:

inputting a training sample to a behavior recognition model to be trained, wherein the behavior model to be trained comprises a total variation network and a double-flow convolution network, the training sample comprises a plurality of source videos, and each source video in the plurality of source videos comprises a specific behavior;

extracting a basic light flow field and a distorted light flow field from the training sample through the total variation network;

Respectively inputting the basic optical flow field, the distorted optical flow field and pixel information of each frame of image included in each source video extracted from the training sample into the double-flow convolution network to obtain a classification result of time flow classification and space flow classification of the training sample;

performing convolution calculation on the classification results of the time stream and the space stream to obtain a target classification result;

taking a behavior recognition model corresponding to the target classification result meeting a preset error range as a target behavior recognition model;

inputting the video to be identified into the target behavior identification model, and determining the behavior category included in the video to be identified.

In the technical scheme of the embodiment of the invention, firstly, a training sample is input into a behavior recognition model to be trained, which comprises a total variation network and a double-flow convolution network, then, a basic optical flow field and a distorted optical flow field in the training sample are extracted through the total variation network, then, the basic optical flow field and the distorted optical flow field are respectively input into the double-flow convolution network by combining pixel information in the training sample, so that time flow classification and space flow classification of the training sample are realized, and then, convolution calculation is carried out on classification results of the time flow and the space flow to obtain target classification results. Then, the behavior recognition model corresponding to the target classification result meeting the preset error range is used as a target behavior recognition model (namely a trained behavior recognition model), so that the behavior category included in the video to be recognized can be determined by inputting the video to be recognized into the target behavior recognition model. That is, the distortion optical flow field of the basic optical flow occasion is extracted through the total variation network, the time flow and the space flow are classified by combining the double-flow convolution network, and finally the fusion is carried out through the convolution model, so that the behavior identification of the video to be identified is realized. Because the basic optical flow field and the distorted optical flow field share the total variation network parameters, the whole recognition process is faster. In addition, the basic optical flow field, the distorted optical flow field, the pixel information and the double-flow convolution network are combined, so that the influence of background movement is effectively restrained, and the identification accuracy is improved. It can be seen that the recognition efficiency of the whole behavior recognition method is higher.

Optionally, before the inputting the training sample to the behavior recognition model to be trained, the method further comprises:

collecting the plurality of source videos;

editing video clips which are matched with a preset behavior category and last for a preset duration from the plurality of source videos;

segmenting the video segment according to a preset rule to obtain a multi-segment interval;

sampling at the frame granularity in each section of interval, and converting into multi-frame images;

the pixel size of each frame of image is adjusted to be a preset value;

and taking each frame of adjusted image as the training sample.

In the technical scheme of the embodiment of the invention, a plurality of source videos are collected first, then video fragments which are matched with a preset behavior category and last for a preset duration are clipped, then the video fragments are sampled with a segmented frame granularity, converted into multi-frame images, and then the image sizes are unified. That is, the sampled multiple source videos are clipped and segmented, so that segmented sampling is realized, redundant images are reduced, the consistency of the test sample and the training sample is ensured after the sizes of the images are unified, and the accuracy of behavior recognition is further improved.

Optionally, the extracting the basic optical flow field from the training sample through the total variation network includes:

Inputting each frame of image in the training sample into the total variation network to obtain a basic optical flow field output by the total variation network;

the extraction process of the basic optical flow field of any two adjacent frame images including the first frame image and the second frame image in the training sample through the total variation network is as follows:

acquiring a first brightness value and a second brightness value of a pixel point at the same position of the first frame image and the second frame image, and a first initial pixel displacement difference and a first initial dual vector field between the two frame images;

inputting the first brightness value, the second brightness value, the first initial pixel displacement difference and the first initial dual vector field into a first module in the total variation network, and outputting a first target pixel displacement difference and a first target dual vector field, wherein the total variation network comprises N modules arranged according to a preset sequence, the first module is positioned at the head of the preset sequence, and N is a positive integer greater than 2;

taking the first target pixel displacement difference and the first target dual vector field as learned parameters to participate in training of each module in the total variation network;

Obtaining an N target pixel displacement difference and an N target dual vector field which are output by an N module in the total variation network;

and acquiring a basic optical flow field comprising the N-th target pixel displacement difference and the N-th target dual vector field, wherein the N-th module is positioned at the tail part of the preset sequence.

Optionally, the inputting the first luminance value, the second luminance value, the first initial pixel displacement difference, and the first initial dual vector field into the first module in the total variation network outputs a first target pixel displacement difference and a first target dual vector field, including:

converting the first brightness value and the second brightness value in the first module by using bilinear interpolation deformation and a convolution layer to obtain a first auxiliary variable;

and respectively carrying out convolution calculation on the first auxiliary variable and the first initial pixel displacement difference and the first initial dual vector field, and outputting the first target pixel displacement difference and the first target dual vector field through the first module.

Optionally, the extracting the distorted optical flow field from the training sample through the total variation network includes:

mapping the images in the training samples through a homography matrix to obtain mapped images;

Inputting the mapped image into the total variation network to obtain a distorted optical flow field output by the total variation network;

the extraction process of the distorted optical flow field of any two adjacent frames of images including the third frame of image and the fourth frame of image in the mapped image comprises the following steps:

acquiring a third brightness value and a fourth brightness value of a pixel point at the same position of the third frame image and the fourth frame image, and a second initial pixel displacement difference and a second initial dual vector field between the two frame images;

inputting the third luminance value, the fourth luminance value, the second initial pixel displacement difference and the second initial dual vector field into the first module, and outputting a first pixel displacement difference and a first dual vector field;

taking the first pixel displacement difference and the first dual vector field as learned parameters to participate in training of each module in the total variation network;

obtaining an nth pixel displacement difference and an nth pair vector field which are output by the nth module;

and acquiring a distorted optical flow field comprising the N pixel displacement difference and the N dual vector field.

Optionally, said inputting the third luminance value, the fourth luminance value, the second initial pixel displacement difference, and the second initial dual vector field into the first module outputs a first pixel displacement difference and a first dual vector field, comprising:

Converting the third brightness value and the fourth brightness value in the first module by using bilinear interpolation deformation and a convolution layer to obtain a second auxiliary variable;

and respectively carrying out convolution calculation on the second auxiliary variable and the second initial pixel displacement difference and the second initial dual vector field, and outputting the first pixel displacement difference and the first dual vector field through the first module.

Optionally, if any pixel point coordinate of the fifth frame image in the training sample is (x) ₁ ,y ₁ 1), the coordinates of the pixel point mapped by the homography matrix are (x) ₂ ，y ₂ 1), the mapping of the fifth frame image by the homography matrix is expressed as:

wherein, the homography matrix is H, and H may be expressed as:

optionally, the obtaining the classification result of the time stream classification and the space stream classification on the training sample includes:

performing weighted calculation on the basic optical flow field and the distorted optical flow field, and taking the weighted calculation result as the input of time flow in the double-flow convolution network to obtain a first classification result of the training sample after the time flow is classified;

and taking the pixel information of each frame of image in the obtained training sample as the input of the space flow in the double-flow convolution network to obtain a second classification result of the training sample after the space flow is classified.

In a second aspect, an embodiment of the present invention further provides a behavior recognition apparatus, including:

the system comprises an input unit, a training unit and a processing unit, wherein the input unit is used for inputting a training sample to a behavior recognition model to be trained, the behavior model to be trained comprises a total variation network and a double-flow convolution network, the training sample comprises a plurality of source videos, and each source video in the plurality of source videos comprises a specific behavior;

the extraction unit is used for extracting a basic light flow field and a distorted light flow field from the training sample through the total variation network;

the first obtaining unit is used for respectively inputting the basic optical flow field, the distorted optical flow field and pixel information of each frame of image included in each source video extracted from the training sample into the double-flow convolution network to obtain a classification result of time flow classification and space flow classification on the training sample;

the second obtaining unit is used for carrying out convolution calculation on the classification results of the time stream and the space stream to obtain a target classification result;

the determining unit is used for taking a behavior recognition model corresponding to the target classification result meeting a preset error range as a target behavior recognition model;

and the third obtaining unit inputs the video to be identified into the target behavior identification model and determines the behavior category included in the video to be identified.

Optionally, the behavior recognition device further includes: the processing unit is specifically used for:

collecting the plurality of source videos;

the pixel size of each frame of image is adjusted to be a preset value;

and taking each frame of adjusted image as the training sample.

Optionally, the extraction unit is specifically configured to:

the extraction unit extracts a basic optical flow field of any two adjacent frame images including a first frame image and a second frame image in the training sample through the total variation network, wherein the extraction process comprises the following steps:

Optionally, the extraction unit is specifically configured to:

the extraction unit extracts a distorted optical flow field of any two adjacent frames of images including a third frame of image and a fourth frame of image from the mapped images, wherein the extraction process comprises the following steps:

Optionally, the extraction unit is specifically configured to:

Optionally, the first obtaining unit is specifically configured to:

In a third aspect, an embodiment of the present invention further provides a computer apparatus, including: a processor and a memory, wherein the memory stores a computer program, and the processor is configured to read the program in the memory to perform the steps of the behavior recognition method according to the first aspect.

In a fourth aspect, embodiments of the present invention also provide a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the behavior recognition method according to the first aspect.

Drawings

FIG. 1 is a flow chart of a behavior recognition method according to an embodiment of the present invention;

fig. 2 is a flowchart of a method before step S101 in a behavior recognition method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a process for extracting a basic optical flow field of an adjacent frame image in a training sample by a total variation network in the behavior recognition method according to the embodiment of the present invention;

FIG. 4 is a diagram of a total variation network structure in a behavior recognition method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a process of extracting a distorted optical flow field through a total variation network in step S102 in a behavior recognition method according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an extraction process of a distorted optical flow field of two adjacent frames of images in a mapped image in a behavior recognition method according to an embodiment of the present invention;

fig. 7 is a flowchart of a method for step S602 in a behavior recognition method according to an embodiment of the present invention;

Fig. 8 is a flowchart of a method for step S103 in a behavior recognition method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a behavior recognition model to be trained in a behavior recognition method according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a behavior recognition device according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The terms "first," "second," and the like in the description and in the claims and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise," "include," and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the above technical solutions, the following detailed description of the technical solutions of the present invention is made by using the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present invention are detailed descriptions of the technical solutions of the present invention, and not limiting the technical solutions of the present invention, and the embodiments and the technical features of the embodiments of the present invention may be combined with each other without conflict.

Referring to fig. 1, an embodiment of the present invention provides a behavior recognition method, which includes the following steps:

s101: inputting a training sample to a behavior recognition model to be trained, wherein the behavior model to be trained comprises a total variation network and a double-flow convolution network, the training sample comprises a plurality of source videos, and each source video in the plurality of source videos comprises a specific behavior;

in the embodiment of the invention, source videos of all behavior categories are firstly selected and used as videos to be classified to be input into a behavior recognition model to be trained, wherein the behavior model to be trained is a model which is built in advance by utilizing a total variation network and a double-flow convolution network. The specific behavior may be any type of behavior selected by those skilled in the art according to actual needs. Such as brushing teeth, washing the face, running, etc.

In the embodiment of the invention, the total variation network is mainly an iterative idea simulating the range of the total variation LI, and the iterative process is converted into layer-to-layer conversion in the neural network. The total variation network comprises N modules arranged according to a preset sequence, wherein N is a positive integer greater than 2, for example, the total variation network comprises a first module, a second module, … … and an N-th module, wherein the first module is positioned at the head part of the preset sequence, and the N-th module is positioned at the tail part of the preset sequence.

In an embodiment of the invention, the dual stream convolutional network includes both time streams and spatial streams. Wherein, the time sequence characteristics of the video can be utilized to classify the video through the time flow. The spatial features of the video (such as pixel information of images in the video) can be utilized by the spatial stream to classify the video.

S102: extracting a basic light flow field and a distorted light flow field from the training sample through the total variation network;

in the embodiment of the invention, the basic optical flow field is specifically extracted from each frame of image of the source video, the distorted optical flow field is specifically extracted from each frame of image of the source video after the distortion transformation, and the basic optical flow field and the distorted optical flow field can represent the behavior records of different directions on the same target object.

S103: respectively inputting the basic optical flow field, the distorted optical flow field and pixel information of each frame of image included in each source video extracted from the training sample into the double-flow convolution network to obtain a classification result of time flow classification and space flow classification of the training sample;

in the embodiment of the invention, the basic optical flow field and the distorted optical flow field are cooperatively used for time sequence characteristics, specifically, the basic optical flow field and the distorted optical flow field are input into a time flow in a double-flow convolution network, and a classification result of classifying the time flow of a training sample is obtained. And inputting pixel information of each frame of image included in each source video extracted from the training samples into a spatial stream in the double-stream convolution network, and obtaining a classification result of spatially classifying the training samples.

S104: performing convolution calculation on the classification results of the time stream and the space stream to obtain a target classification result;

in the embodiment of the invention, the convolution calculation process is specifically to perform weighted calculation on the classification result of the time stream input through convolution check and the classification result of the space stream to obtain a target classification result. The whole process effectively suppresses the influence of background movement, improves the accuracy of model identification, and can realize behavior identification of rapid movement.

S105: taking a behavior recognition model corresponding to the target classification result meeting a preset error range as a target behavior recognition model;

in the embodiment of the invention, the preset error range is specifically determined to minimize the classification error according to the objective function. In the specific implementation process, small batches of random gradient descent is selected as an optimization function, and a minimized classification error is used as an objective function. The optimization function is used for optimizing parameters of the convolution calculation process, so that the objective function is optimal, and at the moment, the objective classification result and the actual classification meet a preset error range. And the target behavior recognition model is a trained behavior recognition model when the target classification result meets the preset error range.

S106: inputting the video to be identified into the target behavior identification model, and determining the behavior category included in the video to be identified.

In an embodiment of the present invention, referring to fig. 2, in order to improve the behavior recognition speed, before step S101, the method further includes:

s201: collecting the plurality of source videos;

s202: editing video clips which are matched with a preset behavior category and last for a preset duration from the plurality of source videos;

S203: segmenting the video segment according to a preset rule to obtain a multi-segment interval;

s204: sampling at the frame granularity in each section of interval, and converting into multi-frame images;

s205: the pixel size of each frame of image is adjusted to be a preset value;

s206: and taking each frame of adjusted image as the training sample.

In the implementation process, the specific implementation process of step S201 to step S206 is as follows:

first, a plurality of source videos are collected, and the duration of the plurality of source videos is 5s-15s. In addition, each type of behavior in the plurality of source videos includes a plurality of people and groups of action behaviors. Then, video clips which are matched with preset behavior categories and last for a preset duration are clipped from the plurality of source videos, wherein the preset behavior categories are behaviors selected by a person skilled in the art according to actual needs, such as door closing, window opening, cooking and the like. The preset time period is a time period set by a person skilled in the art according to actual usage habits of the user, for example, 2s, 3s, and the like. That is, a plurality of source videos are clipped, irrelevant actions are eliminated, and the main body action is ensured to be continuous and stable for a certain period of time. Where the extraneous behavior is such as noise or behavior that is extraneous to the subject's motion. Then, segmenting the clipped video clips according to a preset rule to obtain a multi-segment section. The preset rule may be a rule set according to a specific duration, for example, when the clipped video clip corresponds to a duration of 3s, the clipped video is segmented according to a duration of 1s, so as to obtain 3 segmentation intervals. Then, sampling is performed at the frame granularity in each section, and the multi-frame image is converted.

In a specific implementation process, in order to improve the speed of behavior recognition, after converting a video of each section of interval into a multi-frame image, the pixel size of each frame of image is adjusted to a preset value so as to unify the image size. In this way, when the behavior of the training sample and the test sample is identified, the consistency of the data sizes of the training sample and the test sample is ensured by adjusting the image sizes in advance. For example, the preset value is 160×160 pixels, and of course, those skilled in the art can set the preset value according to actual needs. Then, each frame of the adjusted image is used as a training sample.

In the embodiment of the present invention, in order to achieve rapid extraction of a basic optical flow field and improve behavior recognition efficiency, extracting the basic optical flow field from the training sample through the total variation network in step S102 includes: inputting each frame of image in the training sample into the total variation network to obtain a basic optical flow field output by the total variation network;

in a specific implementation process, please refer to fig. 3, an extraction process of a basic optical flow field of any two adjacent frame images including the first frame image and the second frame image in the training sample through the total variation network is:

S301: acquiring a first brightness value and a second brightness value of a pixel point at the same position of the first frame image and the second frame image, and a first initial pixel displacement difference and a first initial dual vector field between the two frame images;

s302: inputting the first brightness value, the second brightness value, the first initial pixel displacement difference and the first initial dual vector field into a first module in the total variation network, and outputting a first target pixel displacement difference and a first target dual vector field, wherein the total variation network comprises N modules arranged according to a preset sequence, the first module is positioned at the head of the preset sequence, and N is a positive integer greater than 2;

s303: taking the first target pixel displacement difference and the first target dual vector field as learned parameters to participate in training of each module in the total variation network;

s304: obtaining an N target pixel displacement difference and an N target dual vector field which are output by an N module in the total variation network;

s305: and acquiring a basic optical flow field comprising the N-th target pixel displacement difference and the N-th target dual vector field, wherein the N-th module is positioned at the tail part of the preset sequence.

In the implementation process, the implementation process of step S301 to step S305 is as follows:

first, a first luminance value and a second luminance value of a pixel point at the same position of a first frame image and a second frame image, and a first initial pixel displacement difference and a first initial dual vector field between the two frame images are acquired. Let the first brightness value be I ₀ The second brightness value is I ₁ The first initial pixel displacement difference is u ₀ The first initial dual vector field is p ₀ For example, the structure diagram of the total variation network corresponding to step S302 to step S304 is shown in fig. 4. Specifically, will I ₀ 、I ₁ 、u ₀ 、p ₀ The first module in the total variation network is input, and the first target pixel displacement difference u and the first target dual vector field p are output. Training with u and p as learned parameters for the respective modulesAnd (5) training.

In the specific implementation process, in the process of I ₀ 、I ₁ 、u ₀ 、p ₀ After input to the first module in the total variation network, the first module uses bilinear interpolation to deform and convolve the layer, for I ₀ And I ₁ And converting to obtain a first auxiliary variable v. The first auxiliary variable v is respectively matched with u ₀ And p ₀ And performing convolution calculation and outputting p and v.

In the first to nth modules, the first auxiliary variable v obtained by each module is the same value. After the first module outputs u and p, it is input into the second module and convolved with the auxiliary variable v to output the first target pixel displacement difference u via the second module ₁ And a first dual vector field p ₁ . Repeating the same processing to finally obtain the N target pixel displacement difference u output by the N module _N And an nth object dual vector field p _N 。

In the embodiment of the invention, the source video with shorter duration is usually shot by a fixed camera, and the orientation of the camera may be changed in the actual recording process of the video, so that misjudgment is easy to occur in behavior recognition of the training sample. In order to improve the accuracy of behavior recognition, mapping transformation is performed on images in training samples by estimating a homography matrix. Referring to fig. 5, in step S102, a distorted optical flow field is extracted from the training sample through the total variation network, including:

s501: mapping the images in the training samples through a homography matrix to obtain mapped images;

s502: and inputting the mapped image into the total variation network to obtain a distorted optical flow field output by the total variation network.

In the implementation process, the specific implementation process of step S501 to step S502 is as follows:

first, mapping the homography matrix to the image in the training sample to obtain the mapped image. That is, the original image frames in the training samples are adjusted using the homography matrix And acquiring a distorted image so as to acquire a distorted optical flow field. Taking the coordinates of any pixel point of the fifth frame image in the training sample as (x) ₁ ,y ₁ 1) for example, the coordinates of the pixel point mapped by the homography matrix are (x) ₂ ，y ₂ 1), the mapping of the fifth frame image by the homography matrix is expressed as:

wherein, the homography matrix is H, and H may be expressed as:

and then, inputting the mapped image into a total variation network to obtain a distorted optical flow field output by the total variation network.

In the embodiment of the present invention, please refer to fig. 6, the process of extracting the distorted optical flow field of any two adjacent frames of images including the third frame of image and the fourth frame of image in the mapped image is:

s601: acquiring a third brightness value and a fourth brightness value of a pixel point at the same position of the third frame image and the fourth frame image, and a second initial pixel displacement difference and a second initial dual vector field between the two frame images;

s602: inputting the third luminance value, the fourth luminance value, the second initial pixel displacement difference and the second initial dual vector field into the first module, and outputting a first pixel displacement difference and a first dual vector field;

S603: taking the first pixel displacement difference and the first dual vector field as learned parameters to participate in training of each module in the total variation network;

s604: obtaining an nth pixel displacement difference and an nth pair vector field which are output by the nth module;

s605: and acquiring a distorted optical flow field comprising the N pixel displacement difference and the N dual vector field.

In the embodiment of the present invention, referring to fig. 7, the specific implementation process of step S602 includes:

s701: converting the third brightness value and the fourth brightness value in the first module by using bilinear interpolation deformation and a convolution layer to obtain a second auxiliary variable;

s702: and respectively carrying out convolution calculation on the second auxiliary variable and the second initial pixel displacement difference and the second initial dual vector field, and outputting the first pixel displacement difference and the first dual vector field through the first module.

In the implementation process, the basic optical flow field and the total variation optical flow field share the total variation network parameters, and the implementation process of steps S601 to S605 is the same as the process of steps S301 to S305, which will not be described again.

In the embodiment of the present invention, in order to improve accuracy of behavior recognition, after outputting the basic optical flow field and the distorted optical flow field through the total variation network, please refer to fig. 8, and the step S103 of obtaining the classification result of performing time flow classification and space flow classification on the training sample includes:

S801: performing weighted calculation on the basic optical flow field and the distorted optical flow field, and taking the weighted calculation result as the input of time flow in the double-flow convolution network to obtain a first classification result of the training sample after the time flow is classified;

s802: and taking the RGB information of each frame of image in the obtained training sample as the input of the space flow in the double-flow convolution network to obtain a second classification result of the training sample after the space flow is classified.

In the implementation process, the specific implementation process of step S801 to step S802 is as follows:

in the embodiment of the invention, the basic optical flow field and the distorted optical flow field are subjected to weighting calculation, and the result of the weighting calculation is used as the input of the time flow in the double-flow convolution network, so that the second classification result of the training sample after the spatial flow classification is obtained. For example, the user A runs on a playground, one frame of image in the video acquired through the camera has a tree, the image adjacent to the frame of image has no tree, if the direct behavior recognition is easy to cause misjudgment, the acquired basic light flow field and the distorted light flow field are subjected to weighted calculation and used as the input of the time flow in the double-flow convolution network, and thus misjudgment on the behavior recognition can be effectively avoided. That is, by means of weighting calculation of the basic optical flow field and the distorted optical flow field, the influence of background movement is effectively restrained, and the accuracy of behavior identification is improved.

In the embodiment of the invention, RGB information of each frame of image in the acquired training sample is used as the input of the space flow in the double-flow convolution network, and a second classification result of the training sample after the space flow is classified is obtained.

In the specific implementation process, the time stream and the space stream can be trained through the pretrained VGG-16 convolutional network, so that the parameter learning process is reduced, the training speed is accelerated, and the recognition efficiency is improved. Of course, those skilled in the art can also train the time stream and the space stream using other convolution networks according to actual needs.

In the embodiment of the invention, after a first classification result of a training sample after time stream classification and a second classification result of the training sample after space stream classification are obtained, in order to improve the accuracy of behavior identification, convolution calculation is performed on the first classification result and the second classification result to obtain a target classification result, namely, the behavior category of the video to be identified is determined.

In the embodiment of the present invention, when the training samples include adjacent continuous frames RGB image 1, RGB image 2 and RGB image 3, the behavior recognition model to be trained provided in the embodiment of the present invention is shown in fig. 9, where 101 represents the training samples, 102 represents the total variation network, and 103 represents the double-flow convolution network. The specific process of behavior recognition by the model is already described in detail in the foregoing, and will not be described in detail here.

Based on the same inventive concept, please refer to fig. 10, an embodiment of the present invention further provides a behavior recognition apparatus, including:

an input unit 10, configured to input a training sample to a behavior recognition model to be trained, where the behavior model to be trained includes a total variation network and a double-flow convolution network, and the training sample includes a plurality of source videos, where each of the plurality of source videos includes a specific behavior;

an extraction unit 20 for extracting a basic optical flow field and a distorted optical flow field from the training sample through the total variation network;

a first obtaining unit 30, configured to input the pixel information of each frame of image included in the basic optical flow field, the distorted optical flow field, and each source video extracted from the training sample into the dual-stream convolutional network, respectively, to obtain a classification result of performing time stream classification and space stream classification on the training sample;

a second obtaining unit 40 for performing convolution calculation on the classification results of the time stream and the space stream to obtain a target classification result;

a determining unit 50, configured to use, as a target behavior recognition model, a behavior recognition model corresponding to the target classification result satisfying a preset error range;

The third obtaining unit 60 inputs the video to be identified into the target behavior identification model, and determines the behavior category included in the video to be identified.

In an embodiment of the present invention, the behavior recognition apparatus further includes: the processing unit is specifically used for:

collecting the plurality of source videos;

the pixel size of each frame of image is adjusted to be a preset value;

and taking each frame of adjusted image as the training sample.

In the embodiment of the present invention, the extraction unit 20 is specifically configured to:

the extraction unit 20 extracts, through the total variation network, a basic optical flow field of any two adjacent frame images including the first frame image and the second frame image in the training sample, where the basic optical flow field comprises:

the extraction unit 20 extracts the distorted optical flow field of any two adjacent frames of images including the third frame of image and the fourth frame of image from the mapped image, where the extraction process is as follows:

In the embodiment of the present invention, the first obtaining unit 30 is specifically configured to:

and taking the RGB information of each frame of image in the obtained training sample as the input of the space flow in the double-flow convolution network to obtain a second classification result of the training sample after the space flow is classified.

Based on the same inventive concept, please refer to fig. 11, which is a schematic structural diagram of a computer device according to an embodiment of the present invention, the computer device includes: processor 70, memory 80, transceiver 90 and bus interface, wherein the memory stores a computer program for reading the program in the memory for performing the steps of the behavior recognition method according to the first aspect.

The processor 70 is responsible for managing the bus architecture and general processing, and the memory 80 may store data used by the processor 70 in performing operations. The transceiver 90 is used to receive and transmit data under the control of the processor 70.

The bus architecture may include any number of interconnecting buses and bridges, and in particular, one or more processors represented by the processor 70 and various circuits of the memory, represented by the memory 80, are linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The processor 70 is responsible for managing the bus architecture and general processing, and the memory 80 may store data used by the processor 70 in performing operations.

The flow disclosed in the embodiment of the present invention may be applied to the processor 70 or implemented by the processor 70. In implementation, the steps of the signal processing flow may be accomplished by integrated logic circuitry in hardware or instructions in software in the processor 70. The processor 70 may be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component that may implement or perform the methods, steps and logic blocks disclosed in embodiments of the invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the route updating method disclosed in connection with the embodiment of the invention can be directly embodied as a hardware processor or a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 80 and the processor 70 reads the information in the memory 80 and, in combination with its hardware, performs the steps of the signal processing flow.

Specifically, the processor 70 is configured to read the program in the memory 80, and perform any of the steps described in the route updating method.

Based on the same technical idea, an embodiment of the present application also provides a readable storage medium having a computer program stored thereon. The computer program, when executed by a processor, performs any of the steps described in the behavior recognition method.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of behavior recognition, comprising:

wherein extracting a basic optical flow field from the training sample through the total variation network comprises:

Inputting the first target pixel displacement difference and the first target dual vector field into a second module as learned parameters, obtaining a second target pixel displacement difference and a second target dual vector field which are output by the second module, and repeating the same processing until an N-th module;

obtaining an N target pixel displacement difference and an N target dual vector field which are output by the N module in the total variation network;

acquiring a basic optical flow field comprising the Nth target pixel displacement difference and the Nth target dual vector field, wherein the Nth module is positioned at the tail part of the preset sequence;

2. The method of claim 1, wherein prior to the inputting training samples to the behavior recognition model to be trained, the method further comprises:

collecting the plurality of source videos;

the pixel size of each frame of image is adjusted to be a preset value;

and taking each frame of adjusted image as the training sample.

3. The method of claim 2, wherein said inputting the first luminance value, the second luminance value, the first initial pixel displacement difference, and the first initial dual vector field into a first module in the full variation network outputs a first target pixel displacement difference and a first target dual vector field, comprising:

4. The method of claim 2, wherein the extracting the distorted optical flow field from the training sample through the total variation network comprises:

inputting the first pixel displacement difference and the first pair of vector fields into the second module as learned parameters to obtain a second pixel displacement difference and a second pair of vector fields output by the second module, and repeating the same processing until the nth module;

5. The method of claim 4, wherein said inputting the third luminance value, the fourth luminance value, the second initial pixel displacement difference, and the second initial dual vector field into the first module, outputting a first pixel displacement difference and a first dual vector field, comprises:

6. The method of claim 1, wherein the obtaining of the classification results of the temporal and spatial flow classification of the training samples comprises:

7. A behavior recognition apparatus, comprising:

the extraction unit is used for extracting a basic light flow field and a distorted light flow field from the training sample through the total variation network, wherein the extraction unit is used for extracting the basic light flow field from the training sample through the total variation network and comprises the following steps: inputting each frame of image in the training sample into the total variation network to obtain a basic optical flow field output by the total variation network; the extraction process of the basic optical flow field of any two adjacent frame images including the first frame image and the second frame image in the training sample through the total variation network is as follows: acquiring a first brightness value and a second brightness value of a pixel point at the same position of the first frame image and the second frame image, and a first initial pixel displacement difference and a first initial dual vector field between the two frame images; inputting the first brightness value, the second brightness value, the first initial pixel displacement difference and the first initial dual vector field into a first module in the total variation network, and outputting a first target pixel displacement difference and a first target dual vector field, wherein the total variation network comprises N modules arranged according to a preset sequence, the first module is positioned at the head of the preset sequence, and N is a positive integer greater than 2; inputting the first target pixel displacement difference and the first target dual vector field into a second module as learned parameters, obtaining a second target pixel displacement difference and a second target dual vector field which are output by the second module, and repeating the same processing until an N-th module; obtaining an N target pixel displacement difference and an N target dual vector field which are output by the N module in the total variation network; acquiring a basic optical flow field comprising the Nth target pixel displacement difference and the Nth target dual vector field, wherein the Nth module is positioned at the tail part of the preset sequence;

8. A computer apparatus, comprising: a processor and a memory, wherein the memory stores a computer program, the processor being configured to read the program in the memory to perform the steps of the behavior recognition method of any one of claims 1 to 6.

9. A readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of any of claims 1 to 6.