CN116363761B

CN116363761B - Behavior recognition method and device based on image and electronic equipment

Info

Publication number: CN116363761B
Application number: CN202310635823.0A
Authority: CN
Inventors: 周波; 梁书玉; 邹小刚
Original assignee: Shenzhen Haiqing Zhiyuan Technology Co ltd
Current assignee: Shenzhen Haiqing Zhiyuan Technology Co ltd
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-08-18
Anticipated expiration: 2043-06-01
Also published as: CN116363761A

Abstract

The invention relates to an image-based behavior recognition method, an image-based behavior recognition device and electronic equipment, wherein a plurality of frames of behavior images to be recognized are obtained; inputting a plurality of frames of behavior images to be identified into a behavior identification model to obtain a behavior identification result; performing target detection on a plurality of frames of behavior images to be identified to obtain annotation images of annotation categories and boundary frames, and determining key frame annotation images from the annotation images; equidistant extraction is carried out on a plurality of frames of behavior images to be identified according to a preset speed ratio to obtain a speed image to be classified; performing image recognition on the fast and slow images to be classified to generate a target characteristic image; and according to the key frame labeling image, carrying out region-of-interest identification on the target feature image to determine a target region-of-interest, carrying out behavior identification and classification on the target region-of-interest according to the fast and slow channel regions-of-interest to obtain a class label, replacing the labeling image with the class label to label the class, and combining the labeling images after replacement to obtain a behavior identification result.

Description

Behavior recognition method and device based on image and electronic equipment

Technical Field

The disclosure relates to the technical field of image processing, in particular to an image-based behavior recognition method and device and electronic equipment.

Background

The application of image recognition is more and more widespread, for example, face image recognition is performed on a subway gate, a high-speed rail station gate and the like, the behavior of a target object can be obtained according to an image recognition result aiming at obstacle image recognition in the running process of a vehicle, for example, in the image recognition technology, an improved dense track method (Improved Dense Trajectories, IDT) is mainly included, the IDT mainly uses Fisher Vector (FV) to encode the characteristics of the target object, then based on the characteristic Vector obtained by encoding, a support Vector machine (Support Vector Machine, SVM) classifier is trained, and further, behavior recognition is realized through the SVM classifier obtained by training, however, the characteristic dimension generated by the IDT is very high, the size of the characteristic file is far greater than that of an original image, and therefore, the speed of image recognition or behavior recognition is relatively slow.

In the behavior recognition technology based on skeleton key points, the skeleton data is used for recognizing the behaviors of the target object, however, some environment information has a great influence on the motion detection, so that the accuracy of form recognition is poor, such as smoking and eating, and the behavior recognition is performed by only using the skeleton data, and the behavior recognition result is difficult to distinguish whether smoking or eating is performed, because the skeleton data of the two behaviors are consistent on the skeleton data.

In the behavior recognition technology of the 3D convolution, information of a time dimension is added on the basis of the common 2D convolution, so that the space-time characteristics are extracted simultaneously for classification, the time information is added in the 3D convolution, a large number of parameters can be generated in the convolution process, and the training time is long.

In the behavior recognition technology based on a cyclic neural network (Recurrent Neural Network, RNN) or a long-short-term memory network (Long Short Term Memory Networks, LSTM), due to errors of the convolutional neural network of the feature sequence, loss of time information is easy to cause, so that the RNN network or the LSTM network cannot extract effective time features, and accuracy of behavior recognition is low.

In the behavior recognition technology based on the double-flow structure, the spatial characteristics and the time characteristics are respectively extracted by utilizing two network channels, and the double-flow structure needs to manufacture an optical flow diagram in advance according to video extraction, so that the calculation amount is large, the time consumption is long, and the instantaneity is low. In the behavior recognition technology of the SlowFast network, the consumed time is still longer, the detection information only takes middle image frames for a period of time, and the accuracy of the detection result is lower due to the position change of the target object in the front image frame and the rear image frame.

Disclosure of Invention

The invention aims to provide an image-based behavior recognition method, an image-based behavior recognition device and electronic equipment, and aims to solve the technical problem of low accuracy of behavior recognition of a target object in a related scene.

To achieve the above object, a first aspect of embodiments of the present disclosure provides an image-based behavior recognition method, the method including:

acquiring a multi-frame behavior image to be identified of a target object;

inputting the multi-frame behavior image to be recognized into a pre-trained behavior recognition model to obtain a behavior recognition result aiming at the target object, which is output by the pre-trained behavior recognition model;

the behavior recognition model comprises a target detection network and a target classification network, the target detection network carries out target detection on the multi-frame behavior images to be recognized to obtain annotation images of annotation categories and boundary frames corresponding to each frame of behavior images to be recognized, and key frame annotation images are determined from the annotation images;

the target classification network extracts the multi-frame behavior images to be identified according to the preset speed and slow ratio at equal intervals and obtains a quick image to be classified and a slow image to be classified, wherein the ratio of the quick image to be classified in the preset speed and slow ratio is larger than that of the slow image to be classified; respectively carrying out image recognition on the fast image to be classified and the slow image to be classified to generate a target characteristic image;

And the behavior recognition model recognizes the region of interest of the target feature image according to the key frame annotation image, determines the target region of interest in the target feature image, recognizes and classifies the target region of interest according to the fast channel region of interest in the fast image to be classified and the slow channel region of interest in the slow image to be classified, obtains a class label of the multi-frame behavior image to be recognized, replaces the annotation class in the annotation image according to the class label, and combines the replaced annotation images to obtain a behavior recognition result aiming at the target object.

In a preferred embodiment, the behavior recognition model performs region of interest recognition on the target feature image according to the key frame labeling image, and determines a target region of interest in the target feature image, including:

the behavior recognition model determines a target image area from the target characteristic image according to the key frame annotation image;

determining the mapping coordinates of each specified region corner in the preset mapping feature layer according to the coordinates of a plurality of specified region corner in the target image region and the step distance of the preset mapping feature layer relative to the target image region;

Randomly selecting a preset number of sampling points from the target image area, and determining the center point coordinates of the sampling points in the preset mapping feature layer according to the corner mapping coordinates of the corners of the designated area;

determining four feature points closest to each sampling point from the target image area, and determining sampling point mapping coordinates of the sampling points according to coordinates of the four feature points in the preset mapping feature layer and central store coordinates of the sampling points;

carrying out maximum pooling on the mapping coordinates of the sampling points to obtain a space positioning image of interest;

and identifying the object of interest on the space positioning image of interest, and determining a target region of interest in the target characteristic image.

In a preferred embodiment, the identifying the object of interest on the spatial location image, determining the target region of interest in the target feature image, includes:

determining a target interested feature region from the interested space positioning image, and taking the interested feature region as a center, and placing a mask with a preset size to obtain a cascade interested feature region corresponding to the target interested feature region;

According to preset interesting super-parameter conditions, calculating super-parameter values of the cascading interesting characteristic regions relative to the target interesting characteristic regions;

and determining the target region of interest in the target feature image according to the cascaded region of interest with the maximum super-parametric value and the target region of interest.

In a preferred embodiment, the preset super-parameter condition of interest includes: the length of a shorter side of the cascade interesting feature region is more than or equal to one third of the total length of the cascade interesting feature region corresponding to the shorter side and the target interesting feature region, and the length of a longer side of the cascade interesting feature region is less than or equal to the total length of the cascade interesting feature region corresponding to the longer side and the target interesting feature region;

and the intersection ratio of the cascading interesting characteristic region and the target interesting characteristic region is greater than or equal to a preset intersection ratio threshold value.

In a preferred embodiment, the plurality of specified region corner points are region corner points of a lower left corner and a lower right corner, and the preset number is 4.

In a preferred embodiment, the target detection network performs target detection on the multiple frames of behavior images to be identified to obtain a labeling category and a labeling image of a bounding box corresponding to each frame of behavior image to be identified, and the method includes:

Determining a GT center point of the behavior image to be identified of each frame;

determining an associated grid from the behavior image to be identified by adopting a random translation and random scaling mode;

determining target feature candidate frames of targets in grids where the GT center points in the behavior image to be identified are located and the associated grids in each frame from preset feature candidate frames;

and adding a boundary frame to the behavior image to be identified of the corresponding frame according to the target feature candidate frame, and adding a labeling category to obtain a labeling category corresponding to the behavior image to be identified of each frame and a labeling image of the boundary frame.

In a preferred embodiment, the determining a key frame annotation image from each of the annotation images includes:

determining a boundary area covered by the boundary box in the marked image;

determining a center point of the boundary region;

and determining the region image where the center point is located as a key frame annotation image of the annotation image.

In a preferred embodiment, the object detection network is a FastestDet network and the object classification network is a SlowFast network.

In a preferred embodiment, the ratio of the fast image to be classified to the ratio of the slow image to be classified in the preset fast-slow ratio is 8:1.

In a second aspect of embodiments of the present disclosure, there is provided an image-based behavior recognition apparatus, the apparatus including:

the acquisition module is configured to acquire a multi-frame behavior image to be identified of the target object;

the input module is configured to input the multi-frame behavior image to be recognized into a pre-trained behavior recognition model to obtain a behavior recognition result which is output by the pre-trained behavior recognition model and aims at the target object;

In a preferred embodiment, the input module is configured to:

determining a boundary area covered by the boundary box in the marked image;

determining a center point of the boundary region;

in a third aspect of the disclosed embodiments, there is provided an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any of the first aspects.

The beneficial effects are that:

the invention provides an image-based behavior recognition method and device and electronic equipment. Compared with the prior art, the method has the following beneficial effects:

The target detection network carries out target detection on the multi-frame behavior images to be identified, so as to obtain the annotation type corresponding to each frame of behavior image to be identified and the annotation image of the boundary frame, and determines the key frame annotation image from each annotation image, thereby accelerating the detection speed of target detection and improving the identification effect of behavior identification. The target classification network extracts the multi-frame behavior images to be identified according to the preset speed and slow proportion equidistantly according to image frames to obtain a quick image to be classified and a slow image to be classified, wherein the proportion of the quick image to be classified in the preset speed and slow proportion is larger than that of the slow image to be classified; respectively carrying out image recognition on the fast image to be classified and the slow image to be classified to generate a target characteristic image; the behavior recognition model recognizes the region of interest of the target feature image according to the key frame labeling image, determines the target region of interest in the target feature image, recognizes and classifies the target region of interest according to the fast channel region of interest in the fast image to be classified and the slow channel region of interest in the slow image to be classified, obtains a class label of the multi-frame behavior image to be recognized, replaces the labeling class in the labeling image according to the class label, combines the replaced labeling images to obtain a behavior recognition result aiming at the target object, and extracts context information near the region of interest while obtaining more accurate spatial positioning information, so as to solve the problem of character position change of the relevant frames before and after, and ensure that the recognition of the network on the behavior is more accurate.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:

fig. 1 is a flow chart illustrating a method of image-based behavior recognition according to an embodiment of the present disclosure.

Fig. 2 is a flow chart illustrating a method of determining a target region of interest according to an embodiment of the specification.

Fig. 3 is a schematic diagram illustrating a determination of a spatially localized image of interest according to an embodiment of the present disclosure.

Fig. 4 is a flowchart for implementing S26 in fig. 2, according to an embodiment of the disclosure.

Fig. 5 is a block diagram of an image-based behavior recognition apparatus according to an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the disclosure, are not intended to limit the disclosure.

To improve accuracy and recognition speed of behavior recognition of a target object, the present disclosure provides an image-based behavior recognition method, and fig. 1 is a flowchart of an image-based behavior recognition method according to an embodiment, and referring to fig. 1, the method includes:

s11, acquiring a multi-frame behavior image to be identified of the target object.

In the embodiment of the disclosure, under the condition that the behavior images to be identified are received, image storage is performed, quantitative grouping is performed on the behavior images to be identified, and under the condition that the number of image frames reaches a preset number, a group of multi-frame behavior images to be identified are obtained.

S12, inputting the multi-frame behavior image to be recognized into a pre-trained behavior recognition model to obtain a behavior recognition result aiming at the target object, which is output by the pre-trained behavior recognition model.

The behavior recognition model comprises a target detection network and a target classification network, the target detection network carries out target detection on the multi-frame behavior images to be recognized to obtain annotation images of annotation categories and boundary frames corresponding to each frame of behavior images to be recognized, and key frame annotation images are determined from the annotation images.

The target classification network extracts the multi-frame behavior images to be identified according to the preset speed and slow proportion equidistantly to obtain a quick image to be classified and a slow image to be classified, wherein the proportion of the quick image to be classified in the preset speed and slow proportion is larger than that of the slow image to be classified. And respectively carrying out image recognition on the fast image to be classified and the slow image to be classified to generate a target characteristic image.

In the embodiment of the disclosure, the behavior recognition result is converted into video output.

According to the technical scheme, the target detection network carries out target detection on the multi-frame behavior images to be identified, the annotation type corresponding to each frame of behavior image to be identified and the annotation image of the boundary frame are obtained, key frame annotation images are determined from the annotation images, the detection speed of target detection is increased, and the identification effect of behavior identification is improved. And the target classification network extracts the multi-frame behavior images to be identified according to the preset speed and speed proportion equidistantly to obtain a quick image to be classified and a slow image to be classified, wherein the proportion of the quick image to be classified in the preset speed and speed proportion is larger than that of the slow image to be classified. And respectively carrying out image recognition on the fast image to be classified and the slow image to be classified to generate a target characteristic image. The behavior recognition model recognizes the region of interest of the target feature image according to the key frame labeling image, determines the target region of interest in the target feature image, recognizes and classifies the target region of interest according to the fast channel region of interest in the fast image to be classified and the slow channel region of interest in the slow image to be classified, obtains a class label of the multi-frame behavior image to be recognized, replaces the labeling class in the labeling image according to the class label, combines the replaced labeling images to obtain a behavior recognition result aiming at the target object, and extracts context information near the region of interest while obtaining more accurate spatial positioning information, so as to solve the problem of character position change of the relevant frames before and after, and ensure that the recognition of the network on the behavior is more accurate.

In a preferred embodiment, referring to fig. 2, the behavior recognition model performs region of interest recognition on the target feature image according to the key frame labeling image, and determines a target region of interest in the target feature image, including:

s21, the behavior recognition model determines a target image area from the target feature image according to the key frame annotation image.

S22, determining the mapping coordinates of the corner points of the specified areas in the preset mapping feature layer according to the coordinates of the corner points of the specified areas in the target image area and the step distance of the preset mapping feature layer relative to the target image area.

S23, randomly selecting a preset number of sampling points from the target image area, and determining the center point coordinates of the sampling points in the preset mapping feature layer according to the corner mapping coordinates of the corners of the designated area.

S24, determining four feature points closest to each sampling point from the target image area, and determining sampling point mapping coordinates of the sampling points according to coordinates of the four feature points in the preset mapping feature layer and central store coordinates of the sampling points.

And S25, carrying out maximum pooling on the sampling point mapping coordinates of each sampling point to obtain a space positioning image of interest.

S26, identifying the object of interest on the space positioning image of interest, and determining a target region of interest in the target feature image.

In the embodiment of the disclosure, each target image area keeps floating point number boundaries unquantified, the target image area is divided into 2×2 units, and boundaries of each unit are unquantified. And determining four fixed sampling points in each unit, calculating the coordinates of the four sampling points in a preset mapping characteristic layer by using a bilinear interpolation method, and carrying out maximum pooling operation on the coordinates of the four sampling points to obtain a space positioning image of interest.

Referring to fig. 3, for example, assuming that a target image area is obtained by RPN, the target image area is on a target feature image, the designated area corner points are respectively an upper left corner and a lower right corner, wherein the coordinates of the upper left corner on the target feature image are (10, 10), the coordinates of the lower right corner on the target feature image are (124 ), the step distance of the preset mapping feature layer relative to the target image area is 32, further, the output expected by RoIAlign is 2x2, the calculation flow is as follows: ，/>The mapping coordinate of the upper left corner of the appointed area corner at the preset mapping characteristic layer is (0.3125), the mapping coordinate of the upper right corner of the appointed area corner at the preset mapping characteristic layer is (3.875), and the midpoint coordinate of the sampling point in the target image area is calculated. The coordinates of the four nearest feature points from the sampling point in the target image area are respectively as follows: />，/>，/>，Therefore, the sampling point mapping coordinates of the sampling points in the target image area are:. In this case, a sampling point is taken as an example for explanation, 4 sampling points are generally selected when the RoIAlign operation is performed, mapping coordinates of the sampling points are obtained through calculation, and then the mapping coordinates of the sampling points are maximally pooled, and 2×2 units are calculated to obtain a space positioning image of interest, so that more accurate space positioning information can be obtained, and the effect of behavior recognition is improved.

In a preferred embodiment, referring to fig. 4, in S26, the identifying the object of interest on the spatial location image, determining the target region of interest in the target feature image includes:

s261, determining a target interested feature area from the interested space positioning image, and placing a mask with a preset size by taking the interested feature area as a center to obtain a cascade interested feature area corresponding to the target interested feature area.

S262, calculating the super-parameter value of each cascade interesting characteristic region relative to the target interesting characteristic region according to the preset interesting super-parameter condition.

And S263, determining a target region of interest in the target feature image according to the cascade region of interest with the maximum super-parameter value and the target region of interest.

In the embodiment of the present disclosure, a mask with a preset size of 3×3 is placed with the feature region of interest as a center, and Rc are used to respectively represent a target feature region of interest in a 3×3 grid unit and 8 cascade feature regions of interest excavated in the grid unit, where the 8 cascade feature regions of interest are respectively: the method comprises the steps of a left cascade interesting feature region, an upper left cascade interesting feature region, a lower left cascade interesting feature region, a right cascade interesting feature region, an upper right cascade interesting feature region, a lower right cascade interesting feature region, an upper cascade interesting feature region and a lower cascade interesting feature region. The object RoI is described by a cascading RoI diagram of 9.DXph×pw.

Further, a candidate RoI set is defined, with Ω= { b0, and bn, wherein b0 represents an anchor RoI, it is located in the center of the target feature region of interest Rc of the grid cell, having half the height and half the width of the cell. Then enumerating all cascaded interesting feature areas bi meeting preset interesting super-parameter conditions.

Further, to select the most influential one, a fully connected layer with one output is introduced, which calculates the score of each candidate context RoI, selects the highest scoring cascade of feature regions of interest and the target feature region of interest, determines the target region of interest in the target feature image, i.e. maximally pooling each surrounding cell throughout the library. The fully connected layer is shared among all 8 cascaded feature regions of interest. Therefore, the context information near the region of interest is mined while more accurate spatial positioning information is obtained, and the problem of character position change of the front and rear related frames is solved, so that the network can recognize behaviors more accurately.

In a preferred embodiment, the preset super-parameter condition of interest includes: the length of the shorter side of the cascade feature of interest region is greater than or equal to one third of the total length of the cascade feature of interest region and the target feature of interest region corresponding to the shorter side, and the length of the longer side of the cascade feature of interest region is less than or equal to the total length of the cascade feature of interest region and the target feature of interest region corresponding to the longer side. So that the cascaded feature of interest area is not too small or too large.

And the intersection ratio of the cascading interesting characteristic region and the target interesting characteristic region is greater than or equal to a preset intersection ratio threshold value. Such that the cascaded feature of interest region is placed at a location not too far from the center of the grid cell.

and determining the GT center point of the behavior image to be identified of each frame.

And determining the associated grid from the behavior image to be identified by adopting a random translation and random scaling mode.

And determining a target feature candidate frame of a target in a grid where the GT center point in the behavior image to be identified is located and the associated grid in each frame from preset feature candidate frames.

and determining a boundary area covered by the boundary box in the marked image.

A center point of the border region is determined.

The single light detection head is adopted, in a network structure, a 5x5 grouping convolution parallel network structure of the indication is used, the characteristics of different sensing fields can be expected to be fused, and the single detection head can be suitable for detecting objects with different scales. The original Anchor-base algorithm needs to perform anchor-bias operation on the data set when training the model, and the anchor-bias can be understood as clustering the width and height of the marked objects in the data set to obtain a set of priori widths and heights, and the network optimizes the width and height of the prediction frame on the basis of the set of priori widths and heights. The FastDet adopts an anchor-free algorithm, the model directly returns the scale value of the GT in the width and height of the feature map, and the prior width and height are not needed, so that the method can simplify the post-processing of the model. And the feature points of each feature map for the anchor-base algorithm are corresponding to N anchor candidate frames, and the feature points of each feature map for the anchor-free are only corresponding to one candidate frame, so that the method is also advantageous in the reasoning speed. By adopting cross-grid multi-candidate targets, not only the grid where the GT center point is located is taken as a candidate target, but also three nearby grids are calculated, and the number of positive sample candidate frames is increased. Dynamic positive and negative sample allocation is also used, and by setting the average value of the interaction ratio calculated by the prediction frame and the GT as the threshold value for allocating positive and negative samples, if the threshold value of the interaction ratio of the current prediction frame and the GT is larger than the average value, the positive samples are obtained, and vice versa. Data enhancement is focused on simplicity, with only random panning and random scaling. The FastDet network has extremely high detection speed, and can rapidly provide detection information for the SlowFat network.

Further, the slow path may capture spatial semantic information, the path running at a lower frame rate and a slower refresh rate. The fast path may capture rapidly changing motion, operating at a fast refresh rate and high temporal resolution. About 20% of the total calculated amount. This is because the path has fewer channels and less capacity to process spatial information, as spatial information can be provided by slow paths. The fast and slow paths are fused by transverse connection. The fast path, because of its lightweight, does not require time pooling operations, can run at high frame rates of all intermediate layers, and maintains time accuracy. Slow paths are more focused on spatial semantics because of the slower time rate. By processing the original video at different time rates, the respective unique functions are possessed in processing the image video, i.e., the fast path grasps the time information and the slow path grasps the space information.

And, the fast path of the SlowFast network is lighter. The SlowFast network does not need to calculate optical flow, and therefore the SlowFast network model is learned from the original data end to end. However, the SllowFast network needs detection information, so that the overall recognition speed is related to the detection network before the SllowFast network, the FastDet network and the SllowFast network are combined, the Faster-RCNN network which is time-consuming originally is abandoned, and the overall recognition speed can be improved.

for example, taking 27 frames of to-be-identified behavioral images of a target object as examples, firstly extracting 8 frames as to-be-classified fast images, then extracting 1 frame as to-be-classified slow images, further extracting 8 frames as to-be-classified fast images, then extracting 1 frame as to-be-classified slow images, and so on, taking 1 st to 8 th frames, 10 th to 17 th frames, 19 th to 26 th frames as to-be-classified fast images, and taking 9 th, 18 th and 27 th frames as to-be-classified slow images.

The embodiment of the present disclosure further provides an image-based behavior recognition apparatus, referring to fig. 5, the image-based behavior recognition apparatus 500 includes:

an obtaining module 510, configured to obtain a multi-frame behavior image to be identified of the target object;

the input module 520 is configured to input the multi-frame behavior image to be identified into a pre-trained behavior identification model, and obtain a behavior identification result for the target object output by the pre-trained behavior identification model;

In a preferred embodiment, the input module 520 is configured to:

determining a boundary area covered by the boundary box in the marked image;

determining a center point of the boundary region;

with respect to the image-based behavior recognition apparatus 500 in the above-described embodiment, a specific manner in which each module performs an operation has been described in detail in the embodiment regarding the method, and will not be described in detail herein.

It will be appreciated by those skilled in the art that the above-described embodiments of the apparatus are merely illustrative, and that, for example, the division of modules is merely a logical function division, and that the division of modules is not limited to the above-described division, and that a plurality of modules may be combined or one module may be divided into a plurality of sub-modules.

Further, the modules illustrated as separate components may or may not be physically separate. Also, each module may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. When implemented in hardware, may be implemented in whole or in part in the form of an integrated circuit or chip.

The embodiment of the disclosure also provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any of the preceding embodiments.

The embodiment of the disclosure also provides a material vehicle, comprising: the electronic device described in the foregoing embodiment.

The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present disclosure is not limited to the specific details of the above embodiments, and various changes, modifications, substitutions and alterations can be made to these embodiments within the scope of the technical idea of the present disclosure, which all fall within the scope of protection of the present disclosure.

It should be further noted that, where specific features described in the foregoing embodiments are combined in any suitable manner, they should also be regarded as disclosure of the present disclosure, and various possible combinations are not separately described in order to avoid unnecessary repetition. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of claims.

Claims

1. An image-based behavior recognition method, the method comprising:

acquiring a multi-frame behavior image to be identified of a target object;

the behavior recognition model recognizes the region of interest of the target feature image according to the key frame annotation image, determines the target region of interest in the target feature image, recognizes and classifies the target region of interest according to the fast channel region of interest in the fast image to be classified and the slow channel region of interest in the slow image to be classified, obtains a class label of the multi-frame behavior image to be recognized, replaces the annotation class in the annotation image according to the class label, and combines the replaced annotation images to obtain a behavior recognition result aiming at the target object;

The behavior recognition model performs region of interest recognition on the target feature image according to the key frame annotation image, and determines a target region of interest in the target feature image, including:

Identifying an object of interest on the space positioning image of interest, and determining a target region of interest in the target feature image;

the identifying the object of interest on the space positioning image of interest, determining the target region of interest in the target feature image, includes:

2. The method of claim 1, wherein the preset super-parameter condition of interest comprises: the length of a shorter side of the cascade interesting feature region is more than or equal to one third of the total length of the cascade interesting feature region corresponding to the shorter side and the target interesting feature region, and the length of a longer side of the cascade interesting feature region is less than or equal to the total length of the cascade interesting feature region corresponding to the longer side and the target interesting feature region;

3. The method of claim 1, wherein the plurality of designated region corner points are region corner points of a lower left corner and a lower right corner, and the preset number is 4.

4. The method according to claim 1, wherein the target detection network performs target detection on the multiple frames of behavior images to be identified to obtain a labeling category and a labeling image of a bounding box corresponding to each frame of behavior image to be identified, including:

5. The method of claim 1, wherein said determining a key frame annotation image from each of said annotation images comprises:

determining a boundary area covered by the boundary box in the marked image;

determining a center point of the boundary region;

6. The method according to any one of claims 1-5, wherein the object detection network is a FastestDet network and the object classification network is a SlowFast network.

7. The method according to any one of claims 1 to 5, wherein the ratio of the fast image to be classified to the ratio of the slow image to be classified in the preset fast-slow ratio is 8:1.

8. an image-based behavior recognition apparatus, the apparatus comprising:

The input module is configured to:

9. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-7.