CN110738101B

CN110738101B - Behavior recognition method, behavior recognition device and computer-readable storage medium

Info

Publication number: CN110738101B
Application number: CN201910832181.7A
Authority: CN
Inventors: 罗郑楠; 周俊琨; 肖玉宾; 许扬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2023-07-25
Anticipated expiration: 2039-09-04
Also published as: WO2021042547A1; CN110738101A

Abstract

The scheme relates to artificial intelligence and provides a behavior recognition method, a behavior recognition device and a storage medium, wherein the behavior recognition method comprises the following steps: dividing the video stream into a sequence of image frames; detecting the human body outline in each frame of image, and marking each human body by using a first rectangular frame; calculating the distance between any two first rectangular frames in each frame of image; if the distance between the two first rectangular frames in a certain frame of image is smaller than a threshold value, surrounding the two first rectangular frames by adopting a two-person combined frame; searching front and rear multi-frame images, forming two persons identical to two persons in the two-person combined frame, and forming a two-person combined frame sequence by the frame images and the two-person combined frames in the front and rear multi-frame images; and inputting the two-person combined frame sequence into a neural network model for behavior recognition. The invention avoids a great amount of calculation amount caused by redundant background to the neural network model, maintains valuable characteristics for behavior judgment between two people, and improves the accuracy of behavior identification under complex scenes.

Description

Behavior recognition method, behavior recognition device and computer-readable storage medium

Technical Field

The present invention relates to artificial intelligence, and more particularly, to a behavior recognition method, apparatus, and computer-readable storage medium.

Background

In the security field, the behavior determination of a human body in a video is a frequently encountered problem. Events within the current area are detected by the camera, for example: a moving human body in the current area is detected and behavioral activity of the human body is identified. In the traditional behavior recognition field, a human body contour in a video frame is generally extracted, and recognition and judgment are performed according to the posture change of the human body contour, wherein when the human body contour belongs to a more complex scene or more people interfere in the background, a larger misjudgment rate exists in the simple classification problem processing of taking the human body posture change as a behavior recognition.

In addition, the behavior between two people has a very valuable judging basis for whether the behavior occurs, the background outside the area between the two people has little value for judging whether the behavior occurs, and for the monitoring video, the human body is usually smaller, the background is larger, and the image of one frame in the video is directly input into the neural network model for calculation, so that huge calculation amount is caused.

Disclosure of Invention

In order to solve the technical problems, the invention provides a behavior recognition method, which is applied to an electronic device and comprises the following steps:

s1, acquiring a video stream, and dividing the video stream into an image frame sequence consisting of multi-frame images;

s2, detecting the human body outline in each frame of image, and marking each human body by using a first rectangular frame;

s3, calculating the distance between any two first rectangular frames in each frame of image;

s4, if the distance between two first rectangular frames in a certain frame of image is smaller than a set distance threshold value, surrounding the two first rectangular frames by adopting a two-person combined frame, wherein the two-person combined frame is the smallest rectangular frame surrounding the two first rectangular frames;

s5, searching a front multi-frame image and a rear multi-frame image of the certain frame image, forming two combined frames by two persons which are the same as the two combined frames, and forming a two combined frame sequence by the certain frame image and the two combined frames in the front multi-frame image and the rear multi-frame image;

s6, inputting the two-person combined frame sequence into a neural network model, and performing human behavior recognition through the neural network model to obtain a recognition result and confirm whether the recognition result belongs to a preset behavior category.

In addition, the invention also provides an electronic device, which comprises: the system comprises a memory and a processor, wherein a behavior recognition program is stored in the memory, and the behavior recognition program realizes the following steps when being executed by the processor:

In addition, the present invention also provides a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the behavior recognition method as described above.

According to the method, two human bodies in the image and the region between the two human bodies are segmented from other regions, so that the background except the two human bodies is removed, a great amount of calculation amount of a neural network model caused by redundant background can be avoided, and the characteristics of the two human bodies, which are valuable for behavior recognition and judgment, can be reserved. And the interference of irrelevant people on behavior judgment can be effectively eliminated, so that the accuracy of behavior identification under a complex scene is greatly improved.

Drawings

The above-mentioned features and technical advantages of the present invention will become more apparent and readily appreciated from the following description of the embodiments thereof, taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of a behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of one example of the impact of the background of an embodiment of the invention on behavior recognition;

FIG. 3 is a schematic diagram of a neural network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hardware architecture of an electronic device according to an embodiment of the invention;

fig. 5 is a schematic block diagram of a behavior recognition program according to an embodiment of the present invention.

Detailed Description

Embodiments of a behavior recognition method, apparatus, and computer-readable storage medium according to the present invention will be described below with reference to the accompanying drawings. Those skilled in the art will recognize that the described embodiments may be modified in various different ways, or combinations thereof, without departing from the spirit and scope of the invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive in scope. Furthermore, in the present specification, the drawings are not drawn to scale, and like reference numerals denote like parts.

The behavior recognition method can be used for recognizing interaction behaviors between two people, such as greeting, trailing, taking frames, stealing and the like. The present embodiment will be described by taking the action of putting a frame between two persons as an example.

Fig. 1 is a flow chart of a behavior recognition method according to an embodiment of the present invention, where the method includes the following steps:

s1, acquiring a video stream, and dividing the video stream into an image frame sequence, namely an image frame by frame.

S2, for each frame of image, detecting the human body outline in each frame of image, positioning the position of the human body outline, and selecting each human body frame by using a first rectangular frame. The human body outline in the detected image is determined by adopting a neural network method. Specifically, the sliding window is slid on the image, and objects in the sliding window are classified by a CNN (convolutional neural network) model, so as to determine whether there is a human body in the sliding window. Each frame of image is used as the input of the CNN model, and the output of the CNN model can be two classifications of 'human body' and 'non-human body', and can be more than two classifications. When the CNN model is used for identifying the human body, only the human body in the image is identified, and more parameters are needed to be output when the position is marked. For example, when the output of the CNN model is classified into "human body", coordinates of four corner points of the sliding window are also output, thereby selecting a human body frame.

The CNN model slides a window from left to right and from top to bottom, and the SVM classifier is utilized to identify the target. And using windows with different sizes and aspect ratios, sending the selected areas of the sliding window frames into a CNN neural network for feature extraction, sending the extracted spatial features into an SVM classifier for classification, determining whether a human body exists in each frame of image, and using a first rectangular frame to frame the human body.

S3, calculating the distance between any two human body contours in the image. The distance between the contours of any two persons in the image may be calculated according to the coordinates of the corner points of the first rectangular frames of the calibrated human contours, for example, the distance between the lower left corner coordinates of the two first rectangular frames 100 is calculated, so that the distance between the two human contours can be obtained.

S4, if the distance between the human body contours of two persons in a certain frame of image is smaller than a set distance threshold value, selecting the two person frames by adopting a two-person combined frame 200, wherein the two-person combined frame is the smallest rectangular frame surrounding the two rectangular frames, so that the two-person combined frame is formed.

S5, judging whether the human bodies in the b frame images before and after the frame image are the same person by adopting a ReID (pedestrian re-identification method). That is, the same person in the front and rear b-frame images is found, and a two-person combination frame is formed between the two persons as well, and the two-person combination frame of the two persons in the front and rear b-frame images and the frame are combined into one two-person combination frame sequence. For a plurality of persons in an image, a plurality of two-person combo box sequences may be formed, respectively. Wherein, reID considers not only the content information of the image but also the motion information from frame to frame. Spatial features are extracted using CNN while temporal features are extracted using RNN (recursive loop network). Each image is passed through a CNN to extract human contours, which are then input to an RNN network to extract final features. The final features are fused with the spatial features of the single-frame image and the features of the optical flow diagram between frames, so as to judge whether the human bodies in the multi-frame images are the same.

S6, inputting the multiple two-person combined frame sequences into a neural network model, and performing human behavior recognition on the input images through the neural network model to obtain recognition results and confirm whether the recognition results belong to preset behavior categories. For the racking, the classification of "racking" and "non-racking" is performed. Thereby judging whether the two persons in the plurality of two-person combined frames have the racking action.

Wherein the background between two persons is reserved in the two-person combined frame, and the background between two persons can be an object such as a cutter, a wine bottle and the like, which is helpful for judging whether to prop up a frame. For example, a knife, which is between two persons, may be used for cutting objects, or for racking. And wine bottles can be used for drinking or serving as a murder for playing a shelf. Training data may be set for these contexts, e.g. when the tool touches two persons simultaneously, the tool is deemed to be cradled, and when the height of the tool raised exceeds the shoulder, the tool is deemed to be cradled. And if the blood stain exists on the cutter, the cutter is considered as a rack. Likewise, if the bottle is held by a person and the raised height exceeds the shoulder, the likelihood of the bottle being set up is considered to be high. The wine bottle is held by one person, the bottle mouth faces downwards, but no wine glass exists below the bottle mouth, and the possibility of taking a frame is considered to be high. The wine bottle is considered as a frame when blood is stained. There may be many kinds of hard things such as bricks, wrenches, hammers, etc. as the muzzle, and the probability of putting a frame is determined by setting the state and position of the object in combination with the characteristics of different objects.

In addition, there may be another human body between the two people. The human body between the two persons may or may not participate in fighting between the two persons. Therefore, after judging whether or not to frame the combined frame sequence of two persons, the calculation can be influenced by the calculation result. For example, as shown in fig. 2, the middle person and the left and right persons have made a racking judgment and it is determined that the middle person and the left and right persons are racking. The possibility that the left person and the right person get the frame is small, or the left person gets the frame with the middle person, the possibility that the left person and the right person get the frame is relatively large, and the possibility that the middle person and the right person get the frame is relatively large. When the training data set is set, the probability of putting the frame is set according to different conditions, and other limb actions of two people are combined to comprehensively judge whether to put the frame.

In addition, the background may also include some displacement of the object that is touched and bumped during the fight process, such as the pouring of a cabinet, the scattering of objects on the ground, whether there is blood stain on the ground, other falling human bodies, etc. These can be used to assist in determining whether there is a racking activity. Whether the frame is drawn or not is comprehensively evaluated by combining the background, so that the judgment result is more accurate.

The above is merely to illustrate the effect of some contexts on the judgment of the racking behavior, and the probability of racking may be set for each case of the contexts, for training the neural network model.

In addition, the rectangular frames surrounding the two human bodies are surrounded by the two-person combined frame, namely the original image is segmented, the two human bodies and the area between the two human bodies are separated from the background, the background except the two human bodies is removed, a great amount of calculation amount of the neural network model caused by redundant background can be avoided, and the characteristics of the two human bodies which are very valuable for the frame taking judgment can be reserved. In addition, for the monitoring video, the person is usually smaller, the background is larger, and the image of one frame in the video is directly input into the neural network model for calculation, so that huge calculation amount is caused. Therefore, only the area between two persons and the two persons as the input of the model are intercepted, the calculated amount can be greatly reduced, and the reserved background between the two persons can still provide a basis for judging whether to fight the frame.

In order to identify the racking activity, a training set is first constructed, which contains a large number of images of two persons racking, the training set including images of, for example, two persons with a murder, such as a cutter, a steel pipe, a brick, etc., in between. Or judging by means of expression, clothes state change, sound, speaking and the like of two persons, and marking the images of the training sets as the racking behaviors. The images in the training set are input into a neural network model, the images in the training set are classified through the neural network model, and the classification quality of the model is judged through a loss function, so that the recognition accuracy of the neural network model is continuously improved, and when the recognition accuracy of the neural network model reaches the expected accuracy range, the neural network model can be used for recognizing the work of the frame. Then the pictures in the two-person combined frame 200 are input into the neural network model, and the racking behavior can be identified.

Further, as shown in fig. 2, the basic structure of the neural network model is S1-SN, which are N images sampled from video. For each image, a first 2D convolution sub-network W is employed _2D To obtain a plurality of feature maps, and a feature set is obtained after the feature maps of the N images are stacked. Here a first 2D convolution sub-network W _2D The method comprises at least one sub-network connected in sequence, wherein the output of one sub-network is input into the next sub-network. Each sub-network comprises 4 branches, the 1 st branch is convolved twice by [ 1x1 ] and [ 3x3 ], and the 2 nd branchThree convolutions are performed by [ 1x1 ], [ 3x3 ], the 3 rd branch is the maximum pooling of [ 1x1 ], the steps of convolution and maximum pooling are both 2, and the 4 th branch is the convolution of [ 1x1 ].

For the obtained feature set, each is input to a 3D convolution sub-network W _3D And a second 2D convolution sub-network V _2D And (5) processing. Here 3D convolution subnetwork W _3D The method comprises a first convolution section, a second convolution section, a third convolution section, a fourth convolution section, a fifth convolution section and an average pooling layer which are sequentially connected.

The first convolution segment comprises a convolution of 7x7x7, 64, where 7x7x7 represents the convolution kernel (7 x7 is the spatial dimension and finally 7 is the temporal dimension) and 64 represents the number of channels;

the second convolution section comprises two second convolution units which are connected in sequence, wherein each second convolution unit comprises two convolution layers of 3x3,64 and 3x3, 64;

the third convolution section comprises two third convolution units which are connected in sequence, and each third convolution unit comprises two convolution layers of 3x3,128 and 3x3, 128;

the fourth convolution section comprises two fourth convolution units which are connected in sequence, and each fourth convolution unit comprises two convolution layers of 3x3,256 and 3x3, 256;

the fifth convolution segment includes two fifth convolution units connected in sequence, and each of the fifth convolution units includes two convolution layers of 3x3,512 and 3x3, 512.

The second 2D convolution sub-network comprises 5 sub-networks, a maximum pooling sub-network and two sub-networks which are connected in sequence, and finally, the characteristics of multi-frame images are obtained by adopting average pooling. Each sub-network comprises 4 branches, the 1 st branch is subjected to convolution twice by [ 1x1 ] and [ 3x3 ], the 2 nd branch is subjected to convolution three times by [ 1x1 ], [ 3x3 ], the 3 rd branch is the maximum pooling of the [ 1x1 ], the step sizes of the convolution and the maximum pooling are 2, and the 4 th branch is the convolution of the [ 1x1 ]. And then fusing the output of the second 2D convolution sub-network with the output result of the 3D net, thereby obtaining the final classification result.

Further, the neural network model may preferably determine the racking behavior in combination with one or more of facial expression, skin color at exposed areas, speech, contact areas.

Wherein the process of detecting the bare skin color of each person in the image comprises:

converting the image into an HSV image, scanning the image from left to right and from top to bottom, marking the connected areas by comparing each pixel with the adjacent pixel values, determining the number of the pixel points with the tone value between 340 and 360 in each connected area, and determining that the racking behavior exists if the number of the pixel points of at least one connected area is larger than a threshold value.

The connected area mark is that one line is scanned from left to right, then the line is fed downwards to continue the scanning from left to right, and each time one pixel is scanned, the immediately adjacent pixel values of the pixel positions are checked, or the immediately adjacent pixel values of the pixel positions are checked.

The following specific steps are described by taking up, down, left and right checks as examples:

assuming that the pixel value of the current position is 255, two adjacent pixels to the left and above it are checked (these two pixels must be scanned before the current pixel). The combination of these two pixel values and the label is the following four cases:

1) The pixel values on the left and the upper are 0, and a new mark (which indicates the start of a new connected domain) is given to the pixel at the current position;

2) Only one pixel value at the left side and the upper side is 255, and the pixel at the current position is the same as the pixel with the pixel value of 255 in the marks;

3) The pixel values on the left and the upper are 255 and the marks are the same, and the marks of the pixels at the current position are the same as the marks of the pixels on the left and the upper;

4) The pixel values on the left and the upper are 255 and the marks are different, the smaller mark is assigned to the pixel at the current position, and then the 4 steps are respectively executed after each trace back from the right to the left until the pixel at the beginning of the region is traced back.

The contact position determination refers to that a fist, an elbow, a foot and a knee are contacted with a sensitive or vital part of another person, for example, a large number of training pictures are adopted to train the neural network model, the training pictures are marked to comprise pictures of whether the sensitive part is contacted, the training pictures are input into the neural network model, and the trained neural network model is obtained through optimizing a loss function.

The step of judging whether the racking action exists according to the voice refers to judging whether the racking action exists according to the semantics, intonation and voice of the audio matched with the video stream. Generating a voice spectrogram aiming at audio data, extracting a first characteristic variable of the voice spectrogram by adopting a DCNN (deep convolutional neural network) and an RNN which are sequentially connected, extracting an MFCC (Mel frequency cepstrum coefficient) from the audio data, converting the MFCC into a second characteristic vector through nonlinear transformation, and projecting the first characteristic vector and the second characteristic vector into a joint characteristic space to form joint characteristics. The combined features are input into a full-connection layer, and the output of the full-connection layer is transmitted to a softmax layer for classification, so that recognition according to the frame playing behavior of the voice can be completed.

Preferably, the plurality of methods are adopted to judge the racking behavior, one judging result is obtained corresponding to each method, and each judging result is weighted and averaged to obtain the final judging result.

Preferably, the neural network model can combine the speed of the limb and the location contacted by the limb movement to determine whether to prop up the frame. For example, a stroke in which the palm moves quickly to the face of another person is a cradling action, and a stroke in which the palm moves slowly to the face of another person is a good emotion expression. For the same person, the distance between the position changes of the human body in the front and rear frames is obtained through image comparison of the front and rear frames, and for the frame taking, it is usually either the upper limb or the lower limb, for example, the coordinate of the lower left corner of the first rectangular frame of the human body in the front frame is A, the coordinate of the lower left corner of the first rectangular frame of the human body in the next frame is B, the ratio of the coordinate difference value to the time difference between the front and rear frames is used as the speed of the motion of the human body, the speed of the motion of the human body is compared with a set speed threshold, and if the speed is higher than the speed threshold, and the limb of the human body is in contact with the body of another person, the frame taking is considered.

In addition, preferably, in step S6, for the racking action, a weighted summation mode may be adopted for comprehensively determining whether to racking the limbs of the two persons and the background. For example, the limb movement judgment of two persons has a duty ratio of 0.8, a background duty ratio of 0.2, and a weight sum of 1. The method comprises the steps of judging whether a frame is drawn or not by a background, and judging whether the background has dangerous goods and the state of the dangerous goods or not by a trained CNN model, so as to judge whether two persons in a two-person combined frame have the frame drawing action or not. And multiplying the judgment of whether to frame or not corresponding to the two persons, the judgment of whether to frame or not corresponding to the background and the weight to average, and if the judgment is higher than the set frame probability threshold, determining the neural network model as frame. By considering the influence of the background, the racking behavior can be more accurately and rapidly identified.

Furthermore, preferably, the optimization loss function of the neural network model is optimized by an optimization method using gradient descent by the following equation.

Wherein C represents a neural network model, θ _C A first weight matrix, L, representing a neural network model to be optimized _C (θ _C ) A first weight matrix representing a neural network model is θ _C The loss caused by the time, M, represents a feature extracted from the feature set M. T (m) represents that the extracted feature m is a set of the racking behaviors, F (m) represents that the extracted feature m is not a set of the racking behaviors, T represents that any one of the racking behaviors is taken from T (m), F represents that any one of the features not being the racking behaviors is taken from F (m),probability of representing a cradling behavior, +.>And (3) representing the probability of not being the racking behavior, obtaining a neural network model by minimizing a negative conditional log likelihood function (loss function) added to the L1 regularization, wherein lambda is a regularization parameter.

In an alternative embodiment, an association matrix of each person in the two-person combined frame is also constructed to correct the feature map obtained in the neural network model. For example, for a certain person in the two-person combined frame, extracting all persons with a distance from the certain human contour smaller than a second distance threshold value from the front and rear a-b frame images of the certain frame image, wherein the second distance threshold value is larger than the first distance threshold value. And judges whether there is a person having been judged to have a racking action in the certain frame image and the preceding and following b frame images therein using ReID (pedestrian re-recognition method), and if so, forms a distance vector in accordance with the distance. Since the distance from the person with the racking action is closer (smaller than the second distance threshold value) in the previous several frames of images, the probability that the certain person is racking is larger.

Similarly, the distance vector is obtained for the other person in the two-person combined frame, the distance vector of the two persons is formed into an association matrix, the association matrix is multiplied by a pixel matrix corresponding to the feature pattern to correct the feature pattern, and the corrected feature pattern is then subjected to subsequent recognition in the neural network model.

The above is the judgment of the racking behavior between two persons, and for other behaviors, the background between two persons can be fully utilized to assist the judgment. For example, if a free fight match is made between two people, it is not a fight match, but a match may be judged by background. For example, people watching in the background, dressing of two persons, referees, fences around two persons, and the like are used as the background to assist judgment. Or when the article is transferred between two people, the article is also used as a background to assist in judging whether the article transfer process occurs or not, rather than only judging by the action of a human body. The different behaviors recognize that the background between two people has different meanings, and the present invention is only briefly described herein and will not be described in detail.

Fig. 2 is a schematic diagram of a hardware architecture of an electronic device according to an embodiment of the invention. In this embodiment, the electronic device 2 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction. For example, it may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including a stand-alone server or a server cluster composed of a plurality of servers), etc. As shown in fig. 2, the electronic device 2 includes at least, but is not limited to, a memory 21, a processor 22, and a network interface 23, which are communicatively connected to each other via a system bus. Wherein: the memory 21 includes at least one type of computer-readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 21 may be an internal storage unit of the electronic device 2, such as a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic apparatus 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic apparatus 2. Of course, the memory 21 may also comprise both an internal memory unit of the electronic device 2 and an external memory means thereof. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed in the electronic device 2, such as the behavior recognition program code. Further, the memory 21 may be used to temporarily store various types of data that have been output or are to be output.

The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the electronic device 2, such as performing control and processing related to data interaction or communication with the electronic device 2. In this embodiment, the processor 22 is configured to execute the program code or process data stored in the memory 21, for example, execute the behavior recognition program.

The network interface 23 may comprise a wireless network interface or a wired network interface, which network interface 23 is typically used for establishing a communication connection between the electronic device 2 and other electronic devices. For example, the network interface 23 is configured to connect the electronic device 2 to a push platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and the push platform. The network may be an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband CodeDivision Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or other wireless or wired network.

Optionally, the electronic device 2 may also comprise a display, which may also be referred to as a display screen or display unit. In some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) display, or the like. The display is used for displaying information processed in the electronic device 2 and for displaying a visualized user interface.

It is noted that fig. 2 only shows an electronic device 2 having components 21-23, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented.

The memory 21 containing the readable storage medium may include an operating system, a behavior recognition program 50, and the like. The processor 22 implements the following steps when executing the behavior recognition program 50 in the memory 21:

s6, inputting the two-person combined frame sequence into a neural network model, and performing human behavior recognition on the input image through the neural network model to obtain a recognition result, and determining whether the recognition result belongs to a preset behavior category. The "cradling" and "non-cradling" are classified as cradling.

In the present embodiment, the behavior recognition program stored in the memory 21 may be divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (the processor 22 in the present embodiment) to complete the present invention. For example, fig. 3 shows a schematic program module of the behavior recognition program, and in this embodiment, the behavior recognition program 50 may be divided into a video stream division module 501, a human contour marking module 502, a distance acquisition module 503, a combo box module 504, a combo box sequence forming module 505, and a behavior recognition module 506. The program modules referred to herein are defined as a series of computer program instruction segments capable of performing a specific function, which are more suitable than programs for describing the execution of the behavior recognition program in the electronic device 2. The following description will specifically introduce specific functions of the program modules.

The video stream segmentation module 501 is configured to obtain a video stream, and segment the video stream into a sequence of image frames, i.e. images frame by frame.

The human body contour marking module 502 detects the human body contour in each frame of image by using the CNN model, positions the human body contour, and selects each human body frame by using the first rectangular frame. The CNN model also outputs coordinates of four corner points of the sliding window, so that the human body frame is selected.

The distance obtaining module 503 is configured to calculate a distance between any two human contours in the image. The distance between the contours of any two persons in the image may be calculated according to the coordinates of the corner points of the first rectangular frames of the calibrated human contours, for example, the distance between the lower left corner coordinates of the two first rectangular frames 100 is calculated, so that the distance between the two human contours can be obtained.

The combo box module 504 is configured to select two human frames by using the two-person combo box 200 if the distance between the human contours of two human bodies in a certain frame of image is smaller than a set distance threshold, where the two-person combo box is a smallest rectangular box surrounding the two rectangular boxes.

Wherein, the combined frame sequence forming module 505 uses ReID to determine whether the human body in the b frame images before and after the frame image is the same person. And finding the same person in the front and back b-frame images, forming a two-person combined frame between the two persons, and combining the two-person combined frame of the two persons in the front and back b-frame images with the frame to form a two-person combined frame sequence.

The behavior recognition module 506 is configured to input a plurality of two-person combined box sequences into a neural network model, and classify the input images into "bracketing" and "non-bracketing" through the neural network model. Thereby judging whether the two persons in the plurality of two-person combined frames have the racking action.

In addition, the embodiment of the invention also provides a computer readable storage medium, which can be any one or any combination of a plurality of hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disc read-only memory (CD-ROM), a USB memory and the like. The computer-readable storage medium includes a behavior recognition program or the like, which when executed by the processor 22, the behavior recognition program 50 performs the following operations:

s6, inputting the two-person combined frame sequence into a neural network model, and performing human behavior recognition on the input image through the neural network model to obtain a recognition result, and determining whether the recognition result belongs to a preset behavior category.

The embodiment of the computer readable storage medium of the present invention is substantially the same as the above-mentioned behavior recognition method and the embodiment of the electronic device 2, and will not be described herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The behavior recognition method is applied to the electronic device and is characterized by comprising the following steps of:

2. The behavior recognition method according to claim 1, wherein the step of detecting a human body contour in each frame of the image and marking each human body with a first rectangular frame comprises:

the method comprises the steps of sliding a sliding window on an image according to a preset track, extracting spatial features of objects in the sliding window through a CNN model, classifying the extracted spatial features through an SVM classifier, determining whether a human body exists in the sliding window, outputting coordinates of four corner points of the sliding window, and forming a first rectangular frame to mark the human body outline according to the coordinates of the four corner points.

3. The behavior recognition method according to claim 2, wherein step S5 further comprises:

identifying whether two persons which are the same as the two-person combined frame exist in the front and rear multi-frame images; and extracting the optical flow graph characteristics of each first rectangular frame in the front and rear multi-frame images, and inputting the optical flow graph characteristics into an RNN model by combining the spatial characteristics of the first rectangular frames to extract time sequence characteristics so as to judge whether two persons which are the same as the two-person combined frame exist.

4. The behavior recognition method according to claim 1, wherein the neural network model includes a plurality of parallel first 2D convolution sub-networks W connected in sequence _2D And a 3D convolution sub-network W in parallel _3D A second 2D convolution sub-network V _2D For each frame of image, a first 2D convolution sub-network W is used _2D Obtaining a plurality of feature maps, and combining the feature maps obtained by all frame images of the image frame sequence into feature sets, wherein the feature sets are respectively input into a 3D convolution sub-network W _3D And a second 2D convolution sub-network V _2D And processing, namely fusing the output of the second 2D convolution sub-network with the output result of the 3D convolution sub-network to obtain a recognition result, and confirming whether the recognition result belongs to a preset behavior category.

5. The behavior recognition method according to claim 1, wherein in step S6, the method for performing behavior recognition on the input image by the neural network model is as follows:

the human body behaviors are judged by combining at least one mode of facial expression recognition, skin color recognition of exposed parts, voice recognition and contact part recognition, one judgment result is obtained corresponding to each mode, and each judgment result is weighted and averaged to be used as a final judgment result.

6. The behavior recognition method of claim 5, wherein the step of determining the behavior of the human body by combining at least one of facial expression recognition, bare site skin color recognition, voice recognition, contact site recognition comprises:

7. The behavior recognition method according to claim 5, wherein the step of determining the behavior of the human body by combining at least one of facial expression recognition, bare site skin color recognition, voice recognition, contact site recognition further comprises:

extracting audio in a video stream, generating a voice spectrogram, extracting a first characteristic variable of the voice spectrogram by adopting DCNN and RNN which are connected in sequence, extracting MFCC from audio data, converting the MFCC into a second characteristic vector through nonlinear transformation, projecting the first characteristic vector and the second characteristic vector into a joint characteristic space to form joint characteristics, inputting the joint characteristics into a full-connection layer, and transmitting the output of the full-connection layer to a softmax layer to judge whether the MFCC belongs to a preset behavior category or not.

8. The behavior recognition method according to claim 1, wherein the preset behavior class is a fight behavior, and in step S6, the method for classifying the behavior class of the input image by the neural network model is as follows: and taking the ratio of the coordinate difference value of the corresponding corner points of the first rectangular frame of the front and rear frame images to the time difference between the front and rear frames as the action speed of the human body of the first rectangular frame, comparing the action speed of the human body of the first rectangular frame with a set speed threshold value, if the action speed of the human body of the first rectangular frame is higher than the speed threshold value, detecting whether the other human body in the two-person combined frame containing the first rectangular frame in the front and rear frame images is in contact with the human body of the first rectangular frame, and if the other human body is in contact, judging that the frame is in a frame taking action.

9. An electronic device, comprising: the system comprises a memory and a processor, wherein a behavior recognition program is stored in the memory, and the behavior recognition program realizes the following steps when being executed by the processor:

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, implement the behavior recognition method of any one of claims 1 to 8.