WO2020168960A1

WO2020168960A1 - Video analysis method and apparatus

Info

Publication number: WO2020168960A1
Application number: PCT/CN2020/074895
Authority: WO
Inventors: 范慧慧; 王天宇; 高在伟
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2019-02-19
Filing date: 2020-02-12
Publication date: 2020-08-27
Also published as: CN111582006A

Abstract

Provided are a video analysis method and apparatus. The method comprises: detecting a monitored target in a collected video stream (S101); capturing a video image containing the monitored target (S102); and performing recognition on the captured video image to obtain classification information of each monitored target (S103). It can be seen that the method involves capturing a video image containing a monitored target and performing classification and recognition only on the captured video image, instead of performing classification and recognition on the whole video stream, thereby reducing the amount of calculations.

Description

Video analysis method and device

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 19, 2019 with the application number 201910121021.1 and the invention title is "a video analysis method and device", the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of surveillance technology, and in particular to a video analysis method and device.

Background technique

In a related solution, a monitoring device is set in a scene that needs to be monitored, and the monitoring device collects the video stream of the scene, analyzes the collected video stream, and determines whether there are persons or vehicles illegally breaking into the scene. In the above solution, the overall analysis of the video stream is performed, that is, the target in each frame of the video image of the video stream is accurately classified and identified, which requires a large amount of calculation.

Summary of the invention

The purpose of the embodiments of the present application is to provide a video analysis method and device to reduce the amount of calculation.

To achieve the foregoing objective, an embodiment of the present application provides a video analysis method, including:

Detect monitoring targets in the captured video stream;

Intercepting a video image containing the monitoring target from the video stream;

Perform classification and recognition on the monitoring target in the intercepted video image to obtain classification information of the monitoring target.

Optionally, the detecting the monitoring target in the collected video stream includes: detecting the moving target in the collected video stream;

The intercepting the video image containing the monitoring target from the video stream includes:

Intercepting one or more frames of video images containing the moving target from the video stream.

Optionally, the classification and recognition of the monitoring target in the intercepted video image to obtain the classification information of the monitoring target includes:

Input the intercepted video image into a pre-trained first neural network model, and use the first neural network model to classify the moving target in the intercepted video image to obtain the first neural network model The output classification information of the moving target.

Optionally, the detecting the monitoring target in the collected video stream includes: performing face recognition in the collected video stream to obtain a recognition result;

According to the recognition result, intercepting a human face area in a video image containing a human face in the video stream, as the intercepted video image;

The classifying and identifying the monitoring target in the intercepted video image to obtain the classification information of the monitoring target includes:

The intercepted video image is matched with the face data stored in the face database to obtain the classification information of the face.

Optionally, the matching the captured video image with the face data stored in the face database to obtain the classification information of the face includes:

Inputting the intercepted video image into a second neural network model obtained by pre-training, and using the second neural network model to convert the intercepted video image into modeling data;

The modeling data is matched with the face data stored in the face database to obtain the classification information of the face. The classification information of the face is the first tag information or the second tag information, and the first The tag information indicates that there is face data that successfully matches the modeling data in the face database, and the second tag information indicates that there is no face data that successfully matches the modeling data in the face database. .

Optionally, after the classification and recognition of the monitoring target in the intercepted video image to obtain the classification information of the monitoring target, the method further includes:

Determine whether the classification information of the monitoring target meets the preset alarm condition;

If it matches, output an alarm message.

Optionally, after obtaining the classification information of the moving target output by the first neural network model, the method further includes:

Determine whether the classification information of the moving target meets the preset alarm conditions; if it meets, output alarm information;

The preset alarm condition includes: the classification information of the moving target is a person, and/or the classification information of the moving target is a vehicle.

Optionally, after obtaining the classification information of the face, the method further includes:

Determine whether the classification information of the human face meets the preset alarm conditions; if it meets, output alarm information;

The preset alarm condition includes: the classification information of the human face is the first tag information, or the classification information of the human face is the second tag information.

To achieve the foregoing objective, an embodiment of the present application also provides a video analysis device, including:

The detection module is used to detect the monitoring target in the collected video stream;

An interception module for intercepting a video image containing the monitoring target from the video stream;

The classification module is used for classifying and identifying the monitoring target in the intercepted video image, and obtaining classification information of the monitoring target.

Optionally, the detection module is specifically configured to: detect a moving target in the collected video stream;

The interception module is specifically configured to intercept one or more frames of video images containing the moving target from the video stream.

Optionally, the classification module is specifically used for:

Optionally, the detection module is specifically configured to: perform face recognition in the collected video stream to obtain a recognition result;

The interception module is specifically configured to: according to the recognition result, intercept a face area from a video image containing a face in the video stream as the intercepted video image;

The classification module is specifically configured to match the intercepted video image with the face data stored in the face database to obtain the classification information of the face.

Optionally, the classification module is specifically used for:

Optionally, the device further includes:

The first judgment module is used to judge whether the classification information of the monitoring target meets the preset alarm condition; if it does, trigger the first alarm module;

The first alarm module is used to output alarm information.

Optionally, the device further includes:

The second judgment module is used to judge whether the classification information of the sports target meets a preset alarm condition; the preset alarm conditions include: the classification information of the sports target is a person, and/or the classification of the sports target The information is the vehicle; if it matches, the second alarm module is triggered;

The second alarm module is used to output alarm information.

Optionally, the device further includes:

The third judgment module is used to judge whether the classification information of the face meets a preset alarm condition; the preset alarm condition includes: the classification information of the face is the first tag information, or the face The classification information of is the second tag information; if it matches, the third alarm module is triggered;

The third alarm module is used to output alarm information.

To achieve the foregoing objective, an embodiment of the present application also provides an electronic device, including a processor and a memory;

The memory is used to store computer programs;

The processor is configured to execute a program stored in the memory to implement any of the steps of the video analysis method described above.

In order to achieve the foregoing objective, an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the steps of any of the foregoing video analysis methods are implemented.

In order to achieve the foregoing objective, an embodiment of the present application further provides a computer program, which, when executed by a processor, implements the steps of any of the foregoing video analysis methods.

In the embodiment of the present application, the monitoring target is detected in the collected video stream; the video image containing the monitoring target is intercepted from the video stream; the monitoring target in the intercepted video image is classified and identified to obtain the classification information of the monitoring target. It can be seen that, in the solution provided by the embodiment of the present application, the monitoring target in all the video images of the video stream is not accurately classified and identified, but the video image containing the monitoring target is intercepted, and only the monitoring target in the intercepted video image Perform accurate classification and recognition, reducing the amount of calculation.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or related technologies, the following will briefly introduce the drawings that need to be used in the description of the embodiments or related technologies. Obviously, the drawings in the following description are merely present For some of the embodiments of the application, for those of ordinary skill in the art, other drawings can be obtained from these drawings without creative work.

FIG. 1 is a schematic diagram of the first flow of a video analysis method provided by an embodiment of this application;

FIG. 2 is a schematic diagram of a second flow of a video analysis method provided by an embodiment of this application;

FIG. 3 is a schematic diagram of the interaction between a monitoring point and an NVR provided by an embodiment of the application;

FIG. 4 is a schematic diagram of a third process of a video analysis method provided by an embodiment of the application;

FIG. 5 is another schematic diagram of the interaction between the monitoring point and the NVR provided by the embodiment of the application;

FIG. 6 is a schematic structural diagram of a video analysis device provided by an embodiment of the application;

FIG. 7 is a schematic diagram of a structure of an electronic device provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of a video analysis system provided by an embodiment of the application.

detailed description

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

In order to solve the above technical problems, embodiments of the present application provide a video analysis method and device. The method and device can be applied to cameras, such as IPC (IP Camera, network camera), or can be applied to NVR (Network Video Recorder, network Hard disk video recorder), or can be applied to other electronic equipment, or can be applied to video analysis system, the specific is not limited. The following first describes the video analysis method provided by the embodiment of the present application in detail.

FIG. 1 is a schematic diagram of the first flow of a video analysis method provided by an embodiment of this application, including:

S101: Detect a monitoring target in the captured video stream.

In the embodiment of the present application, the monitoring target may be a person, a vehicle, an object, an animal, etc. The captured video stream includes multiple frames of video images. Based on this, the foregoing detection of the monitoring target in the collected video stream may specifically be: detecting the monitoring target in each frame of video image of the video stream. Since the video stream includes multiple frames of video images, the monitoring target may be detected in the multiple frames of video images included in the video stream.

In one implementation, the monitoring target may be a moving target. In this case, S101 may include: detecting a moving target in the collected video stream. For example, the frame difference method, or the background subtraction algorithm, or the optical flow method can be used to detect the moving target in the video stream.

Specifically, if the monitoring target is a moving target, S101 may include: detecting the moving target in each frame of video image of the video stream.

In another embodiment, the monitoring target may be a human face. In this case, S101 may include: performing face recognition in the collected video stream to obtain the recognition result. For example, a face recognition algorithm can be used to recognize faces in a video stream and obtain the recognition results.

Specifically, if the monitoring target is a human face, S101 may include: performing face recognition in each frame of video image in the video stream to obtain the recognition result of each frame of video image. The recognition result of each frame of video image may include a first recognition result or a second recognition result, where the first recognition result indicates that the video image includes a human face, and the second recognition result indicates that the video image does not include a human face. If the recognition result of a frame of video image includes the first recognition result, the recognition result of the frame of video image may also include position information of the face region in the video image. Among them, the face area is the area where the face in the video image is located.

S102: Intercept a video image containing the monitored target.

In this embodiment of the application, the video image is an image in a video stream. Based on this, the foregoing S102 may specifically be: intercepting a video image containing a monitoring target from a video stream. Here, if the monitoring target is detected in multiple frames of video images, it is possible to intercept multiple frames of video images.

In one embodiment, the monitoring target is a moving target. In this case, S102 may include: intercepting one or more frames of video images containing the moving target from the video stream.

Specifically, when the monitoring target is a moving target, S102 may include: if a frame of video image in the video stream contains a moving target, intercept the video image from the video stream; if it is in multiple frames of video images in the video stream If they all contain moving targets, the multiple frames of video images are intercepted from the video stream.

In one embodiment, it can be detected whether the key frame of the video stream contains a moving target. If it contains, intercept multiple frames of video images within a few seconds before and after the key frame, and form a small video from these multiple frames of video images. In this way, it is not necessary to detect all the video images of the video stream, which further reduces the calculation.

In another embodiment, the monitoring target is a human face. In this case, S102 may include: according to the above-mentioned recognition result, intercepting a face area from a video image containing a face in the video stream as the intercepted video image. Or, according to the recognition result, one or more frames of video images containing the face area may be intercepted in the video stream.

Specifically, when the monitoring target is a human face, S102 may include: according to the above recognition result, if it is determined that a frame of the video image of the video stream contains a human face, intercepting the human face area from the video image as the intercepted video image. When the monitoring target is a human face, S102 may further include: according to the above recognition result, if it is determined that one frame of the video image of the video stream contains a human face, intercepting the video image from the video stream.

S103: Obtain classification information of the monitored target by recognizing the intercepted video image.

In the embodiment of the present application, after the video image is intercepted from the video stream, the surveillance target in the intercepted video image is classified and identified to obtain the classification information of the surveillance target.

In one embodiment, the monitoring target is a moving target. In this case, S103 may include: inputting the intercepted video image into the first neural network model obtained by pre-training, and using the first neural network model to classify the moving target in the video image to obtain the first neural network model Classification information of the output moving target.

When the monitoring target is a moving target, if there are multiple intercepted video images, input the intercepted video images into the pre-trained first neural network model, and use the first neural network model to analyze each video image The moving target is classified to obtain the classification information of the moving target contained in each video image output by the first neural network model.

For example, the moving target can be people, vehicles, and so on. The first neural network model is a model for classifying moving targets. The process of obtaining the first neural network model by training may include: obtaining a sample image, which may include moving targets such as people or vehicles; adding a label based on the moving target in the sample image, the label is used to indicate the type of the moving target , Such as vehicles, personnel, etc.; input the sample image into the neural network of the preset structure, and use the label as supervision to adjust the parameters of the neural network iteratively; when the iteration end condition is met, the training is completed The first neural network model. The neural network with a preset structure can be a deep neural network or a convolutional neural network.

The video image intercepted in S102 is input to the first neural network model, and the first neural network model outputs classification information of the moving target in the video image. The classification information of the moving target is the type of the moving target, such as vehicles, people and so on.

For example, some scenes have high requirements for safety, and perimeter protection needs to be performed on these scenes, that is, to determine whether there are people or vehicles entering the scene. Applying the solution provided by the embodiments of the present application, on the one hand, the first neural network model is used to identify the moving target in the video image to obtain the classification information of the moving target. If the classification information is a person or a vehicle, the relevant personnel can be reminded for subsequent processing in time. Achieve effective perimeter prevention. On the other hand, first perform moving target detection on the video stream. The moving target detection algorithm is a rough detection algorithm. The calculation complexity of the moving target detection algorithm is low, and the amount of calculation is small; after the moving target is detected, the video is intercepted A small part of the video images in the stream are only accurately classified and recognized for the moving targets in this small part of the video images. Specifically, the first neural network model is used to identify the classification information of the moving targets. In the solution provided in the embodiment of this application, Accurate classification and recognition of moving targets in all video images of the video stream is not performed. Compared with the precise classification and recognition of moving targets in all video images of the video stream, the amount of calculation is reduced.

In another embodiment, the monitoring target is a human face. In this case, S103 may include: matching the intercepted video image with the face data stored in the face database to obtain the classification information of the face contained in the intercepted video image. Among them, the face data may be a face image or feature information extracted from a face image. The embodiment of the present application does not limit this.

When the monitoring target is a human face, if there are multiple intercepted video images, the intercepted video images are matched with the face data stored in the face database to obtain the face data contained in each intercepted video image. Classified information.

For example, some scenes only allow authorized personnel to enter, and strangers (ie, unauthorized personnel) need to be identified in these scenes. In this case, the solution provided in the embodiment of this application can be used. For example, the face data of authorized persons can be stored in the face database, and the video image intercepted in S102 can be matched with the face data stored in the face database to determine whether the person in the video stream is an authorized person. The classification information of the face contained in the video image may be the first tag information or the second tag information. The first tag information indicates that there is face data that matches the intercepted video image successfully in the face database, and the second tag information indicates the person There is no face data that matches the captured video image successfully in the face database. The human face classification information contained in the intercepted video image may also be an authorized person or an unauthorized person. In this case, the person corresponding to the intercepted video image is an authorized person or an unauthorized person.

To give another example, some scenes require identification of designated persons, such as attendance scenes, or VIP (Very Important Person) identification scenes. The technical solutions provided in the embodiments of this application can also be used in these scenarios. For example, the face data of a designated person can be stored in the face database, and the video image intercepted in S102 can be matched with the face data stored in the face database to determine whether the person in the video stream is the designated person. The classification information of the face contained in the video image may be the first tag information or the second tag information. The first tag information indicates that there is face data that matches the intercepted video image successfully in the face database, and the second tag information indicates the person There is no face data that matches the captured video image successfully in the face database. The human face classification information contained in the intercepted video image may also be a designated person or a non-designated person. In this case, the person corresponding to the intercepted video image is a designated person or a non-designated person.

When the intercepted video image is a human face area, in one embodiment, S103 may include: inputting the intercepted video image into a second neural network model obtained by pre-training, and using the second neural network model to convert the intercepted video The image is converted into modeling data; the modeling data is matched with the face data stored in the face database to obtain the classification information of the face contained in the intercepted video image. The human face classification information contained in the intercepted video image is the first tag information or the second tag information. The aforementioned modeling data is the data output after the second neural network model processes the intercepted video image.

When the intercepted video image is a human face area, if there are multiple intercepted video images, the intercepted video images can be input into the pre-trained second neural network model, and the second neural network model can be used to intercept the Convert each video image of the video image into the modeling data corresponding to the video image; match the modeling data corresponding to each video image intercepted with the face data stored in the face database to obtain each intercepted video image Contains the classification information of the face.

The second neural network model may be a face modeling model, and the second neural network model may convert a face image into modeling data, and the modeling data is a kind of structure data.

In this embodiment of the application, the process of training to obtain the second neural network model may include: obtaining a sample face image and labels of objects in the sample face image; inputting the sample face image into a neural network with a preset structure to The label is supervised, and the parameters of the neural network are adjusted iteratively; when the iteration end condition is met, the second neural network model that has been trained is obtained. The neural network with a preset structure can be a deep neural network or a convolutional neural network. In the embodiment of the present application, the network layer that outputs the modeling output may be specified first.

In the embodiment of the present application, the face data stored in the face database is modeling data obtained after the sample face image is transformed by the second neural network model. That is, the face data stored in the face database is data output by the specified network layer of the second neural network model after the sample face image is input to the second neural network model. The modeling data obtained after converting the video image intercepted in S102 is matched with the face data in the face database. If the matching is successful, it means that the person corresponding to the intercepted video image is an authorized person or a designated person. Success means that the person corresponding to the captured video image is an unauthorized person or an unspecified person.

Applying the solution provided by the embodiments of this application, on the one hand, the second neural network model is used to obtain the classification information of the face contained in the intercepted video image. According to the classification information, it is determined whether the person is an authorized person or a designated person, and according to The judgment result promptly reminds relevant personnel for follow-up processing, which can realize effective stranger alarm or identification of designated personnel. On the other hand, face recognition is performed on the video stream first. The face recognition algorithm is a rough detection algorithm. The calculation complexity of the face recognition algorithm is low and the calculation amount is small; after the moving target is detected, the video is intercepted A small part of the video image or image area in the stream, only a small part of the intercepted video image or the face in the image area is accurately classified and recognized, and the classification information of the face is determined. In the solution provided in the embodiment of this application, no Accurate classification and recognition of human faces in all video images of the video stream reduces the amount of calculation compared with accurate classification and recognition of human faces in all video images of the video stream.

As an implementation manner, after S103, it may further include: judging whether the classification information of the monitoring target meets the preset alarm condition; if so, outputting the alarm information. If it does not meet the requirements, no treatment is required.

In one embodiment, the monitoring target is a moving target. In this case, the preset alarm condition may include: the classification information of the moving target is a person, and/or the classification information of the moving target is a vehicle.

As mentioned above, if perimeter prevention is required, that is, to determine whether there are people or vehicles entering the scene, the solution provided in the embodiments of this application can be used, and the preset alarm conditions including the classification information of the moving target are used as the classification of the people and the moving target. The information is a vehicle as an example. It is judged whether the classification information of the moving object is a person or a vehicle. If the classification information of the moving object is a person or the judgment result is a vehicle with classification information of the moving object, an alarm information is output.

In another embodiment, the monitoring target is a human face. In this case, the preset alarm condition may include: the classification information of the human face is the first tag information, or the classification information of the human face is the second tag information.

As mentioned above, if a stranger needs to be identified, the solution provided in the embodiment of this application can be used. The face data of authorized persons is stored in the face database, and it is judged whether there is a signature corresponding to the intercepted video image in the face database. The model data matches the face data successfully. If it exists, it means that the person corresponding to the intercepted video image is an authorized person, and the classification information of the face contained in the intercepted video image is an authorized person, and no processing is required. If it does not exist, it means that the person corresponding to the intercepted video image is a stranger, and the classification information of the face contained in the intercepted video image is a stranger, and an alarm message is output.

If it is necessary to identify a designated person, the solution provided in the embodiment of this application can also be used. The face data of the designated person is stored in the face database, and it is determined whether there is a matching modeling data corresponding to the intercepted video image in the face database Successful face data. If it exists, it means that the person corresponding to the intercepted video image is the designated person, the classification information of the face contained in the intercepted video image is the designated person, and the alarm information is output. If it does not exist, it means that the person corresponding to the intercepted video image is an unspecified person, and the classification information of the face contained in the intercepted video image is an unspecified person, and no processing is required.

In an implementation manner, S101 and S102 may be executed by IPC, and then IPC sends the intercepted video image to NVR, and the NVR executes the subsequent steps.

Applying the embodiment shown in Figure 1 of this application, the monitoring target is detected in the collected video stream; the video image containing the monitoring target is intercepted from the video stream; the monitoring target in the intercepted video image is classified and identified to obtain the monitoring target Classified information. It can be seen that, in the solution provided by the embodiment of the present application, the monitoring target in all the video images of the video stream is not accurately classified and identified, but the video image containing the monitoring target is intercepted, and only the monitoring target in the intercepted video image Perform accurate classification and recognition, reducing the amount of calculation.

Based on the embodiment shown in FIG. 1, the embodiment of the present application also provides a video analysis method. Referring to FIG. 2, FIG. 2 is a schematic diagram of the second flow of the video analysis method provided by an embodiment of the application, including:

S201: Detect a moving target in the collected video stream.

S202: Capture one or more frames of video images containing the moving target.

S203: Input the intercepted video image into the pre-trained first neural network model, and use the first neural network model to classify the moving target in the intercepted video image to obtain the moving target output by the first neural network model Classification information.

S204: Determine whether the classification information meets the preset alarm condition; where the preset alarm condition includes: the classification information of the moving target is a person, and/or the classification information of the moving target is a vehicle. If it matches, execute S205.

The classification information in the above step S204 is the classification information of the moving target.

S205: Output alarm information.

If the classification information of the moving object obtained in step S204 is a person, or the classification information of the moving object is a vehicle, it is determined that the classification information of the moving object meets the preset alarm condition, and step S205 is executed to output the alarm information. If the classification information of the moving object obtained in step S204 is neither a person nor a vehicle, it is determined that the classification information of the moving object does not meet the preset alarm condition, and no processing may be performed.

For example, in a scenario where perimeter prevention is required, the embodiment shown in FIG. 2 of the present application can be used to determine whether a person or vehicle enters the scene, and an alarm is issued if the determination result is yes.

Applying the embodiment of Figure 2 of this application, in the first aspect, the first neural network model is used to identify the moving target in the video image to obtain the classification information of the moving target. If the classification information is a person or a vehicle, the relevant personnel can be reminded for subsequent processing in time. Achieve effective perimeter prevention. On the other hand, first perform moving target detection on the video stream. The moving target detection algorithm can be understood as a rough detection algorithm. The calculation complexity of the moving target detection algorithm is low, and the amount of calculation is small; after the moving target is detected, Intercept a small part of the video image in the video stream, and only perform accurate classification and recognition of the moving target in this small part of the video image. Specifically, the first neural network model is used to identify the classification information of the moving target. The solution provided in the embodiment of the application , The accurate classification and recognition of moving targets in all video images of the video stream is not performed. Compared with the accurate classification and recognition of moving targets in all video images of the video stream, the amount of calculation is reduced.

In some related solutions, infrared detectors are used to emit infrared lasers, and the area covered by the infrared lasers forms a monitoring area. When someone breaks into the monitoring area, the waveform of the infrared laser changes, so it can be judged whether someone breaks into the monitoring area based on the waveform of the infrared laser. However, in this infrared laser-based monitoring solution for the monitoring area, since the infrared laser emitted by one infrared detector covers a limited area, if the monitoring area is large, multiple infrared detectors need to be set up, and the monitoring cost is high.

Using the technical solution provided by the embodiments of this application, the monitoring area is monitored based on the images collected by the image acquisition device. The field of view of one image acquisition device is relatively large. Using one image acquisition device can realize larger monitoring. Regional monitoring, and the cost of one image acquisition device is lower than the cost of multiple infrared detectors, reducing monitoring costs.

The following describes an implementation manner in which the video analysis method provided in an embodiment of the present application is applied in a perimeter defense scenario in conjunction with FIG. 3. The monitoring point in Figure 3 can be IPC.

The monitoring point collects the video stream and detects the moving target of the video stream. According to the detection result, one or more frames of video images containing the moving target are intercepted from the video stream, and the intercepted video image is sent to the NVR.

The NVR receives the video image sent by the monitoring point, inputs the video image into the first neural network model obtained in advance, and uses the first neural network model to classify the moving targets in the video image to obtain the motion output by the first neural network model Classification information of the target. The classification information of the moving target can be persons, vehicles, objects, etc., and is not specifically limited.

It is assumed that the preset alarm condition is: the classification information of the moving target is a person, and/or the classification information of the moving target is a vehicle. If the classification information of the moving target output by the first neural network model is a vehicle or a person, the NVR outputs alarm information.

In the technical solution provided by the embodiments of the present application, only when the classification information of the moving target meets the preset alarm conditions, the alarm information is output, which can reduce false alarms caused by disturbance, pet interference, and light changes, and improve the alarm accuracy.

Based on the embodiment shown in FIG. 1, the embodiment of the present application also provides a video analysis method. Referring to FIG. 4, FIG. 4 is a schematic diagram of a third process of a video analysis method provided by an embodiment of this application, including:

S401: Perform face recognition in the collected video stream to obtain a recognition result.

S402: According to the recognition result, intercept the face region in the image containing the face.

Step S402 may specifically be: according to the recognition result, intercepting a face area in a video image containing a face in the video stream as the intercepted video image. Here, the captured video image can be considered as a face image.

S403: Input the intercepted face area into a second neural network model obtained through pre-training, and use the second neural network model to convert the face area into modeling data.

Step S403 may specifically include: inputting the intercepted video image into a second neural network model obtained in advance, and using the second neural network model to convert the intercepted video image into modeling data.

S404: Obtain classification information of the face region by matching the modeling data with the face data stored in the face database.

Step S404 may specifically be: matching the modeling data with the face data stored in the face database to obtain the classification information of the face contained in the intercepted video image. Wherein, the classification information of the human face contained in the intercepted video image is the first tag information or the second tag information.

For example, the face image of an authorized person or a designated person can be collected in advance, the face image can be converted into modeling data using the second neural network model, and the converted modeling data can be stored as face data in the face database in. Then, the second neural network model is used to convert the intercepted video image into modeling data, and the modeling data is matched with the face data stored in the face database to obtain the classification information of the face area.

S405: Determine whether the classification information meets the preset alarm condition; where the preset alarm condition includes: the classification information of the face contained in the intercepted video image is the first tag information; or the information of the face contained in the intercepted video image The classification information is the second label information. If it matches, execute S406.

The classification information in the above step S405 is the classification information of the human face contained in the intercepted video image.

S406: Output alarm information.

In the case that the preset alarm condition includes that the classification information of the face contained in the intercepted video image is the first tag information, if the classification information of the face contained in the intercepted video image obtained in step S405 is the first tag information, It is determined that the classification information of the human face contained in the intercepted video image meets the preset alarm condition, and step S406 is executed to output the alarm information. If the classification information of the face contained in the intercepted video image obtained in step S405 is the second tag information, it is determined that the classification information of the face contained in the intercepted video image does not meet the preset alarm condition, and no processing is required.

In the case where the preset alarm condition includes that the classification information of the face contained in the intercepted video image is the second tag information, if the classification information of the face contained in the intercepted video image obtained in step S405 is the second tag information, It is determined that the classification information of the human face contained in the intercepted video image meets the preset alarm condition, and step S406 is executed to output the alarm information. If the classification information of the face contained in the intercepted video image obtained in step S405 is the first tag information, it is determined that the classification information of the face contained in the intercepted video image does not meet the preset alarm condition, and no processing is required.

For example, when a stranger needs to be identified, the embodiment in Figure 4 of this application can be applied to store the face data of authorized persons in a face database, and the modeled data obtained after the intercepted video image is converted to the face The face data stored in the database is matched to determine whether the person in the video stream is an authorized person. If it is determined that the person in the video stream is a stranger, an alarm is issued.

To give another example, if you need to identify a designated person, you can apply the embodiment in Figure 4 of this application to store the face data of the designated person in a face database, and convert the intercepted video image into the modeling data and the face The face data stored in the database is matched to determine whether the person in the video stream is a designated person. If it is determined that the person in the video stream is a designated person, an alarm is issued.

Applying the embodiment shown in Figure 4 of this application, on the one hand, the second neural network model is used to obtain the classification information of the face contained in the intercepted video image, and according to the classification information, it is determined whether the person is an authorized person or a designated person, and According to the judgment result, the relevant personnel are promptly reminded for follow-up processing, which can realize effective stranger alarm or identification of designated personnel. On the other hand, face recognition is performed on the video stream first. The face recognition algorithm is a rough detection algorithm. The computational complexity of the face recognition algorithm is low, and the amount of calculation is small; the video stream is intercepted after the moving target is detected Only a small part of the video image or image area in the video image or the image area is accurately classified and recognized, that is, face matching is performed. This solution does not perform accurate classification and recognition on all the video images in the video stream. Accurate classification and recognition of faces in the video stream reduces the amount of calculation compared to accurate classification and recognition of faces in all video images of the video stream.

The following describes an implementation manner in which the video analysis method provided by an embodiment of the present application is applied to a stranger alarm scenario with reference to FIG. 5. The monitoring point in Figure 5 can be IPC.

The monitoring point collects the video stream, performs face recognition on the video stream, and intercepts one or more frames of video images containing human faces in the video stream according to the recognition results, or intercepts the face area in the video image containing human faces in the video stream; Send the captured video image or face area to the NVR. For the convenience of description, the intercepted video images or face regions are collectively referred to as face images.

The NVR receives the face image sent by the monitoring point, inputs the face image into the second neural network model obtained in advance, and uses the second neural network model to convert the face image into modeling data; the converted model The model data is matched with the face data stored in the face database; if the matching is successful, it means that the person corresponding to the face image is an authorized person, and the classification information of the face contained in the face image is an authorized person. If the matching is unsuccessful, it means that the person corresponding to the face image is a stranger, and the classification information of the face contained in the face image is a stranger, and an alarm message is output.

Corresponding to the foregoing method embodiment, an embodiment of the present application also provides a video analysis device, as shown in FIG. 6, including:

The detection module 601 is used to detect a monitoring target in the collected video stream;

The interception module 602 is used to intercept the video image containing the monitoring target from the video stream;

The classification module 603 is used to classify and recognize the monitoring target in the intercepted video image to obtain the classification information of the monitoring target.

As an implementation manner, the detection module 601 is specifically configured to: detect a moving target in the collected video stream;

The interception module 602 is specifically configured to intercept one or more frames of video images containing the moving target from the video stream.

As an implementation manner, the classification module 603 is specifically configured to:

Input the intercepted video image into the pre-trained first neural network model, and use the first neural network model to classify the moving target in the intercepted video image to obtain the classification of the moving target output by the first neural network model information.

As an implementation manner, the detection module 601 is specifically configured to: perform face recognition in the collected video stream to obtain a recognition result;

The interception module 602 is specifically used for: intercepting the face area in the video image containing the face in the video stream according to the recognition result as the intercepted video image;

The classification module 603 is specifically configured to match the captured video image with the face data stored in the face database to obtain the classification information of the face.

Input the intercepted video image into the second neural network model obtained in advance, and use the second neural network model to convert the intercepted video image into modeling data;

The modeling data is matched with the face data stored in the face database to obtain the classification information of the face. The classification information of the face is the first tag information or the second tag information, and the first tag information indicates the face database There is face data that successfully matches the modeling data, and the second tag information indicates that there is no face data that successfully matches the modeling data in the face database.

As an implementation manner, the above-mentioned video analysis device may further include: a first judgment module and a first alarm module (not shown in the figure), wherein:

The first judgment module is used to judge whether the classification information of the monitoring target meets the preset alarm condition; if it meets, the first alarm module is triggered;

The first alarm module is used to output alarm information.

As an implementation manner, the above-mentioned video analysis device may further include: a second judgment module and a second alarm module (not shown in the figure), wherein:

The second judgment module is used to judge whether the classification information of the moving target meets the preset alarm conditions; the preset alarm conditions include: the classification information of the moving target is a person, and/or the classification information of the moving target is a vehicle; if it matches, trigger the first Two alarm modules;

The second alarm module is used to output alarm information.

As an implementation manner, the above-mentioned video analysis device may further include: a third judgment module and a third alarm module (not shown in the figure), wherein:

The third judgment module is used to judge whether the classification information of the face meets the preset alarm condition; the preset alarm condition includes: the classification information of the face is the first tag information, or the classification information of the face is the second tag information; if Yes, trigger the third alarm module;

The third alarm module is used to output alarm information.

An embodiment of the present application also provides an electronic device, as shown in FIG. 7, including a processor 701 and a memory 702,

The memory 702 is used to store computer programs;

The processor 701 is configured to implement any of the above-mentioned video analysis methods when executing a program stored in the memory 702.

The memory mentioned in the above electronic device may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. As an implementation manner, the memory may also be at least one storage device located far away from the foregoing processor.

The aforementioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

The embodiments of the present application also provide a computer-readable storage medium, and a computer program is stored in the computer-readable storage medium. When the computer program is executed by a processor, any one of the foregoing video analysis methods is implemented.

The embodiments of the present application also provide a computer program, which implements any of the above-mentioned video analysis methods when the computer program is executed by a processor.

An embodiment of the present application also provides a video analysis system, as shown in FIG. 8, including: a monitoring point and processing equipment, where:

The monitoring point is used to detect the monitoring target in the collected video stream; intercept the video image containing the monitoring target from the video stream; send the intercepted video image to the processing device;

The processing device is used to receive the video image, identify the monitoring target in the received video image, and obtain the classification information of the monitoring target.

For example, the monitoring point may be an IPC, and the processing device may be an NVR, which is not specifically limited.

In this solution, the monitoring target is not accurately identified in all video images of the video stream, but the video image containing the monitoring target is intercepted, and only the intercepted video image is accurately identified, which reduces the amount of calculation.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or device that includes a series of elements includes not only those elements, but also includes Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article or equipment that includes the element.

Each embodiment in this specification is described in a related manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the video analysis device embodiment, the electronic device embodiment, the computer readable storage medium embodiment, the computer program embodiment, and the video analysis system embodiment, since they are basically similar to the video analysis method embodiment, the comparison described Simple, please refer to the part of the description of the embodiment of the video analysis method for relevant details.

The above are only preferred embodiments of the present application, and are not used to limit the protection scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of this application are all included in the protection scope of this application.

Claims

A video analysis method, characterized in that it includes:

Detect monitoring targets in the captured video stream;

Intercepting a video image containing the monitoring target from the video stream;

Perform classification and recognition on the monitoring target in the intercepted video image to obtain classification information of the monitoring target.
The method according to claim 1, wherein:

The detecting the monitoring target in the collected video stream includes:

Detect moving targets in the collected video stream;

The intercepting the video image containing the monitoring target from the video stream includes:

Intercepting one or more frames of video images containing the moving target from the video stream.
The method according to claim 2, wherein the classifying and identifying the monitoring target in the intercepted video image to obtain the classification information of the monitoring target comprises:

Input the intercepted video image into a pre-trained first neural network model, and use the first neural network model to classify the moving target in the intercepted video image to obtain the first neural network model The output classification information of the moving target.
The method according to claim 1, wherein:

The detecting the monitoring target in the collected video stream includes:

Perform face recognition in the collected video stream to obtain the recognition result;

The intercepting the video image containing the monitoring target from the video stream includes:

According to the recognition result, intercepting a human face area in a video image containing a human face in the video stream, as the intercepted video image;

The classifying and identifying the monitoring target in the intercepted video image to obtain the classification information of the monitoring target includes:

The intercepted video image is matched with the face data stored in the face database to obtain the classification information of the face.
The method according to claim 4, wherein the matching the captured video image with the face data stored in the face database to obtain the classification information of the face comprises:

Inputting the intercepted video image into a second neural network model obtained by pre-training, and using the second neural network model to convert the intercepted video image into modeling data;

The modeling data is matched with the face data stored in the face database to obtain the classification information of the face. The classification information of the face is the first tag information or the second tag information, and the first The tag information indicates that there is face data that successfully matches the modeling data in the face database, and the second tag information indicates that there is no face data that successfully matches the modeling data in the face database. .
The method according to claim 1, characterized in that, after classifying and identifying the monitoring target in the intercepted video image, and obtaining classification information of the monitoring target, the method further comprises:

Determine whether the classification information of the monitoring target meets the preset alarm condition;

If it matches, output an alarm message.
The method according to claim 3, wherein after obtaining the classification information of the moving target output by the first neural network model, the method further comprises:

Determine whether the classification information of the moving target meets the preset alarm conditions; if it meets, output alarm information;

The preset alarm condition includes: the classification information of the moving target is a person, and/or the classification information of the moving target is a vehicle.
The method according to claim 5, characterized in that, after obtaining the classification information of the face, the method further comprises:

Determine whether the classification information of the human face meets the preset alarm conditions; if it meets, output alarm information;

The preset alarm condition includes: the classification information of the human face is the first tag information, or the classification information of the human face is the second tag information.
A video analysis device, characterized by comprising:

The detection module is used to detect the monitoring target in the collected video stream;

An interception module for intercepting a video image containing the monitoring target from the video stream;

The classification module is used for classifying and identifying the monitoring target in the intercepted video image, and obtaining classification information of the monitoring target.
The device according to claim 9, wherein the detection module is specifically configured to: detect a moving target in the collected video stream;

The interception module is specifically configured to intercept one or more frames of video images containing the moving target from the video stream.
The device according to claim 10, wherein the classification module is specifically configured to:

Input the intercepted video image into a pre-trained first neural network model, and use the first neural network model to classify the moving target in the intercepted video image to obtain the first neural network model The output classification information of the moving target.
The device according to claim 9, wherein the detection module is specifically configured to: perform face recognition in the collected video stream to obtain a recognition result;

The interception module is specifically configured to: according to the recognition result, intercept a face area from a video image containing a face in the video stream as the intercepted video image;

The classification module is specifically configured to match the intercepted video image with the face data stored in the face database to obtain the classification information of the face.
The device according to claim 12, wherein the classification module is specifically configured to:

Inputting the intercepted video image into a second neural network model obtained by pre-training, and using the second neural network model to convert the intercepted video image into modeling data;

The modeling data is matched with the face data stored in the face database to obtain the classification information of the face. The classification information of the face is the first tag information or the second tag information, and the first The tag information indicates that there is face data that successfully matches the modeling data in the face database, and the second tag information indicates that there is no face data that successfully matches the modeling data in the face database. .
The device according to claim 9, wherein the device further comprises:

The first judgment module is used to judge whether the classification information of the monitoring target meets the preset alarm condition; if it does, trigger the first alarm module;

The first alarm module is used to output alarm information.
The device according to claim 11, wherein the device further comprises:

The second judgment module is used to judge whether the classification information of the sports target meets a preset alarm condition; the preset alarm conditions include: the classification information of the sports target is a person, and/or the classification of the sports target The information is the vehicle; if it matches, the second alarm module is triggered;

The second alarm module is used to output alarm information.
The device according to claim 13, wherein the device further comprises:

The third judgment module is used to judge whether the classification information of the face meets a preset alarm condition; the preset alarm condition includes: the classification information of the face is the first tag information, or the face The classification information of is the second tag information; if it matches, the third alarm module is triggered;

The third alarm module is used to output alarm information.
An electronic device characterized by comprising a processor and a memory;

The memory is used to store computer programs;

The processor is configured to execute the program stored in the memory to implement the method steps of any one of claims 1-8.
A computer-readable storage medium, characterized in that it stores a computer program, which, when executed by a processor, implements the method steps of any one of claims 1-8.
A computer program, characterized in that, when the computer program is executed by a processor, the method steps of any one of claims 1-8 are realized.