CN112364692A

CN112364692A - Image processing method and device based on monitoring video data and storage medium

Info

Publication number: CN112364692A
Application number: CN202011085900.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Terminus Technology Group Co Ltd
Current assignee: Terminus Technology Group Co Ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-02-12

Abstract

The invention discloses an image processing method, an image processing device and a storage medium based on monitoring video data, wherein the method comprises the following steps: inputting an image to be recognized into an image recognition model for recognition, and outputting a recognition result, wherein the recognition result comprises recognized objects of various categories and corresponding candidate frames; and judging whether a target object exists in the monitored video data according to the identification result, if the identification result is that the objects of all types in the candidate frame comprise the target object, judging that the target object exists in the monitored video data, and if not, judging that no target object exists in the monitored video data. By adopting the embodiment of the application, the objects of all categories in the image to be recognized can be automatically recognized, and whether the target object exists in the monitoring video data can be automatically judged according to the recognition result, so that the following effects can be achieved: and rapidly and accurately screening out videos with target objects from massive video data in a monitoring video library.

Description

Image processing method and device based on monitoring video data and storage medium

Technical Field

The invention relates to the technical field of video data processing, in particular to an image processing method, an image processing device and a storage medium based on monitoring video data.

Background

With the popularization of camera technology, more and more cameras are installed at each corner of user life, in markets, living areas and the like, so that massive monitoring data can be obtained. However, the existing monitoring data are often independent, and it is difficult to retrieve the monitoring data in different areas in order to protect the privacy of the user.

The existing monitoring data is too large in quantity and very disordered in classification, and users cannot quickly and accurately screen out the required monitoring data in a massive monitoring database. In addition, the existing method is too complicated for processing massive monitoring video data, and the obtained processing result is not accurate enough if manual screening is not performed or random screening is not performed.

Disclosure of Invention

The embodiment of the application provides an image processing method and device based on monitoring video data and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present application provides an image processing method based on surveillance video data, where the method includes:

selecting an image to be identified from a monitoring video database;

training according to each monitoring data in the monitoring database to obtain an image recognition model;

inputting the image to be recognized into an image recognition model for recognition, and outputting a recognition result, wherein the recognition result comprises recognized objects of various categories and corresponding candidate frames;

and judging whether a target object exists in the monitoring video data according to the identification result, if the identification result is that the objects of all types in the candidate frame comprise the target object, judging that the target object exists in the monitoring video data, and if not, judging that the target object does not exist in the monitoring video data.

Optionally, the training according to each monitoring data in the monitoring database to obtain an image recognition model includes:

selecting a plurality of data sets from the surveillance video data, the data sets including a first data set, a second data set, and a third data set;

taking the first data set and the second data set as training samples, and taking the third data set as detection samples;

training the training sample to obtain an initial image recognition model;

and correcting the initial image recognition model according to the detection sample to obtain the image recognition model.

Optionally, the inputting the image to be recognized into an image recognition model for recognition, and outputting a recognition result includes:

extracting candidate regions of the image to be identified to obtain a plurality of candidate regions;

respectively extracting the features of each candidate region, and extracting the image features of corresponding dimensions;

and classifying each candidate region through a support vector machine to obtain the identification result.

Optionally, the classifying each candidate region by the support vector machine includes:

obtaining a support vector machine classifier corresponding to each category;

and classifying the objects of each candidate region of the image to be recognized through the support vector machine classifier corresponding to each category, and determining the classification category of the maximum probability corresponding to each candidate region.

Optionally, after the outputting the recognition result, the method further includes:

selecting any one object from the objects in each category;

obtaining coordinates of a bounding box used for determining the object, wherein the coordinates comprise an abscissa and an ordinate;

determining the position of the bounding box and the size of the bounding box according to each coordinate;

correcting the position of the boundary frame according to a linear regression correction model to obtain the corrected position of the boundary frame, and/or,

and correcting the size of the boundary frame according to the linear regression correction model to obtain the corrected size of the boundary frame.

Optionally, before the correcting the position of the bounding box according to the linear regression correction model, the method further includes:

and training corresponding bounding box regressors for the objects of all classes.

Optionally, the image recognition model is an image recognition model established based on a regional convolutional neural network.

In a second aspect, an embodiment of the present application provides an image processing apparatus based on surveillance video data, the apparatus including:

the selection unit is used for selecting an image to be identified from the monitoring video database;

the training unit is used for training according to each monitoring data in the monitoring database to obtain an image recognition model;

the identification unit is used for inputting the image to be identified selected by the selection unit into an image identification model for identification and outputting an identification result, wherein the identification result comprises the identified objects of each category and corresponding candidate frames;

and the processing unit is used for judging whether a target object exists in the monitoring video data according to the identification result identified by the identification unit, judging that the target object exists in the monitoring video data if the identification result is that the objects of all types in the candidate frame comprise the target object, and otherwise, judging that the target object does not exist in the monitoring video data.

Optionally, the training unit is specifically configured to:

training the training sample to obtain an initial image recognition model;

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the embodiment of the application, an image to be recognized is input into an image recognition model for recognition, and a recognition result is output and comprises recognized objects of various categories and corresponding candidate frames; and judging whether a target object exists in the monitored video data according to the identification result, if the identification result is that the objects of all types in the candidate frame comprise the target object, judging that the target object exists in the monitored video data, and if not, judging that no target object exists in the monitored video data. By adopting the embodiment of the application, the objects of all categories in the image to be recognized can be automatically recognized, and whether the target object exists in the monitoring video data can be automatically judged according to the recognition result, so that the following effects can be achieved: and rapidly and accurately screening out videos with target objects from massive video data in a monitoring video library.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic flowchart of an image processing method based on surveillance video data according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an image processing apparatus based on surveillance video data according to an embodiment of the present application.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following describes in detail an image processing method based on surveillance video data according to an embodiment of the present application with reference to fig. 1. The image processing method may be implemented by relying on a computer program, and may be run on an image processing apparatus based on surveillance video data.

Referring to fig. 1, a schematic flow chart of an image processing method based on surveillance video data is provided in an embodiment of the present application. As shown in fig. 1, the image processing method based on surveillance video data according to the embodiment of the present application may include the following steps:

and S102, selecting an image to be identified from the monitoring video database.

In the embodiment of the application, a screenshot of a specific video data in a monitoring video database can be selected as an image to be identified according to a selection instruction of a user; or randomly selecting a screenshot of certain video data as an image to be identified. In the image processing method provided in the embodiment of the present application, the image to be recognized is not specifically limited.

And S104, training according to each monitoring data in the monitoring database to obtain an image recognition model.

In the embodiment of the application, the training according to each monitoring data in the monitoring database to obtain the image recognition model comprises the following steps:

selecting a plurality of data sets from the monitoring video data, wherein the data sets comprise a first data set, a second data set and a third data set;

training the training sample to obtain an initial image recognition model;

Here, this is merely an example, and more data sets may be selected according to the requirement for the training accuracy of the image recognition model, for example, five data sets are configured, where four data sets are used as training samples for training, and one data set is used as a detection sample, and the obtained initial image recognition model is modified to finally obtain a modified image recognition model.

In practical applications, in order to ensure a better detection result, a plurality of detection samples are often set, for example, any one of five data sets is used as a certain detection sample to be detected in sequence, so as to obtain five corresponding detection results; after five detection results are obtained respectively, the five detection results are compared, and finally an image recognition model is determined, wherein the image recognition model can enable a predicted value to be infinitely close to a true value.

In the embodiment of the application, the image recognition model is an image recognition model established based on a regional convolution neural network.

The following description is made for a training process of an image recognition model in the embodiment of the present application:

the specific training process adopts a gradient descent method.

For any parameter w of the regional convolutional neural network, the training formula for gradient descent used is:

where η is the learning rate, a small positive number specified before training, similar to the step size of each step of the gradient descent. While

Is to calculate the partial derivative of the signal,

referred to as a gradient.

From the definition of the derivative, it is calculated:

the specific gradient descent formula of the area convolution neural network:

LOSS ═ OUT-desired output²,

For the weight of the neuron, the formula is:

for bias, the formula is:

according to the definition of the derivative, for any derivable function f, there is:

thus:

then:

the core idea of the area convolution neural network series is to convert the traditional image processing technology into processing by using a neural network and reuse the processing as much as possible to reduce the calculation amount.

The nonlinear activation function used in the training of the image recognition model in the embodiment of the present application may be any one of ReLU, sigmoid, and tanh.

The ReLU activation function is explained as follows:

ReLU is defined as:

the derivative of ReLU is:

the nonlinear activation function ReLU has the following advantages:

ReLU is simple and fast to operate;

ReLU can avoid gradient disappearance;

the ReLU has better network performance.

sigmoid is defined as:

the derivative of sigmoid is: f' (x) ═ f (x) (1-f (x)).

the definition of tanh is:

the derivative of tanh is: f' (x) ═ 1-f (x)²。

The following steps are taken to obtain an image recognition model by training all data under a specific application scene, and the method comprises the following steps, assuming that the number of data is 20000 and the batch size is 100:

step a 1: reading all the data (20000);

step a 2: dividing data into training set test sets, for example, the first 18000 as training sets and the last 2000 as test sets;

step a 3: starting the first epoch (one forward and one backward propagation of all training samples in the neural network, called an epoch); disorganizing 18000 data of a training set, wherein input and output keep correct one-to-one correspondence;

step a 4: sending the No. 1 batch processing, namely No. 1-100 of the scrambled data into a network;

step a 5: forward propagation, comparing the output result with an expected value, and calculating loss;

step a 6: back propagation and updating network parameters;

step a 7: sending the 2 nd batch processing, namely the No. 101-200 of the scrambled data into the network; repeating the above process;

step a 8: completing the 1 st period after all training sets are trained, namely 18000/100-180 batches;

step a 9: sending the test set into a network; forward propagation; comparing the test result with an expected value, and calculating loss;

step a 10: comparing the loss of the training set with the loss of the test set, and performing corresponding adjustment, for example, adjusting the learning rate, or stopping training and modifying the hyper-parameters;

step a 11: beginning the 2 nd session, the 18000 data from the training set are shuffled again and the process is repeated.

And S106, inputting the image to be recognized into the image recognition model for recognition, and outputting a recognition result, wherein the recognition result comprises recognized objects of various categories and corresponding candidate frames.

In the embodiment of the application, the image to be recognized is input into the image recognition model for recognition, and the output of the recognition result comprises the following steps:

step a: extracting candidate regions of an image to be identified to obtain a plurality of candidate regions;

specifically, the candidate region extraction usually adopts a classical target detection algorithm, and uses a sliding window to sequentially judge each possible region. In practical application, the optimization can be performed by using an object detection algorithm, a series of candidate regions of possible objects are extracted in advance by adopting selective search, and then, features are extracted only on the candidate regions, so that the calculation amount is greatly reduced.

Step b: respectively extracting the features of each candidate region, and extracting the image features of corresponding dimensions;

specifically, all candidate regions are cropped and scaled to a fixed size, and then, for each candidate region, multi-dimensional image features are extracted using 5 convolutional layers and 2 fully-connected layers.

Step c: and classifying each candidate region through a support vector machine to obtain an identification result.

Specifically, the classification of each candidate region by the support vector machine includes the following steps:

obtaining a support vector machine classifier corresponding to each category;

Each category is provided with a support vector machine classifier, objects (including background) in 1000 candidate areas are classified through 21 classifiers, and the most possible classification of each candidate area is judged, such as people, flowers, pets and the like.

In practical application, in order to avoid the possibility that the same object has a plurality of different frames, after the classifier is used for classifying the possible types of each candidate area, the classification result is subjected to non-maximum suppression, and redundant frames are removed. For example, the same object may have different boxes, some redundant boxes are removed, and only one box is retained.

In one possible implementation, after outputting the recognition result, the method further includes:

selecting any one object from the objects in each category;

acquiring coordinates of a bounding box for determining an object, wherein the coordinates comprise an abscissa and an ordinate;

correcting the position of the boundary frame according to the linear regression correction model to obtain the corrected position of the boundary frame, and/or,

and correcting the size of the boundary frame according to the linear regression correction model to obtain the corrected size of the boundary frame. And performing fine tuning calibration on each candidate frame by performing boundary frame regression and correcting the model through linear regression so as to accurately frame the object in each candidate area, and finally improving the accuracy of identifying the object.

In one possible implementation, before the position of the bounding box is corrected according to the linear regression correction model, the method further includes the following steps:

For example, there is a support vector machine classifier for each category, and objects (including background) in 1000 candidate regions are classified by 21 classifiers, and the most probable classification of each candidate region is determined, for example, human, flower, pet, etc. After 21 classifications are determined, corresponding bounding box regressors are trained for the 21 classes of objects. The training method for training the boundary box regressor is a conventional method, and is not described herein again.

And S108, judging whether a target object exists in the monitored video data according to the identification result, if the identification result is that the objects of all types in the candidate frame comprise the target object, judging that the target object exists in the monitored video data, and if not, judging that no target object exists in the monitored video data.

The following is an embodiment of an image processing apparatus of the present invention that can be used to perform an embodiment of an image processing method of the present invention. For details that are not disclosed in the embodiments of the image processing apparatus of the present invention, refer to the embodiments of the image processing method of the present invention.

Referring to fig. 2, a schematic structural diagram of an image processing apparatus based on surveillance video data according to an exemplary embodiment of the present invention is shown. The image processing apparatus may be implemented as all or a part of the terminal by software, hardware, or a combination of both. The image processing apparatus includes a selecting unit 10, a training unit 20, a recognizing unit 30, and a processing unit 40.

Specifically, the selecting unit 10 is configured to select an image to be identified from a surveillance video database;

the training unit 20 is used for training according to each monitoring data in the monitoring database to obtain an image recognition model;

the recognition unit 30 is used for inputting the image to be recognized selected by the selection unit 10 into the image recognition model for recognition and outputting a recognition result, wherein the recognition result comprises recognized objects of various categories and corresponding candidate frames;

and the processing unit 40 is configured to determine whether a target object exists in the monitored video data according to the identification result identified by the identification unit 30, determine that a target object exists in the monitored video data if the identification result indicates that the objects of each category in the candidate frame include a target object, and determine that no target object exists in the monitored video data if the identification result indicates that the objects of each category in the candidate frame include a target object.

Optionally, the training unit 20 is specifically configured to:

training the training sample to obtain an initial image recognition model;

Optionally, the identification unit 30 is configured to:

extracting candidate regions of an image to be identified to obtain a plurality of candidate regions;

and classifying each candidate region through a support vector machine to obtain an identification result.

Optionally, the identification unit 30 is specifically configured to:

obtaining a support vector machine classifier corresponding to each category;

Optionally, after the recognition unit 30 outputs the recognition result, the processing unit 40 is further specifically configured to:

selecting any one object from the objects in each category;

Optionally, before the processing unit 40 corrects the position of the bounding box according to the linear regression correction model, the training unit 20 is further configured to:

It should be noted that, when the image processing apparatus provided in the foregoing embodiment executes the image processing method, only the division of the functional modules is illustrated, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

In the embodiment of the application, the identification unit inputs the image to be identified into the image identification model for identification, and outputs an identification result, wherein the identification result comprises the identified objects of each category and corresponding candidate frames; and the processing unit judges whether a target object exists in the monitored video data according to the identification result of the identification unit, if the identification result is that the objects of all types in the candidate frame comprise the target object, the target object exists in the monitored video data, and if not, the target object does not exist in the monitored video data. By adopting the embodiment of the application, the objects of all categories in the image to be recognized can be automatically recognized, and whether the target object exists in the monitoring video data can be automatically judged according to the recognition result, so that the following effects can be achieved: and rapidly and accurately screening out videos with target objects from massive video data in a monitoring video library.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: selecting an image to be identified from a monitoring video database; training according to each monitoring data in the monitoring database to obtain an image recognition model; inputting an image to be recognized into an image recognition model for recognition, and outputting a recognition result, wherein the recognition result comprises recognized objects of various categories and corresponding candidate frames; and judging whether a target object exists in the monitored video data according to the identification result, if the identification result is that the objects of all types in the candidate frame comprise the target object, judging that the target object exists in the monitored video data, and if not, judging that no target object exists in the monitored video data.

In one embodiment, the step of training, performed by the processor, according to each monitoring data in the monitoring database to obtain the image recognition model includes: selecting a plurality of data sets from the monitoring video data, wherein the data sets comprise a first data set, a second data set and a third data set; taking the first data set and the second data set as training samples, and taking the third data set as detection samples; training the training sample to obtain an initial image recognition model; and correcting the initial image recognition model according to the detection sample to obtain the image recognition model.

In one embodiment, the step of inputting the image to be recognized into the image recognition model for recognition and outputting the recognition result, which is executed by the processor, includes: extracting candidate regions of an image to be identified to obtain a plurality of candidate regions; respectively extracting the features of each candidate region, and extracting the image features of corresponding dimensions; and classifying each candidate region through a support vector machine to obtain an identification result.

In one embodiment, the step of classifying each candidate region by a support vector machine performed by the processor comprises: obtaining a support vector machine classifier corresponding to each category; and classifying the objects of each candidate region of the image to be recognized through the support vector machine classifier corresponding to each category, and determining the classification category of the maximum probability corresponding to each candidate region.

In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: selecting an image to be identified from a monitoring video database; training according to each monitoring data in the monitoring database to obtain an image recognition model; inputting an image to be recognized into an image recognition model for recognition, and outputting a recognition result, wherein the recognition result comprises recognized objects of various categories and corresponding candidate frames; and judging whether a target object exists in the monitored video data according to the identification result, if the identification result is that the objects of all types in the candidate frame comprise the target object, judging that the target object exists in the monitored video data, and if not, judging that no target object exists in the monitored video data.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image processing method based on monitoring video data, characterized in that the method comprises:

selecting an image to be identified from a monitoring video database;

2. The method of claim 1, wherein the training according to each monitoring data in the monitoring database to obtain an image recognition model comprises:

training the training sample to obtain an initial image recognition model;

3. The method according to claim 1, wherein the inputting the image to be recognized into an image recognition model for recognition and the outputting the recognition result comprises:

4. The method of claim 3, wherein the classifying the respective candidate regions by a support vector machine comprises:

obtaining a support vector machine classifier corresponding to each category;

5. The method of claim 1, after said outputting a recognition result, further comprising:

selecting any one object from the objects in each category;

6. The method of claim 5, further comprising, prior to said modifying the position of the bounding box according to a linear regression modification model:

7. The method according to any one of claims 1 to 6,

the image recognition model is established based on a regional convolution neural network.

8. An image processing apparatus based on surveillance video data, the apparatus comprising:

9. The apparatus of claim 8,

the training unit is specifically configured to:

training the training sample to obtain an initial image recognition model;

10. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.