CN105678267A

CN105678267A - Scene recognition method and device

Info

Publication number: CN105678267A
Application number: CN201610013913.6A
Authority: CN
Inventors: 朱旭东
Original assignee: Zhejiang Uniview Technologies Co Ltd
Current assignee: Zhejiang Uniview Technologies Co Ltd
Priority date: 2016-01-08
Filing date: 2016-01-08
Publication date: 2016-06-15

Abstract

The present invention provides a scene recognition method and device. The method comprises the steps of obtaining a foreground region image of a target monitoring image; inputting the foreground region image into a preset target detection network model and a first event type detection network model respectively to obtain a target feature and an event feature both corresponding to the foreground region image; inputting the target feature and the event feature into a preset second event type detection network model to obtain an event type corresponding to the target monitoring image. Embodiments of the present invention show that, the accuracy of scene recognition is improved.

Description

A kind of scene recognition method and device

Technical field

The present invention relates to technical field of video monitoring, particularly relate to a kind of scene recognition method and device.

Background technology

One important use of intelligent monitoring is exactly monitor the abnormal conditions in Camera coverage in time, and produces alarm processing in time. The detection of abnormal scene is possible not only to find that improper scene informs that staff processes in time in time by intelligent video monitoring, stop the generation of illegal scene, and substantial amounts of memory space can be saved, it is to avoid illegal scene occur after the lookup of staff's magnanimity and evidence obtaining. Some are based on the monitoring model of motion vector and simple front background modeling scheme, such as intrusion detection, it is detained, hover, obtained more general application, but the analysis to scene most complex scenarios, such as fall detection, crowded, pedestrian etc. do not given precedence to by motor vehicles, there is also bigger erroneous judgement at present and fails to judge.

Summary of the invention

The present invention provides a kind of scene recognition method and device, to improve the accuracy rate of scene Recognition.

First aspect according to embodiments of the present invention, it is provided that a kind of scene recognition method, including:

Obtain the foreground region image of target monitoring image;

Target detection network model and the first event type the input of described foreground region image preset respectively detect network model, to obtain target characteristic corresponding to described foreground region image and affair character;

Described target characteristic and affair character are input to default second event type detection network model, the event type corresponding to obtain described target monitoring image.

Second aspect according to embodiments of the present invention, it is provided that a kind of scene Recognition device, including:

Acquiring unit, for obtaining the foreground region image of target monitoring image;

Feature extraction unit, target detection network model and the first event type the input of described foreground region image preset respectively detect network model, to obtain target characteristic corresponding to described foreground region image and affair character;

Scene Recognition unit, for being input to default second event type detection network model, the event type corresponding to obtain described target monitoring image by described target characteristic and affair character.

The application embodiment of the present invention, by obtaining the foreground region image of target monitoring image, and the target detection network model input of this foreground region image preset respectively and the first event type detect network model, to obtain target characteristic corresponding to this foreground region image and affair character, and then, target characteristic and affair character are input to default second event type detection network model, the event type corresponding to obtain target monitoring image, improve the accuracy rate of scene Recognition.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of scene recognition method that the embodiment of the present invention provides;

Fig. 2 is the structural representation of a kind of scene Recognition device that the embodiment of the present invention provides;

Fig. 3 is the structural representation of the another kind of scene Recognition device that the embodiment of the present invention provides;

Fig. 4 is the structural representation of the another kind of scene Recognition device that the embodiment of the present invention provides;

Fig. 5 is the structural representation of the another kind of scene Recognition device that the embodiment of the present invention provides.

Detailed description of the invention

In order to make those skilled in the art be more fully understood that the technical scheme in the embodiment of the present invention, and it is understandable to enable the above-mentioned purpose of the embodiment of the present invention, feature and advantage to become apparent from, below in conjunction with accompanying drawing, technical scheme in the embodiment of the present invention is described in further detail.

Refer to the schematic flow sheet of a kind of scene recognition method that Fig. 1, Fig. 1 provide for the embodiment of the present invention, as it is shown in figure 1, this scene recognition method may comprise steps of:

Step 101, obtain target monitoring image foreground region image.

In the embodiment of the present invention, said method can apply to video monitoring system, and as being applied in the background server of video monitoring system, for ease of describing, the executive agent of following method described above is server is that example is described.

In the embodiment of the present invention, target monitoring image also refers not to a certain fixing monitoring image, but may refer to arbitrary monitoring image needing to carry out scene Recognition.

In the embodiment of the present invention, when server needs that target monitoring image is carried out scene Recognition, in order to get rid of the background area interference to scene Recognition in target monitoring image, target monitoring image can be carried out separating of foreground area and background area by server, to obtain the foreground region image of target monitoring image.

Alternatively, server can pass through to set up GMM (GaussianMixtureModel, gauss hybrid models) mode of model, realize the foreground area of target monitoring image and separating of background area, and then the foreground region image of acquisition target monitoring image, it implements and does not repeat them here.

Step 102, target detection network model foreground region image input preset respectively and the first event type detect network model, to obtain target characteristic corresponding to foreground region image and affair character.

In the embodiment of the present invention, after server gets the foreground region image of target monitoring image, server directly can't carry out scene Recognition according to this foreground region image, but need first to determine respectively target characteristic and the affair character of this foreground region image, and then, simultaneously carry out scene Recognition according to target characteristic corresponding to foreground region image and affair character, to improve the accuracy rate of scene Recognition.

Wherein, the specific implementation of target characteristic and affair character that server obtains foreground region image will be described below, and the embodiment of the present invention does not repeat them here.

As the optional embodiment of one, in embodiments of the present invention, before above-mentioned target detection network model foreground region image input preset respectively and the first event type detect network model, it is also possible to comprise the following steps:

11), judge whether foreground region image exists goal-selling;

12) if existing, it is determined that perform the above-mentioned step that respectively target detection network model default for foreground region image input and the first event type are detected network model.

In this embodiment, it is possible to preset the event needing to pay close attention to, namely setting the event type that need to detect of server, as fought, motor vehicles blocks up and crowded etc. And then, it is possible to according to the event that these needs are paid close attention to, it is determined that need the target paid close attention to, for instance, for fighting, it is necessary to the target of concern can include people; Blocking up for motor vehicles needs the target of concern can include motor vehicles; For crowded, it is necessary to the target of concern can include people and bicycle etc.

Correspondingly, based on above-mentioned setting, server is after getting the foreground region image of target monitoring image, may determine that in this foreground region image whether there is goal-selling, such as whether what exist in people, motor vehicles and bicycle is one or more, if existing, then foreground region image is input to default target detection network model and the first event type detects in network model, to carry out the extraction of target characteristic and affair character.

Alternatively, the screening of foreground region image can be passed through the realization of SVM (SupportVectorMachine, support vector machine) grader by server. Concrete, assume that goal-selling includes motor vehicles, bicycle and people three class, (sample including goal-selling is positive sample then can to pass through the prior positive negative sample gathered, sample including goal-selling is negative sample) SVM classifier is trained, so that the positive negative sample of SVM classifier identification (such as, aligns sample output 1; To negative sample export 0) accuracy rate meet pre-conditioned. Based on above-mentioned training, after server gets the foreground region image of target monitoring image, it is possible to this foreground region image to be input to the SVM classifier trained; If SVM classifier output 1, then showing to exist in this foreground region image goal-selling, this foreground region image can be carried out the extraction of follow-up target characteristic and affair character by server; If SVM classifier output 0, then showing to be absent from this foreground region image goal-selling, server can delete this foreground region image, and this foreground region image does not carry out the extraction of follow-up target characteristic and affair character.

Step 103, target characteristic and affair character are input to default second event type detection network model, the event type corresponding to obtain target monitoring image.

In the embodiment of the present invention, after server extracts target characteristic corresponding to the foreground region image of target monitoring image and affair character, this target characteristic and affair character can be input to default second event type detection network model, the event type corresponding to obtain target monitoring image.

Alternatively, after server gets target characteristic corresponding to foreground region image and affair character, the target characteristic got can become feature one group new combine with affair character serial, and the combination of this feature is input in second event type detection network model, the event type corresponding by the output of this second event type detection network model.

As the optional embodiment of one, second event type detection network model can be SVM event type grader.

In this embodiment, server can obtain target characteristic corresponding to the training sample gathered in advance and affair character according to step 101 and step 102 manner described. For arbitrary training sample, target characteristic corresponding for this training sample and affair character serial become feature one group new combine, and be entered in the SVM event type grader built in advance and be trained, so that the output result of this SVM event type grader meets pre-conditioned, as the discrimination of event type is exceeded predetermined threshold value.

Based on the SVM event type grader after above-mentioned training, when server needs to carry out scene Recognition, target characteristic corresponding for the foreground region image of target monitoring image and affair character can be input to the SVM event type grader after training, the event type corresponding to obtain target monitoring image by server.

Visible, in the method flow shown in Fig. 1, by obtaining the foreground region image of target monitoring image, and extract target characteristic and the affair character of this foreground region image respectively, and then, according to this target characteristic and affair character, target monitoring image is carried out scene Recognition, improve the accuracy rate of scene Recognition.

In order to make those skilled in the art be more fully understood that the technical scheme that the embodiment of the present invention provides, structure and application that target detection network model in the embodiment of the present invention and the first event type are detected network model below are described. Wherein, detect network model with target detection network model and the first event type and be convolutional neural networks (ConvolutionalNeuralNetwork is called for short CNN) for example.

In this embodiment, target detection network model and the first event type detection network model be all mainly made up of input layer, convolution and pond layer, full articulamentum, softmax layer etc.

Concrete, for target detection network model:

In this embodiment, for each training sample gathered in advance, server is after getting the foreground region image of each training sample, the target that can include according to foreground region image is (such as people, motor vehicles, bicycle or other target etc.) foreground region image is split, to obtain multiple object candidate area, and judge whether the target that each object candidate area includes is effective target (i.e. goal-selling) the type that decides specific aims, by traveling through all of object candidate area, obtain all effective targets positional information in foreground region image.

Server identifies in foreground region image after the positional information of each effective target, it is possible to this foreground region image is divided into several independent sub-image area Area{}. For example, it is assumed that foreground region image includes n effective target, then can obtain n independent sub-image area: Area₁、Area₂…Area_n, each sub-image area includes an independent effective target, and then this n independent sub-image area can be input in default target detection network model and be trained by server.

Wherein, for i-th image region Area_i, the ground floor convolution function output of this target detection network modelFor:

{Out}_{(x, y) &Element; N_{i}}^{(l, k)} = \tanh (Σ_{t = 0}^{f - 1} Σ_{r = 0}^{k_{h}} Σ_{c = 0}^{k_{w}} w_{(r, c)}^{(k, t)}) {Out}_{(x + r, x + c)}^{(l - 1, t)} + {Bias}^{(l, k)}

Wherein, tanh is hyperbolic functions, and input layer is l-1, and output layer is l, k is default constant parameter, and w is a k_w*k_hThe convolution kernel of size, f is the number of convolution kernel, and Bias is output amount of bias, k_h, k_wIt is the height and width of convolution kernel, N_iFor pixel each in i-th image region set of coordinate values in two dimensional image plane, (x, y) for pixel in i-th image region at the coordinate figure of two dimensional image plane.

Network model is detected for the first event type:

Input layer: the foreground region image of the training sample got can be detected the input of network model by server as the first event type, is linked into input layer;

Convolutional layer: in convolutional layer, each convolutional layer passes through convolution results and the convolution kernel of last layerCarrying out two-dimensional convolution, excitation result output is added through nonlinear activation function and bias and obtains;

Classification layer (softmax): choose suitable convolution kernel and the size of max-pooling rectangle, the output of last convolutional layer is downsampled to a pixel, the output of last convolutional layer and the connection of full articulamentum are one-dimensional matrix.Last layer is generally a full articulamentum, the corresponding a kind of event type of each output. In this embodiment, adopt softmax to return the excitation function as last layer and each neuronic output is interpreted as the probability of event type corresponding to each input picture.

Alternatively, the training parameter adjustment formula of the first event type detection network is as follows:

{Out}_{(x, y) &Element; N}^{(l, k)} = \tanh (Σ_{t = 0}^{f - 1} Σ_{r = 0}^{k_{h}} Σ_{c = 0}^{k_{w}} w_{(r, c)}^{(k, t)}) {Ou}_{(x + r, x + c)}^{(l - 1, t)} + {Bias}^{(l, k)}

Wherein, N is that in whole foreground region image, each pixel is in the set of coordinate values of two dimensional image plane, and (x, y) for pixel in whole foreground region image at the coordinate figure of two dimensional image plane.

Visible, in target detection network model, parameter training pattern in same sub-image area is that weight shares training mode, but the parameter training between different subimage spaces is not followed weight and is shared principle, namely the parameter training pattern of target detection network model is that partial weight shares training mode (namely sharing in same sub-image area, do not share between sub-image area); First event type detects network model and then directly uses the shared training mode of weight to carry out parameter training.

In this embodiment, server construction and train above-mentioned target detection network model and the first event type detection network model after, can pass through this target detection network model and the first event type detection network model obtains target characteristic corresponding to foreground region image and the affair character of target monitoring image, and this target characteristic and affair character are connected, obtain a higher-dimension assemblage characteristic, and this higher-dimension assemblage characteristic is input to the second event type detection network model trained, such as SVM event type grader, and determine, according to output result, the event type that target monitoring image is corresponding.

Be can be seen that by above description, in the technical scheme that the embodiment of the present invention provides, by obtaining the foreground region image of target monitoring image, and the target detection network model input of this foreground region image preset respectively and the first event type detect network model, to obtain target characteristic corresponding to this foreground region image and affair character, and then, target characteristic and affair character are input to default second event type detection network model, the event type corresponding to obtain target monitoring image, improves the accuracy rate of scene Recognition.

Refer to Fig. 2, structural representation for a kind of scene Recognition device that the embodiment of the present invention provides, wherein, this scene Recognition device can apply to the video monitoring system in said method embodiment, such as, this scene Recognition device it is applied in the background server of video monitoring system, as in figure 2 it is shown, may include that

Acquiring unit 210, for obtaining the foreground region image of target monitoring image;

Feature extraction unit 220, target detection network model and the first event type the input of described foreground region image preset respectively detect network model, to obtain target characteristic corresponding to described foreground region image and affair character;

Scene Recognition unit 230, for being input to default second event type detection network model, the event type corresponding to obtain described target monitoring image by described target characteristic and affair character.

Seeing also the structural representation of the another kind of scene Recognition device that Fig. 3, Fig. 3 provide for the embodiment of the present invention, on the basis of embodiment illustrated in fig. 2, scene Recognition device shown in Fig. 3 can also include:

Judging unit 240, is used for judging whether there is goal-selling in described foreground region image;

Correspondingly, described feature extraction unit 220, it is possible to specifically for when described judging unit judged result is for existing, target detection network model and the first event type the input of described foreground region image preset respectively detect network model.

Seeing also the structural representation of the another kind of scene Recognition device that Fig. 4, Fig. 4 provide for the embodiment of the present invention, on the basis of embodiment illustrated in fig. 2, in scene Recognition device shown in Fig. 4, feature extraction unit 220 may include that

Dividing subelement 221, for described foreground area being divided into multiple independent subgraph as region according to the effective target existed in described foreground area, wherein, each sub-image area includes an independent effective target;

Extract subelement 222, for the plurality of independent subgraph is input in default target detection network model as region, the target characteristic corresponding to obtain described foreground region image.

In an alternative embodiment, described target detection network model is convolutional neural networks model, and for i-th image region, the ground floor convolution function output formula of described target detection network model is:

{Out}_{(x, y) &Element; N_{i}}^{(l, k)} = \tanh (Σ_{t = 0}^{f - 1} Σ_{r = 0}^{k_{h}} Σ_{c = 0}^{k_{w}} w_{(r, c)}^{(k, t)}) {Out}_{(x + r, x + c)}^{(l - 1, t)} + {Bias}^{(l, k)}

In an alternative embodiment, described second event type detection network model is support vector machines event type grader;

Correspondingly, seeing also the structural representation of 5, Fig. 5 another kind of scene Recognition devices provided for the embodiment of the present invention, on the basis of embodiment illustrated in fig. 2, in scene Recognition device shown in Fig. 5, scene Recognition unit 230 may include that

Feature combination subelement 231, for connecting described target characteristic and affair character, to obtain characteristic of correspondence combination;

Scene Recognition subelement 232, for the combination of described feature is input to the SVM event type grader trained, and the output result according to described SVM event type grader determines the event type that described target monitoring image is corresponding.

What in said apparatus, the function of unit and the process that realizes of effect specifically referred in said method corresponding step realizes process, does not repeat them here.

For device embodiment, owing to it corresponds essentially to embodiment of the method, so relevant part illustrates referring to the part of embodiment of the method. Device embodiment described above is merely schematic, the wherein said unit illustrated as separating component can be or may not be physically separate, the parts shown as unit can be or may not be physical location, namely may be located at a place, or can also be distributed on multiple NE. Some or all of module therein can be selected according to the actual needs to realize the purpose of the present invention program. Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.

In the description of this specification, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means in conjunction with this embodiment or example describe are contained at least one embodiment or the example of the present invention.In this manual, the schematic representation of above-mentioned term is necessarily directed to identical embodiment or example. And, the specific features of description, structure, material or feature can combine in one or more embodiments in office or example in an appropriate manner. Additionally, when not conflicting, the feature of the different embodiments described in this specification or example and different embodiment or example can be carried out combining and combining by those skilled in the art.

Additionally, term " first ", " second " are only for descriptive purposes, and it is not intended that indicate or imply relative importance or the implicit quantity indicating indicated technical characteristic. Thus, define " first ", the feature of " second " can express or implicitly include at least one this feature. In describing the invention, " multiple " are meant that at least two, for instance two, three etc., unless otherwise expressly limited specifically.

As seen from the above-described embodiment, by obtaining the foreground region image of target monitoring image, and the target detection network model input of this foreground region image preset respectively and the first event type detect network model, to obtain target characteristic corresponding to this foreground region image and affair character, and then, target characteristic and affair character are input to default second event type detection network model, the event type corresponding to obtain target monitoring image, improve the accuracy rate of scene Recognition.

Those skilled in the art, after considering description and putting into practice invention disclosed herein, will readily occur to other embodiment of the present invention. The application is intended to any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed the general principle of the present invention and include the undocumented known general knowledge in the art of the present invention or conventional techniques means. Description and embodiments is considered only as exemplary, and the true scope of the present invention and spirit are pointed out by claim below.

It should be appreciated that the invention is not limited in precision architecture described above and illustrated in the accompanying drawings, and various amendment and change can carried out without departing from the scope. The scope of the present invention is only limited by appended claim.

Claims

1. a scene recognition method, it is characterised in that including:

Obtain the foreground region image of target monitoring image;

2. method according to claim 1, it is characterised in that the described target detection network model input of described foreground region image preset respectively and the first event type also include before detecting network model:

Judge whether described foreground region image exists goal-selling;

If existing, it is determined that perform the described step that respectively target detection network model default for the input of described foreground region image and the first event type are detected network model.

3. method according to claim 1, it is characterised in that the target detection network model that the input of described foreground region image is preset, the target characteristic corresponding to obtain described foreground region image, including:

Described foreground area is divided into multiple independent subgraph as region by the effective target according to existing in described foreground area, and wherein, each sub-image area includes an independent effective target;

The plurality of independent subgraph is input in default target detection network model as region, the target characteristic corresponding to obtain described foreground region image.

4. method according to claim 3, it is characterised in that described target detection network model is convolutional neural networks model, for i-th image region, the ground floor convolution function output formula of described target detection network model is:

{Out}_{(x, y) &Element; N_{i}}^{(l, k)} = \tanh (Σ_{t = 0}^{f - 1} Σ_{r = 0}^{k_{h}} Σ_{c = 0}^{k_{w}} w_{(r, c)}^{(k, t)}) {Out}_{(x + r, x + c)}^{(l - 1, t)} + {Bias}^{(l, k)}

Wherein, tanh is hyperbolic functions, and input layer is l-1, and output layer is l, k is default constant parameter, and w is a k_w*k_hThe convolution kernel of size, f is the number of convolution kernel, and Bias is output amount of bias, k_h, k_wIt is the height and width of convolution kernel, N_iFor pixel each in i-th image region set of coordinate values in two dimensional image plane, (x, y) be in i-th image region pixel at the coordinate figure of two dimensional image plane.

5. method according to claim 1, it is characterised in that described second event type detection network model is support vector machines event type grader;

Described default second event type detection network model that described target characteristic and affair character are input to, the event type corresponding to obtain described target monitoring image, including:

Described target characteristic and affair character are connected, to obtain characteristic of correspondence combination;

The combination of described feature is input to the SVM event type grader trained, and the output result according to described SVM event type grader determines the event type that described target monitoring image is corresponding.

6. a scene Recognition device, it is characterised in that including:

7. device according to claim 6, it is characterised in that described device also includes:

Judging unit, is used for judging whether there is goal-selling in described foreground region image;

Described feature extraction unit, specifically for when described judging unit judged result is for existing, target detection network model and the first event type the input of described foreground region image preset respectively detect network model.

8. device according to claim 7, it is characterised in that described feature extraction unit, including:

Dividing subelement, for described foreground area being divided into multiple independent subgraph as region according to the effective target existed in described foreground area, wherein, each sub-image area includes an independent effective target;

Extract subelement, for the plurality of independent subgraph is input in default target detection network model as region, the target characteristic corresponding to obtain described foreground region image.

9. device according to claim 8, it is characterised in that described target detection network model is convolutional neural networks model, for i-th image region, the ground floor convolution function output formula of described target detection network model is:

{Out}_{(x, y) &Element; N_{i}}^{(l, k)} = \tanh (Σ_{t = 0}^{f - 1} Σ_{r = 0}^{k_{h}} Σ_{c = 0}^{k_{w}} w_{(r, c)}^{(k, t)}) {Out}_{(x + r, x + c)}^{(l - 1, t)} + {Bias}^{(l, k)}

10. device according to claim 6, it is characterised in that described second event type detection network model is support vector machines event type grader;

Described scene Recognition unit, including:

Feature combination subelement, for connecting described target characteristic and affair character, to obtain characteristic of correspondence combination;

Scene Recognition subelement, for the combination of described feature is input to the SVM event type grader trained, and the output result according to described SVM event type grader determines the event type that described target monitoring image is corresponding.