CN110309789A

CN110309789A - Video monitoring human face clarity evaluation method and device based on deep learning

Info

Publication number: CN110309789A
Application number: CN201910598757.8A
Authority: CN
Inventors: 刘远; 崔长鸣
Original assignee: Beijing Wei Lian Zhong Cheng Technology Co Ltd
Current assignee: Beijing Wei Lian Zhong Cheng Technology Co Ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2019-10-08

Abstract

The present invention relates to video monitoring human face clarity evaluation method and device based on deep learning, the present invention realizes construction and the selection of feature with convolutional neural networks, helps to improve the accuracy rate of evaluation result.It is directed to the problems such as network complex parameters are excessive and time-consuming simultaneously, it is also proposed that traditional convolutional coding structure is transformed into the method for two-fold lamination mechanism to promote calculating speed.Video monitoring human face clarity evaluation method of the present invention based on deep learning can accurately carry out the assessment of face clarity, and have faster processing speed.

Description

Video monitoring human face clarity evaluation method and device based on deep learning

Technical field

The present invention relates to video monitoring human face clarity evaluation methods and device based on deep learning, belong to video prison Control technical field.

Background technique

Under the application scenarios of video monitoring, we first can be monitored positioning to the face in image, the category detected In the image of the same person may be one, it is also possible to be multiple.If the same person 1 has only detected a people in total Face, then this image should all be used for recognition of face regardless of the quality of picture quality.But most of the time, monitoring can all adopt Collect the different facial image of multiple clarity.Therefore selectively choosing some more visible pictures can for recognition of face Effectively to promote discrimination.The fuzzy facial image of crawl how is abandoned, only retains clearly face figure, this is also industry With academia always primary study the problem of one of.

The concern of more and more scholars, research side are also obtained in recent years for the research of the clarity evaluation of image Method can be mainly divided into following a few classes:

Fuzzy (JNB) can be felt based on minimum；Based on fuzzy detection accumulated probability (CPBD)；Based on local sensing clarity (S3)；Based on local phase coherence (LPC)；Based on maximum localized variation (MLV)；Perceived sharpness based on edge gradient (PSI) method；Zhang Tianyu et al. obtains picture level and vertical gradient using sobel operator, then acquires the strong edge of image And strong edge histogram is established, clarity evaluation of estimate is obtained finally by weighting；Based on spatial domain without regard to marginal information Method；Method based on transform domain；Utilize discrete cosine transform evaluation image definition etc..

But there are following deficiencies for above-mentioned face clarity evaluation method: 1), accuracy it is low；2), poor robustness；3), The speed of service is slow.

Summary of the invention

The present invention is in view of the deficienciess of the prior art, provide the video monitoring human face clarity based on deep learning Deep learning is evaluated applied to face clarity, establishes the clear of oneself using convolutional neural networks by evaluation method and device Assessment models are spent, while solving computation complexity height further through the method for changing traditional convolutional coding structure and being difficult to calculate in real time The problem of, specific technical solution is as follows:

Video monitoring human face clarity evaluation method based on deep learning includes the following steps: to utilize convolutional Neural Network establishes intelligibility evaluation model, and the face figure of input can constantly carry out convolution operation and extract local feature, while at The dimension of the picture of reason is also constantly increasing, and finally extracts global characteristics by full articulamentum, is then divided with softmax Class, finally obtains the output of four nodes, and four output nodes represent the corresponding probability of four opinion ratings of clarity.

As an improvement of the above technical solution, the convolution operation is separable convolution operation.

As an improvement of the above technical solution, the separable convolution operation is decomposed into the convolution operation of standard by logical Two module of convolution of road convolution sum 1 × 1 carries out, as long as each of previous each convolution kernel by channel convolution and input are logical Road carries out convolution, and then subsequent 1 × 1 convolution is responsible for the result for calculating upper one layer merging.

As an improvement of the above technical solution, the first convolutional layer in convolutional neural networks structure is not decomposed, in addition It is finally also added with a mean value down-sampling layer, convolutional neural networks structure is done a whole picture characteristic pattern and adopted under global mean value Sample.

Video monitoring human face clarity evaluating apparatus based on deep learning comprising,

Memory, for storing program instruction；

Processor, for running described program instruction, to execute the video monitoring human face based on deep learning The step of clarity evaluation method.

Beneficial effects of the present invention:

1), the present invention carries out face clarity evaluation using convolutional neural networks substitution traditional approach, using separating The method of convolution greatly reduces operand, improves processing speed.

2), the video monitoring human face clarity evaluation method accuracy based on deep learning is high, robustness is good, fortune Scanning frequency degree is fast, and real-time is good, the face clarity evaluation being particularly suitable under monitor video.

Detailed description of the invention

Fig. 1 is convolution schematic diagram of a layer structure；

Fig. 2 is full connection and locally-attached comparison diagram；

Fig. 3 is mean value down-sampling schematic diagram；

Fig. 4 is comparison diagram when carrying out convolution on the convolutional layer and improved convolutional layer of standard；

Fig. 5 is clear and more visible face instance graph；

Fig. 6 is the example picture of fuzzy face and relatively fuzzy face.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

The video monitoring human face clarity evaluation method based on deep learning includes the following steps: to utilize convolution Neural network intelligibility evaluation model, the face figure of input can constantly carry out convolution operation and extract local feature, pass through simultaneously The dimension of picture for crossing processing is also constantly increasing, and finally extracts global characteristics by full articulamentum, then with softmax into Row classification, finally obtains the output of four nodes, four output nodes represent the corresponding probability of four opinion ratings of clarity.

1, convolutional neural networks

Convolutional neural networks are that one kind is proved in the particularly effective neural network of computer vision field.Main cause exists The shared characteristic of weight in it can greatly reduce the complexity of model, while reduce the quantity of weight again.This is one Critically important advantage, especially in the case of the image of input is multidimensional, image can be directly used for the input of network.

1.1, convolutional layer

In general, convolutional layer is just used exclusively for doing the network layer of feature extraction, as shown in Figure 1, each in convolutional layer Neuron is connected to the fritter in upper one layer of characteristic pattern by the weight of one group of filter.These filters will be manually set in the present invention The length and width of wave matrix, but it is usually arranged as 3 × 3 or 5 × 5.Convolutional layer attempts to carry out each fritter in neural network More in depth analysis is to obtain the higher feature of level of abstraction.

Fig. 2 is the comparison diagram of two kinds of connection types, for connecting entirely, b₁And b₂It must be with 4 all input elements a₁, a₂, a₃, a₄All connect, and for convolutional layer part connection for, only and b₁Be attached, without and b₂It is connected, a₄Only With b₂Connection, without and b₁Connection, the characteristics of this local sensing, enables the parameter on network is jumbo to reduce, thus pole The earth training for promotion rate.

1.2, down-sampling layer

Down-sampling layer will calculate the local fritter maximum value or average value of each characteristic pattern or several characteristic patterns.Such as figure Shown in 3, the average value of every 4 fritters is extracted, and can reduce expression dimension in this way, at the same can ignore target inclination, The variation of the relative position of rotation etc.

It does so and precision can be improved, and alleviate over-fitting to a certain extent.In order to make it easy to understand, down-sampling operates It is also assumed that a higher picture of resolution ratio is converted to the lower picture of resolution ratio.

1.3, full articulamentum and softmax classifier

After processing by a few wheel convolutional layers and down-sampling layer, it can be understood as the information in image has been abstracted into Those features with more information amount.The effect of full articulamentum is all features of connection, gives output valve to classifier (such as Softmax classifier).

To sum up, convolutional neural networks avoid feature extraction and data reconstruction processes cumbersome in tional identification algorithm, from And reduce runing time, increase the speed of service.

2, the face clarity based on separable convolution evaluates network

Because research background of the invention is the face clarity evaluation under monitor video, real-time is had to It asks.Traditional convolution algorithm amount is big, and method below improves the operation mode of convolution.

2.1, convolution is separated

Present invention uses this follow-on convolution operations of separable convolution, and the convolution operation of standard is decomposed by logical Two module of convolution of road convolution sum 1 × 1 carries out, it can be understood as an original convolutional layer is decomposed into two convolutional layers, it is preceding As long as each channel of each convolution kernel by channel convolution and input carries out convolution, then subsequent 1 × 1 convolution is negative The result that duty calculates upper one layer merges.

Fig. 4 is comparison diagram when carrying out convolution on the convolutional layer and improved convolutional layer of standard, and X is input feature vector figure Quantity, Y are exactly the quantity of this layer of convolution kernel, it can be understood as the channel number of output, the convolution operation traditional for one, example Such as the first row in Fig. 4, it is assumed that input as the characteristic pattern of X K × K size, be then W × W × X convolution kernel with Y size Convolution algorithm is done, output is the characteristic pattern of Y H × H, then the calculation amount for always meeting generation together is W × W × X × Y × H × H.

The second row of Fig. 4, expression is first layer convolution after decomposing, same to input, unlike here first with X X characteristic pattern of convolution kernel and input having a size of W × W × 1 does corresponding convolution algorithm, X operation available in this way As a result, this X result is not added up between each other still, then the output of this first layer is H × H × X, the operand of this layer For H × H × W × W × X.

In the third line of Fig. 4, the characteristic pattern of X H × H size of upper one layer of output as input, then with Y it is a having a size of The convolution kernel of 1 × 1 × X does convolution, and the life peacekeeping dimensionality reduction of port number can be effectively performed in this 1 × 1 convolution of standard, realizes Interaction and information across channel are integrated, and are common convolution kernel types in this several years some outstanding network structures, are finally exported Be Y H × H characteristic pattern, then total operand of this layer be X × Y × 1 × 1 × H × H.

The ratio between the calculation amount of two kinds of convolutional coding structures CR is acquired with following formula:

Molecule indicates the calculation amount after convolution operation in above-mentioned formula, and denominator indicates the operand of common convolution, then with 3 When the convolution kernel of × 3 sizes does convolution, it is original 1/9 or so that theoretically the operation time of convolution, which can be reduced,.

2.2, face clarity evaluates network

It can finally be obtained in the network architecture by series of computation first using the facial image block of 56 × 56 sizes as input To the output of 4 nodes, such as table 1:

1 face clarity of table evaluates network structure

The face figure of input can constantly carry out convolution operation and extract local feature, while the dimension of treated picture Constantly increasing, is finally extracting global characteristics by full articulamentum, then classified with softmax, four here are defeated Egress represents the corresponding probability of four opinion ratings of clarity.

The present invention does not decompose the first convolutional layer in the network structure, in addition last also added under a mean value Sample level provides last classification results finally putting 1 to 2 full articulamentums with some classical network structures before habits Different, the network structure of this paper does global mean value down-sampling to a whole picture characteristic pattern, and this operation can save much Parameter, reduces network size, and the dw in table 1 is meant that by channel.

3, implement and compare

3.1, the acquisition of data set and mark

Face snap data set of the data required for network training under actual monitored, picture mainly pass through manually They are divided into clear, more visible, relatively fuzzy, fuzzy four parts according to the readability of picture by mark, and every picture has several Then the readability that the most classification of poll determines mark is chosen in individual vote by marking.Reference when table 2 is mark is quasi- Then, the monitoring face clarity being collected into total marks about 3000, picture, wherein picking out each 200 4 class pictures 800 faces are as test data set in total, and remaining 2200 faces are for training.

2 face clarity of table marks criterion

The training and test method of 3.2 models

In order to increase corresponding training data, not directly using entire face as the input of model, but in these faces The small image block of interception 56 × 56 on figure works as test model then manually to each small image block ranking score group as label Effect when, on a certain face, take and all available be not overlapped small image block and given a mark and conduct of averaging The clarity evaluation of estimate of this face.

3.3, the precision of clarity evaluation

The evaluation and test of face clarity depends on subjective standard scores manually evaluated and objectively training pattern provides Score, then using Pearson's linearly dependent coefficient (Pearsonlinear correlation coefficient, PLCC) and Spearman rank correlation coefficient (Spearman rankorder correlation coefficient, SROCC) the two Formula calculates separately the consistency of two groups of scores, wherein PLCC is mainly used for calculating accuracy rate, and SROCC is mainly used for acquiring list Tonality.The value of PLCC and SROCC more levels off to 1, illustrates that the evaluation effect of objective models is better.

Since the evaluation and test magnitude of subjective scores and objective score may be different, need before calculating PLCC and SROCC first Score will be objectively evaluated and carry out regressing calculation.

3.4, the test of face clarity data set and comparison

It is that can feel fuzzy (JNB) based on minimum that first three, which tests other comparative approach, herein, accumulative general based on fuzzy detection Rate (CPBD) is based on local sensing clarity (S3), is based on local phase coherence (LPC), based on maximum localized variation (MLV).In order to prove that the present invention is effective to high quality face and low quality face, select picture quality in test set is preferable This clear and more visible two parts 400 open face and put and tested in the first set of experiments, then will be fuzzy and relatively fuzzy The second-rate human face data collection of two parts is placed on second group of experiment and is tested.

First experiment is advanced clearly with the test of more visible data set, and Fig. 5 below is clear set and more visible collection Two examples.

Table 3 is experimental result data, 6 kinds of comparative approach and the corresponding precision of every kind of method.

The comparison result (clear and more visible) of 3 six kinds of methods of table

Test fuzzy and compared with fuzzy data set is carried out below, and the data set picture quality of this part is poor, mainly camera Focal length do not mix up the defocus blur of generation, produce the motion blur of relative motion between the face and camera of people, Fig. 6 is The example picture of fuzzy face and relatively fuzzy face.Table 4 is experimental result, is similarly the comparison of 6 kinds of methods.

The comparison result (fuzzy and relatively fuzzy) of 4 six kinds of methods of table

Then the speed of service of model and other methods is tested in one group of experiment, is tested in Visual Studio It is carried out under 2013 environment, computer configures Intel Core i5-6500 CPU 3.20GHz, inside saves as 8.00GB, the model of GPU For NVIDIA GeForce GTX 960, the index of test speed is number of pictures per second fps, test result such as table 5:

5 six kinds of method speed of service comparisons of table

Last group of control experiment will be compared with the network structure of several classics, including AlexNet and GoogLeNet, and test platform is put on arm plate, the data set of test is assessed and is referred to using clear and more visible data set Mark uses PLCC coefficient, while the runing time for recording them is used to calculate fps, such as table 6:

Comparison result of the 63 kinds of methods of table on ARM platform

The result of comparison sheet 3 and table 4 can be seen that in four kinds of methods later, the evaluation to clear and more visible face Effect becomes larger with the evaluation effect to fuzzy and relatively fuzzy face, and the centralized way of the effect relative rear of JNB is obviously far short of what is expected, And a kind of improved method of the CPBD as JNB, effect have many promotions, but there are also some gaps, JNB than four kinds of methods below Both classical way terms typically go the method for calculating clarity using the width at edge with CPBD, when image is seriously dimmed When, it is difficult to detect edge, so that obscuring and others methods can decline faster relatively compared with accuracy rate on fuzzy data set, S3 Method has taken spatial domain and frequency domain into consideration, and for effect than individually considering that spatial domain is more preferable, LPC calculates image based on Transformation Domain Clarity, also well, performance is more stable for this method effect, and the result in several groups of experiments is all slightly worse than knot of the invention Fruit.MLV method is method that is a kind of and having had fastly, both has considerable operational efficiency, while being in effect also slightly It is worse than the present invention, the present invention is compared with other clarity evaluation algorithms, and index of the invention is highest on PLCC and SROCC , it is several to illustrate that evaluation method of the invention is better than front.

The MLV method as the result is shown of watch 5 has a most fast speed of service, and the speed of service of S3 method be it is most slow, Although the present invention has had many optimizations in terms of speed but still this most fast method has in CPU arithmetic speed with MLV The biggish gap of institute, but be more than method in front 4, it can also reach real-time treatment effect, the place especially on GPU substantially Reason speed has even approached MLV method, this be primarily due to GPU ratio CPU be more good to handle this highly-parallel task (such as Convolution operation), it can preferably play the performance of floating-point operation.

The last one experiment, which is selected, is clearly tested with more visible data set, such as table 6, selects ARM this embedded flat Platform can preferably verify the efficiency of network, and finally, the PLCC coefficient outline of GoogLeNet is higher than the present invention 0.8%, this The method of invention has a raising for having 0.3% than AlexNet, the processing speed of AlexNet be it is most slow in these three networks, only 5.9fps, and network processes speed of the invention can achieve 11.5fps, almost twice of AlexNet, GoogLeNet's Fast 1.3 times of processing speed ratio AlexNet, it can achieve 7.6fps, but still 0.66 only invented times, therefore it is several in accuracy In identical situation, the present invention is in speed or has huge advantage.

In the above-described embodiments, it proposes and is commented using convolutional neural networks substitution traditional approach to carry out face clarity Valence greatly reduces operand using the method for separable convolution, improves processing speed.

Recognition of face has been widely used in daily life, as one of key technology face clarity evaluation at For popular research topic.However, the method for traditional manual extraction feature is all lacking in effect and robustness.For This, the present invention realizes construction and the selection of feature with convolutional neural networks, helps to improve the accuracy rate of evaluation result.Simultaneously For network complex parameters excessively and the problems such as time-consuming, it is also proposed that traditional convolutional coding structure is transformed into two-fold lamination mechanism Method promotes calculating speed.

Show that the video monitoring human face clarity proposed by the invention based on deep learning is commented by largely testing Valence method can accurately carry out the assessment of face clarity, and have faster processing speed.

In addition, the invention further relates to the video monitoring human face clarity evaluating apparatus based on deep learning comprising,

Memory, for storing program instruction；

Processor is clear to execute the video monitoring human face based on deep learning for running described program instruction The step of clear degree evaluation method, specific steps are commented with the video monitoring human face clarity described above based on deep learning Step in valence method is identical, is no longer repeated herein.

The present invention be referring to according to the method for the embodiment of the present invention, the flow chart of equipment (system), computer program product And/or block diagram describes.It should be understood that each process in flowchart and/or the block diagram can be realized by computer program instructions And/or the combination of the process and/or box in box and flowchart and/or the block diagram.It can provide these computer programs to refer to Enable the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to generate One machine so that by the instruction that the processor of computer or other programmable data processing devices executes generate for realizing The device for the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. the video monitoring human face clarity evaluation method based on deep learning, it is characterised in that include the following steps: to utilize Convolutional neural networks establish intelligibility evaluation model, and the face figure of input can constantly carry out convolution operation and extract local feature, together When treated picture dimension also constantly increasing, finally extract global characteristics by full articulamentum, then use Softmax classifies, and finally obtains the output of four nodes, and four output nodes represent four opinion ratings pair of clarity The probability answered.

2. the video monitoring human face clarity evaluation method according to claim 1 based on deep learning, feature exist In: the convolution operation is separable convolution operation.

3. the video monitoring human face clarity evaluation method according to claim 2 based on deep learning, feature exist In: the separable convolution operation be the convolution operation of standard is decomposed into two module of convolution by channel convolution sum 1 × 1 into Row, as long as each channel of previous each convolution kernel by channel convolution and input carries out convolution, then subsequent 1 × 1 Convolution is responsible for the result for calculating upper one layer and is merged.

4. the video monitoring human face clarity evaluation method according to claim 1 based on deep learning, feature exist In: the first convolutional layer in convolutional neural networks structure is not decomposed, it is in addition last to be also added with a mean value down-sampling layer, Convolutional neural networks structure does global mean value down-sampling to a whole picture characteristic pattern.

5. the video monitoring human face clarity evaluating apparatus based on deep learning, it is characterised in that: it includes,

Memory, for storing program instruction；

Processor is based on deep learning as Claims 1 to 4 is described in any item to execute for running described program instruction Video monitoring human face clarity evaluation method the step of.