CN110502995A

CN110502995A - Driver based on subtle facial action recognition yawns detection method

Info

Publication number: CN110502995A
Application number: CN201910658690.2A
Authority: CN
Inventors: 闵卫东; 杨浩; 韩清; 熊辛; 张愚; 汪琦
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2019-11-26
Anticipated expiration: 2039-07-19
Also published as: CN110502995B

Abstract

It yawns detection method the present invention provides the driver based on subtle facial action recognition, comprising the following steps: step 1, driver's driving video that vehicle-mounted vidicon captures is pre-processed, carries out Face datection and segmentation, image size normalization and denoising；Step 2, Key-frame Extraction Algorithm is proposed, the similarity picture that screens and peel off by picture histogram similarity threshold rejects the method combined, to extract the key frame in deliberate action sequence；Step 3, according to the key frame of selection, establishing has the 3D deep learning network (3D-LTS) of low Temporal sampling to detect various behaviors of yawning, the present invention extracts the key frame of deliberate action by Key-frame Extraction Algorithm, then it by the 3D-LTS network established, extracts space-time characteristic and detects various facial deliberate actions；It is better than existing method in terms of discrimination and overall performance, can effectively distinguishes and yawn and other facial deliberate actions, effectively reduces driver and yawn the false detection rate of behavior.

Description

Driver based on subtle facial action recognition yawns detection method

Technical field

The present invention relates to technical field of computer vision, and the driver specially based on subtle facial action recognition yawns Detection method.

Background technique

Intelligent driving includes providing pre-warning signal, and monitoring and assist vehicle control is the hot topic of road improvement safety in recent years Study topic.There are thousands onboard to fall asleep due to dead or major injuries because of driver every year.Road safety is tired by driver Labor seriously threatens.National Highway Traffic safety management bureau is investigated, interviewed more than one third as the result is shown Person recognizes to live through fatigue when driving.In fatigue driving accident, 10% people recognizes them in past one month and past Such accident occurred in 1 year.It was discovered by researchers that driver fatigue results in 22% traffic accident.In no any police In the case where announcement, driving fatigue leads to collision or close a possibility that colliding is six times of normal driving.Therefore, research identification is driven The method of the person's of sailing fatigue is extremely important for improving road safety.In the past few decades, it is tired that many drivers are proposed Labor detection method, to help driver safety to drive and improve traffic safety.The behavioural characteristic packet of driver in fatigue driving Blink is included, is nodded, close one's eyes and is yawned.In these behaviors, yawning is one of the principal mode of fatigue performance.Therefore, it grinds Study carefully personnel and numerous studies have been done to detection of yawning.Compared with traditional double identification, face action can be considered as thin Micro- facial action.

It yawns although many researchers propose different methods to detect, they still suffer from huge choose War.Due to driver's face action and expression complicated in true driving environment, existing method is difficult to accurately, steadily detect It yawns, ought especially have some face actions and the mouth deformation of expression similar with yawning, easily generation false retrieval.Therefore, New feature, new challenge in face of driving environment, how quickly and accurately to carry out driver's behavioral value of yawning is us The project for needing to study.

Summary of the invention

It is an object of the present invention to solve already present driver behavioral value algorithm of yawning can not effectively distinguish some spies It is different yawn with class yawn behavior the problem of, such as while singing the special behavior of yawning such as yawn, the classes such as shout It yawns behavior, proposes that the driver based on subtle facial action recognition yawns detection method.

To achieve the above object, the invention provides the following technical scheme: the driver based on subtle facial action recognition beats Yawn detection method, comprising the following steps:

Step 1, driver's driving video that vehicle-mounted vidicon captures is pre-processed, carry out Face datection and divided It cuts, image size normalization and denoising；

Step 2, it proposes Key-frame Extraction Algorithm, passes through the screening of picture histogram similarity threshold and the similarity graph that peels off Piece rejects the method combined, to extract the key frame in deliberate action sequence；

Step 3, according to the key frame of selection, establish have the 3D deep learning network (3D-LTS) of low Temporal sampling with Detect various behaviors of yawning.

Further, it includes: use that the driver's driving video captured to vehicle-mounted vidicon, which carries out pretreatment, Viola-Jones Face datection algorithm carries out the detection of driver's human face region, driver's facial area is partitioned into, using quick Median filtering method carries out denoising.

Further, the Key-frame Extraction Algorithm is from a series of original video frame F={ F_j, j=1 ..., N } in mention Take a series of key frame K={ K_i, i=1 ..., M }；Wherein M indicates the quantity of the key frame selected from primitive frame, and N indicates former The quantity of beginning frame, the Key-frame Extraction Algorithm include two choice phases:

In the first choice stage, the RGB color histogram of each video frame is calculated；Then, two are calculated using Euclidean distance The color histogram γ of a successive frame_jAnd γ_j+1Between similarity:

Wherein, 1≤j≤N-1, n are the dimensions of picture color histogram.

Similarity threshold T is calculated by formula (2)_s:

T_s=μ_s (2)

Wherein, μ_sFor Mean (S), S S_jSet, work as S_j> T_sWhen, it is believed that F_jAnd F_j+1Similarity is smaller, by F_jIt is added Candidate key frame queue.

In second choice phase, the candidate key-frames picture that those in candidate key-frames have the feature that peels off is rejected, is come Final key frame is obtained, use two kinds of image similarity Measure Indexes: Euclidean distance (ED) and root-mean-square error (RMSE) make The frame with the feature that peels off is detected with median absolute deviation (MAD), calculates MAD according to formula (3):

MAD=median (| X_i-median(X)|) (3)

Continuous two candidate key-frames are denoted as K_{I, i+1}, for all K_{I, i+1}, their RMSE and ED value is calculated, For each calculated RMSE (K_{I, i+1}) and ED (K_{I, i+1}), their MAD value is calculated, is denoted as α=MAD (RMSE), β= MAD (ED), RMSE (K_{I, i+1}) calculation formula such as formula (4) shown in, ED (K_{I, i+1}) calculation formula such as formula (5) shown in, when RMSE(K_{I, i+1}) it is less than α and ED (K_{I, i+1}) be less than β when, it is believed that K_iIt is with the candidate key-frames of feature of peeling off, and by K_iIt moves Except candidate key-frames.

Here n represents K_iSize.

Here m represents the picture color histogram dimension of candidate key-frames.

Further, the 3D-LTS network extracts the identification with deliberate action, the 3D-LTS net for space-time characteristic Network uses 8 non-superframe frames as input, and space-time characteristic, all convolution filters are extracted from successive frame using four 3D convolutional layers Wave device is all 3 × 3 × 3, and stride is 1 × 1 × 1, and all pond layers are all maximum ponds, and the kernel size of the first and second tether layers is 1 × 2 × 2, the filter quantity of first four convolutional layer is respectively 32,64,128 and 256, and third pond inner nuclear layer is 2 × 4 × 4, Convolutional layer is followed by the layer being fully connected, be used for mappings characteristics, using one with 1024 output be fully connected layer, Distribution for integration characteristic.

Compared with prior art, the beneficial effects of the present invention are:

It yawns detection method the invention proposes a kind of driver based on subtle facial action recognition, firstly, proposing A kind of two stage Key-frame Extraction Algorithm, the algorithm have calculating speed fast and can effectively mention from original frame sequence The advantages of taking the key frame of deliberate action；Secondly, the invention also provides a kind of, the deliberate action based on Three dimensional convolution network is known Other network, for extracting space-time characteristic and detecting various facial deliberate actions；Method proposed by the present invention is in discrimination and entirety Aspect of performance is better than existing method, can effectively distinguish and yawn and other facial deliberate actions, effectively reduce driver and beat Kazakhstan The false detection rate for the behavior of owing.

Detailed description of the invention

Fig. 1 is detection framework figure of yawning the present invention is based on the driver of subtle facial action recognition；

Fig. 2 is key-frame extraction result demonstration graph of the present invention；

Fig. 3 is the comparison figure of two-dimensional convolution and Three dimensional convolution；

Fig. 4 is the structure chart of 3D-LTS network proposed by the present invention；

Fig. 5 is some frame samples from YawDDR data set；

Fig. 6 be YawDDR data set in two kinds movement some image sequences (a) speak (b) yawn；

Fig. 7 is high-definition camera of the present invention and position of driver figure；

Fig. 8 be MFAY data set of the present invention in three face actions image sequence (a) yawn (b) singing (c) shout；

Fig. 9 is the video sequence number figure in MFAY data set of the present invention；

Figure 10 is the testing result figure of the method for the present invention and four kinds of advanced methods on MFAY data set.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is described in further detail.The specific embodiments are only for explaining the present invention technical solution described herein, and It is not limited to the present invention.

The present invention provides a kind of technical solution: the driver based on subtle facial action recognition yawns detection method, General frame is as shown in Figure 1, comprising the following steps:

Video pre-filtering is the essential step in our research contents, since driver's detection of yawning is related to driver Fatigue driving, so must have the requirement of real-time.It must be taken the photograph using faster and better video processing technique to vehicle-mounted As the video that head is recorded is handled.Video is carried out sub-frame processing first by us.Due to including many in the picture after framing The redundancy of background etc., and these information have not a particle of use for our subsequent classification, will cause very big do instead It disturbs.Our target is classified to the movement of driver face, so our region-of-interest should be the face of driver Portion region.We carry out driver's human face region detection with Viola-Jones Face datection algorithm.The inspection of Viola-Jones face It is quick, stable, accurate that method of determining and calculating has the characteristics that, is a kind of the most widely used Face datection algorithm.But it is to being more than The detection effect of the face of certain rotation angle is simultaneously bad.In the present invention, the position of camera is face driver face, It may be ensured that there is 100% face recall rate.After detecting human face region, the size of successive frame is unified for by we 200×200。

Driving environment in reality has vibration due to the movement of automobile.Which results in our vehicle-mounted pick-ups Head can generate noise and interference because of vibration during shooting.It is dry in order to reduce these noise brings as far as possible It disturbs, we carry out denoising using Fast median filtering algorithm.Fast median filtering algorithm is that the GPU of median filtering method accelerates version This.Median filtering method can effectively remove shot noise and salt-pepper noise, since the noise that vibration generates belongs to both mostly. So denoising effect can be made to reach best using Fast median filtering algorithm, the interference of noise bring is minimized.

In pretreatment stage, video is divided into 30 frames/second frame sequence.Due in original frame sequence between consecutive frame Information content difference very little, there are bulk redundancy frames, these redundant frames make the accuracy decline of the classification of motion, especially for movement The lesser deliberate action of amplitude of variation.In order to solve these problems, the present invention proposes a kind of based on the screening of picture similarity threshold With the effective real-time Key-frame Extraction Algorithm for the similarity rejecting that peels off.Fig. 2 shows the keyframe sequence by the algorithms selection.

Key-frame Extraction Algorithm proposed by the present invention is from a series of original video frame F={ F_j, j=1 ..., N } in extract A series of key frame K={ K_i, i=1 ..., M }.Wherein M indicates the quantity of the key frame selected from primitive frame, and N indicates original The quantity of frame.The present invention combines the histogram similarity filtering based on threshold value with outlier detection.Picture histogram has Calculate advantage at low cost.Compared with local feature, the global characteristics such as image distance and histogram in classification can be effectively Reduce wrong report.

Key-frame Extraction Algorithm proposed by the present invention includes two choice phases:

Wherein, 1≤j≤N-1, n are the dimensions of picture color histogram.By calculating, we obtain one to include F_jAnd F_j+1The set S of similarity.We are it needs to be determined that a similarity threshold T_sTo carry out the selection of key frame.This threshold value energy Represent the average level of similarity between frame.We consider two kinds of threshold value calculation methods.The half of minimum and maximum similarity And average similarity.The data set and processed YawDD reference data that we use the two threshold values to collect from us ourselves Concentrate selection key frame.Our network is trained on the data set that self is collected, and in processed YawDD data set On tested.As the result is shown in table 1, wherein s indicates similarity collection, and YT is the abbreviation yawned when talking.From knot Fruit is it will be seen that use average similarity to allow our detection method realization of yawning optimal whole as metric threshold Body result.One semi-fusion of largest frames similarity and minimum frame similarity, two extreme similarities.The threshold value cannot represent this The average value of the similarity of a little face actions.Average similarity as threshold value can choose most representative key frame.

Experimental result (unit: %) under 1 two kinds of threshold values of table

Similarity threshold T is calculated by formula (2)_s:

T_s=μ_s (2)

MAD=median (| X_i-median(X)|) (3)

Here n represents K_iSize.

In the present invention, another important contribution is detection that the introducing of action recognition mechanism is yawned.In recent years, it acts Identification all has made great progress in terms of accuracy and speed.Researcher has proposed various networks to identify action.Extensively The action recognition frame used is double-current converged network and 3D convolutional network.

3D convolutional network causes many passes in action recognition, scene and target identification and movement similarity analysis Note.Compared with other space-time characteristic extracting methods based on binary-flow network, Three dimensional convolution network has calculating speed fast, accuracy rate High advantage.Some researchers attempt to be superimposed the continuous characteristic pattern of 2D convolution to classify to video actions, but in 2D convolution Temporal information has also been lost in the process.In contrast, 3D convolutional network uses multiple successive video frames as input, such as Fig. 3 institute Show that .3D convolutional network realizes better temporal information modeling by the operation of 3D convolution sum 3D pondization.It is discovered by experiment that 3 × 3 × The 3D convolution kernel of 3 sizes can extract most representative space-time characteristic.

Based on Three dimensional convolution, the invention proposes a kind of 3D-LTS networks with low Temporal sampling, special for space-time Sign is extracted and the identification of deliberate action.The 3D-LTS Web vector graphic 3D convolution extracts space-time characteristic, and using softmax layers into Row classification.After data prediction and key frame selection, determines and how many frame are used as the input of 3D-LTS network to obtain most Good recognition performance is extremely important.We compare the result of 3D-LTS network from different input frame numbers.Our net Network is trained on the data set that self is collected, and is tested on processed YawDD data set.Experimental result such as table 2.From the point of view of whole recognition result, the results showed that our 3D-LTS network is not very sensitive to the quantity of input frame.We Web vector graphic 8 non-superframe frames show better performance as input.3D-LTS is using four 3D convolutional layers from successive frame Extract space-time characteristic.The structure of 3D-LTS is as shown in Figure 4.It may be seen that all convolution filters are all from structure chart 3 × 3 × 3, stride is 1 × 1 × 1.All pond layers are all maximum ponds.If we slow down the remittance of shallow tether layer on time dimension Collect rate, then deep convolutional layer can extract more representative temporal characteristics from shallow convolutional layer.This is for recognizing subtle behavior It is extremely important.Based on the theory analysis, the kernel size of the first and second tether layers in our 3D-LTS is 1 × 2 × 2. The filter quantity of first four convolutional layer is respectively 32,64,128 and 256.Third pond inner nuclear layer is 2 × 4 × 4.After convolutional layer Face is the layer being fully connected, and is used for mappings characteristics.We used one with 1024 output be fully connected layer.It is complete The layer connected entirely is used for the distribution of integration characteristic.We have found that our 3D-LTS is when face is a layer being fully connected behind Obtain optimal recognition performance.

The experimental result (unit: %) of 2 network difference of table input frame number

In present invention experiment, our systems that we are tested using the yawn detection data collection of a standard first, YawDD data set, YawDD are a public detection data collection of yawning.It can be used for verifying Face datection, and face characteristic mentions It takes, yawn detection and other algorithms.The data set has collected from different sexes, age, and country is with the volunteer's of race A series of actions video.The data set includes 351 videos.It recorded three to four videos for every driver, including not Same mouth situation, such as speaks, yawns and yawn.

It since most of video clip duration in YawDD data set are super after one minute, and include multiple faces Movement, therefore we need for the video clip in YawDD data set to be divided into video clip only comprising individual part.Pass through This mode, we are based on YawDD data set and construct YawDDR data set.Video length in YawDDR data set is about 8 Second.There are three types of operations in this data set: talk (T) yawns (Y) and yawns (YT) when talking.It is had collected in YawDDR 486 image sequences.Display is in fig. 5 and fig. for some examples (before face segmentation and after segmentation) in data set.I Verified using the data set we method validity.

Many face data collection are used for identification, human facial expression recognition and face detection.However, none public is driven The person's of sailing detection data collection of yawning includes various face actions.The purpose for collecting this data set is to verify our method various Driver is carried out in face action to yawn the efficiency of detection.Therefore, by practical driving environment using HD camera come structure Build our MFAY data set.Various face actions are divided into six grades that may occur during driving by we.Driver In speak (T)；Yawn (YT) when speaking；Yawn (Y)；It sings (S)；Yawn (YS) when singing；Shout (ST).In view of tired Please the danger sailed, our collection place are selected on the broad road of seldom pedestrian.Do not influencing the case where driving Under, mini high-definition camera is installed in face of driver to capture their face action.During the experiment, driver is not It drives a car under same illumination and road conditions.In the copilot of vehicle, researcher persistently monitors each subject's Face action variation, to annotate the brass tacks of each face action.The position of high definition camera and driver are as shown in Figure 7.

The facial video of 20 testers (the range of age from 20 to 46 year old) is under the different situations in motor racing It obtains.The sample image of MFAY data set is as shown in Figure 8.All videos are all converted to audio-video stagger scheme, video speed Rate is 30fps.Finally, as shown in figure 9, extracting 347 image sequences (53652 images) from the video obtained.Each figure As the length of sequence is about 5 seconds (150 frame).

The present invention is based on YawDDR data sets and MFAY data set to have carried out following three parts experiment:

Experiment one: in order to prove that our Key-frame Extraction Algorithm can effectively select the pass in driving sequence of frames of video Key frame, we have carried out following experiment on YawDDR data set and MFAY data set.Firstly, picture histogram is for deleting difference The frame of other very little simultaneously selects candidate key-frames, this operation note is the stage one by we.Can have to verify our algorithm Effect improves the discrimination of various face actions, we additionally provide recognition result, does not use any Key-frame Extraction Algorithm.I This case " is not used " referred to as.The results are shown in Table 3.After the stage one, accuracy is improved.In the base in stage one On plinth, we reject the candidate key-frames with the feature that peels off using MAD.After completing this processing, we will obtain required pass Key frame.As can be seen from Table III, compared with the stage one and without using key frame extraction, our two stages Key-frame Extraction Algorithm Realize best identified performance.Demonstrate the validity of our Key-frame Extraction Algorithm.

The experimental result (unit: %) of the different key frame choice phases of table 3

Experiment two: in this experiment, we be absorbed in it is proposed that method and some other existing based on image Method between comparative experiments.The method based on coring fuzzy rough set that we propose our method and Du Y et al. It compares.The two-fold algorithm for acting on behalf of expert system proposed based on Anitha C et al..Convolutional Neural net is based on two kinds The method of network.In order to verify we method validity, we use following model training and Test Strategy: training set includes Classification of the random video clipping according to belonging to them is extracted from MFAY data set and YawDDR data set.Remaining video is cut It collects and is used for test model.All video clippings all by it is proposed that Key-frame Extraction Algorithm handle.Selected key frame is used for Trained and test network model.Since the movement as YT cannot be effectively detected in the method based on image, these feelings Experimental result under condition is not recorded in our table and figure.It as shown in table 4 and Figure 10, is compared with other methods, is based on The discrimination of the detection method of yawning of subtle facial movement and key frame of video has obtained significant raising.We identify various faces The method of portion's movement is better than existing method, effectively reduces error detection.Method based on video can efficiently extract foot Enough space-time motion characteristics simultaneously realize detection of dynamically yawning.This further demonstrate it is proposed that method robustness.

The testing result (unit: %) of table 4 mentioned method and four kinds of advanced methods of the invention on YAWDDR data set

Experiment three: in this experiment, we compare the method based on image and the method based on video.Our method Use successive frame as input, this is a kind of method based on video.For the method based on image, YawDDR data set and Frame image in MAFY data set is for training and testing.We equably extract some frames from two datasets, and according to Class belonging to them is that they distribute label.The data processing step and verification algorithm of these experiments are identical.Experimental result As shown in table 5.The result shows that there is better performance than the method based on image based on the method for video, because yawning is Continuous action rather than it is static.Method based on video can detecte yawning in various facial situations.If being used only one Frame is identified that the material time action message between frame will lose.It indicates that the feature yawned may be acted with instruction to sing Or the obscure aspects shouted.In contrast, the method based on video can provide enough space-time action messages, can lead to Movement frame sequence is crossed to classify to movement.To yawn be considered as movement rather than static state go to detect, can significantly change A large amount of false retrieval problems present in the kind method detected based on still image.

Detection method experimental result (unit: %) of the table 5 based on picture and based on video

Experiment shows method proposed by the present invention in terms of discrimination and overall performance better than existing method, energy effective district It point yawns and other facial deliberate actions, effectively reduces driver and yawn the false detection rate of behavior.

The above only expresses the preferred embodiment of the present invention, and the description thereof is more specific and detailed, but can not be because This and be interpreted as limitations on the scope of the patent of the present invention.It should be pointed out that for those of ordinary skill in the art, In Under the premise of not departing from present inventive concept, several deformations can also be made, improves and substitutes, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

The detection method 1. driver based on subtle facial action recognition yawns, it is characterised in that: the following steps are included:

Step 1, driver's driving video that vehicle-mounted vidicon captures is pre-processed, carries out Face datection and segmentation, figure As size normalization and denoising；

Step 2, it proposes Key-frame Extraction Algorithm, is picked by the screening of picture histogram similarity threshold and the similarity picture that peels off Except the method combined, to extract the key frame in deliberate action sequence；

Step 3, according to the key frame of selection, establishing has the 3D deep learning network (3D-LTS) of low Temporal sampling to detect Various behaviors of yawning.
The detection method 2. driver according to claim 1 based on subtle facial action recognition yawns, feature exist In: it includes: to be examined using Viola-Jones face that the driver's driving video captured to vehicle-mounted vidicon, which carries out pretreatment, Method of determining and calculating carries out the detection of driver's human face region, is partitioned into driver's facial area, is gone using Fast median filtering algorithm It makes an uproar processing.
The detection method 3. driver according to claim 1 based on subtle facial action recognition yawns, feature exist In: the Key-frame Extraction Algorithm from a series of original video frame F={ F_j, j=1 ..., N } in extract a series of key frames K={ K_i, i=1 ..., M }；Wherein M indicates the quantity of the key frame selected from primitive frame, and N indicates the quantity of primitive frame, institute Stating Key-frame Extraction Algorithm includes two choice phases:

In the first choice stage, the RGB color histogram of each video frame is calculated；Then, two companies are calculated using Euclidean distance The color histogram γ of continuous frame_jAnd γ_j+1Between similarity:

Wherein, 1≤j≤N-1, n are the dimensions of picture color histogram；

Similarity threshold T is calculated by formula (2)_s:

T_s=μ_s (2)

Wherein, μ_sFor Mean (S), S S_jSet, work as S_i> T_sWhen, it is believed that F_jAnd F_j+1Similarity is smaller, by F_jIt is added candidate Crucial frame queue；

In second choice phase, the candidate key-frames picture that those in candidate key-frames have the feature that peels off is rejected, to obtain Final key frame, uses two kinds of image similarity Measure Indexes: Euclidean distance (ED) and root-mean-square error (RMSE), in use Value absolute deviation (MAD) has the frame for the feature that peels off to detect, and calculates MAD according to formula (3):

MAD=median (| X_i-median(X)|) (3)

Continuous two candidate key-frames are denoted as K_{I, i+1}, for all K_{I, i+1}, their RMSE and ED value is calculated, for meter Each the RMSE (K calculated_{I, i+1}) and ED (K_{I, i+1}), their MAD value is calculated, α=MAD (RMSE), β=MAD are denoted as (ED), RMSE (K_{I, i+1}) calculation formula such as formula (4) shown in, ED (K_{I, i+1}) calculation formula such as formula (5) shown in, when RMSE(K_{I, i+1}) it is less than α and ED (K_{I, i+1}) be less than β when, it is believed that K_iIt is with the candidate key-frames of feature of peeling off, and by K_iIt moves Except candidate key-frames；

Here n represents K_iSize；

Here m represents the picture color histogram dimension of candidate key-frames.
The detection method 4. driver according to claim 1 based on subtle facial action recognition yawns, feature exist In: the 3D-LTS network extracts for space-time characteristic and the identification of deliberate action, the 3D-LTS Web vector graphic 8 is non-super Frame frame extracts space-time characteristic using four 3D convolutional layers, all convolution filters are all 3 × 3 as input from successive frame × 3, stride is 1 × 1 × 1, and all pond layers are all maximum ponds, and the kernel size of the first and second tether layers is 1 × 2 × 2, preceding four The filter quantity of a convolutional layer is respectively 32,64,128 and 256, and third pond inner nuclear layer is 2 × 4 × 4, and convolutional layer is followed by One layer being fully connected, be used for mappings characteristics, using one with 1024 output be fully connected layer, be used for integration characteristic Distribution.