CN110334627A

CN110334627A - The device and system that the behavior of personnel is detected

Info

Publication number: CN110334627A
Application number: CN201910561189.4A
Authority: CN
Inventors: 马乾力
Original assignee: Shenzhen Micro & Nano Integrated Circuit And System Application Institute
Current assignee: Shenzhen Micro & Nano Integrated Circuit And System Application Institute
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-10-15

Abstract

The present invention provides the detection device that the behavior of a kind of couple of personnel is detected, it is configured to handle the video and/or image that include personnel, with in each frame image simultaneously testing staff eye state, mouth state, cigarette state and telephone state, wherein, the detection device carries out above-mentioned detection using the algorithm based on convolutional neural networks, the convolutional neural networks include at least one convolutional layer, at least one residual layers, one global pool layer and a full articulamentum, wherein, each convolutional layer is all disposed with one BN layers and one LeakyReLU layers.

Description

The device and system that the behavior of personnel is detected

Technical field

The present invention relates to field of visual inspection, the device and system detected more particularly, to the behavior to personnel.

Background technique

Current domestic and international anti-terrorism and public security situation are relative complex, therefore in the flow of personnel close quarters of main cities as The places such as iron, airport, bus, bus station are generally provided with Security Personnel.

And the Security Personnel for executing task is generally supplied to user by manpower outsourcing company or AnBao Co., Ltd, for effect Rate, which maximizes, to be considered, outsourcing company or AnBao Co., Ltd may be by rushing towards it after completing a certain work with a group of people again He executes task dispatching mode in place, uses Security Personnel to greatest extent, improves the income of company.But accompanying problem is that Because the case where Security Personnel's working time and intensity increase, and absorbed degree declines, weariness working increases, human factor causes A possibility that security loophole, greatly increases.

To avoid the occurrence of the above problem, Security Personnel user needs set of system to complete related work: passing through face Attendance is completed in identification, and ensures that operator on duty is consistent with shift report；Such as tired shape is carried out to operator on duty by deep learning The behavioral value of state, focus etc.；With for finding the problem of timely feedbacks to backstage and person liable, to carry out in real time Intervene.

Behavioral value is generally divided into fatigue detecting and violation motion detection.(on duty) fatigue refers to Security Personnel due to sleep Insufficient or decline long lasting for respond caused by duty, this decline shows that Security Personnel is sleepy, dozes, operates Make mistakes or completely lose security ability etc..The fatigue detecting of early stage carries out mainly from medical angle by medical device Physiological characteristic measurement, research fatigue, which is knocked, sleeps Producing reason and other risk factors, finds the side that can reduce this danger Method.Another method is to develop intelligent alarm system, prevents Security Personnel on duty in the case where knocking the state of sleeping.For example, using some Signal processing method obtains the fatigue datas such as Security Personnel's frequency of wink and duration, to judge whether Security Personnel beats It knocks and sleeps or fall asleep.Movement refers in Security Personnel during on duty in violation of rules and regulations, some influences made movement or behavior on duty, than It such as makes a phone call, plays mobile phone, smokes, whispering to each other.Violation motion detection solves generally by way of computer vision, needle Different algorithms is respectively adopted to different movements.Since various algorithms differ greatly, if being all deployed to mobile terminal too Huge redundancy generally can seldom be applied in mobile terminal.

Summary of the invention

The present invention at high cost, traditional images identification technology scheme redundancy complexity spy for existing sensor technology scheme Point proposes a kind of Computer Vision Detection scheme end to end neural network based, and it is on and off duty that integration solves Security Personnel Attendance, Postmatch, fatigue is on duty, smokes, many demands such as motion detection of making a phone call.The present invention is mentioned using convolutional neural networks Feature is taken, the state (eyes opening and closing, mouth are opened and closed, make a phone call, smoke) of each instantaneous time point target human body is detected, one The fatigue state and violation state on duty for interior comprehensive descision target body of fixing time, and it is on duty or make in Security Personnel's fatigue Alarm signal is issued when movement in violation of rules and regulations.

The present invention provides the device that the behavior of a kind of couple of personnel is detected, it is configured to the video including personnel And/or image is handled, in each frame image simultaneously testing staff eye state, mouth state, cigarette state and electricity Speech phase, wherein the detection device carries out above-mentioned detection, the convolutional neural networks using the algorithm based on convolutional neural networks Including at least one convolutional layer, at least one residual layers, a global pool layer and a full articulamentum, wherein each volume Lamination is all disposed with one BN layers and one LeakyReLU layers.

The present invention provides Security Personnel's identity neural network based and behavior personnel behavioral value to detect end to end System solves the problems such as traditional detection mode is at high cost or traditional images processing computing redundancy is complicated, passes through a system It is unified to export all judging results, it settles at one go.Compared to traditional image detection means, the deep learning that the present invention uses Mode is not necessarily to do the pretreatment in terms of image enhancement, and it is each to be adapted to that uneven illumination is even, target signature is diversified, background is complicated etc. Kind extreme environment, and support the incremental training for being directed to scene, in actual use, pass through regular manual intervention school appropriate Quasi- training sample, the accuracy rate being lifted under dedicated scene.The present invention includes simultaneously Face datection and behavioral value, Neng Gougao The analysis and early warning of progress related personnel's behavior of effect solve the two in traditional scheme and different system or product are needed The problems such as at, complex interfaces, response speed is slow.

By the way that the description of exemplary embodiment, other features of the invention be will be apparent referring to the drawings.

Detailed description of the invention

The attached drawing being included in the description and forms part of the description instantiates exemplary embodiment of the present invention, spy In terms of seeking peace, and the principle used to explain the present invention together with explanatory note.

Fig. 1 is the module diagram according to the detection system of one aspect of the invention.

Fig. 2 is the schematic diagram according to the Face datection process of one aspect of the invention.

Fig. 3 is the schematic diagram according to the detection process of one aspect of the invention.

Fig. 4 is the structural schematic diagram of the convolutional neural networks of one aspect according to the present invention.

Fig. 5 is the overhaul flow chart of the fatigue of one aspect or violation state according to the present invention.

Specific embodiment

The various exemplary embodiments of detailed description of the present invention, feature and aspect below with reference to accompanying drawings.It should be pointed out that removing Non- in addition to illustrate, the relative configuration of the component, digital representation and the numerical value that describe in these embodiments does not limit the present invention Range.It it should be pointed out that the following examples are not intended to limit the scope of the present invention recorded in claim, and is not these Whole combinations of feature described in embodiment are necessary to the present invention.

To solve the above problems, as shown in Figure 1, the present invention provides the system detected to human behavior, comprising: image Acquisition unit, behavioral value unit and data storage cell (or " memory module ").Preferably, detection system further includes face Detection unit and alarm unit.In addition to this, system of the invention further includes for the communication unit and interface with PERCOM peripheral communication Unit.

As an example, image acquisition units are preferably by device for obtg. high fidelity image, such as camera, camera Deng, acquisition fixed range (the high-incidence region of fire, such as forest, residential block, basement, overhead route, in city such as market et al. The big public domain etc. of flow) video and/or image, and by collected video and/or image for example, by RTSP agreement It is separately sent to detection unit and data storage cell.Wherein, camera includes but is not limited to simulate camera, IP camera Deng.Image acquisition units are configured to continuously record the behavior of related personnel, to form the view including multiple image Frequently.Image, which can be sent in the behavioral value unit or Face datection unit that will be described below, carries out feature extraction and fortune It calculates.Alternatively, meet predetermined condition frame image (for example, the 5th frame, the 10th frame, the 15th frame, the 20th frame, and so on etc.) can quilt It is transmitted in the behavioral value unit or Face datection unit that will be described below and carries out feature extraction and operation.

As an example, behavioral value unit handles the collected video of image acquisition units and/or image, according to Prestore model and network and carry out feature extraction, judge in monitoring area operator on duty with the presence or absence of fatigue, make a phone call, play mobile phone, The violations states such as smoking.Such as no exceptions, then continue to detect；It such as notes abnormalities, saves exception information and image/video Data, data storage cell is arrived in storage, and passes backstage early warning back by wired or wireless way；Behavioral value unit described in it Including algorithm and computing hardware two parts, hardware includes but is not limited to general GPU, embedded gpu, CPU, the dedicated core of artificial intelligence Piece etc..Algorithm part is described in detail below.

As shown in figure 3, detection system (behavioral value unit) of the invention can extract video or figure by convolutional neural networks As in feature and classify, detect the position at each position of human body and state and cigarette and phone in image respectively Presence or absence and their respective positions.Then, the open and-shut mode of eye and mouth directly is obtained by convolutional neural networks, inhale Cigarette state (or smoking state), telephoning state (or talking state) etc..The continuous duration or number of each state can quilts It calculates.It can determine that personnel are in a state of fatigue if detecting successional eye closing or yawn, system can export pair The caution signal answered；If the case where detecting the presence of smoking and call, can determine that personnel are in violation state, at this time may be used Export corresponding warning message or directly as warning output.

<<convolutional neural networks architectural overview>>

Firstly, convolutional neural networks of the invention are made of the convolutional layer of 1*1 and 3*3 a series of as general introduction, each convolutional layer It afterwards all can be with one BN layers and one LeakyReLU layers.Meanwhile in order to solve due to convolutional network depth increase caused by property The problem of capable of declining, also introduces residual layers, finally in the end of convolutional neural networks addition global pool layer and entirely Articulamentum reuses softmax and classifies.Wherein, the step-length (strides) of convolution is defaulted as (1,1), " padding " ( Whether boundary pixel point is lost when convolution) being defaulted as same(, i.e. in convolution, the side length of padding default is 1, using 0 filling (before convolution algorithm, a circle 0 is mended around image, then does convolution again).Preferably, in network of the invention, Padding is preferably always by the way of same.

<<first embodiment of convolutional neural networks structure>>

Hereinafter, carrying out the specific structure for the convolutional neural networks that the present invention will be described in detail referring to Fig. 4.

Firstly, being input into the 1st convolutional layer by the video or a series of images of image acquisition units acquisition (" Convolutional "), to carry out preliminary feature extraction to image.Here, using the image having a size of 256*256 as showing Example is illustrated.It will be understood by those skilled in the art that the image of other sizes can also be inputted.Certainly, subsequent convolution knot Structure also can occur to change accordingly because of the size of input picture difference, such as increase or decrease convolutional layer, increase or decrease volume Product core size and quantity, increase or less residual layers of quantity change the position etc. of residual in a network.As Example, the convolution kernel size of the 1st convolutional layer are configured to 3*3, and convolution nuclear volume is arranged to 32.After layer processing, output Image having a size of 256*256.

Then, the image of the 1st convolutional layer output enters the 2nd convolutional layer, to carry out down-sampling, downscaled images ruler to image It is very little.The convolution kernel size of 2nd convolutional layer is configured to 3*3/2, and convolution nuclear volume is arranged to 64.2nd convolutional layer is by image Size reduction is 128*128 and exports.

Then, the image of the 2nd convolutional layer output enters the 3rd combination layer, to extract feature and increase network depth.3rd group It closes layer and includes the 31st convolutional layer, the 32nd convolutional layer and Residual layers.Wherein, the convolution kernel size of the 31st convolutional layer is configured to 1*1, convolution nuclear volume are arranged to 32；The convolution kernel size of 32nd convolutional layer is configured to 3*3, and convolution nuclear volume is arranged to 64.Through the layer processing after, still Output Size be 128*128 image.

Then, the image of the 3rd combination layer output enters the 4th convolutional layer, to carry out down-sampling, downscaled images ruler to image It is very little.The convolution kernel size of 4th convolutional layer is configured to 3*3/2, and convolution nuclear volume is arranged to 128.After layer processing, output Image having a size of 64*64.

Then, the image of the 4th convolutional layer output sequentially enters 2 (2x) the 5th combination layers, to extract feature and increase network Depth.Each 5th combination layer includes the 51st convolutional layer, the 52nd convolutional layer and Residual layers.Wherein, the volume of the 51st convolutional layer Product core size is configured to 1*1, and convolution nuclear volume is arranged to 64；The convolution kernel size of 52nd convolutional layer is configured to 3*3, volume Product nuclear volume is arranged to 128.After the processing of 2 the 5th combination layers, still Output Size be 64*64 image.

Next, the image of the 5th combination layer output enters the 6th convolutional layer, to carry out down-sampling, downscaled images ruler to image It is very little.The convolution kernel size of 6th convolutional layer is configured to 3*3/2, and convolution nuclear volume is arranged to 256.After layer processing, output Image having a size of 32*32.

Continue, the image of the 6th convolutional layer output sequentially enters 4 the 7th combination layers, to extract feature and increase network depth Degree.Each 7th combination layer includes the 71st convolutional layer, the 72nd convolutional layer and Residual layers.Wherein, the convolution kernel of the 71st convolutional layer Size is configured to 1*1, and convolution nuclear volume is arranged to 128；The convolution kernel size of 72nd convolutional layer is configured to 3*3, convolution Nuclear volume is arranged to 256.Through the layer processing after, still Output Size be 32*32 image.

Then, the data of the 7th combination layer output enter the 8th convolutional layer, to carry out down-sampling, downscaled images ruler to image It is very little.The convolution kernel size of 8th convolutional layer is configured to 3*3/2, and convolution nuclear volume is arranged to 512.After layer processing, output Image having a size of 16*6.

Subsequently, the image of the 8th convolutional layer output sequentially enters 4 the 9th combination layers, to extract feature and increase network depth Degree.Each 9th combination layer includes the 91st convolutional layer, the 92nd convolutional layer and Residual layers.Wherein, the convolution kernel of the 91st convolutional layer Size is configured to 1*1, and convolution nuclear volume is arranged to 256；The convolution kernel size of 92nd convolutional layer is configured to 3*3, convolution Nuclear volume is arranged to 512.Through the layer processing after, still Output Size be 16*16 image.

Continue, the image of the 9th combination layer output enters the 10th convolutional layer, to carry out down-sampling, downscaled images ruler to image It is very little.The convolution kernel size of 10th convolutional layer is configured to 3*3/2, and convolution nuclear volume is arranged to 1024.It is defeated after layer processing Out having a size of the image of 8*8.

Then, the image of the 10th convolutional layer output sequentially enters 2 the 11st combination layers, to extract feature and increase network depth Degree.Each 11st combination layer includes the 111st convolutional layer, the 112nd convolutional layer and Residual layers.Wherein, the volume of the 111st convolutional layer Product core size is configured to 1*1, and convolution nuclear volume is arranged to 512；The convolution kernel size of 112nd convolutional layer is configured to 3*3, Convolution nuclear volume is arranged to 1024.After layer processing, Output Size is the image of 8*8.

Then, the image of the 11st combination layer output sequentially enters global pool layer and full articulamentum, to classify.Complete Office's pond layer carries out global pool to obtained characteristic pattern 8*8, obtains a characteristic point.In full articulamentum, input dimension is used Two layers of neural network for being 2 for 256, output dimension handles the characteristic point, and wherein first layer neural network passes through TanH activation primitive, second layer neural network connect softmax function.

<<second embodiment of convolutional neural networks structure>>

If in order to reduce the parameter of network and calculation amount, on the one hand can the parameter appropriate that reduce network, on the other hand can be with A part of network layer is dismissed, and indistinctively influences neural network accuracy.For example, can be slightly on the basis of the first specific embodiment Deformation obtains the second specific embodiment.The parameter setting of the convolutional layer being identical with the first embodiment and combination layer will not described here And arrangement mode.The difference is that 2 points: first, second embodiment does not have the 7th for second embodiment and the first specific embodiment Combination layer, that is, the image of the 6th convolutional layer output is directly entered the 8th convolutional layer.Second, second embodiment the 11st combination layer it Afterwards, the 12nd convolutional layer and the 13rd combination layer are increased.

As an example, the convolution kernel size of the 12nd convolutional layer is configured to 3*3/2, convolution nuclear volume is arranged to 1024. After layer processing, Output Size is the image of 8*8.

As an example, the 13rd combination layer includes the 131st convolutional layer, the 132nd convolutional layer and Residual layers.Wherein, the 131st The convolution kernel size of convolutional layer is configured to 1*1, and convolution nuclear volume is arranged to 512；The convolution kernel size quilt of 132nd convolutional layer It is configured to 3*3, convolution nuclear volume is arranged to 1024.Through the layer processing after, still Output Size be 8*8 image.Later, the figure As entering global pool layer.

<<training methods and parameter of convolutional neural networks>>

Convolution kernel and full articulamentum in convolutional layer are carried out using the random numbers of Gaussian distribution that mean value is 0, standard deviation is 0.1 is obeyed Initialization, bias term are initialized using the uniform random number that section is [0,1] is obeyed.

In batch processing layer, momentum is set as 0.95, and constant is set as 0.01.

Using AdaDelta gradient descent algorithm training weight, batch processing is dimensioned to 64.

According to a certain percentage be arranged data training set, verifying collection and test set, after the training in 20 generations, every generation all into The test of row verifying collection, that as a result best generation training pattern can be saved and used for the test of test set, and result is The result entirely learnt.

Setting total data changes cycle of training as 100 generations, and in training, the positive negative sample ratio in training set is 10:1, often In generation training, the negative sample and whole positive samples for successively upsetting 20% are trained, until whole negative samples have trained completion one A cycle of training.

Above-mentioned experimental method and parameter are obtained on the basis of scientific research by many experiments.These methods and ginseng Number is very applicable for personnel environment of the present invention, especially in detection eye state, mouth state, smoking state It is especially pronounced when with talking state.

Video or image pass through convolutional neural networks feature extraction, and divide an image into 11*11 sub-box in advance, with each Centered on grid, 5 Random candidate frames are randomly generated respectively, each candidate frame is divided in the full articulamentum of the last layer Class obtains classification results and the position of each candidate frame with this；In network training, following several states are drafted: in image Whether the position and open and-shut mode of personnel's eye or mouth, personnel lift the position that mobile phone is fitted in the state of face, mobile phone It sets, the position of cigarette；State judgement or alert if:

Fatigue state: the state that eye is in closure is eye strain characterization, if the continuous duration of eye closure is more than 3s (that is, eye closing scheduled duration, such as 3s, 5s, 10s etc.) is then assert and is in eye closing fatigue state；Mouth is in the state opened greatly i.e. For mouth fatigue characterization, if the big Zhang Lianxu duration of mouth is more than 1s(that is, yawn scheduled duration, such as 2s) and when yawn is set Between during detect in (for example, at least 60s, 100,120s etc.) 3 times or more, then assert and be in yawn fatigue state.It closes one's eyes Fatigue state and yawn fatigue state are referred to as fatigue state.

Smoking state: as long as detecting the presence of cigarette and cigarette is defined as smoking state close to mouth.If such State reaches 3 times or 4 times or 5 times (smoking pre-determined number) in (such as 5s, 10s, 20s etc.) during setting time of smoking, then It can be determined that personnel smoke in violation of rules and regulations.

Talking state: personnel, which lift mobile phone and are fitted in face, is defined as talking state, if the continuous example of the state As 5s or more (that is, call scheduled duration, such as 6s, 8s, 15s etc.) then can be determined that personnel are conversing in violation of rules and regulations.

As the detection example of eye closing fatigue state, during video flow detection, when detecting that eye is in for the first time When closed state, record the current time (such as 10:10:10) and/or record present frame number (that is, time or number, under Together).Later during continuous detection, if being consecutively detected this kind of state, continuous integration variable, if subsequent detection In continuous several frames or back to back next frame can't detect this kind of state, illustrate that eyes are opened, just interrupt statistics, during this section Variate-value (unit: frame) or time started to the time difference (unit: second (s)) terminated between record be exactly closed-eye state Continuous duration.The present invention sets the maximum continuous time (that is, eye closing scheduled duration) of eye closing as 3s.As known to those skilled in the art, The other times such as 4s, 5s are also set to maximum continuous time of closing one's eyes.

As an example, if 1-10 frame is not detected eye and is in closed state, initial time of closing one's eyes and the company of eye closing Continuous duration is disposed as 0.If detecting that eye is in closed state in the 11st frame, current time, for example, 10 are recorded: 10:10, and eye closing initial time is set by the time.If until the 20th frame detects that eye is constantly in closed state, Continuous updating current time is until the time of the 20th frame, for example, 10:10:11, then close one's eyes a length of 1s of consecutive hours, not up to closes one's eyes Scheduled duration not can determine that personnel are in eye closing fatigue state at this time.If detecting that eye is in the 21st frame opens state, Expression personnel are not in the state continuously closed one's eyes, and exclude the possibility of weariness working.Eye closing initial time and eye closing are continuous at this time Duration is updated to 0.Alternatively, if the 11st frame is to during the 20th frame, and in the 21st frame to continuous during the 60th frame Detect that eye is constantly in closed state in image, then the current time being recorded constantly refreshes (to be opened from the time of the 12nd frame Begin record, flushes to the time of the 60th frame always) to the time of the 60th frame, for example, 10:10:15, then continuous duration quilt of closing one's eyes It is updated to 5s.(the present embodiment be more than) eye closing scheduled duration (such as 3s), then assert at this point, reach since eye closing consecutive hours is long Personnel are in sleep or doze state, and triggering alarm unit makes a sound or the alarm of light, and controls associated picture or video It is transmitted to external equipment (such as console).After alarm, eye closing initial time and continuous duration of closing one's eyes are reset as 0, enter Next round detection.

As the detection example of yawn fatigue state, during video flow detection, when detecting that mouth is in for the first time When opening state greatly, records the current time (such as 10:10:10) and/or record the number of present frame.It continuously detects later In the process, if being consecutively detected this kind of state, then continuous integration variable, if continuous several frames or back to back in subsequent detection Next frame can't detect this kind of state, just interrupt statistics, the variate-value (unit: frame) or time started during this section to end Time difference (unit: second (s)) between record is exactly the continuous duration of yawn state.The present invention sets yawn maximum consecutive hours Between (that is, yawn scheduled duration) be 1s.As known to those skilled in the art, other times are also set to yawn maximum consecutive hours Between.

As an example, if 1-10 frame is not detected mouth and is in a state greatly, by yawn initial time and yawn Continuous duration is disposed as 0.If detecting that mouth is in a state greatly in the 11st frame, current time, for example, 10 are recorded: 10:10, and yawn initial time is set by the time.If still detecting that mouth is constantly in a shape greatly until the 15th frame State then records current time, for example, 10:10:10 ' 30, then a length of 0.5s of yawn consecutive hours.At this point, not up to yawn is pre- Timing is long (the present embodiment 1s), therefore not can determine that personnel are in yawn fatigue state.If until the since the 11st frame 40 frames detect that mouth is constantly in a state, the then current time being recorded greatly and constantly refreshes (since the time of the 12nd frame Record, is recorded always the time of the 60th frame) to the time of the 40th frame, for example, 10:10:12, then the continuous duration of yawn is by more It is newly 2s.(the present embodiment be more than) yawn scheduled duration (such as 1s), then yawn time at this point, reach since yawn consecutive hours is long Number is updated to 1 from 0, indicates that personnel have played a yawn at this time, while yawn initial time and the continuous duration of yawn are by more New is 0.Hereafter detection process continues, if record is worked as just detecting that mouth is again at a state greatly until the 100th frame Preceding time, for example, 10:10:16 then record the current time, and set yawn initial time for the time.If from the 100th Frame starts to detect that mouth is constantly in state greatly until the 140th frame, then the current time being recorded constantly refreshes (from the The time start recording of 101 frames, is recorded always the time of the 140th frame) to the time of the 140th frame, for example, 10:10:18, then The continuous duration of yawn is updated to 2s.(the present embodiment be more than) yawn scheduled duration at this point, reach since yawn consecutive hours is long (such as 1s), then yawn number is updated to 2 from 1, indicates that personnel have beaten 2 yawns, while yawn initial time and Kazakhstan at this time It owes continuous duration and is updated to 0.And so on.If in the yawn setting time that the yawn initial time of first time yawn starts In period (for example, 30s, 40s, 50s), detect that yawn number is that 4 times (or 5 times or 6 times) are greater than yawn pre-determined number 3 times, Then show that personnel are in yawn fatigue state.Triggering alarm unit makes a sound or the alarm of light at this time, and controls and scheme correlation Picture or video are transmitted to external equipment (such as console).After alarm, yawn initial time, the continuous duration of yawn and yawn number It is reset as 0, is detected into next round.

As smoking violation state detection example, during video flow detection, when detect cigarette exist and first Secondary when detecting cigarette close to mouth, then number of smoking is arranged to 1.Later during continuous detection, if detecting This kind of state, then continuous integration variable.The present invention sets smoking maximum times (that is, smoking pre-determined number) as 3 times.This field skill Art personnel know, 4 times, 5 other inferior numbers be also set to smoking pre-determined number.

As an example, setting 0 for smoking number if cigarette is not detected in 1-10 frame.If being detected in the 11st frame To cigarette and its close to mouth until the 20th frame cigarette is far from mouth, then number of smoking is incremented by 1.If being detected again in the 50th frame For cigarette close to mouth until the 60th frame cigarette is far from mouth, then number of smoking is incremented by 1 again becomes 2.And so on.If smoking Smoking number increases to 3 times or 4 times or 5 inferior in (for example, 10s, 20s, 60s, 90s, 120s etc.) during setting time, then table The person of leting others have a look at is in smoking violation state, triggers alarm unit at this time and makes a sound or the alarm of light, and control associated picture or Video is transmitted to external equipment (such as console).After alarm, smoking number is reset as 0, detects into next round.

As the detection example of call violation state, during video flow detection, when detecting that phone is in for the first time When near mouth, records the current time (such as 10:10:10) and/or record the number of present frame.It continuously detects later In the process, if being consecutively detected this kind of state, continuous integration variable, if in subsequent detection continuous several frames or it is back to back under As soon as frame can't detect this kind of state, statistics is interrupted, knot is recorded in the variate-value (unit: frame) or time started during this section Time difference (unit: second (s)) between beam recording is exactly the continuous duration of telephoning state.The present invention sets maximum of making a phone call Continuous time (that is, call scheduled duration) is 5s.As known to those skilled in the art, the other times such as 10s are also set to lead to Talk about maximum continuous time.

As an example, if 1-10 frame is not detected phone and is located near mouth, by call start time and call Continuous duration is disposed as 0.If detecting that phone is near mouth in the 11st frame, current time, for example, 10 are recorded: 10:10, and call start time is set by the time.If until the 20th frame detects that phone is constantly near mouth, Continuous updating current time is until the time of the 20th frame, for example, 10:10:11, then converse a length of 1s of consecutive hours, not up to converses Scheduled duration not can determine that personnel are in call violation state at this time.If detecting that phone leaves mouth in the 21st frame, determine Personnel are not in talking state, exclude the possibility of violation.Call start time and continuous duration of conversing are updated at this time 0.Alternatively, if the 11st frame is to during the 20th frame, and electricity is detected into the consecutive image during the 60th frame in the 21st frame Words are constantly near mouth, then the current time being recorded, which constantly refreshes, (from the time start recording of the 12nd frame, to be recorded always To the time of the 60th frame) to the time of the 50th frame, for example, 10:10:15, then continuous duration of conversing is updated to 5s.At this point, by Reach call scheduled duration (such as 5s) in call consecutive hours length, then assert that personnel are in call violation state, triggering is alarmed single Member makes a sound or the alarm of light, and controls associated picture or video being transmitted to external equipment (such as console).After alarm, Call start time and continuous duration of conversing are reset as 0, detect into next round.

The above embodiment of the present invention is exemplary only.The selection of video frame can be timing, and it is non-fixed to be also possible to When, herein with no restrictions.For example, can also be with every in preceding 100 frame every 10 milliseconds or 0.5 second 1 frame videos of interception 10 milliseconds intercept video for unit, and rear 100 frame is with every 5 milliseconds of interceptions video.For example, it may be possible to choose the 1st frame figure in 10:10:10 Picture, chooses the 10th frame image in 10:10:11, chooses the 100th frame image in 10:10:15.In addition, above-mentioned example is to record the time To judge duration, number etc..Those skilled in the art can also judge duration, secondary by way of recording the number of present frame Number etc., this is not as limitation of the present invention.

As an example, data storage cell is mainly used for storing video and image document that image acquisition units obtain, and The functions such as video playback and data backup are supported when needed.Data storage cell can plug into support ONVIF, PSIA, The third party's video camera and main brand video camera of RTSP agreement；Support IPv4, IPv6, HTTP, UPnP, NTP, SADP, SNMP, The network protocols such as PPPoE, DNS, FTP, ONVIF, PSIA；Support maximum 64 road network videos access；Support monitoring camera more Kind mainstream resolution ratio access；Support image locally playback and inquiry；It is single to support that the testing result that will test unit is sent to alarm Member；The testing result that will test unit is supported to be sent to the platform that other need to dock.

As an example, as shown in Fig. 2, Face datection unit is in the image and video counts for receiving image acquisition units transmission According to rear, according to prestoring model and network carries out feature extraction, and it is compared with the personnel characteristics recorded in database, confirms quilt The identity of attendance personnel；It is further compared with watch bill after personnel identity confirmation, confirms the operator on duty in current post It is consistent with watch bill.It can not verify such as attendance personnel identity or not be inconsistent with index table, save exception information and image/video money Material, and store and arrive data storage cell, and backstage early warning is passed back by transmission unit；Attendance unit described in it include algorithm and Computing hardware two parts, hardware include but is not limited to general GPU, embedded gpu, CPU, artificial intelligence special chip etc..

Face datection unit executes following steps: using the Face datection algorithm MTCNN of current superior performance；Of the invention MTCNN is made of first order network and second level network.Wherein, first order network can regard a random forest as, each The model of tree is the same.The picture size of all inputs is reduced, and utilizes face similar in P_Net detection image All areas.Through can probably remove 70% image unrelated with face as a result,.In the case where reducing by 70% image and subtracting While doubtful human face region quantity can be detected by having lacked first order network, human face region will not be omitted, it thus can be significantly Reduce algorithm complexity." 70% " this numerical value will be configured according to concrete scene, including will appear maximum number quantity, The pixel number etc. that face occupies.Wherein, second level network carries out again the face that first order network detects with R_Net true Recognize, to obtain the image for only including face.This equally greatly reduces false alarm rate.Unlike existing MTCNN algorithm, this hair Bright improved MTCNN does not need third level network, that is, does not need to detect face.

As an example, needing to carry out image different degrees of scaling, to every when first order network detects image One zoomed image is all detected with P_Net.Existing MTCNN is needed using 10 scale values or so, this allow for The human body contained in altimetric image is more, and has come out human body in the present invention, so only need to be using 4,3 or even less The scale value of quantity.Five points for there was only existing MTCNN by the doubtful human face region quantity that first order network extracts in this way One of or so, in addition the MTCNN in the present invention does not use third level network, improved MTCNN algorithm of the invention is being kept While performance, the speed of service is improved close to 8 times, is suitable for most embedded device.For example, if utilizing full will A64 Chip, then the runing time of single-frame images is promoted by 200ms or so to 20 to 30ms.

Further, first order network use little framework 4 layers of convolutional neural networks, input image size be 12 × 12 × 3.Second level network uses 4 layers of convolutional neural networks of little framework, and input image size is 24 × 24 × 3.As an example, convolution The training method and parameter of neural network, specific as follows:

Convolution kernel and full articulamentum in convolutional layer are carried out using the random numbers of Gaussian distribution that mean value is 0, standard deviation is 0.1 is obeyed Initialization.

Using stochastic gradient descent algorithm training weight, batch processing is dimensioned to 128, when training only to 70% loss compared with Small data carry out backpropagation.

Training sample has used 15000 front face figures, and 50000 are free of facial image and 50000 part faces Image, positive sample in training set, negative sample, the ratio of three kinds of samples of part sample are 3:1:1, and front face image is from multiple The face image set of open source and image set with Face detection coordinate, partial face image largely pass through front face image Interception obtain, all data are cut to 12 × 12 × 3 and 24 × 24 × 3 two kinds of sizes respectively.

According to a certain percentage be arranged data training set, verifying collection and test set, after the training in 10 generations, every generation all into The test of row verifying collection.Those skilled in the art can be arranged according to known instruction the number of iterations and training set, verifying collection and The ratio of test set.

The setting total data repetitive exercise period was 1000 generations.More (1200 generations, 2000 generations) or less (500 generations, 800 Generation) the repetitive exercise period it is also possible.

It will be understood by those skilled in the art that the data occurred in above-mentioned training method and parameter are not limiting.This Field technical staff can use different picture size, training sample and iteration cycle according to the difference of application scenarios, with Guarantee that arithmetic speed and precision are optimized simultaneously.

While property retention, the speed of service is improved close to 8 times simplified Face datection algorithm, is suitable for big portion The embedded device divided.

As an example, alarm unit includes field mode and background mode.After receiving the warning message of detection unit, Early warning is issued at the scene that image acquisition units are arranged by different sensors such as sound, light, electricity, surrounding is reminded to have alert. Alternatively, alarm unit can also send live signal to the contingency management department early warning platform of system docking, the information content includes: Camera number, camera position, alert time of origin, alert type, live image/video etc..Help contingency management department Quickly judgement and decision shorten the response time.

It is illustrated in figure 5 the schematic flow diagram of the method for the invention detected to human behavior.In step S501, Eye state, mouth state, cigarette state and the talking state of acquisition personnel.Then, in step S502, while eye is detected Whether state meets closed-eye state, whether mouth state meets yawn state, whether cigarette state meets smoking state and phone Whether state meets talking state, and further judges whether there is personnel and be in eye closing fatigue state, yawn fatigue state, inhales Cigarette violation state and call violation state.Finally, in step S503, if any of the above-described state meets fatigue or violation state, Then issue alarm.Preferably, acquisition show personnel be in fatigue or violation state video or picture can be sent to it is outer Portion's equipment, such as Central Control Room or security room.

The present invention provides human behaviors neural network based to detect detection mode end to end, solves traditional detection The problems such as mode is at high cost or traditional images processing computing redundancy is complicated, all judgement knots are uniformly exported by a kind of network Fruit settles at one go.Compared to traditional image detection means, the present invention using deep learning by the way of without doing image enhancement side The pretreatment in face is adapted to the various extreme environments such as uneven illumination is even, target signature is diversified, background is complicated, and supports needle To the incremental training of scene, in actual use, training sample is calibrated by periodically manual intervention appropriate, is lifted at specially With the accuracy rate under scene.

Can be with one BN layers and a LeakyReLU after each convolutional layer, and introduce residual layers of solution network Because of degradation problem caused by depth；Training method and parameter be also by a large number of experimental results show that preferable skill And parameter.Application: application of the convolution algorithm in human behavior detection, end-to-end direct solution test problems simplify and pass Complexity, the detection means of redundancy of system.

Detection device provided by the present invention and system are described in detail above.Specific case used herein Principle and implementation of the present invention are described, the above embodiments are only used to help understand side of the invention Method and its core concept.It should be pointed out that for those skilled in the art, not departing from the principle of the invention Under the premise of, it can be with several improvements and modifications are made to the present invention, these improvement and modification also fall into the claims in the present invention In protection scope.

Claims

1. the device that the behavior of a kind of couple of personnel is detected, which is characterized in that be configured to include personnel video and/or Image is handled, in each frame image simultaneously testing staff eye state, mouth state, cigarette state and phone shape State, wherein the detection device carries out above-mentioned detection using the algorithm based on convolutional neural networks, which includes At least one convolutional layer, at least one residual layers, a global pool layer and a full articulamentum, wherein each convolutional layer All it is disposed with one BN layers and one LeakyReLU layers.

2. the apparatus according to claim 1, which is characterized in that detection device is configured to:

Whether detection eye state meets closed-eye state, if detecting, eye state does not meet closed-eye state, originates closing one's eyes Time and continuous duration of closing one's eyes are disposed as 0；If detecting that eye state meets closed-eye state for the first time, current time is set It is set to eye closing initial time；If previous frame image and detecting eye state symbol with the continuous next frame image of previous frame image Closed-eye state is closed, then is set as closing one's eyes by duration continuous between the current time and eye closing initial time of the next frame image Continuous duration；

Whether detection mouth state meets yawn state, if detecting, mouth state does not meet yawn state, and yawn is originated Time, the continuous duration of yawn and yawn number are disposed as 0；It, will if detecting that mouth state meets yawn state for the first time Current time is set as yawn initial time；If previous frame image and being detected with the continuous next frame image of the previous frame image Meet yawn state to mouth state, then when will be continuous between the current time and yawn initial time of the next frame image Length is set as the continuous duration of yawn；

Whether detection cigarette state meets smoking state, if detecting, cigarette state does not meet smoking state, and will smoke number It is initialized as 0；If detecting, cigarette state meets smoking state, smoking number is added 1, and smoking setting time hereafter During will smoking number be classified as 0；

Whether detection telephone state meets talking state, if detecting, telephone state does not meet talking state, by call starting Time and continuous duration of conversing are disposed as 0；If detecting that telephone state meets talking state for the first time, current time is recorded For call start time；If previous frame image and with the continuous next frame image of the previous frame image detect telephone state accord with Talking state is closed, then is set as conversing by duration continuous between the current time and call start time of the next frame image Continuous duration.

3. the apparatus according to claim 1, which is characterized in that convolutional neural networks be construed as including sequential connection as Lower layer:

1st convolutional layer, image are directly input into the 1st convolutional layer,

2nd convolutional layer,

1 the 3rd combination layer comprising the 31st convolutional layer, the 32nd convolutional layer and Residual layers,

4th convolutional layer,

2 the 5th combination layers, each 5th combination layer include the 51st convolutional layer, the 52nd convolutional layer and Residual layers,

6th convolutional layer,

4 the 7th combination layers, each 7th combination layer include the 71st convolutional layer, the 72nd convolutional layer and Residual layers,

8th convolutional layer,

4 the 9th combination layers, each 9th combination layer include the 91st convolutional layer, the 92nd convolutional layer and Residual layers,

10th convolutional layer,

2 the 11st combination layers, each 11st combination layer include the 111st convolutional layer, the 112nd convolutional layer and Residual layers,

Global pool layer, and

Full articulamentum.

4. the apparatus according to claim 1, which is characterized in that convolutional neural networks be construed as including sequential connection as Lower layer:

2nd convolutional layer,

4th convolutional layer,

6th convolutional layer,

8th convolutional layer,

10th convolutional layer,

12nd convolutional layer,

1 the 13rd combination layer, each 13rd combination layer include the 131st convolutional layer, 132 convolutional layers and Residual layers,

Global pool layer, and

Full articulamentum.

5. device according to claim 3 or 4, which is characterized in that

The convolution kernel size of 1st convolutional layer is 3*3, and convolution nuclear volume is 32, and output picture size is 256*256；

The convolution kernel size of 2nd convolutional layer is 3*3/2, and convolution nuclear volume is 64, and output picture size is 128*128；

The convolution kernel size of 31st convolutional layer is 1*1, and convolution nuclear volume is that the convolution kernel size of the 32, the 32nd convolutional layer is 3*3, volume Product nuclear volume is that the 64, the 3rd combination layer output picture size is 128*128；

The convolution kernel size of 4th convolutional layer is 3*3/2, and convolution nuclear volume is 128, and output picture size is 64*64；

The convolution kernel size of 51st convolutional layer is 1*1, and convolution nuclear volume is that the convolution kernel size of the 64, the 52nd convolutional layer is 3*3, volume Product nuclear volume is that the 128, the 5th combination layer output picture size is 64*64；

The convolution kernel size of 6th convolutional layer is 3*3/2, and convolution nuclear volume is 256, and output picture size is 32*32；

The convolution kernel size of 71st convolutional layer is 1*1, and convolution nuclear volume is that the convolution kernel size of the 128, the 72nd convolutional layer is 3*3, Convolution nuclear volume is that the 256, the 7th combination layer output picture size is 32*32；

The convolution kernel size of 8th convolutional layer is 3*3/2, and convolution nuclear volume is 512, and output picture size is 16*16；

The convolution kernel size of 91st convolutional layer is 1*1, and convolution nuclear volume is that the convolution kernel size of the 256, the 92nd convolutional layer is 3*3, Convolution nuclear volume is that the 512, the 9th combination layer output picture size is 16*16；

The convolution kernel size of 10th convolutional layer is 3*3/2, and convolution nuclear volume is 1024, and output picture size is 8*8；

The convolution kernel size of 111st convolutional layer is 1*1, and convolution nuclear volume is that the convolution kernel size of the 512, the 112nd convolutional layer is 3* 3, convolution nuclear volume is that the 1024, the 11st combination layer output picture size is 8*8；

The convolution kernel size of 12nd convolutional layer is 3*3/2, and convolution nuclear volume is 1024, and output picture size is 8*8；

The convolution kernel size of 131st convolutional layer is 1*1, and convolution nuclear volume is that the convolution kernel size of the 512, the 132nd convolutional layer is 3* 3, convolution nuclear volume is that the 1024, the 131st combination layer output picture size is 8*8.

6. the system that a kind of pair of human behavior is detected characterized by comprising

Image acquisition units are used to obtain video/image of the personnel under behavior, with simultaneously obtain personnel eye state, The multiple image of each of mouth state, cigarette state and the telephone state of personnel；

Behavioral value unit is detection device as described in any one of the preceding claims,

Judging unit is configured to judge whether personnel are in eye closing fatigue state, yawn fatigue state, smoking violation state With call violation state,

Data storage cell is used to save the video and/or image and behavioral value unit of image acquisition units acquisitions Testing result.

7. system according to claim 6, which is characterized in that judging unit is configured to be performed simultaneously following fatigue of closing one's eyes State judgement, the judgement of yawn fatigue state, the judgement of smoking violation state and call violation state judgment step:

In the judgement of eye closing fatigue state, whether the continuous duration that judges to close one's eyes reaches eye closing scheduled duration, if so, personnel are in Eye closing fatigue state；

In the judgement of yawn fatigue state, judge whether the continuous duration of yawn reaches yawn scheduled duration, if the continuous duration of yawn Reach yawn scheduled duration, then yawn number adds 1, if yawn number reaches yawn predetermined time during yawn setting time Number, then personnel are in yawn fatigue state；

In smoking violation state judgement, judge whether smoking number reaches smoking predetermined time during smoking setting time Number, if so, personnel are in smoking violation state；

In call violation state judgement, whether the continuous duration that judges to converse reaches call scheduled duration, if so, personnel are in Call violation state.

8. system according to claim 7, which is characterized in that eye closing fatigue state refers to the state of eye closure, yawn Fatigue state refers to that mouth is opened greatly, and smoking violation state refers to cigarette close to mouth, and violation state of conversing refers to that mobile phone is fitted in Near face；

Eye closing scheduled duration is arranged at least 3 seconds,

Yawn scheduled duration is arranged at least 1 second,

Yawn pre-determined number is arranged at least 3 times,

It is at least 30 seconds during yawn setting time,

Smoking pre-determined number is arranged at least 3 times,

It is at least 10 seconds during setting time of smoking,

Call scheduled duration is arranged at least 5 seconds.

9. system according to claim 6, which is characterized in that further include Face datection unit, be configured to judgement personnel Identity.

10. system according to claim 6, which is characterized in that further include alarm unit, in the inspection of behavior detection unit Surveying result is the scene sending early warning arranged by sound, light or electric transducer in behavior detection unit when needing early warning, or Warning information is sent out.