CN109063581A - Enhanced Face datection and face tracking method and system for limited resources embedded vision system - Google PatentsEnhanced Face datection and face tracking method and system for limited resources embedded vision system Download PDF
- Publication number
- CN109063581A CN109063581A CN201810747849.3A CN201810747849A CN109063581A CN 109063581 A CN109063581 A CN 109063581A CN 201810747849 A CN201810747849 A CN 201810747849A CN 109063581 A CN109063581 A CN 109063581A
- Prior art keywords
- Prior art date
- 230000001815 facial Effects 0.000 claims abstract description 80
- 230000036544 posture Effects 0.000 claims description 143
- 230000001537 neural Effects 0.000 claims description 21
- 230000015654 memory Effects 0.000 claims description 20
- 239000000203 mixtures Substances 0.000 claims description 15
- 239000000284 extracts Substances 0.000 claims description 11
- 238000007689 inspection Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 7
- 238000006243 chemical reactions Methods 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000005286 illumination Methods 0.000 claims description 4
- 230000003287 optical Effects 0.000 claims description 3
- 280000450078 System One companies 0.000 claims 1
- 238000000034 methods Methods 0.000 abstract description 55
- 238000005516 engineering processes Methods 0.000 description 35
- 230000000875 corresponding Effects 0.000 description 19
- 210000003128 Head Anatomy 0.000 description 16
- 238000004891 communication Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 10
- 239000010410 layers Substances 0.000 description 8
- 230000001052 transient Effects 0.000 description 7
- 210000000887 Face Anatomy 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 6
- 238000010586 diagrams Methods 0.000 description 5
- 230000003068 static Effects 0.000 description 5
- 280000623714 New Face companies 0.000 description 3
- 230000000712 assembly Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 239000008264 clouds Substances 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 210000001508 Eye Anatomy 0.000 description 2
- 210000000214 Mouth Anatomy 0.000 description 2
- 210000001331 Nose Anatomy 0.000 description 2
- 281000140812 Social graph companies 0.000 description 2
- 241000287181 Sturnus vulgaris Species 0.000 description 2
- 230000002708 enhancing Effects 0.000 description 2
- 239000011159 matrix materials Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006011 modification reactions Methods 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 280000740616 New Video companies 0.000 description 1
- 210000002381 Plasma Anatomy 0.000 description 1
- 280000086786 Radio Service companies 0.000 description 1
- 280000545949 Relevant Technologies companies 0.000 description 1
- 238000004458 analytical methods Methods 0.000 description 1
- 239000002956 ash Substances 0.000 description 1
- 238000010009 beating Methods 0.000 description 1
- 230000003139 buffering Effects 0.000 description 1
- 238000004364 calculation methods Methods 0.000 description 1
- 210000004027 cells Anatomy 0.000 description 1
- 230000001413 cellular Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000001427 coherent Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 239000012141 concentrates Substances 0.000 description 1
- 238000007796 conventional methods Methods 0.000 description 1
- 230000002596 correlated Effects 0.000 description 1
- 230000003247 decreasing Effects 0.000 description 1
- 230000003111 delayed Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004973 liquid crystal related substances Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 210000002569 neurons Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000149 penetrating Effects 0.000 description 1
- 230000002093 peripheral Effects 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 230000000630 rising Effects 0.000 description 1
- 239000002965 ropes Substances 0.000 description 1
- 239000004065 semiconductors Substances 0.000 description 1
- 238000006467 substitution reactions Methods 0.000 description 1
- 239000002699 waste materials Substances 0.000 description 1
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00268—Feature extraction; Face representation
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computer systems based on biological models
- G06N3/02—Computer systems based on biological models using neural network models
- G06N3/04—Architectures, e.g. interconnection topology
- G06N3/0454—Architectures, e.g. interconnection topology using a combination of multiple neural nets
Present application relates generally to machine learning and artificial intelligence field.In particular in the embedded view of limited resources Real-time face detection, face tracking are realized in the digital picture captured in feel system and repeat the system of Face datection, device and Technology.
Deep learning (DL) is a branch based on one group of algorithm of machine learning and artificial neural network, and the algorithm is logical Cross using having the artificial neural network of many process layers attempt it is high-level abstract in modeling data.Typical DL structure May include many layers neuron and millions of a parameters.It can be instructed on the high-speed computer equipped with GPU with mass data Practice these parameters, and the new training algorithm by can also be applicable in deep layer network instructs, such as amendment linear unit (ReLU), Leakage (or discarding), data set enhancing and stochastic gradient descent (SGD).
Among existing DL structure, convolutional neural networks (CNN) are one of most popular structures.Although the behind CNN Thought was just found before more than 20 years, and still, the real ability of CNN is after the theoretical recent development of deep learning Just it is realized.Up to the present, CNN is in many artificial intelligence and machine learning field, such as recognition of face, image point Class, image subtitle generate, achieve immense success in visual question and answer and autonomous driving vehicle.
Face datection (detecting and position the position of every face in the picture) is usually the of many face recognition applications One step.Contemporary face detection system generally includes two main modulars: face detection module and face tracking module.Face inspection Surveying module usually utilizes DL structure (such as CNN) to detect face in digital picture.Once face detection module is in the figure of video As frame detects a new face (i.e. a new person), face tracking module will be in each subsequent image frames of the video It goes to track the new person, to find/re-recognize the new person.For the embedded system device of some low complex degrees, people The function of face tracking module can be achieved based on such as Kalman filter and Hungary Algorithm simply tracking technique.
In many human face detection devices, since everyone head/face has different directions, i.e., in different images With different postures, it is therefore necessary to execute human face posture assessment.In addition, in order to avoid repeating to send and store the same person Facial image, need to track the attitudes vibration of each face to only send that face figure for corresponding to " optimum posture " Picture, such as the face closest to front view of every detection face (there is the smallest rotation angle).A variety of skills can be used Art estimates head/face posture of a people.One of them technology estimates the position of some face key points first, such as eyes, Then nose, mouth estimate the posture of face based on the position of this some face key point again.An other technology uses three A Eulerian angles (pitch angle, yaw angle and flip angle) show head pose, by these three angles come direct estimation face/ The posture of head.Therefore, when losing the tracked people/face by face tracking module, then corresponding tracked face It will disappear.Then, the facial image with optimum posture is transmitted and stores, in case using in the future.By using based on CNN's DL structure can combine and execute Face datection and human face posture assessment.
Many human face detection tech can easily detect the positive face of short distance.However, under no case of constraint, it is real Now steady and quick Face datection is still very difficult.This is because these situations usually with a large amount of variations of face Correlation, the expression and extreme illumination variation that these variations include attitudes vibration, block, exaggerate.These be can handle without about The effective human face detection tech based on CNN under beam situation includes: (1) in " the convolutional neural networks grade for Face datection Connection " (A Convolutional Neural Network Cascade for Face Detection) (Lee et al., IEEE meter Calculation machine vision and pattern-recognition conference proceedings (Proc.IEEE Conf.on Computer Vision and Pattern Recognition), on June 1st, 2015) described in concatenated convolutional neural network (CNN) structure (hereinafter referred to as " cascade CNN " or " cascade CNN structure ")；And (2) at " joint Face datection and alignment using multitask concatenated convolutional network " (Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks) (et al.；IEEE signal processing journal (IEEE Signal Processing Letters), in October, 2016 10th the 1499-1503 pages of volume 23 of phase) described in multitask cascade CNN structure (hereinafter referred to as " MTCNN " or " MTCNN framework ").
However, the problem limited due to the usual computing resource of high complexity and embedded system of MTCNN structure, Realize that the Face datection based on MTCNN obtains satisfied real-time detection result and still remains lot of challenges in embedded system.This Outside, the simple face tracking technology that embedded system uses often will lead to many approximate faces that repeat and be tracked and transmit, by This leads to the problem of wasting computing resource and network bandwidth.
Summary of the invention
Multiple embodiments described herein provide the real-time face in embedded vision system detection, face tracking and Multiple examples of human face posture selection subsystem.On the one hand, this application discloses the approximate repetition facial image of identification and at this Selectively transmitted in system optimum posture facial image to server process.The process is comprised the steps that when determination Tracked face determines that this is tracked the optimum posture facial image of face when having disappeared；It is mentioned from optimum posture facial image Take characteristics of image；It is special to calculate each storage image extracted in one group of storage characteristics of image in characteristics of image and feature buffer One group of similarity value between sign, wherein group storage characteristics of image is from the one group of optimum posture facial image transmitted before It extracts；Whether determine has similarity value to be higher than preset threshold value in this similarity value；If without similarity Value is higher than the preset threshold value, then the best facial image is transmitted to server and stores the characteristics of image extracted To this feature buffer.
In some embodiments, it is hidden when this is tracked face in predetermined a certain number of picture frames by object Gear then determines that this is tracked face and has disappeared.
In some embodiments, which is made of one below or several: histograms of oriented gradients (HoG) is special Sign, Harr type feature, scale invariant feature convert (SIFT), the face characteristic based on deep learning.
In some embodiments, group storage characteristics of image and the characteristics of image extracted belong to same type.
In some embodiments, one extracted in one group of storage characteristics of image in characteristics of image and feature buffer is calculated The mode of similarity value between a storage characteristics of image are as follows: calculate between the extraction characteristics of image and the storage characteristics of image Cosine similarity.
In some embodiments, if characteristics of image is the face characteristic based on deep learning (DL), image spy is extracted Sign may be used to determine this compared between group storage characteristics of image and be tracked whether face is that the same person has stored face But the repetition face that face has different postures has been stored from this.
In some embodiments, if calculated any similarity value is higher than preset threshold value, preventing should Optimum posture facial image is sent to server.
On the other hand, a kind of side that real-time face detection is carried out to the gray scale input picture for using camera to capture is disclosed Method.Method includes the following steps: receiving facial image training dataset；By the cromogram in the facial image training dataset As being converted to gray scale training image；Using gray level image training convolutional neural networks (CNN) face detection module after conversion；From Gray scale input picture is received in camera；Using the CNN face detection module by the gray level image training after converting to receiving Gray scale input picture execute Face datection operation.
In some embodiments, facial image training dataset is the main extensive public instruction as composed by color image Practice data set.
In some embodiments, gray scale input picture is captured under monochromatic light or gray scale lighting condition.
In some embodiments, gray scale lighting condition includes LED light optical illumination.
In some embodiments, camera only captures gray level image.
In some embodiments, CNN face detection module includes multitask cascade CNN (MTCNN).
In still further aspect, discloses a kind of approximate repeatedly facial image of identification and selectively transmit optimum posture face Embedded system of the image to server.The embedded system includes: processor；The memory being connect with the processor.This is deposited Reservoir store instruction, when executed by a processor, the system execute as follows: true when determining that tracked face has disappeared Fixed this is tracked the optimum posture facial image of face；Characteristics of image is extracted from optimum posture facial image；It calculates and extracts figure As one group of similarity value between each storage characteristics of image in one group of storage characteristics of image in feature and feature buffer, Wherein group storage characteristics of image is extracted from the one group of optimum posture facial image transmitted before；Determine this phase Like whether have in angle value similarity value be higher than preset threshold value；If being higher than the preset threshold without similarity value Value, then be transmitted to server for the best facial image and store the characteristics of image extracted to this feature buffer.
In some embodiments, the embedded system be monitoring camera head system, NI Vision Builder for Automated Inspection, UAV system, One in robot system, autonomous driving vehicle or mobile device.
Detailed description of the invention
By read be described in detail below and each attached drawing, it is possible to understand that the application structurally and operationally, in the accompanying drawings, phase Same appended drawing reference indicates identical component, in which:
Fig. 1, which is shown, to be commented according in some embodiments of the application including real-time face detection, face tracking and human face posture The exemplary embedded vision system estimated；
Fig. 2 is shown according to the face inspection being located in embedded vision system shown in Fig. 1 in some embodiments of the application Survey the block diagram with the illustrative embodiments of tracing subsystem；
Fig. 3, which is shown, detects face appearance according to the real-time face in the embedded vision system in some embodiments of the application The flow chart of the example process of state assessment and face tracking；
Fig. 4, which is shown, is used to detect whether a tracked people to have disappeared from video according in some embodiments of the application The flow chart of mistake；
Fig. 5 is shown according to pair for having obtained sequence of frames of video He the processed video frame in some embodiments of the application Answer subset；
Fig. 6 is shown according to the people for being executed untreated video frame based on processed video frame in some embodiments of the application The flow chart of the example process of face detection and tracking；
Fig. 7 is shown according to the face inspection being located in embedded vision system shown in Fig. 1 in some embodiments of the application Survey the block diagram with the illustrative embodiments of tracing subsystem；
Fig. 8 is shown according to the Face datection performance for being used to improve processing gray scale input picture in some embodiments of the application Example process flow chart；
Fig. 9, which is shown, repeats face according to approximate for identification in some embodiments of the application and selectively will be best Posture face is sent to the flow chart of the example process of server；
Figure 10 is shown according to the client-server in some embodiments of the application being realization embedded vision system Network environment.
Detailed description below is intended as the description of the various configurations of subject technology, it is no intended to which master can be implemented in expression Unique configuration of topic technology.Attached drawing includes herein, and constitutes a part of detailed description.Detailed description includes for being intended to The detail of comprehensive understanding subject technology.However, subject technology is not limited only to detail described in this paper, without these Detail is also possible to implement.In some cases, structure and component are shown in block diagram form, to avoid subject technology is made Concept thicken.
Unless context has other to clearly state, the following term that specification uses in the whole text is subject to defined herein.Art Language " head pose ", " human face posture " and " posture " can replace mutually, the specific orientation of the head of expression people in the picture. " tracked people ", " people tracked " and " face tracking person " is replaceable, indicate the Face datection being disclosed by this application and The people that digital video image in face tracking video system is detected and tracks.
Embedded Video System
Fig. 1 shows exemplary embedded vision system, according to some of the embodiments described herein, the embedded vision Real-time face detection, human face posture assessment and face tracking function may be implemented in system.Embedded vision system 100 can integrate or Be embodied as monitoring camera head system, NI Vision Builder for Automated Inspection, UAV system, robot system, autonomous driving vehicle system or Mobile device.It can be seen in FIG. 1 that embedded vision system 100 may include bus 102, processor 104, memory 106, Storage device 108, camera subsystem 110, Face detection and tracking subsystem 112, output device interface 120 and network Interface 122.In some embodiments, embedded vision system 100 is inexpensive embedded system.
Bus 102 concentrate all systems of various assemblies for indicating that embedded vision system 100 can be connected, peripheral unit with And chipset bus.For example, bus 102 can by processor 104 and memory 106, storage device 108, camera system 110, Face detection and tracking system 112, output device interface 120 and network interface 122 communicate to connect.
Processor 104, which from memory 106 fetches instruction and executed and fetch data, to be handled, embedded to control The various assemblies of vision system 100.Processor 104 may include any kind of processor, include but are not limited to micro process Calculating in device, large-scale computer, digital signal processor (DSP), personal reminder note book, Setup Controller and electric appliance is drawn It holds up and any other processor currently known or develop later.Further, processor 104 may include one or more Core.Processor 104 may include the cache that store code and data are executed for processor 104 in itself.
Memory 106 may include any kind of storage that can store code and data device 104 for processing and execute Device.This include but are not limited to dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, only Read memory (ROM) and the memory of any other type that is currently known or developing later.
Storage device 108 may include any kind of non-volatile memories that can be integrated with embedded vision system 100 Device device.This includes but is not limited to magnetic, optics and magneto optical storage devices, and based on flash memory and/or with battery backup The storage device of the memory of power supply.
Bus 102 is also connected to camera subsystem 110.Camera subsystem 110 is configured to scheduled resolution ratio Static image and/or video image are acquired, and by bus 102, acquired image or video data are transmitted to embedded Various assemblies in vision system 100 are such as connected to memory 106 for buffering, and are connected to Face detection and tracking subsystem System 112 is to carry out Face datection, human face posture assessment, face tracking and optimum posture selection.Camera subsystem 110 can be with Including one or more digital cameras.In some embodiments, camera subsystem 110 includes one or more equipped with wide The digital camera of angle mirror head.Different resolution ratio can have by 110 acquired image of camera subsystem or video, wrap Include high-resolution, such as 1280 × 720p, 1920 × 1080p or other high-resolution.
Face detection and tracking subsystem 112 further includes face tracking module 114, Face datection and human face posture assessment connection Mold block 116 and optimum posture selecting module 118.In some embodiments, Face detection and tracking subsystem 112 is for receiving The video image of capture, the high resolution video image such as captured by bus 102, and commented using Face datection and human face posture Estimate combinatorial module 116 and executes Face datection and human face posture evaluation operation based on CNN in the video image received, thus Face is detected in each video image, and human face posture assessment is generated to each detection face；Face detection and tracking System 112 is further configured and is tracked the people of each detection in a sequence video image using face tracking module 114 Face, and each tracked people is determined using optimum posture selecting module 118 when tracked face disappears from video The optimum posture of face.Face datection and human face posture assessment combinatorial module 116 can be real in one or more hardware CNN modules It is existing.In general, Face detection and tracking subsystem 112 is configured to track multiple people at the same time.Ideally, face is examined It surveys and tracing subsystem 112 should be able to track people as much as possible simultaneously.In some embodiments, Face datection and face appearance State assessment combinatorial module 116 can be used by slightly to thin multistage MTCNN structure.If embedded vision system 100 be it is low at This embedded system, then Face datection and human face posture assessment combinatorial module 116 can be in one or more inexpensive hardware CNN It is realized in module, for example, the embedded CNN module in semiconductor Hi3519 system on chip (SoC) is thought in sea.
Output device interface 120 is also connected to bus 102, and the output device interface 120 can for example be shown by face The result generated of detection and tracking subsystem 112.The output device being used together with output device interface 120 is for example including beating Print machine and display device, such as cathode-ray tube display (CRT), light emitting diode indicator (LED), liquid crystal display (LCD), organic light emitting diode display (OLED), plasma scope or Electronic Paper.
Finally, as shown in Figure 1, embedded vision system 100 is also connected to network by network interface 122 by bus 102 (not shown).In this way, embedded vision system 100 can be network (such as local area network (" LAN "), wide area network (" WAN "), or The network of Intranet or network, such as internet) a part.In some embodiments, Face detection and tracking subsystem 112 pass the detection face in the face of multiple detections of a particular person with optimum posture by network interface 122 and network It send to control centre or primary server.Any or all component of embedded vision system 100 can be with disclosure theme It is used in combination.
Face detection and tracking structure
Fig. 2 shows according to the Face datection in the embedded vision system 100 in some embodiments described herein With the block diagram of the illustrative embodiments 200 of tracing subsystem 112.As shown in Fig. 2, Face detection and tracking subsystem 200 from It has captured and has received a sequence video image 202 in video, using the sequence video image as input, and be every in the video 202 The optimum posture facial image 220 is used as and exports at optimum posture facial image 220 by a independent detection face/life. It is to be appreciated that the Face detection and tracking subsystem 200 includes at least motion detection block 204, people examines detection module 206, people Face posture evaluation module 208, optimum posture selecting module 210 and face tracking module 212.As mentioned above, face Detection and tracking subsystem 200 is configured to that multiple people can be being tracked simultaneously in capturing video 202.Ideally, face Detection and tracking subsystem 200 can track people as much as possible simultaneously.Face detection and tracking subsystem 200 can also include Other unshowned modules in Fig. 2.Each part of the Face detection and tracking subsystem 200 will be carried out in further detail below Description.
It can be seen from the figure that firstly, motion detection block 204 receives the given video image 202 for having captured video.? In some embodiments, it is assumed that catcher's face is that movement is relevant in video image 202, which initially enters from a people and take the photograph As the head visual field starts, until the people leaves the visual field of the camera or is blocked by other people or objects.Therefore, it is The computation complexity of Face detection and tracking subsystem 200 is reduced, which can pre-process each video Frame, thus positioning and identification region, i.e. moving area those of related to movement in each video frame.In this way, Face detection module 206 only needs to operate the moving area that those are detected, to detect facial image；Phase therewith It is right, ignore in the video image that (in other words, the Face detection and tracking subsystem 200 is not with incoherent remaining area is moved Further handle the remaining area), to be integrally improved system effectiveness and image processing speed.However, there is also in this way A kind of situation, i.e. people enter in range of video, then stop movement.In this case, it originates mobile face and becomes quiet Face only.Hereafter we will provide some technologies that can detecte and handle such static face.
In some embodiments, motion detection block 204 passes through the current video figure directly calculated in a sequence video frame Error image between picture and previous video image detects the moving area in recently received video image.Implement one In example, in current video image and the video sequence frame relative to the previous video image immediately of the current video image into Row compares.In some embedded systems, embedded motion detecting hardware can use, the motion detection mould is realized such as DSP Block 204.For example, can use when realizing Face detection and tracking subsystem 200 using Hi3519 system on chip Embedded motion detection function in DSP in Hi3519SoC realizes the motion detection block 204.The motion detection block 204 Output includes can have many various sizes of one groups to identify moving area 214.It is to be appreciated that for the mobile human body of detection Object, moving area relevant to human body object include face and human body, and can also include a more than face.Cause This, each has identified that moving area can be sent in subsequent face detection module 206, to detect each detection turnover zone Most or all faces in domain.
(the applying date: on October 20th, 2017 of U.S. Patent application 15/789,957；Denomination of invention: it is based on embedded system The Face datection of small-scale convolutional neural networks module and the assessment of head pose angle) motion detection block 204 has been carried out more Add and explain in detail, content is incorporated herein with reform.
Face datection and human face posture assessment
For each detection moving area 214 generated by motion detection block 204, the face detection module based on CNN 206 are used equally for detecting some or all of faces in the detection moving area.Can be realized using many different technologies should Face detection module 206.For example, histograms of oriented gradients (HOG) feature comparison technology combination supporting vector machine can be used (SVM) classifier realizes the face detection module 206.In some embodiments, using by slightly to thin multistage CNN structure, just Be (et al. at " joint Face datection and alignment using multitask concatenated convolutional network ", " IEEE signal processing journal ", The 1499-1503 pages of volume 23 of the 10th phase of October in 2016) that structure described in periodical, come without the use of MTCNN structure Realize the face detection module 206 for being used for Face datection.However, it is also possible to using known to other or the following exploitation based on The Face datection structure and technology of CNN realizes the face detection module 206, without departing from the protection scope of the application.Face Detection module 206 generates one group of detection face 216 and corresponding bounding box position.It is to be appreciated that face tracking module 210 can Current output based on the face detection module 206, tracks the face being previously detected in processed video image, wherein deserving Preceding output is associated with the video image of newest processing.
When a people enters range of video, head/face of the people can have different directions, i.e., in different video images With different postures.Estimate that the posture of each detection face is conducive to keep the posture of every face in tracking sequence video frame Variation；When tracked people is considered as losing and needing to be removed, the facial image of " optimum posture " will be corresponded to It is sent to primary server, is used for recognition of face, is wherein somebody's turn to do the facial image of " optimum posture " i.e. closest to each detection face Front view (with the smallest rotation angle) facial image.Face detection and tracking subsystem 200 is commented using human face posture Estimate module 208 and estimate the posture of each detection face exported by face detection module 206, and generates human face posture assessment 218. Optimum posture selecting module 201 utilizes the output of the face optimum posture evaluation module 208, when people are in a sequence video frame The optimum posture of each tracked people is updated in the case where movement.
In a technology, based on the position of some face key points, by calculating in these face key points and front view The distance of corresponding key point determines the position of the face key point, and is based on the location estimation human face posture, the face key point For example, eye, nose, mouth.Another technology indicates the head pose using three Eulerian angles, the Eulerian angles be pitch angle, yaw angle with And flip angle, and utilize these three Eulerian angles direct estimations posture.The posture appraisal procedure based on angle is usually than the base There is lower complexity in the method for key point, this is because the method based on angle only needs three values, and based on key The method of point usually requires more than three key point coordinate in its estimation procedure.In addition, should the posture assessment side based on angle Method can also execute simple optimum posture assessment by using the sum of the absolute value of three attitude angles.
Above two human face posture assessment technology is using conventional method, without being realized using deep neural network；Or Person utilizes the deep neural network such as CNN to realize.When being realized by CNN, face detection module 206 and human face posture can be commented Estimate module 208 to realize as single Neural joint.U.S. Patent application 15/789,957 (October 20 2017 applying date Day, denomination of invention: Face datection and head pose angular estimation based on the small-scale convolutional neural networks module of embedded system) it retouches The joint Face datection and head pose angular estimation system and technology based on CNN is stated, content of the application is with reform It is incorporated into the application.
In Face detection and tracking subsystem 200, human face posture evaluation module 208 is followed by optimum posture selecting module 210, the Sequence Detection face of people of the optimum posture selecting module 210 for being tracked in a sequence video frame is relevant " optimum posture " of each tracked people is determined in the assessment of one sequence posture, and is updated and be somebody's turn to do " optimum posture ".In some implementations In example, it is somebody's turn to do " optimum posture " and is defined as (in other words, there is the rotation of the smallest integral head closest to the human face posture of front view Gyration).From figure 2 it can be seen that the optimum posture selecting module 210 can be connect with face tracking module 212, with recipient Face tracking information.Since human face posture evaluation module 208 continuously estimates the posture of the people, and optimum posture selecting module 210 It is continuously updated the optimum posture of the people, therefore, the people that optimum posture selecting module 212 can keep tracking to be each tracked. In some embodiments, when face tracking module 212 determine a tracked people from video disappear (be, by with People's determination of track will disappear from video), optimum posture selecting module 210 can will correspond to the current best of the tracked people The corresponding detection facial image of posture (i.e. optimum posture facial image 220) is transmitted to control centre or primary server, is used for Recognition of face task.
Face tracking module 212 can be realized for determining it is more whether a tracked people has disappeared from the video Item technology.For example, a technology keeps the face bounding box of everyone latest computed of tracking.It is assumed that there is no extremely move In the case of, it should between the bounding box of the tracked people calculated recently and the bounding box immediately before of the tracked people With a degree of overlapping.Therefore, when face tracking module 212 determine the tracked people the bounding box calculated recently with When not having overlapping between the bounding box immediately before of the tracked people, which determines that this is tracked People disappear from the video.In another technology, which can keep tracking all detection people The given mark (the i.e. one group association uniquely marked between corresponding detection face) of face.Next, if dividing previously The given mark of a tracked people in the previously processed video image of dispensing is not allocated to currently processed video image In any detection face, it may be considered that tracked people associated with the given mark has disappeared from video.
In some scenes, staying people for a long time in video and will lead to the optimum posture of the people can not be slowly transferred into The control centre or primary server.In order to solve this problem, some embodiments provide a kind of early stage postures to submit technology. In one embodiment, which submits technology during handling a sequence video image, if tracked people Estimate that human face posture is good (for example, it is enough to work as the human face posture compared with threshold value enough in a wherein video frame It is good), then it can will correspond to the detection facial image for being somebody's turn to do " good human face posture enough " and be immediately transferred to server, without The people for waiting this tracked leaves the visual field of the video.More specifically, if the neural network is while detecting facial image Face key point is also created, then the determined key point is relative to the relevant benchmark face key point of full-frontal human face posture Distance can be compared with threshold distance.Alternatively, if the neural network also creates three while detecting facial image A attitude angle, then the sum of the absolute value of attitude angle of these estimations can be compared with the threshold angle.In both cases, When the posture metric calculated recently is lower than in the case where corresponding threshold value, the posture of the tracked people calculated recently is measured Value is considered " good enough ", and the correspondence facial image of the tracked people can be transferred into the server.? In these embodiments, in order to avoid the face of the same person is repeatedly transmitted, determining the tracked face from video After middle disappearance, only when the facial image of " good enough " is not submitted to control centre or server, just by the people Determining optimum posture is sent to the control centre or server.
In Face detection and tracking subsystem 200, in order to find the optimum posture of each tracked people, it is necessary to from this People is detected at the beginning into video, until determining that the people during the entire process of disappearing in video, tracks and is somebody's turn to do The position for the people that this in each frame is tracked in video is captured.
In some embodiments, the face detection module 206 by this based on CNN and human face posture evaluation module 208 are applied Each frame in video 202 has been captured in this.In other words, the input of the motion detection block 204 in the subsystem 200 includes being somebody's turn to do Each frame of video 202 is captured.This is for high-performance embedded vision system 100 or when to have captured video frame rate non-for this It is feasible in the case where often low.In these embodiments, which has captured video for this 202 each frame generates one group of detection face and corresponding bounding box coordinates.Utilize the face from the sequence video frame Detection information, face tracking module 212 can execute single face tracking or be sequentially performed more face tracking.For example, having captured Video 202 only includes single people, then the face tracking based on the processed video frame only relates to determine when this is tracked face It disappears from the video.
If having captured video 202 includes multiple people, which needs to be implemented more face tracking, simultaneously Track multiple people.In one embodiment, originally multiple people can be detected in single video frame.In another embodiment, due to Multiple people are put in different times into the video, therefore can detect multiple people respectively in multiple video frames.Some In embodiment, after multiple people are detected and marked, which executes more face tracking, to pass through By in preceding video frame one group of mark bounding box and the video frame handled recently in the bounding box that identifies match, and then with Multiple people that track detects.In one embodiment, it can use Hungary Algorithm for one group of mark in the preceding video frame The bounding box identified in bounding box and the video frame handled recently is associated.For example, can construct in preceding video frame Similar matrix in bounding box and newest processing video frame between bounding box, wherein each matrix element is for measuring previous video Matching score in given bounding box and newest processing video frame in frame between bounding box.It can be calculated using different parameters The matching score, one of parameter are the friendship between a pair of of bounding box and ratio.In order to by the boundary of two continuous video frames Frame associates, and can use the technical substitution of other data correlations Hungary Algorithm, and what is generated is used for data correlation CNN feature can also be using the feature except bounding box.For example, in order to improve face association performance, it may be considered that some low Cost face characteristic, such as facial size, aspect ratio, LBP, HOG, color histogram etc..
Fig. 3 is shown according to some embodiments of the application, and real-time face detection, face are executed in embedded vision system The flow chart of the example process 300 of posture assessment and face tracking.The process 300 is at the beginning in the sequence for having captured video Video image (step 302) is received in video frame.In some embodiments, which includes monitoring camera system System, NI Vision Builder for Automated Inspection, self driving and cell phone.Next, the process 300 executes Face datection behaviour on the video image Make, to detect lineup's face (step 304) in the video image.In some embodiments, step 304 includes: using above The motion detection block of description identifies one group of moving area in the video image, and has identified in moving area for the group Each moving area, using based on CNN human face detection tech to detect whether the moving area includes face.In some implementations In example, the bounding box in the raw video image is used to limit each detection facial image.
Next, process 300 determines that new person has been introduced into (step 306) in the video based on the detection face.For example, The process 300 can detect one group in face and current video image to one group of mark of previous video image immediately and not mark It is operation associated that note detection face executes face.The process then will associated each detection face be true not with face was previously detected It is set to new person.Next, process 300 has captured at this and tracks new person's (step 308) in the subsequent video images in video.Example Such as, process 300 can detecte a sequence new position of the new person in subsequent video image.For including the subsequent of the new person In the video image of all new positions, process 300 is calculated in the new position, the human face posture of the detection face of the new person；And And upload the optimum posture of the new person.
Next, whether process 300 detects this is tracked new person's disappearance (step 310) from the video.Fig. 4 is shown According to some embodiments of the application, the example process 400 that whether has disappeared from video for detecting tracked people Flow chart.The process determines that the people of the detection does not have corresponding detection face (step in current video frame at the beginning 402).In some embodiments, step 402 include: current video frame the new position of prediction and its around the quilt is not detected The people of tracking.Next, the process detection whether have in current video frame corresponding position it is corresponding with the tracked people Face, wherein the corresponding position is the position of the detection face of the tracked people in the previous video frame of current video frame (step 404).If it has, then the process determines that the tracked people does not move and has a static face (step 406).Otherwise, which determines (the step 408) that indeed disappears from the video of the tracked face.
Fig. 3 is returned to, if determining that the tracked new person disappears from the video in the step 310, process 300 Then the tracked new person is corresponded to and has determined that the detection facial image of optimum posture is sent to server (step 312). It is to be appreciated that no need to send all tracked faces, and only need to transmit the facial image for corresponding to the optimum posture, this Significantly reduce the requirement to network bandwidth and memory space.Otherwise, if this is tracked new person still in the video, mistake Journey 300 continues to track the people in subsequent video images, and updates the optimum posture (step 314) of the people.
It is discussed above to assume on condition that the face detection module 206 and human face posture evaluation module 208 based on CNN Each video frame of video 204 has been captured applied to this.However, the resource and performance due to the embedded system limit, one In a little embodiments, which is unpractical using each video frame of CNN resume module.In addition, in order to drop Low computation complexity simultaneously improves real time video processing speed, should assess mould based on the face detection module 206 and human face posture of CNN Block 208 is not necessarily intended to be applied to each video frame.In some embodiments, motion detection block 204 only receives this and has captured view The subset of frequency frame, such as a video frame is received per N number of video frame；Similarly, the face detection module 206 based on CNN and people Face posture evaluation module 208 can be only applied to the subset of video frame.Fig. 5 is shown according to some embodiments of the application, has been captured Sequence of frames of video 502 and corresponding processed video frame 504.In the embodiments illustrated in the figures, in every 4 video frames Only 1 video frame is for handling Face datection and human face posture assessment (N=4).Therefore, among every two video frame processed Those video frames do not have relative tracking information.
However, it may be necessary to execute Face datection and face tracking to untreated " centre " video frame of the CNN module.Example Such as, it is continuously displayed in the application of monitor in the bounding box for being tracked face, needs to generate boundary for these intermediate frames Frame, otherwise the display of the bounding box can flash always.In addition, when N is very big, since multiple tracked people can occur very greatly The movement of amount, therefore using the face the relevant technologies from a processed video frame 504 to next processed video frame Multiple people are tracked in 504 becomes more difficult.In this case, can need again to these " centre " video frames execute face with Track.
In some embodiments, face tracking module 212 can be used for positioning and marking being tracked in the intermediate video frame Face, without applying face detection module 206.In some embodiments, face tracking module 212 is used to be tracked based on this Face determines after immediately following the processed frame 504 (such as frame 1) not in the determination position of the processed frame (such as frame 1) The position of face is tracked each of in processing video frame (such as frame 2).For example, Fig. 6 is shown according to some implementations of the application Example, the flow chart of the example process 600 of the Face detection and tracking of untreated video frame is executed based on processed video frame.? In some embodiments, example process 600 is realized in face tracking module 212.
For each detection face in processed video frame, position is confined in the correspondence boundary of the detection face by process 600 For reference block, and the detection facial image in the bounding box is positioned as to search for block (step 602).Next, it is subsequent not It handles in video frame, the having centered on the same position of the reference block in the untreated video frame of process 600 predefines (step 604) is detected in the search window of size.More specifically, can have reference block size to multiple in the search window Multiple positions (such as 64 different positions) scan for.At each searching position in the search window, by the search Block is compared (step 606) with the image block in the reference block.Therefore, process 600 the search block and correspondence image block most At good matched searching position, identical detection face (step 608) in the untreated video frame is determined.As long as it is to be appreciated that Search window in the untreated video frame is sufficiently large and two successive frames (it is assumed that limitless movement) between in the detection face Position do not generate biggish variation, then directly search technology and sufficiently precise to position the detection face in the untreated video frame Position, without use neural network, no matter the movement of the detection face is linear or nonlinear.Next, mistake Corresponding bounding box can be placed in the searching position for being confirmed as best match by journey 600, thus in the untreated video frame Determine the new position (step 610) of the detection face.It is to be appreciated that giving untreated video frame process 600 is applied to one It, can be based on the detection face in newest processed video frame (such as frame 2), to immediately this is most after (such as frame 2 in Fig. 5) Other untreated video frames (such as frame 3-4 in Fig. 5) after new processed video frame repeat process 600.
In one embodiment, process 600 is similar between the search block and the image block compared by calculating The search block is compared by degree score with the image block given at searching position in the search window.In some embodiments, should The similarity searched between block and the image block can be with difference of the simple computation between the search block and the image block.
In some embodiments, in order to accelerate above-described search process, face tracking module 212 is based on using this The reference position in video image and the predicted motion of the detection face are handled, predicts each detection face in untreated video frame In new position.More specifically, for each detection face in processed video frame (the frame #5 in such as Fig. 5), face tracking Module 212 predicts estimated location of the detection face in the untreated video frame (such as frame #6 in Fig. 5) first.At these In embodiment, it is primarily based on multiple face locations of the detection face in previous processed video frame and predicts the detection face Mobile (such as track and speed).For example, the detection face position in the video sequence 502 described in Fig. 5, in frame 1 and frame 5 Set the new position that can be used for predicting the detection face in frame 6-8.It is to be appreciated that the prediction of the movement may include linear prediction and non- Linear prediction.In linear prediction, track and the speed of the movement can be predicted.It, can be using card for nonlinear prediction Thalmann filter method.
Next, face tracking module 212 is searched using corresponding for each detection face in the processed video frame Rope block searches for the new position of estimation (i.e. the searching position) of the detection face in the untreated video frame.It is to be appreciated that due to The accuracy of the estimated location improves, thus face tracking module 212 do not need many positions near the estimated location into Row search.For each searching position centered on the estimated location, by the image block in the search block and the search block into Row compares.Therefore, at the search block and the searching position of correspondence image block best match, phase in the untreated video frame is determined Same detection face.As the modification of the above process, face tracking module 212 will directly can detect the boundary of face in previous frame Frame is used to estimate the estimated location of the untreated video frame, and the image block at estimated location in the bounding box is not located as this Manage the image block of the detection face in video frame.
It, can be right using motion detection block 204 in order to further decrease computation complexity and accelerate face tracking process The sequence frame of down-sampled/low-definition version carries out motion detection, this can be realized by many standard faces detection schemes. One of them be used for generate down-sampled version input video frame and in the image execution Face datection method it is special in the U.S. Benefit application 15/657,109 (July 21 2017 applying date, denomination of invention: using the small-scale convolution mind in embedded system Face datection through network module) in be described, content is incorporated into the application with reform.
Face tracking under low frame handling rate
As noted previously, as multistage CNN is operated, face detection module 206 and human face posture evaluation module 208 are applied to Each input video frame will need a large amount of calculate.For capturing the embedded system of image with high frame per second, video is being captured Real-time face detection and human face posture assessment are executed in image will become extremely difficult.It is some when capturing new video frame If low side Embedded Video System processing speed is not able to satisfy the high input frame rate, input video frame cannot be just handled.? In the case of these, some embedded systems can only execute Face datection based on CNN to the subset of video frame and human face posture is commented Estimate, for example, choosing an input video frame (such as N=4) in per N number of input video frame.Therefore, not to these it is untreated or " centre " video frame carries out Face datection and human face posture assessment.Within the system, face tracking performance it is poor and by with The face of track is easily lost.
In order to solve this problem, a kind of method is to utilize the face position determined in the first two or multiple processed video frames The face location for predicting the untreated frame in immediately two or more processed video frames after the last one video frame is set, from And realize the prediction.For example, processed frame 1 and 5 can be used for predicting the face position of frame 6-8 in the example described in Fig. 5 It sets.In another example, processed frame 1,5 and 9 can be used for predicting the face location of frame 10-12.In some embodiments, it is used for The face location prediction of untreated frame may include using the more complicated of linear prediction either such as kalman filter method Scheme.
Another kind solves the problems, such as the method for the low frame handling rate using estimation, one or more subsequent untreated The new position of each detection face in previous processed frame is searched in video frame.Again by taking Fig. 5 as an example, it is assumed that CNN will be based on Face detection module 206 and human face posture evaluation module 208 be applied to frame 1 and frame 5, due between the two frames have it is larger Interval, directly by the detection face in the mark and frame 5 in frame 1, to be associated be relatively difficult.In some embodiments, Search and annotation process above in association with Fig. 6 description can with Recursion Application in frame 1 until frame 5.More specifically, being detected using frame 1 And the face marked, corresponding face is searched in frame 2, and the face repeated in frame 2 is labeled.Next, Face to detect and mark in frame 2 is scanned for and is marked to the face in frame 3 for reference, reciprocal with this.Finally, it is based on Initial Face detection information from frame 1 is scanned for and is marked to the face in frame 4.Next, using one of them face Correlation technology is labeled using the face marked in frame 4 to face had previously been detected in frame 5.
In a specific example, should face detection module 206 based on CNN and human face posture evaluation module 208 can be with It is applied in video frame every one mode, such as frame 1, frame 3, frame 5, frame 7 etc. (that is, N=2).Using one of them people Face correlation technology is labeled the processed frame of the CNN, which is, for example, mentioned above hand over and than (IoU) Technology.Next, for each not processed intermediate video frame (such as frame 2), it can be by immediately before and immediately It is inserted into the corresponding position for being tracked face in processed video frame (frame 1 and frame 3) later, simply determines each tracked people The position of face.It is to be appreciated that above-mentioned technology can be easily extended to face detection module 206 and face based on CNN Posture evaluation module 208 is applied to three video frames and selects in the scene of a frame (i.e. N=3).
The detection of static face
As described above, if it is assumed that the people detected in having captured video moves always, and mobile inspection It surveys operation only to pre-process the video frame in such a way that moving area is extracted and handled, then when the people is in the view When certain point in frequency image stops, the face of the people can not be detected in subsequent video frame.It is to be appreciated that if should Motion detection block 204 is removed from Face detection and tracking subsystem 200, and entire inputted video image is by subsequent mould Block processing, then the problem no longer exists.Combine Fig. 4 that the detection technique for stopping mobile people is described above.As long as The people continues to remain stationary, which, which is also used to continue before the people starts again at movement to monitor this in more video frames, stops The people only moved.It is then possible to which above-described face tracking technology to be applied to the people in movement again.
Show high frame per second
Many monitoring systems include video preview feature, the video that this feature can make control centre capture each camera Carry out live preview.However, due to the face inspection based on CNN of some low costs realized in embedded vision system 100 Surveying module 206 can not be run with high frame per second (such as video catch rate of 30 frame per second (30fps)), due to the obvious drop of frame per second It is low, it is very bad only to show that the subset of processed frame and coherent detection face bounding box will lead to visual quality.
In some embodiments, in order to improve the visual quality of preview mode, technology disclosed in the present application is by showing Middle introducing delay generates high frame per second and shows.Again by taking Fig. 5 as an example, every 4 frames are primary to input video sequence processing.In order to aobvious Show that this has captured video, which postpones 4 frames first, shows first processed frame, then show following three it is not processed Frame 2-4.Next, showing processed frame 5, following three not processed frame 6-8 are then shown, and the process is past with this It is multiple.Although and the frame of not all display all shows that the bounding box of the detection face, display video can be broadcast with initial frame per second It puts, and can be smooth as initial video.It is to be appreciated that the delay 506 of above-described 4 frame is exemplary only.It can To determine delay 506 based on the required processing time, and the delay is set to be greater than determining processing delay.
Improve Face datection and face tracking subsystem
Fig. 7 is shown according to the Face datection in the embedded vision system 100 in some embodiments described herein With the block diagram of the illustrative embodiments 700 of tracing subsystem 112.Notice Face detection and tracking subsystem 700 and face Detection and tracking subsystem 200 is essentially identical in overall structure.That is Face detection and tracking subsystem 700 is from having captured video One sequence video image 702 of middle reception by the sequence video image 702 as input, and is each only in the video image 702 The optimum posture facial image 720 is used as and exports at one group of optimum posture facial image 720 by vertical detection face/life. In addition, Face detection and tracking subsystem 700 further includes having motion detection block 704, and face detection module 706, human face posture Evaluation module 708, optimum posture selecting module 710 and face tracking module 712.Face detection and tracking subsystem 700 is being examined While surveying new face, multiple faces also are tracked in the captured video image 702.Ideally, Face datection and with Track subsystem 700 is configured to track people as much as possible simultaneously.
In some embodiments, face detection module 706 can be realized by the MTCNN structure based on DL, a kind of this structure In " joint Face datection and alignment using multitask concatenated convolutional network " (Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks) (et al.；At IEEE signal Neo-Confucianism report (IEEE Signal Processing Letters), the 1499-1503 pages of volume 23 of the 10th phase of October in 2016) phase Disclosed in periodical.MTCNN structure is by multi-task learning process and using the unified CNN that cascades by Face datection and Face normalization Operation combines.The structure of this cascade CNN and " convolutional neural networks for Face datection cascade " (A Convolutional Neural Network Cascade for Face Detection) (Lee et al., IEEE computer view Feel and pattern-recognition conference proceedings (Proc.IEEE Conf.on Computer Vision and Pattern Recognition), on June 1st, 2015) described in be it is similar, MTCNN structure is also used by slightly to thin multiple CNN grades Remove the input picture of processing different resolution.However, in every level-one of the struc-ture of MTCNN, using CNN grades individual Concentration training face key point location, face binary classification and frame calibration.Therefore, three CNN are only needed in MTCNN structure Grade.More specifically, the first order of MTCNN quickly generates candidate face window by shallow-layer CNN.Next, the second of MTCNN Grade excludes a large amount of non-face window by a more complicated CNN and refines out candidate window.Finally, the of MTCNN Three-level further determines that whether the window of each input is face using a more powerful CNN.If the result is that face, It then can further estimate the position of five key points of face.Under normal circumstances, compared with cascading CNN structure, MTCNN structure It is more suitable for the limited embedded vision system of resource.In addition to using MTCNN structure, face detection module 706 can also be used Other known or future exploitation Face datection structures and technology based on CNN realize the face detection module 206, this is not Protection scope beyond the application.Face detection module 706 generates one group of detection face 716 and corresponding bounding box position.
Although MTCNN structure is generally more useful for most of embedded vision systems, due to many embedded views Feel system (such as Hi3519 (SoC)) is resource-constrained, and original MTCNN structure still needs to consume a large amount of computing resource, To limiting the speed of Face datection, the number of people that finally further limits while being traced by face detection module 706 Amount.In some embodiments, face detection module 706 can also use original other than using original MTCNN structure to go operation " simplify version " of beginning MTCNN structure goes to accelerate Face datection operation, this technology is (Howard et al. " mobile network: is suitable for The effective convolutional network of moving-vision application " (" MobileNets:Efficient Convolutional Networks For Mobile VisionApplications, "), calculate research library, in April, 2017) be disclosed (hereafter in periodical Also referred to as " mobile network ").
Using " mobile network " technology described above, face detection module 706 can be realized by improved MTCNN structure, The MTCNN uses reduction parameter alpha in the quantity of each CONV layers of reduction filter.For given CONV layer and give Reduction parameter alpha, the quantity of inputting channel M become α M, and output channel N becomes α N.In some embodiments, reduction parameter alpha Numerical value be between 0 to 1, the typical numerical value that sets is, for example, 0.25,0.25,0.5,0.75 and 1, wherein 1 represents There is no reduction.Such as in a mobile network, all CONY layers be all made of 75% filter (being α=0.75), nicety of grading The precision of aspect is slightly decreased.It is similar, all CONV layers in three networks of MTCNN retain 75% filter, Just obtain one " 0.75 MTCNN ".As test, this 0.75 MTCNN is placed in an ARM Cortex A7 core It is used on embedded platform, by " wide in range human face data collection " (Wider Face Dataset) training；With " Face datection database And benchmark " (FDDB) all R/G/B channels as entering data to be tested.Test result is, based on this 0.75 MTCNN structure equipment, Face datection only has 1% accuracy rate to lose, and the processing speed of the system improves about half. In actual moving process, based on using " simplify version " the MTCNN structure of this reduction parameter alpha between 0.5 to 0.75 Face detection module 706 can obtain satisfactory Face datection accuracy rate, while also can significant speed up processing.
Some cameras (such as conventional monitoring systems) in embedded vision system can only capture greyscale video image 702.Furthermore, it is possible to video image 702 be captured under monochromatic light and gray scale lighting condition, so video image 702 can more Close to gray level image and achromatic image.For example, city night illumination uses LED lamp source, such as bus stop more and more Lamp box and street lamp, these are all to belong to monochromatic and gray scale light mostly in nature.For another example public transport is equally more next LED lamp source is used more, such as subway station and indoor subway station.In this case, even if camera can capture colour Image, the monitoring video picture captured are substantially gray level image.Therefore, when the Face detection and tracking subsystem of input When the video image 702 of system 700 is all gray level image, image block 714 can be acted in face detection module 706, can Become single gray scale channel format, rather than 3 original R/G/B channels.
However, when face detection module 706 carries out data training and input video figure using conventional training dataset When as 702 being gray level image, the performance of the face detection module 706 can decline.This is because public training dataset, such as " wide in range human face data collection " is mostly to be captured under conditions of natural lighting as RGB image.Therefore, work as Face datection CNN in module 706 be using RGB image carry out data training and inputted video image 702 be gray level image when, training number It is asymmetric according to just will appear a large amount of data distribution between set and input picture.In this way, handle gray level image 702 when It waits, the accuracy of trained CNN will reduce.In a specific performance test, face detection module 706 uses original MTCNN structure simultaneously carries out data training with RGB image, and the gray level image of input will be taken as three identical channels and carry out Processing.In the test, the performance of Face datection is good enough.It is described previously when utilizing but in another performance test 0.75 MTCNN realize face detection module 706 when, this using RGB image training image training 0.75 MTCNN in ash The performance spent on input picture 702 is bad.
In order to solve above-noted problems, reducing the asymmetric one of them scheme of data distribution is to make to train figure Picture and the video image of processing are more consistent.In particular, when the video image 702 of input is gray level image, face detection module 706 Also it should be trained using gray level image.But it is less feasible to directly acquire a large amount of gray scale training images, in some embodiments In, the training image being taken from large-scale tranining database, such as " wide in range human face data collection " are converted into gray scale first Image, then the gray level image after these conversions is then used to training data by face detection module 706.Same thinking can be with Applied in the input picture of extended formatting.In other words, based on some features of video image 702, training dataset can be added With modification, thus make it is more with consistency between training image and the video image of processing, thus reduce data distribution asymmetry.
Fig. 8, which is shown, improves Face datection performance to processing gray scale input picture according in some embodiments of the application The flow chart of example process.Firstly, receiving facial image training dataset, such as colour FERET database or " wide in range face Data set " (step 802).Next, the color image that training data is concentrated is converted to gray level image (step 804).It is next Step, using the gray level image after conversion to being trained (step 806) in Face datection CNN module.Again in next step, it is examined in face It surveys and tracks in application, receive greyscale video image and simultaneously existed using the Face datection CNN module trained by gray scale training image Face datection (step 808) is executed on gray level image.
Technology based on the above described further modifies previously mentioned performance test.Particularly, Face datection The gray scale that module is converted using 0.75 MTCNN structure (or 0.75 MTCNN module) by " wide in range human face data collection " Image carries out data training, and 0.75 MTCNN trained is used to handle the image in FDDB.Use gray scale training The performance of 0.75MTCNN module is only than using the full MTCNN module performance of RGB a little bit poorer (~1.5%).However, gray scale is instructed Performance of the experienced 0.75MTCNN module in gray scale input picture is substantially better than the 0.75MTCNN module of RGB training identical Performance in gray scale input picture.Therefore, the simplification MTCNN structure proposed can be combined with CNN training technique to be promoted Mutual performance, while improving the speed that face detection module 706 handles gray scale input picture.
According to above in association with content described in Fig. 2~3, in Real- time Face Tracking process, Face detection and tracking is sub System 700 keeps tracking the optimum posture face of each tracked face, when tracked face disappears simultaneously from video When needing to be removed, server is sent by current optimum posture facial image and is further analyzed, such as face Identification.Under normal conditions, when temporary another person or the object gear by the same picture frame of the face of a people When firmly, tracked people can be considered being lost.In next picture frame, the barrier of the face blocked is gone It removes and when face reappears, face can be detected again.Ideally, this face detected again should It is considered already existing face rather than a new face.However, being based on simple face tracking technology, such as Kalman Filter and Hungary Algorithm can not effectively handle such circumstance of occlusion, and it is duplicate that this frequently results in many approximations of generation Face.If such duplicate face of approximation is not identified by Face detection and tracking subsystem 700, these repetitions Optimum posture face be carried in server, to waste network bandwidth and memory space.
In some embodiments, to solve the above-mentioned problems, when tracked face has been lost and the optimum posture face When being not yet sent to the server, approximate repetition Face datection program is executed.In more detail, by each optimum posture people Face image is transmitted to before server, feature extraction operation is carried out to the optimum posture facial image, thus to the optimum posture Predetermined characteristics of image is extracted in facial image.For example, predetermined characteristics of image can be HoG feature, Harr Type feature, SIFT feature or in which one kind are based on DL face characteristic.But in some embodiments it is possible to from optimum posture people It is extracted in face image more than a kind of characteristics of image.For example, in a wherein example, it can be from optimum posture facial image Extract HoG feature and Harr type feature.
Next, will extract the feature extracting and store in characteristics of image and the optimum posture facial image that before transmits into Row compares one by one.These features can be stored in the local buffer of the face tracking operation of Embedded Video System.However, These storage features can also be stored remotely, for example be stored in cloud server, thus the people with Embedded Video System It is separated that face tracks operation.In some embodiments, for one-dimensional (1D) or two-dimentional (2D) feature (such as HoG feature and Harr type feature) for, can by the feature newly extracted from optimum posture facial image and from what is transmitted before it is best The feature stored in posture facial image is calculated, to obtain cosine similarity or Euclid's similitude.Work as use When face characteristic based on DL, the feature and storage feature that can extract by comparing are to detect the same person of storage Duplicate face, but the face stored can have different postures from the repetition face.In some embodiments, if new extract Feature and storage aspect ratio to the calculated numerical value for having any similarity value be higher than preset threshold value (such as 0.8), then the optimum posture face is identified as the approximate face that repeats without being sent to server.However, if new extract Feature and storage feature between all similarity values in calculated similarity value be below preset threshold value, then The optimum posture face is considered as a unique face, then this image is transmitted to server and is deposited with associated extraction feature It stores up in feature buffer or memory.After having used above-mentioned technology, can substantially reduce the approximate quantity for repeating face and The quantity of the optimum posture face of transmission, to save network bandwidth and storage resource.
Fig. 9 shows best according to the approximate repetition facial image of identification in some embodiments of the application and selectively transmission Posture facial image to server example process flow chart.The process at the end of face tracking operates, receives first Determining optimum posture facial image, wherein the optimum posture facial image of the determination is that the tracked face disappears from video Facial image (step 902) when mistake.In some embodiments, when the tracked face is predetermined a certain number of It is obscured by an object in successive image frame, then determines that the tracked face has disappeared.Next, from optimum posture facial image Middle extraction characteristics of image (step 904).As described above, the feature extracted can be HoG feature, Harr type feature, SIFT feature or based on one of DL face characteristic or a variety of combinations.Next, this process newly extracts calculating Feature and one group of storage feature being extracted and stored in from the optimum posture facial image transmitted before in feature buffer Each of storage feature between one group of similarity score (step 906).Next, the process judges this group of similarity value In whether have numerical value be higher than preset threshold value (step 908).If it is determined that all similarity values are not higher than and preset Threshold value, then this optimum posture facial image can be transmitted to server and store the correlated characteristic extracted and delayed to feature Rush device (step 910).Otherwise, it is determined that the optimum posture facial image is approximate repetition facial image, and prevent this best appearance State facial image is sent to server (step 912).
Figure 10 is shown according in some embodiments of the application, for realizing embedded vision system disclosed in the present application Exemplary client-server network environment.Network environment 1000 is communicatively connected to server by network 1008 including several 1010 embedded vision system 1002,1004 and 1006.One or more remote servers 1020 and the server 1010 and/ Or one or more embedded vision systems 1002,1004 and 1006 connect.
In some exemplary embodiments, embedded vision system 1002,1004 and 1006 may include monitoring camera system System, NI Vision Builder for Automated Inspection, unmanned plane, robot, automatic driving car, smart phone, PDA, portable media player, notebook Computer or other embedded systems integrated with one or more digital cameras.In one example, each embedded view Feel system 1002,1004 and 1006 includes one or more cameras, CPU, DSP and one or more small-scale CNN moulds Block.
Server 1010 includes processing unit 1012 and face database 1014.Processing unit 1012 is used to execute program, To be received based on the face information stored in face database 1014 to from the embedded vision system 1002,1004 and 1006 Human face analysis is carried out to facial image.Processing unit 1012 is also used to processed facial image being stored into face database 1014。
In some instances, server 1010 can be single computing device, such as computer server.In other embodiments In, server 1010 can indicate the computing device that more than one cooperates, the computing device execute server computer Behavior (such as cloud computing).The server 1010 can operational network server, wherein the network server by network 1008 with Browser communication connection in client terminal device (such as embedded vision system 1002,1004 and 1006).One is shown wherein Example in, which can run client application, thus during service dispatch scheduler client device supplier and Client between client starts service and server provider starts service.Server 1010 can also pass through network 1008 Or other networks or communication are communicated by way of with the foundation of one or more remote servers 1020.
The one or more remote server 1020 individually or jointly can realize the server 1010 by server 1010 Above-mentioned multiple functions and/or memory capacity.Each server in the one or more remote server 1020 can be sent out Play multinomial service.For example, the service that server 1020 is initiated includes: offer information relevant to one or more suggested positions, The information is, for example, web interface relevant to suggested position or website；It determines the position of one or more user or enterprise, use Family inquiry identification search engine, one or more user comments or query service；Or provide about one or more enterprises, Consumer and/or about the inquiry of the enterprise or other one or more services of feedback.
Server 1010 can also safeguard the social networking service being erected on one or more remote servers, Huo Zheyu The social networking service establishes communication.The one or more social networking service can provide multinomial service, and family wound can be used It builds file and is associated with its own with the other users in the long-range social networking service.The server 1010 and/or Person's one or more remote server 1020 can also aid in generation and safeguard social graph, which includes user's wound The association built.The social graph for example may include all users of the long-range social networking service list and each user It is associated with other users in the long-range social networking service.
Each server in the one or more remote server 1020 can be single computing device, as computer takes Business device, or can indicate the computing device that more than one cooperates, the behavior of the computing device execute server computer (such as cloud computing).In one embodiment, server 1010 and the one or more remote server 1020 can be by individually taking Business device is realized, or is realized by multiple server cooperations.In one example, the server 1010 and the one or more remotely take Be engaged in device 1020 can be by network 1008 and via the client terminal device (such as embedded vision system 1002,1004 and 1006) at User agent establish communication.
Utilize the client applications being mounted in the embedded vision system 1002,1004 and 1006, embedded vision The user of system 1002,1004 and 1006 can with the system being erected in server 1010 and/or be erected at remote service One or more services in device 1020 interact.Alternatively, user can pass through the embedded vision system 1002,1004 and Network-based browser application in 1006 is interacted with the system and the one or more social networking service.It can Embedded vision system 1002,1004 and 1006 and the system and/or one or more are improved by network (such as network 1008) Communication between a service.
Can be promoted by various communications protocols embedded vision system 1002,1004 and 1006 and server 1010 and/ Or the communication between one or more remote servers 1020.From some aspects, embedded vision system 1002,1004 and 1006 Communication between server 1010 and/or one or more remote servers 1020 can be built by communication interface (not shown) Vertical wireless communication, if necessary, which includes digital signal processing circuit.The communication interface can provide various modes or Communication under agreement, including global system mobile communication (GSM) voice call, short message service (SMS), enhanced message service, EMS (EMS) or multimedia messaging service, MMS (MMS) transmission, CDMA (CDMA), time division multiple acess (TDMA), personal digital cellular (PDC), wideband code division multiple access (WCDMA), CDMA2000 or General Packet Radio Service (GPRS) etc..For example, can be by penetrating Frequency transceiver (not shown) realizes communication.It is also possible to use bluetooth, WiFi or other transceivers realize short haul connection.
Network 1008 for example may include personal area network (PAN), local area network (LAN), school domain net (CAN), metropolitan area Net (MAN), wide area network (WAN), broadband network (BNN), internet etc..In addition, the network 1008 includes but is not limited to following nets It is one or more in network topological structure: bus network, star network, loop network, mesh network, star bus network, tree Shape or hierarchical network etc..
Various illustrative components, blocks, module described in each embodiment disclosed herein, circuit and algorithm steps, The combination of electronic hardware, computer software or both be can be used as to realize.For this exchange for clearly illustrating hardware and software Property, various Illustrative components, unit, module, circuit and step are generally functionally described above.It is such Function collection is implemented as hardware or software depends on the design constraint of concrete application and total system.Technical staff can be directed to Every kind of specific application realizes described function collection in different method, but such design decision is not necessarily to be construed as to be detached from The scope of the present disclosure.
For realizing various illustrative logicals described together with various aspects disclosed herein, logical block, module, with And the hardware of circuit can be with general processor, digital signal processor (DSP), specific integrated circuit (ASIC), field programmable Gate array (FPGA) or other programmable logic device, individual grid or transistor logic, individual hardware component, or set It is calculated as executing any combination thereof of functionality described herein, to realize or execute.General processor can be microprocessor, but It is that in the alternative, processor can be any conventional processor, controller, microcontroller or state machine.Processor The combination of acceptor device be can be used as to realize, for example, the combination of DSP and microprocessor, multi-microprocessor and DSP core One or more microprocessors or any other such configuration together.Alternatively, some steps or method can be by giving The specific circuit of function is determined to execute.
In one or more illustrative aspects, described function can be with hardware, software, firmware, or any combination thereof To realize.If realized with software, function can be used as one or more instructions or code is stored in non-transient calculating On the storage medium that machine can be read or on storage medium that non-transient processor can be read.Method disclosed herein or algorithm The step of can processing can reside on the storage medium that non-transient computer-readable or processor can be read Device executable instruction embodies.The storage medium that non-transient computer-readable or processor can be read can be can Any storage medium accessed by computer or processor.As example but without limitation, such non-transient computer The storage medium that can be read or processor can be read may include that RAM, ROM, EEPROM, flash memory, CD-ROM or other CDs are deposited Reservoir, magnetic disk storage or other magnetic memory apparatus, or the required journey of store instruction or data structure form can be used to Sequence code and any other medium that can be accessed by a computer.As used herein disk and CD include compact disk (CD), laser disk, CD, digital versatile disc (DVD), floppy disk and Blu-ray disc, wherein disk is usually magnetically again Existing data, and CD utilizes laser reproduce data optically.The combination of above-mentioned items is also included within non-transient computer In the range of the medium that can be read and processor can be read.In addition, the operation of method or algorithm can be used as a code and/ Or instruction or code and/or any combination or group of instruction, reside in storage medium that non-transient processor can be read and/or On computer-readable storage medium, storage medium can be included in computer program product.
Although this patent document includes many details, these are not construed as to any disclosed technology Or claimed thing range limitation, but the specific feature as the specific embodiment to specific technology Description.Certain features described in the context of individual embodiment can also be in single embodiment in this patent document In realize in combination.On the contrary, the various features described in the context of single embodiment can also be individually or with any Suitable sub-portfolio is realized in various embodiments.In addition, although feature can be described as rising in combination with certain above Effect, it is even initially claimed in this way, can be in some cases from group from claimed combined one or more features It is deleted in conjunction, it is desirable that the combination of protection is directed into the variant of sub-portfolio or sub-portfolio.
Similarly, although operation is described by particular order in the accompanying drawings, this is not construed as requiring this The operation of sample is executed with shown particular order or sequentially or all shown operations are all performed, to realize Required result.In addition, the separation of the various system components in each embodiment described in this patent document should not It is understood to require such separation in all embodiments.
This patent document only describes several realizations and example, can be shown based on the sum described in this patent document Content make other realize, enhancing and variation.
Priority Applications (6)
|Application Number||Priority Date||Filing Date||Title|
|US15/789,957 US10467458B2 (en)||2017-07-21||2017-10-20||Joint face-detection and head-pose-angle-estimation using small-scale convolutional neural network (CNN) modules for embedded systems|
|US15/796,798 US10510157B2 (en)||2017-10-28||2017-10-28||Method and apparatus for real-time face-tracking and face-pose-selection on embedded vision systems|
|US15/943,728 US10691925B2 (en)||2017-10-28||2018-04-03||Enhanced face-detection and face-tracking for resource-limited embedded vision systems|
|Publication Number||Publication Date|
|CN109063581A true CN109063581A (en)||2018-12-21|
Family Applications (1)
|Application Number||Title||Priority Date||Filing Date|
|CN201810747849.3A CN109063581A (en)||2017-07-21||2018-07-06||Enhanced Face datection and face tracking method and system for limited resources embedded vision system|
Country Status (1)
|CN (1)||CN109063581A (en)|
Cited By (1)
|Publication number||Priority date||Publication date||Assignee||Title|
|CN110505446A (en) *||2019-07-29||2019-11-26||西安电子科技大学||The hotel's video security protection system calculated based on mist|
- 2018-07-06 CN CN201810747849.3A patent/CN109063581A/en active Search and Examination
Cited By (1)
|Publication number||Priority date||Publication date||Assignee||Title|
|CN110505446A (en) *||2019-07-29||2019-11-26||西安电子科技大学||The hotel's video security protection system calculated based on mist|
|Garcia-Garcia et al.||A survey on deep learning techniques for image and video semantic segmentation|
|Sixt et al.||Rendergan: Generating realistic labeled data|
|Kang et al.||Object detection in videos with tubelet proposal networks|
|Mühlfellner et al.||Summary maps for lifelong visual localization|
|Poleg et al.||Temporal segmentation of egocentric videos|
|Wang et al.||Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks|
|Zhong et al.||Fully convolutional networks for building and road extraction: Preliminary results|
|Choi et al.||A general framework for tracking multiple people from a moving camera|
|Tripathi et al.||Convolutional neural networks for crowd behaviour analysis: a survey|
|Wang et al.||CDnet 2014: An expanded change detection benchmark dataset|
|Dubé et al.||SegMap: 3d segment mapping using data-driven descriptors|
|US8744125B2 (en)||Clustering-based object classification|
|JP2019509566A (en)||Recurrent network using motion-based attention for video understanding|
|Chi et al.||Automated object identification using optical video cameras on construction sites|
|US20150154457A1 (en)||Object retrieval in video data using complementary detectors|
|US20200250436A1 (en)||Video object segmentation by reference-guided mask propagation|
|CN106897670B (en)||Express violence sorting identification method based on computer vision|
|CN105051754B (en)||Method and apparatus for detecting people by monitoring system|
|Leng et al.||A survey of open-world person re-identification|
|Varadarajan et al.||A sequential topic model for mining recurrent activities from long term video logs|
|Ko||A survey on behavior analysis in video surveillance for homeland security applications|
|GB2538847A (en)||Joint Depth estimation and semantic segmentation from a single image|
|US8866845B2 (en)||Robust object recognition by dynamic modeling in augmented reality|
|EP2864930B1 (en)||Self learning face recognition using depth based tracking for database generation and update|
|US10380431B2 (en)||Systems and methods for processing video streams|
|SE01||Entry into force of request for substantive examination|
|SE01||Entry into force of request for substantive examination|