WO2012128382A1 - Device and method for lip motion detection - Google Patents

Device and method for lip motion detection Download PDF

Info

Publication number
WO2012128382A1
WO2012128382A1 PCT/JP2012/057677 JP2012057677W WO2012128382A1 WO 2012128382 A1 WO2012128382 A1 WO 2012128382A1 JP 2012057677 W JP2012057677 W JP 2012057677W WO 2012128382 A1 WO2012128382 A1 WO 2012128382A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
lip motion
motion detection
visual feature
mouth region
Prior art date
Application number
PCT/JP2012/057677
Other languages
French (fr)
Inventor
Wang Yan
Original Assignee
Sharp Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Kabushiki Kaisha filed Critical Sharp Kabushiki Kaisha
Publication of WO2012128382A1 publication Critical patent/WO2012128382A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]

Definitions

  • the invention relates to video processing, and more particularly, to a device and method for video-based lip motion detection.
  • US7343289B2 discloses a system and method for audio / video speaker detection .
  • the method is directed to detect the speaker, i. e . the subject of lip motion , based on both visual and audio information .
  • the disclosed method includes the following steps of: finding face from video frame; finding and extracting the mouth region; extracting the mouth openness with Linear Discriminant Analysis (LDA) as visual feature ; extracting the energy of the audio signal corresponding to the video frame as audio feature ; and inputting both features into a trained Time-Delayed Neural Network (TDNN) and detecting lip motion according to the output of the TDNN .
  • LDA Linear Discriminant Analysis
  • the above method extracts visual feature from each frame separately.
  • the feature includes rich information on the subject's identity and is thus somewhat individual-dependent.
  • the detection rate will be degraded significantly.
  • a device for video-based lip motion detection which comprises : a face search unit adapted for searching a face in an input video frame; a mouth region extraction unit adapted for extracting a mouth region from the searched face; a visual feature extraction unit adapted for extracting a visual feature of the mouth region ; and a detection unit adapted for detecting lip motion based on the extracted visual feature of the mouth region.
  • the visual feature extraction unit is adapted for extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the extraction result.
  • LBP Local Binary Pattern
  • the detection unit is pre-trained with the extracted visual feature of the mouth region.
  • the device for video-based lip motion detection further comprises a smoothing unit adapted for smoothing the detection result of the detection unit.
  • the device for video-based lip motion detection further comprises : an audio feature extraction unit adapted for extracting an audio feature corresponding to the input video frame .
  • the detection unit is adapted for detecting lip motion based on the visual feature extracted by the visual feature extraction unit and the audio feature extracted by the audio feature extraction unit.
  • the detection unit is pre-trained with the extracted visual feature and audio feature .
  • the visual feature comprises a Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) -based visual feature .
  • LBP-TOP Local Binary Pattern on Three Orthogonal Planes
  • the mouth region is a rectangle, the center of the rectangle being located at the middle of the line connecting two mouth corners and the longer edges of the rectangle being parallel with the line connecting two mouth corners.
  • the detection unit comprises a Support Vector Machine (SVM) .
  • SVM Support Vector Machine
  • the smoothing unit comprises a median filter.
  • the face search unit comprises a Viola-Jones face detector.
  • the mouth region extraction unit is adapted for extracting the mouth region from the searched face by using an Active Shape Model (ASM) .
  • ASM Active Shape Model
  • the visual feature extraction unit is further adapted for extracting at least one of a gradient on image plane and a Local Binary Pattern (LBP) code on image plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the overall extraction result
  • LBP Local Binary Pattern
  • a method for video-based lip motion detection comprises the following steps of: searching a face in an input video frame; extracting a visual feature of the mouth region; and detecting lip motion based on the extracted visual feature of the mouth region.
  • the method for video-based lip motion detection further comprises: extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window, prior to extracting a visual feature of the mouth region .
  • the visual feature of the mouth region is extracted based on the extraction result.
  • LBP Local Binary Pattern
  • the method for video-based lip motion detection further comprises : pre-training with the extracted visual feature of the mouth region, prior to detecting lip motion.
  • the method for video-based lip motion detection further comprises : smoothing the detection result.
  • the method for video-based lip motion detection further comprises: extracting an audio feature corresponding to the input video frame .
  • the lip motion is detected based on the extracted visual feature and audio feature .
  • the method further comprises: pre-training with the extracted visual feature and audio feature , prior to detecting lip motion.
  • the visual feature comprises a Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) -based visual feature .
  • LBP-TOP Local Binary Pattern on Three Orthogonal Planes
  • the mouth region is a rectangle, the center of the rectangle being located at the middle of the line connecting two mouth corners and the longer edges of the rectangle being parallel with the line connecting two mouth corners.
  • the lip mouth is detected by using a Support
  • SVM Vector Machine
  • the detection result is smoothed by using a median filter.
  • the face is searched from the input video frame by using a Viola-Jones face detector.
  • the mouth region is extracted from the searched face by using an Active Shape Model (ASM) .
  • ASM Active Shape Model
  • the method for video-based lip motion detection further comprises: extracting at least one of a gradient on image plane and a Local Binary Pattern (LBP) code on image plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the overall extraction result.
  • LBP Local Binary Pattern
  • a speech recognition system which comprises: a microphone adapted for capturing an audio signal; a camera adapted for capturing a video signal; a device for video-based lip motion detection, adapted for detecting lip motion based on the video signal captured by the camera to obtain start time and end time of the lip motion; a speech segment detector adapted for extracting a speech segment based on the audio signal captured by the microphone and the start time and the end time of the lip motion; a feature extractor adapted for extracting an audio feature from the extracted speech segment; and a speech recognizer adapted for recognizing speech based on the extracted audio feature.
  • a video conference system which comprises : a microphone adapted for capturing an audio signal; a camera adapted for capturing a video signal; a device for video-based lip motion detection, adapted for detecting lip motion based on the video signal captured by the camera to obtain start time and end time of the lip motion ; and a transmitter.
  • the device for video-based lip motion detection is adapted for controlling the transmitter to transmit the audio signal captured by the microphone and the video signal captured by the camera at the start time of the lip motion and for controlling the transmitter to transmit only the video signal captured by the camera at the end time of the lip motion .
  • the video conference system further comprises : a video frame cropper adapted for cropping video from the video signal captured by the camera.
  • the device for video-based lip motion detection is adapted for enabling the video frame cropper and controlling the transmitter to transmit the video signal captured by the microphone and the video cropped by the video frame cropper at the start time of the lip motion and for disabling the video frame cropper and controlling the transmitter to transmit only the video signal captured by the camera at the end time of the lip motion .
  • the video frame cropper is adapted for cropping a close-up of a speaker who is currently speaking by means of zooming.
  • the present invention it is possible to implement subj ect-independent lip motion detection for a training set containing a limited number of subjects. Compared with prior art, the present invention enables a higher detection rate for a subj ect not contained in the training set. According to the present invention, the retraining or adaptation for different users for improving the detection rate is not necessary any more, and thus the usability is improved.
  • Fig. 1 shows a block diagram of a device for video-based lip motion detection according to an embodiment of the present invention
  • Fig. 2 shows an example for calculating LBP code according to an embodiment of the present invention
  • Fig. 3 shows an example for extracting LBP-TOP-based feature according to an embodiment of the present invention
  • Fig. 4 shows a block diagram of a device for video-based lip motion detection according to another embodiment of the present invention .
  • Fig. 5 shows a block diagram of a device for video-based lip motion detection according to another embodiment of the present invention.
  • Fig. 6 illustrates a flowchart of a method for video-based lip motion detection according to an embodiment of the present invention
  • Fig. 7 shows a block diagram of a video-aided speech recognition system having a device for lip motion detection according to an embodiment of the present invention
  • Fig. 9 shows a block diagram of a video conference system having a device for lip motion detection according to an embodiment of the present invention.
  • Fig. 1 shows a block diagram of a device 10 for video-based lip motion detection according to an embodiment of the present invention .
  • the lip motion detection device 10 includes: a face search unit 1 1 0 adapted for searching a face in an input video frame; a mouth region extraction unit 120 adapted for extracting a mouth region from the searched face ; a visual feature extraction unit 130 adapted for extracting a visual feature of the mouth region; and a detection unit 140 adapted for detecting lip motion based on the extracted visual feature of the mouth region.
  • a face search unit 1 1 0 adapted for searching a face in an input video frame
  • a mouth region extraction unit 120 adapted for extracting a mouth region from the searched face
  • a visual feature extraction unit 130 adapted for extracting a visual feature of the mouth region
  • a detection unit 140 adapted for detecting lip motion based on the extracted visual feature of the mouth region.
  • the face search unit 1 10 searches a face in each input video frame . If any face is found, its position will be transferred to the mouth region extraction unit 120 as input information. On the other hand, no further process will be taken on a video frame in which no face is found .
  • the face search unit 100 can be implemented using various known techniques for face detection and tracking, such as but not limited to: Viola-Jones face detector, Rowley face detector, meanshift tracker and particle filtering tracker.
  • the mouth region extraction unit 120 searches a mouth region from the face found by the face search unit 1 10 and extracts it from the face .
  • two mouth corners are searched first.
  • the mouth region is determined based on the two found mouth corners.
  • the two mouth corners can be located by the well-known Active Shape Model (ASM) .
  • ASM Active Shape Model
  • AAM Active Appearance Model
  • Snakes also known as Active Contour Model
  • a rectangular region can be determined, with the center of the rectangular region located at the middle of the line connecting two mouth corners and the longer edges of the rectangular region being parallel with the line connecting two mouth corners.
  • the rectangular region is determined as the mouth region.
  • the aspect ratio of the rectangular region is preferably 3 : 2. However, other aspect ratios are also applicable .
  • the mouth region can be of any other shape , such as ellipse, which contains the entire outer contour of the lip.
  • these shapes are not necessarily symmetric and their centers are not necessarily coincident with the middle of the line connecting the mouth corners .
  • any shape having a large intersection with the outer contour of the lip can also be used as the mouth region.
  • the visual feature extraction unit 130 extracts a visual feature from a spatial-temporal window which contains one or more consecutive mouth regions .
  • the visual feature is described as a Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) feature, which is a spatial-temporal extension of the well-known Local Binary Pattern (LBP) feature.
  • LBP-TOP Local Binary Pattern on Three Orthogonal Planes
  • g c is the value of the pixel (x c ,3 ⁇ 4) and g s is the value of the p-th neighbor.
  • the values of the central pixel and its neighbors uniformly distributed on a surrounding unit circle are shown in (a) of Fig. 2.
  • a neighbor can be represented as 1 if its value is not smaller than the value of the central pixel; otherwise it can be represented as 0.
  • (b) of Fig. 2 shows the comparison result.
  • the comparison can be arranged, in a counter-clockwise order, into a binary value which is the LBP code of the central pixel.
  • any other order or starting pixel is also applicable.
  • LBP code e.g., uniform LBP, rotation invariant LBP and a combination thereof.
  • other LBP code variations and other values of P and R are also applicable.
  • the LBP-TOP feature is extracted from a spatial-temporal window which contains one or more consecutive mouth regions.
  • Fig. 3 shows an example of the extraction process.
  • a window (as shown in (a) of Fig. 3) is first divided into one or more spatial-temporal blocks .
  • the LBP codes for each pixel in each block (as shown in (b) of Fig. 3) are extracted with respect to its neighbors on XY, XT and/ or YT planes, respectively.
  • a LBP code histogram for each plane is extracted and the histograms from one or more specific planes are then concatenated into the LBP-TOP feature of the block.
  • the histogram from the XY plane contains more information on the subject's identity, while the histograms from the XT and YT planes contain more information on motion, which is less individual-dependent. Finally, the LBP-TOP features from all blocks are concatenated into the LBP-TOP feature of the spatial-temporal window.
  • the spatial-temporal window comprises five consecutive mouth regions and is divided uniformly into 6 x 4 x 1 (corresponding to X, Y and T axes, respectively) blocks with 50% of overlap .
  • Only LBP histograms from the XT and YT planes are used.
  • the present invention works well on subj ects not contained in the training set.
  • other numbers of mouth regions, other types of window division and other combinations of LBP histograms from different planes are also applicable . It can be appreciated by those skilled in the art that the feature based on LBP code can be extracted by post-processing the histograms or otherwise .
  • the histograms can be normalized .
  • the normalization can be carried out separately for each plane of each block.
  • the histograms of the same plane for spatially and temporally adj acent blocks, the histograms of different planes for the same block, or the histograms of different planes for spatially and temporally adjacent blocks can be normalized collectively.
  • the criterion of normalization may cause the sum, or squared sum, of the normalized vector elements to be 1 . After a first normalization, a value in a histogram exceeding a certain threshold can be changed to that threshold and a further normalization can be performed.
  • the normalized histograms can be concatenated into the feature of the spatial-temporal window.
  • the LBP code(s) on one or more particular planes for all pixels of each block can be considered as a feature vector which is dimension-reduced by means of sub-space analysis, such as Principal Component Analysis or Linear Discriminant Analysis, and then used as the feature of the block or plane.
  • the LBP-TOP features of all blocks and / or all planes are concatenated into the feature of the spatial-temporal window.
  • the gradient(s) on one or more planes for all pixels of each block can be considered as a feature vector which is dimension-reduced by means of sub-space analysis, such as Principal Component Analysis or Linear Discriminant Analysis, and then used as the feature of the block or plane .
  • sub-space analysis such as Principal Component Analysis or Linear Discriminant Analysis
  • the detection unit 140 detects lip motion based on the extracted visual feature of the mouth region .
  • the detection unit 140 can be a classifier capable of distinguishing two classes of mouth regions, i. e. , mouth region with lip motion and mouth region without lip motion.
  • classifiers There are a number of known classifiers.
  • a Support Vector Machine (SVM) is used.
  • SVM Support Vector Machine
  • other classifiers such as k-Nearest Neighbor classifier, Adaboost classifier, Neural Network classifier, Gaussian Process classifier and threshold classifier based on feature similarity are also applicable .
  • the detection unit 140 can be pre-trained with the extracted visual feature of the mouth region.
  • such training can be achieved by assigning a label to each extracted visual feature . For example, if there is a lip motion in the mouth region corresponding to a visual feature, a label of + 1 is assigned to the feature , otherwise a label of - 1 is assigned. Then, the detection unit 140 can be trained by using a number of known training approaches.
  • Fig. 4 shows a block diagram of a device 40 for video-based lip motion detection according to another embodiment of the present invention .
  • the lip motion detection device 40 includes: a face search unit 4 10, a mouth region extraction unit 420, a visual feature extraction unit 430 , a detection unit 440 and an audio feature extraction unit 450. Since the face search unit 4 10 , mouth region extraction unit 420, visual feature extraction unit 430 and detection unit 440 as shown in Fig. 4 are similar to the face search unit 1 10, mouth region extraction unit 120 , visual feature extraction unit 130 and detection unit 140 as shown in Fig. 1 , only the audio feature extraction unit 450 will be detailed in the following for simplicity.
  • the audio feature extraction unit 450 extracts an audio feature corresponding to the input video frame .
  • the extracted audio feature is provided to the detection unit 440 along with the visual feature extracted by the visual feature extraction unit 430.
  • the audio feature extraction unit 450 can use any known audio-based speech endpoint detection methods to detect the speech part and non-speech part. If a frame falls in the speech part, there is a high probability of presence of lip motion in the mouth region in the frame. The detection accuracy of the detection unit 440 can be improved by detecting the lip motion in connection with the audio feature .
  • the detection unit 140 can be pre-trained with the visual feature of the mouth region as extracted by the visual feature extraction unit 440 and the audio feature extracted by the audio feature extraction unit 450.
  • the detection unit 440 can be trained by using a number of known training approaches.
  • Fig. 5 shows a block diagram of a device 50 for video-based lip motion detection according to another embodiment of the present invention .
  • the lip motion detection device 50 includes: a face search unit 5 10, a mouth region extraction unit 520 , a visual feature extraction unit 530 , a detection unit 540 and a smoothing unit 550. Since the face search unit 5 10, mouth region extraction unit 520 , visual feature extraction unit 530 and detection unit 540 as shown in Fig. 5 are similar to the face search unit 1 10 , mouth region extraction unit 1 20, visual feature extraction unit 130 and detection unit 140 as shown in Fig. 1 , only the smoothing unit 550 will be detailed in the following for simplicity.
  • the smoothing unit 550 temporally smoothes the detection result of the detection unit 450. Such smoothing can improve the accuracy of the detection result by utilizing the apriori knowledge that the status of the lip motion does not change repeatedly within a short period of time .
  • the smoothing unit 550 can be implemented using a median filter, preferably a five-point median filter.
  • a median filter having another window size or any other filter can also be used.
  • both the audio feature extraction unit shown in Fig. 4 and the smoothing unit shown in Fig. 5 can be incorporated into the video-based lip motion detection device as shown in Fig. 1 .
  • Fig. 6 illustrates a flowchart of a method 60 for video-based lip motion detection according to an embodiment of the present invention .
  • the method 60 starts at step S6 10.
  • a face is searched in an input video frame by the face search unit 1 10 , 4 10 or 5 10. If any face is found, its position will be transferred to the next step as input information. On the other hand, no further process will be taken on a video frame in which no face is found .
  • the face can be searched by using approaches such as Viola-Jones face detection, Rowley face detection, meanshift tracking and particle filtering tracking.
  • a mouth region is extracted from the searched face by the mouth region extraction unit 120 , 420 or 520.
  • two mouth corners are searched first.
  • the mouth region is determined based on the two found mouth corners .
  • the two mouth corners can be located by the well-known Active Shape Model (ASM) .
  • AAM Active Appearance Model
  • AAM Active Appearance Model
  • Snakes also known as Active Contour Model
  • a rectangular region can be determined, with the center of the rectangular region located at the middle of the line connecting two mouth corners and the longer edges of the rectangular region being parallel with the line connecting two mouth corners.
  • the rectangular region is determined as the mouth region .
  • the aspect ratio of the rectangular region is preferably 3 : 2. However, other aspect ratios are also applicable .
  • the mouth region can be of any other shape, such as ellipse, which contains the entire outer contour of the lip .
  • these shapes are not necessarily symmetric and their centers are not necessarily coincident with the middle of the line connecting the mouth corners.
  • any shape having a large intersection with the outer contour of the lip can also be used as the mouth region.
  • a visual feature of the mouth region is extracted from a spatial-temporal window which contains one or more consecutive mouth regions by the visual feature extraction unit 130, 430 or 530.
  • the visual feature is described as a LBP-TOP feature .
  • the visual feature can also be described using other features, such as but not limited to the above mentioned spatial-temporal extension of gradient type feature .
  • the step S640 can be performed by the visual feature extraction unit 130 as shown in Fig. 1 .
  • the lip motion is detected based on the extracted visual feature of the mouth region by the detection unit 140 or 540.
  • a classifier can be used to distinguish two classes of mouth regions, i . e. , mouth region with lip motion and mouth region without lip motion, so as to obtain the detection result.
  • Such classifier includes, but not limited to , Support Vector Machine (SVM) , k-Nearest Neighbor classifier, Adaboost classifier, Neural Network classifier, Gaussian Process classifier and threshold classifier based on feature similarity.
  • SVM Support Vector Machine
  • k-Nearest Neighbor classifier Adaboost classifier
  • Neural Network classifier Neural Network classifier
  • Gaussian Process classifier Gaussian Process classifier
  • step S650 for detecting the lip motion it is possible to pre-train with the visual feature of the mouth region as extracted in step S640.
  • such training can be achieved by assigning a label to each extracted visual feature . For example, if there is a lip motion in the mouth region corresponding to a visual feature, a label of + 1 is assigned to the feature , otherwise a label of - 1 is assigned. Then, the pre-training can be carried out using a number of known training approaches.
  • an audio feature can be extracted from the input video frame by the audio feature extraction unit 450.
  • the extracted audio feature is provided to step S650 along with the extracted visual feature for the detection unit 440 to detect the lip motion.
  • any known audio-based speech endpoint detection methods to detect the speech part and non- speech part. If a frame falls in the speech part, there is a high probability of presence of lip motion in the mouth region in the frame .
  • the detection accuracy can be improved by detecting the lip motion in connection with the audio feature . Accordingly, before detecting the lip motion, t is possible to pre-train with the visual feature and the audio feature .
  • the method can smooth the detection result obtained in Step S650 by the smoothing unit 550.
  • smoothing can improve the accuracy of the detection result by utilizing the apriori knowledge that the status of the lip motion does not change repeatedly within a short period of time .
  • the smoothing can be implemented using a median filter, preferably a five-point median filter.
  • a median filter having another window size or any other filter can also be used.
  • Fig. 7 shows a block diagram of a video-aided speech recognition system 70 having a device for lip motion detection according to an embodiment of the present invention .
  • a speech segment can be detected based on video, such that the accuracy of speech detection in noisy environment can be improved.
  • Fig. 7 includes a microphone 7 1 0 , a camera 720, a lip motion detection device 730, a speech segment detector 740 , a feature extractor 750 and a speech recognizer 760.
  • the microphone 7 1 0 and the camera 720 capture video and audio signals in real time, respectively.
  • the speaker faces the camera 720 and the microphone 7 10 when speaking.
  • the captured video is sent to the lip motion detection device 730 and the captured audio is sent to the speech segment detector 740.
  • the lip motion detection device 730 can be implemented by the lip motion detection device 10 shown in Fig. 1 , the lip motion detection device 40 shown in Fig. 4 and the lip motion detection device 50 shown in Fig. 4.
  • the lip motion detection device 730 detects a lip motion , it sends start time and end time of the lip motion to the speech segment detector 740, which then extracts a speech segment based on the start time and end time of the lip motion as received, (a) of Fig. 8 shows the audio signal received by the microphone 7 10 and (b) of Fig. 8 shows the lip motion detection result. It's clear that the speech endpoints and lip motion signal endpoints match with each other quite well, (c) of Fig. 8 shows the extracted speech segment based on the lip motion signal.
  • the extracted speech segment is then sent to the feature extractor 750 in which a audio feature is extracted and sent to the speech recognizer 760 for recognizing the speech and outputting a recognition result.
  • Fig. 9 shows a block diagram of a video conference system 90 having a device for lip motion detection according to an embodiment of the present invention .
  • the system is capable of automatically turning on off a microphone and providing a close up for the current speaker in a case where there are multiple subjects.
  • the video conference system 90 includes a microphone 9 10 , a camera 920 , a lip motion detection device 930 , a video frame cropper 940 and a transmitter 950.
  • the microphone 9 10 captures audio signal in real time and the camera 920 captures video in real time .
  • the speaker faces the camera when speaking to another party of the conference .
  • the captured video is sent to both the lip motion detection device 930 and the video frame cropper 940.
  • the lip motion detection device 930 can be implemented by the lip motion detection device 10 shown in Fig. 1 , the lip motion detection device 40 shown in Fig. 4 and the lip motion detection device 50 shown in Fig. 4.
  • the video frame cropper 940 crops video from the video captured by the camera 920.
  • the lip motion detection device 930 turns on the video frame cropper 940 and sends the position of the speaker to the video frame cropper 940. If there are multiple subj ects (speakers) , the video frame cropper crops video frame and resizes it (by means of zooming) to provide a close up for the current speaker.
  • the cropped video and the audio from the microphone 9 1 0 are sent to the transmitter 950 for transmission .
  • the lip motion detection device 930 turns off the video frame cropper 940 to stop cropping video .
  • the transmitter 950 transmits only the video captured by the camera 920.
  • the lip motion detection device 930 controls the transmitter 950 to transmit the audio captured by the microphone 9 10 and the video captured by the camera 920.
  • the lip motion detection device 930 controls the transmitter 950 to transmit only the video captured by the camera 920.
  • the present invention it is possible to implement subj ect-independent lip motion detection for a training set containing a limited number of subjects. Compared with prior art, the present invention enables a higher detection rate for a subj ect not contained in the training set. According to the present invention, the retraining or adaptation for different users for improving the detection rate is not necessary any more, and thus the usability is improved .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A device for video-based lip motion detection is provided, which comprises: a face search unit adapted for searching a face in an input video frame; a mouth region extraction unit adapted for extracting a mouth region from the searched face; a visual feature extraction unit adapted for extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the extraction result; and a detection unit adapted for detecting lip motion based on the extracted visual feature of the mouth region. A method for video-based lip motion detection is also provided. According to the present invention, the detection accuracy can be improved without re-training or adaptation for a particular user.

Description

DESCRIPTION
TITLE OF INVENTION :
DEVICE AND METHOD FOR LIP MOTION DETECTION
TECHNICAL FIELD
The invention relates to video processing, and more particularly, to a device and method for video-based lip motion detection.
BACKGROUND ART
In a noisy environment, it is difficult to accurately detect a speech segment based on audio signal only. The speech segment may sometimes be improperly cropped and / or appended with noise . As a consequence, the speech recognition accuracy may be degraded.
It is known that lip motion is a good indicator of speech .
US7343289B2 discloses a system and method for audio / video speaker detection . The method is directed to detect the speaker, i. e . the subject of lip motion , based on both visual and audio information . In particular, the disclosed method includes the following steps of: finding face from video frame; finding and extracting the mouth region; extracting the mouth openness with Linear Discriminant Analysis (LDA) as visual feature ; extracting the energy of the audio signal corresponding to the video frame as audio feature ; and inputting both features into a trained Time-Delayed Neural Network (TDNN) and detecting lip motion according to the output of the TDNN .
However, the above method extracts visual feature from each frame separately. The feature includes rich information on the subject's identity and is thus somewhat individual-dependent. As a consequence, if the method is used to detect lip motion of a subject not included in the training set of the TDNN, the detection rate will be degraded significantly.
SUMMARY OF INVENTION
In order to solve the above technical problem, according to an aspect of the present invention, a device for video-based lip motion detection is provided, which comprises : a face search unit adapted for searching a face in an input video frame; a mouth region extraction unit adapted for extracting a mouth region from the searched face; a visual feature extraction unit adapted for extracting a visual feature of the mouth region ; and a detection unit adapted for detecting lip motion based on the extracted visual feature of the mouth region.
Preferably, the visual feature extraction unit is adapted for extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the extraction result.
Preferably, the detection unit is pre-trained with the extracted visual feature of the mouth region.
Preferably, the device for video-based lip motion detection further comprises a smoothing unit adapted for smoothing the detection result of the detection unit.
Preferably, the device for video-based lip motion detection further comprises : an audio feature extraction unit adapted for extracting an audio feature corresponding to the input video frame . The detection unit is adapted for detecting lip motion based on the visual feature extracted by the visual feature extraction unit and the audio feature extracted by the audio feature extraction unit.
Preferably, the detection unit is pre-trained with the extracted visual feature and audio feature .
Preferably, the visual feature comprises a Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) -based visual feature .
Preferably, the mouth region is a rectangle, the center of the rectangle being located at the middle of the line connecting two mouth corners and the longer edges of the rectangle being parallel with the line connecting two mouth corners.
Preferably, the detection unit comprises a Support Vector Machine (SVM) .
Preferably, the smoothing unit comprises a median filter. Preferably, the face search unit comprises a Viola-Jones face detector.
Preferably, the mouth region extraction unit is adapted for extracting the mouth region from the searched face by using an Active Shape Model (ASM) .
Preferably, the visual feature extraction unit is further adapted for extracting at least one of a gradient on image plane and a Local Binary Pattern (LBP) code on image plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the overall extraction result
According to another aspect of the present invention, a method for video-based lip motion detection is provided, which comprises the following steps of: searching a face in an input video frame; extracting a visual feature of the mouth region; and detecting lip motion based on the extracted visual feature of the mouth region.
Preferably, the method for video-based lip motion detection further comprises: extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window, prior to extracting a visual feature of the mouth region . The visual feature of the mouth region is extracted based on the extraction result.
Preferably, the method for video-based lip motion detection further comprises : pre-training with the extracted visual feature of the mouth region, prior to detecting lip motion.
Preferably, the method for video-based lip motion detection further comprises : smoothing the detection result.
Preferably, the method for video-based lip motion detection further comprises: extracting an audio feature corresponding to the input video frame . The lip motion is detected based on the extracted visual feature and audio feature .
Preferably, the method further comprises: pre-training with the extracted visual feature and audio feature , prior to detecting lip motion.
Preferably, the visual feature comprises a Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) -based visual feature .
Preferably, the mouth region is a rectangle, the center of the rectangle being located at the middle of the line connecting two mouth corners and the longer edges of the rectangle being parallel with the line connecting two mouth corners.
Preferably, the lip mouth is detected by using a Support
Vector Machine (SVM) .
Preferably, the detection result is smoothed by using a median filter.
Preferably, the face is searched from the input video frame by using a Viola-Jones face detector. Preferably, the mouth region is extracted from the searched face by using an Active Shape Model (ASM) .
Preferably, the method for video-based lip motion detection further comprises: extracting at least one of a gradient on image plane and a Local Binary Pattern (LBP) code on image plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the overall extraction result.
According to another aspect of the present invention, a speech recognition system is provided, which comprises: a microphone adapted for capturing an audio signal; a camera adapted for capturing a video signal; a device for video-based lip motion detection, adapted for detecting lip motion based on the video signal captured by the camera to obtain start time and end time of the lip motion; a speech segment detector adapted for extracting a speech segment based on the audio signal captured by the microphone and the start time and the end time of the lip motion; a feature extractor adapted for extracting an audio feature from the extracted speech segment; and a speech recognizer adapted for recognizing speech based on the extracted audio feature.
According to another aspect of the present invention, a video conference system is provided, which comprises : a microphone adapted for capturing an audio signal; a camera adapted for capturing a video signal; a device for video-based lip motion detection, adapted for detecting lip motion based on the video signal captured by the camera to obtain start time and end time of the lip motion ; and a transmitter. The device for video-based lip motion detection is adapted for controlling the transmitter to transmit the audio signal captured by the microphone and the video signal captured by the camera at the start time of the lip motion and for controlling the transmitter to transmit only the video signal captured by the camera at the end time of the lip motion .
Preferably, the video conference system further comprises : a video frame cropper adapted for cropping video from the video signal captured by the camera. The device for video-based lip motion detection is adapted for enabling the video frame cropper and controlling the transmitter to transmit the video signal captured by the microphone and the video cropped by the video frame cropper at the start time of the lip motion and for disabling the video frame cropper and controlling the transmitter to transmit only the video signal captured by the camera at the end time of the lip motion .
Preferably, the video frame cropper is adapted for cropping a close-up of a speaker who is currently speaking by means of zooming.
With the present invention, it is possible to implement subj ect-independent lip motion detection for a training set containing a limited number of subjects. Compared with prior art, the present invention enables a higher detection rate for a subj ect not contained in the training set. According to the present invention, the retraining or adaptation for different users for improving the detection rate is not necessary any more, and thus the usability is improved.
BRIEF DESCRIPTION OF DRAWINGS
The above and other features of the present invention will be more apparent from the following detailed description with reference to the figures, in which:
Fig. 1 shows a block diagram of a device for video-based lip motion detection according to an embodiment of the present invention;
Fig. 2 shows an example for calculating LBP code according to an embodiment of the present invention;
Fig. 3 shows an example for extracting LBP-TOP-based feature according to an embodiment of the present invention;
Fig. 4 shows a block diagram of a device for video-based lip motion detection according to another embodiment of the present invention ;
Fig. 5 shows a block diagram of a device for video-based lip motion detection according to another embodiment of the present invention;
Fig. 6 illustrates a flowchart of a method for video-based lip motion detection according to an embodiment of the present invention;
Fig. 7 shows a block diagram of a video-aided speech recognition system having a device for lip motion detection according to an embodiment of the present invention ;
(a) to (c) of Fig. 8 show signals in the speech recognition system of Fig. 7 ; and
Fig. 9 shows a block diagram of a video conference system having a device for lip motion detection according to an embodiment of the present invention.
DESCRIPTION OF EMBODIMENTS
The embodiments of the present invention will be detailed with reference to the drawings, from which the principles and implementations of the present invention will become more apparent. It is to be noted that the present invention is not limited to the particular embodiments described in the following. In addition, details of well known techniques unnecessary to the present invention are omitted herein for simplicity.
Fig. 1 shows a block diagram of a device 10 for video-based lip motion detection according to an embodiment of the present invention . As shown in Fig. 1 , the lip motion detection device 10 includes: a face search unit 1 1 0 adapted for searching a face in an input video frame; a mouth region extraction unit 120 adapted for extracting a mouth region from the searched face ; a visual feature extraction unit 130 adapted for extracting a visual feature of the mouth region; and a detection unit 140 adapted for detecting lip motion based on the extracted visual feature of the mouth region. In the following, the operations of the respective components included in the lip motion detection device 10 will be detailed.
The face search unit 1 10 searches a face in each input video frame . If any face is found, its position will be transferred to the mouth region extraction unit 120 as input information. On the other hand, no further process will be taken on a video frame in which no face is found . The face search unit 100 can be implemented using various known techniques for face detection and tracking, such as but not limited to: Viola-Jones face detector, Rowley face detector, meanshift tracker and particle filtering tracker.
The mouth region extraction unit 120 searches a mouth region from the face found by the face search unit 1 10 and extracts it from the face . In particular, for each found face, two mouth corners are searched first. Then, the mouth region is determined based on the two found mouth corners. The two mouth corners can be located by the well-known Active Shape Model (ASM) . Alternatively, the Active Appearance Model (AAM) and Snakes (also known as Active Contour Model) can also be used to locate the two mouth corners. After the two mouth corners are located, a rectangular region can be determined, with the center of the rectangular region located at the middle of the line connecting two mouth corners and the longer edges of the rectangular region being parallel with the line connecting two mouth corners. Then, the rectangular region is determined as the mouth region. The aspect ratio of the rectangular region is preferably 3 : 2. However, other aspect ratios are also applicable .
Alternatively, the mouth region can be of any other shape , such as ellipse, which contains the entire outer contour of the lip. In addition, these shapes are not necessarily symmetric and their centers are not necessarily coincident with the middle of the line connecting the mouth corners . Further, any shape having a large intersection with the outer contour of the lip can also be used as the mouth region.
The visual feature extraction unit 130 extracts a visual feature from a spatial-temporal window which contains one or more consecutive mouth regions . In an embodiment of the present invention , the visual feature is described as a Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) feature, which is a spatial-temporal extension of the well-known Local Binary Pattern (LBP) feature.
Specifically, for a given pixel , values of its P uniformly spaced neighbors on a circle which is centered at (xc,ye) and has a radius of R are extracted by interpolation. The LBP code of the pixel is given by
Figure imgf000013_0001
where
Figure imgf000013_0002
gc is the value of the pixel (xc,¾) and gs is the value of the p-th neighbor. Fig. 2 shows an example for calculating the LBP code, where P = S and gc = 70. The values of the central pixel and its neighbors uniformly distributed on a surrounding unit circle are shown in (a) of Fig. 2. First, the value of the central pixel is compared with the value of each of its neighbors. A neighbor can be represented as 1 if its value is not smaller than the value of the central pixel; otherwise it can be represented as 0. (b) of Fig. 2 shows the comparison result. Then, starting from the horizontal neighbor on the left, the comparison can be arranged, in a counter-clockwise order, into a binary value which is the LBP code of the central pixel. Of course, any other order or starting pixel is also applicable.
Besides the basic form, there are many known variations of the LBP code, e.g., uniform LBP, rotation invariant LBP and a combination thereof. Preferably, the basic LBP code with P = 8 and R= l is employed. However, other LBP code variations and other values of P and R are also applicable.
The LBP-TOP feature is extracted from a spatial-temporal window which contains one or more consecutive mouth regions. Fig. 3 shows an example of the extraction process. As illustrated in Fig. 3 , a window (as shown in (a) of Fig. 3) is first divided into one or more spatial-temporal blocks . The LBP codes for each pixel in each block (as shown in (b) of Fig. 3) are extracted with respect to its neighbors on XY, XT and/ or YT planes, respectively. For each block, as shown in (c) of Fig. 3 , a LBP code histogram for each plane is extracted and the histograms from one or more specific planes are then concatenated into the LBP-TOP feature of the block. The histogram from the XY plane contains more information on the subject's identity, while the histograms from the XT and YT planes contain more information on motion, which is less individual-dependent. Finally, the LBP-TOP features from all blocks are concatenated into the LBP-TOP feature of the spatial-temporal window.
As used in this embodiment, the spatial-temporal window comprises five consecutive mouth regions and is divided uniformly into 6 x 4 x 1 (corresponding to X, Y and T axes, respectively) blocks with 50% of overlap . Only LBP histograms from the XT and YT planes are used. By combining the less individual-dependent information from the XT and YT planes , the present invention works well on subj ects not contained in the training set. However, other numbers of mouth regions, other types of window division and other combinations of LBP histograms from different planes are also applicable . It can be appreciated by those skilled in the art that the feature based on LBP code can be extracted by post-processing the histograms or otherwise . For example, after the histogram of each plane for each spatial-temporal block is extracted, the histograms can be normalized . The normalization can be carried out separately for each plane of each block. Alternatively, the histograms of the same plane for spatially and temporally adj acent blocks, the histograms of different planes for the same block, or the histograms of different planes for spatially and temporally adjacent blocks can be normalized collectively. The criterion of normalization may cause the sum, or squared sum, of the normalized vector elements to be 1 . After a first normalization, a value in a histogram exceeding a certain threshold can be changed to that threshold and a further normalization can be performed. Finally, the normalized histograms can be concatenated into the feature of the spatial-temporal window. In addition, after the LBP code for each pixel is calculated, instead of extracting the histogram, the LBP code(s) on one or more particular planes for all pixels of each block can be considered as a feature vector which is dimension-reduced by means of sub-space analysis, such as Principal Component Analysis or Linear Discriminant Analysis, and then used as the feature of the block or plane. Finally, the LBP-TOP features of all blocks and / or all planes are concatenated into the feature of the spatial-temporal window. Also , it is understood by those skilled in the art that other features can be used to describe a visual feature, such as but not limited to spatial-temporal extension of gradient type feature . Traditional gradient type feature calculates a gradient on the XY image plane and extracts a feature based on the gradient, while the spatial-temporal extension of gradient type feature calculates features on the XY, XY and/ or YT planes, respectively. Then, the gradient direction histograms for the respective planes in each spatial-temporal block can be extracted and normalized according to any one of normalization approaches in LBP-TOP for extracting feature . Alternatively, the gradient(s) on one or more planes for all pixels of each block can be considered as a feature vector which is dimension-reduced by means of sub-space analysis, such as Principal Component Analysis or Linear Discriminant Analysis, and then used as the feature of the block or plane . Finally, the gradient features from all blocks and/ or all planes are concatenated into the feature of the spatial-temporal window, thereby obtaining the visual feature of the mouth region .
The detection unit 140 detects lip motion based on the extracted visual feature of the mouth region . For example , the detection unit 140 can be a classifier capable of distinguishing two classes of mouth regions, i. e. , mouth region with lip motion and mouth region without lip motion. There are a number of known classifiers. In a preferred embodiment, a Support Vector Machine (SVM) is used. However, other classifiers such as k-Nearest Neighbor classifier, Adaboost classifier, Neural Network classifier, Gaussian Process classifier and threshold classifier based on feature similarity are also applicable .
Preferably, before detecting the lip motion, the detection unit 140 can be pre-trained with the extracted visual feature of the mouth region. In an embodiment, such training can be achieved by assigning a label to each extracted visual feature . For example , if there is a lip motion in the mouth region corresponding to a visual feature, a label of + 1 is assigned to the feature , otherwise a label of - 1 is assigned. Then, the detection unit 140 can be trained by using a number of known training approaches.
Fig. 4 shows a block diagram of a device 40 for video-based lip motion detection according to another embodiment of the present invention . As shown in Fig. 4 , the lip motion detection device 40 includes: a face search unit 4 10, a mouth region extraction unit 420, a visual feature extraction unit 430 , a detection unit 440 and an audio feature extraction unit 450. Since the face search unit 4 10 , mouth region extraction unit 420, visual feature extraction unit 430 and detection unit 440 as shown in Fig. 4 are similar to the face search unit 1 10, mouth region extraction unit 120 , visual feature extraction unit 130 and detection unit 140 as shown in Fig. 1 , only the audio feature extraction unit 450 will be detailed in the following for simplicity.
In this embodiment, the audio feature extraction unit 450 extracts an audio feature corresponding to the input video frame . The extracted audio feature is provided to the detection unit 440 along with the visual feature extracted by the visual feature extraction unit 430. In particular, if a subject in the video is speaking and a synchronized audio is available, the audio feature extraction unit 450 can use any known audio-based speech endpoint detection methods to detect the speech part and non-speech part. If a frame falls in the speech part, there is a high probability of presence of lip motion in the mouth region in the frame. The detection accuracy of the detection unit 440 can be improved by detecting the lip motion in connection with the audio feature .
Preferably, before detecting the lip motion, the detection unit 140 can be pre-trained with the visual feature of the mouth region as extracted by the visual feature extraction unit 440 and the audio feature extracted by the audio feature extraction unit 450. The detection unit 440 can be trained by using a number of known training approaches.
Fig. 5 shows a block diagram of a device 50 for video-based lip motion detection according to another embodiment of the present invention . As shown in Fig. 5, the lip motion detection device 50 includes: a face search unit 5 10, a mouth region extraction unit 520 , a visual feature extraction unit 530 , a detection unit 540 and a smoothing unit 550. Since the face search unit 5 10, mouth region extraction unit 520 , visual feature extraction unit 530 and detection unit 540 as shown in Fig. 5 are similar to the face search unit 1 10 , mouth region extraction unit 1 20, visual feature extraction unit 130 and detection unit 140 as shown in Fig. 1 , only the smoothing unit 550 will be detailed in the following for simplicity.
The smoothing unit 550 temporally smoothes the detection result of the detection unit 450. Such smoothing can improve the accuracy of the detection result by utilizing the apriori knowledge that the status of the lip motion does not change repeatedly within a short period of time . For example , the smoothing unit 550 can be implemented using a median filter, preferably a five-point median filter. As an alternative, a median filter having another window size or any other filter can also be used.
It is to be understood that, as another implementation , both the audio feature extraction unit shown in Fig. 4 and the smoothing unit shown in Fig. 5 can be incorporated into the video-based lip motion detection device as shown in Fig. 1 .
Fig. 6 illustrates a flowchart of a method 60 for video-based lip motion detection according to an embodiment of the present invention . The method 60 starts at step S6 10.
At step S620 , a face is searched in an input video frame by the face search unit 1 10 , 4 10 or 5 10. If any face is found, its position will be transferred to the next step as input information. On the other hand, no further process will be taken on a video frame in which no face is found . Preferably, the face can be searched by using approaches such as Viola-Jones face detection, Rowley face detection, meanshift tracking and particle filtering tracking.
At step S630 , a mouth region is extracted from the searched face by the mouth region extraction unit 120 , 420 or 520. For each found face , two mouth corners are searched first. Then, the mouth region is determined based on the two found mouth corners . The two mouth corners can be located by the well-known Active Shape Model (ASM) . Alternatively, the Active Appearance Model (AAM) and Snakes (also known as Active Contour Model) can also be used to locate the two mouth corners . After the two mouth corners are located, a rectangular region can be determined, with the center of the rectangular region located at the middle of the line connecting two mouth corners and the longer edges of the rectangular region being parallel with the line connecting two mouth corners. Then , the rectangular region is determined as the mouth region . The aspect ratio of the rectangular region is preferably 3 : 2. However, other aspect ratios are also applicable .
Alternatively, the mouth region can be of any other shape, such as ellipse, which contains the entire outer contour of the lip . In addition, these shapes are not necessarily symmetric and their centers are not necessarily coincident with the middle of the line connecting the mouth corners. Further, any shape having a large intersection with the outer contour of the lip can also be used as the mouth region.
At step S640, a visual feature of the mouth region is extracted from a spatial-temporal window which contains one or more consecutive mouth regions by the visual feature extraction unit 130, 430 or 530. In an embodiment of the present invention, the visual feature is described as a LBP-TOP feature . However, it can be appreciated by those skilled in the art that the visual feature can also be described using other features, such as but not limited to the above mentioned spatial-temporal extension of gradient type feature . The step S640 can be performed by the visual feature extraction unit 130 as shown in Fig. 1 .
Next, at step S650 , the lip motion is detected based on the extracted visual feature of the mouth region by the detection unit 140 or 540. For example, a classifier can be used to distinguish two classes of mouth regions, i . e. , mouth region with lip motion and mouth region without lip motion, so as to obtain the detection result. Such classifier includes, but not limited to , Support Vector Machine (SVM) , k-Nearest Neighbor classifier, Adaboost classifier, Neural Network classifier, Gaussian Process classifier and threshold classifier based on feature similarity. Finally, the method 60 ends at step S660.
Preferably, though not shown in Fig. 6 , before the step S650 for detecting the lip motion , it is possible to pre-train with the visual feature of the mouth region as extracted in step S640. In an embodiment, such training can be achieved by assigning a label to each extracted visual feature . For example, if there is a lip motion in the mouth region corresponding to a visual feature, a label of + 1 is assigned to the feature , otherwise a label of - 1 is assigned. Then, the pre-training can be carried out using a number of known training approaches.
Additionally, though not shown in Fig. 6, an audio feature can be extracted from the input video frame by the audio feature extraction unit 450. The extracted audio feature is provided to step S650 along with the extracted visual feature for the detection unit 440 to detect the lip motion. In particular, if a subj ect in the video is speaking and a synchronized audio is available, it is possible to use any known audio-based speech endpoint detection methods to detect the speech part and non- speech part. If a frame falls in the speech part, there is a high probability of presence of lip motion in the mouth region in the frame . The detection accuracy can be improved by detecting the lip motion in connection with the audio feature . Accordingly, before detecting the lip motion, t is possible to pre-train with the visual feature and the audio feature .
Further, though not shown in Fig. 6, the method can smooth the detection result obtained in Step S650 by the smoothing unit 550. Such smoothing can improve the accuracy of the detection result by utilizing the apriori knowledge that the status of the lip motion does not change repeatedly within a short period of time . For example, the smoothing can be implemented using a median filter, preferably a five-point median filter. As an alternative , a median filter having another window size or any other filter can also be used.
Fig. 7 shows a block diagram of a video-aided speech recognition system 70 having a device for lip motion detection according to an embodiment of the present invention . In the speech recognition system 70 , a speech segment can be detected based on video, such that the accuracy of speech detection in noisy environment can be improved.
In particular, the speech recognition system 70 shown in
Fig. 7 includes a microphone 7 1 0 , a camera 720, a lip motion detection device 730, a speech segment detector 740 , a feature extractor 750 and a speech recognizer 760. The microphone 7 1 0 and the camera 720 capture video and audio signals in real time, respectively. The speaker faces the camera 720 and the microphone 7 10 when speaking. The captured video is sent to the lip motion detection device 730 and the captured audio is sent to the speech segment detector 740. Herein, the lip motion detection device 730 can be implemented by the lip motion detection device 10 shown in Fig. 1 , the lip motion detection device 40 shown in Fig. 4 and the lip motion detection device 50 shown in Fig. 4.
If the lip motion detection device 730 detects a lip motion , it sends start time and end time of the lip motion to the speech segment detector 740, which then extracts a speech segment based on the start time and end time of the lip motion as received, (a) of Fig. 8 shows the audio signal received by the microphone 7 10 and (b) of Fig. 8 shows the lip motion detection result. It's clear that the speech endpoints and lip motion signal endpoints match with each other quite well, (c) of Fig. 8 shows the extracted speech segment based on the lip motion signal.
The extracted speech segment is then sent to the feature extractor 750 in which a audio feature is extracted and sent to the speech recognizer 760 for recognizing the speech and outputting a recognition result.
Fig. 9 shows a block diagram of a video conference system 90 having a device for lip motion detection according to an embodiment of the present invention . The system is capable of automatically turning on off a microphone and providing a close up for the current speaker in a case where there are multiple subjects.
Specifically, the video conference system 90 includes a microphone 9 10 , a camera 920 , a lip motion detection device 930 , a video frame cropper 940 and a transmitter 950. The microphone 9 10 captures audio signal in real time and the camera 920 captures video in real time . The speaker faces the camera when speaking to another party of the conference . The captured video is sent to both the lip motion detection device 930 and the video frame cropper 940. Herein, the lip motion detection device 930 can be implemented by the lip motion detection device 10 shown in Fig. 1 , the lip motion detection device 40 shown in Fig. 4 and the lip motion detection device 50 shown in Fig. 4.
The video frame cropper 940 crops video from the video captured by the camera 920. At the start time of the lip motion, the lip motion detection device 930 turns on the video frame cropper 940 and sends the position of the speaker to the video frame cropper 940. If there are multiple subj ects (speakers) , the video frame cropper crops video frame and resizes it (by means of zooming) to provide a close up for the current speaker. The cropped video and the audio from the microphone 9 1 0 are sent to the transmitter 950 for transmission . At the end time of the lip motion, the lip motion detection device 930 turns off the video frame cropper 940 to stop cropping video . At this time , the transmitter 950 transmits only the video captured by the camera 920.
It is to be understood that it is not necessary for the video conference system 90 to include a video frame cropper 940. At this time , at the start time of the lip motion, the lip motion detection device 930 controls the transmitter 950 to transmit the audio captured by the microphone 9 10 and the video captured by the camera 920. At the end time of the lip motion, the lip motion detection device 930 controls the transmitter 950 to transmit only the video captured by the camera 920.
With the present invention, it is possible to implement subj ect-independent lip motion detection for a training set containing a limited number of subjects. Compared with prior art, the present invention enables a higher detection rate for a subj ect not contained in the training set. According to the present invention, the retraining or adaptation for different users for improving the detection rate is not necessary any more, and thus the usability is improved .
The present invention has been described above with reference to the preferred embodiments thereof. It should be understood that various modifications, alternations and variations can be made by those skilled in the art without departing from the spirits and scope of the present invention . Therefore , the scope of the present invention is not limited to the above particular embodiments but only defined by the claims as attached and their equivalents.

Claims

1 . A device for video-based lip motion detection, comprising:
-a face search unit adapted for searching a face in an input video frame;
-a mouth region extraction unit adapted for extracting a mouth region from the searched face ;
-a visual feature extraction unit adapted for extracting a visual feature of the mouth region; and
-a detection unit adapted for detecting lip motion based on the extracted visual feature of the mouth region .
2. The device for video-based lip motion detection according to claim 1 , wherein the visual feature extraction unit is adapted for extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the extraction result.
3. The device for video-based lip motion detection according to claim 1 , wherein the detection unit is pre-trained with the extracted visual feature of the mouth region.
4. The device for video-based lip motion detection according to claim 1 , further comprising:
-a smoothing unit adapted for smoothing the detection result of the detection unit.
5. The device for video-based lip motion detection according to claim 1 , further comprising:
-an audio feature extraction unit adapted for extracting an audio feature corresponding to the input video frame ;
wherein the detection unit is adapted for detecting lip motion based on the visual feature extracted by the visual feature extraction unit and the audio feature extracted by the audio feature extraction unit.
6. The device for video-based lip motion detection according to claim 5, wherein the detection unit is pre-trained with the extracted visual feature and audio feature .
7. The device for video-based lip motion detection according to claim 1 , wherein the visual feature comprises a
Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) -based visual feature.
8. The device for video-based lip motion detection according to claim 1 , wherein the mouth region is a rectangle, the center of the rectangle being located at the middle of the line connecting two mouth corners and the longer edges of the rectangle being parallel with the line connecting two mouth corners .
9. The device for video-based lip motion detection according to claim 1 , wherein the detection unit comprises a Support Vector Machine (SVM) .
1 0. The device for video-based lip motion detection according to claim 4 , wherein the smoothing unit comprises a median filter.
1 1 . The device for video-based lip motion detection according to claim 1 , wherein the face search unit comprises a
Viola-Jones face detector.
12. The device for video-based lip motion detection according to claim 1 , wherein the mouth region extraction unit is adapted for extracting the mouth region from the searched face by using an Active Shape Model (ASM) .
13. The device for video-based lip motion detection according to claim 1 , wherein the visual feature extraction unit is further adapted for extracting at least one of a gradient on image plane and a Local Binary Pattern (LBP) code on image plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the overall extraction result
14. A method for video-based lip motion detection, comprising the following steps of:
- searching a face in an input video frame ;
-extracting a mouth region from the searched face;
-extracting a visual feature of the mouth region; and
-detecting lip motion based on the extracted visual feature of the mouth region .
15. The method for video-based lip motion detection according to claim 14 , further comprising:
-extracting at least one of a gradient on spatial-temporal plane and a Local Binary Pattern (LBP) code on spatial-temporal plane for each pixel in a spatial-temporal window, prior to extracting a visual feature of the mouth region;
wherein the visual feature of the mouth region is extracted based on the extraction result.
16. The method for video-based lip motion detection according to claim 14 , further comprising:
-pre-training with the extracted visual feature of the mouth region, prior to detecting lip motion .
17. The method for video-based lip motion detection according to claim 14 , further comprising:
- smoothing the detection result.
18. The method for video-based lip motion detection according to claim 14 , further comprising:
-extracting an audio feature corresponding to the input video frame ;
wherein the lip motion is detected based on the extracted visual feature and audio feature.
19. The method for video-based lip motion detection according to claim 18 , further comprising:
-pre-training with the extracted visual feature and audio feature , prior to detecting lip motion .
20. The method for video-based lip motion detection according to claim 14 , wherein the visual feature comprises a
Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) -based visual feature.
2 1 . The method for video-based lip motion detection according to claim 14 , wherein the mouth region is a rectangle , the center of the rectangle being located at the middle of the line connecting two mouth corners and the longer edges of the rectangle being parallel with the line connecting two mouth corners.
22. The method for video-based lip motion detection according to claim 14 , wherein the lip mouth is detected by using a Support Vector Machine (SVM) .
23. The method for video-based lip motion detection according to claim 17 , wherein the detection result is smoothed by using a median filter.
24. The method for video-based lip motion detection according to claim 14 , wherein the face is searched from the input video frame by using a Viola-Jones face detector.
25. The method for video-based lip motion detection according to claim 14 , wherein the mouth region is extracted from the searched face by using an Active Shape Model (ASM) .
26. The method for video-based lip motion detection according to claim 14 , further comprising:
-extracting at least one of a gradient on image plane and a Local Binary Pattern (LBP) code on image plane for each pixel in a spatial-temporal window and extracting a visual feature of the mouth region based on the overall extraction result.
27. A speech recognition system, comprising:
-a microphone adapted for capturing an audio signal;
-a camera adapted for capturing a video signal;
-a device for video-based lip motion detection according to any one of claims 1 - 13 , adapted for detecting lip motion based on the video signal captured by the camera to obtain start time and end time of the lip motion ;
-a speech segment detector adapted for extracting a speech segment based on the audio signal captured by the microphone and the start time and the end time of the lip motion;
-a feature extractor adapted for extracting an audio feature from the extracted speech segment; and
-a speech recognizer adapted for recognizing speech based on the extracted audio feature .
28. A video conference system, comprising:
-a microphone adapted for capturing an audio signal; -a camera adapted for capturing a video signal;
-a device for video-based lip motion detection according to any one of claims 1 - 13 , adapted for detecting lip motion based on the video signal captured by the camera to obtain start time and end time of the lip motion; and
-a transmitter;
wherein the device for video-based lip motion detection is adapted for controlling the transmitter to transmit the audio signal captured by the microphone and the video signal captured by the camera at the start time of the lip motion and for controlling the transmitter to transmit only the video signal captured by the camera at the end time of the lip motion .
29. The video conference system according to claim 28 , further comprising:
-a video frame cropper adapted for cropping video from the video signal captured by the camera;
wherein the device for video-based lip motion detection is adapted for enabling the video frame cropper and controlling the transmitter to transmit the video signal captured by the microphone and the video cropped by the video frame cropper at the start time of the lip motion and for disabling the video frame cropper and controlling the transmitter to transmit only the video signal captured by the camera at the end time of the lip motion .
30. The video conference system according to claim 29 , wherein the video frame cropper is adapted for cropping a close-up of a speaker who is currently speaking by means of zooming.
PCT/JP2012/057677 2011-03-18 2012-03-19 Device and method for lip motion detection WO2012128382A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110077483.1 2011-03-18
CN2011100774831A CN102682273A (en) 2011-03-18 2011-03-18 Device and method for detecting lip movement

Publications (1)

Publication Number Publication Date
WO2012128382A1 true WO2012128382A1 (en) 2012-09-27

Family

ID=46814174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/057677 WO2012128382A1 (en) 2011-03-18 2012-03-19 Device and method for lip motion detection

Country Status (2)

Country Link
CN (1) CN102682273A (en)
WO (1) WO2012128382A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268472A (en) * 2013-04-17 2013-08-28 哈尔滨工业大学深圳研究生院 Dual-color-space-based lip detection method
WO2015076828A1 (en) * 2013-11-22 2015-05-28 Intel Corporation Apparatus and method for voice based user enrollment with video assistance
CN106157972A (en) * 2015-05-12 2016-11-23 恩智浦有限公司 Use the method and apparatus that local binary pattern carries out acoustics situation identification
EP3103260A1 (en) * 2014-02-03 2016-12-14 Google, Inc. Enhancing video conferences
DE102018206216A1 (en) * 2018-04-23 2019-10-24 Bayerische Motoren Werke Aktiengesellschaft A method, apparatus and means for automatically associating a first and second video data stream with a corresponding first and second audio data stream
CN110750152A (en) * 2019-09-11 2020-02-04 云知声智能科技股份有限公司 Human-computer interaction method and system based on lip action
CN112241521A (en) * 2020-12-04 2021-01-19 北京远鉴信息技术有限公司 Identity verification method and device of plosive, electronic equipment and medium
JP2021526048A (en) * 2018-05-28 2021-09-30 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. Optical detection of subject's communication request
EP4009323A1 (en) * 2020-12-04 2022-06-08 BlackBerry Limited Speech activity detection using dual sensory based learning

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745723A (en) * 2014-01-13 2014-04-23 苏州思必驰信息科技有限公司 Method and device for identifying audio signal
CN104951730B (en) * 2014-03-26 2018-08-31 联想(北京)有限公司 A kind of lip moves detection method, device and electronic equipment
CN104298961B (en) * 2014-06-30 2018-02-16 中国传媒大学 Video method of combination based on Mouth-Shape Recognition
CN105321523A (en) * 2014-07-23 2016-02-10 中兴通讯股份有限公司 Noise inhibition method and device
JP2016173791A (en) * 2015-03-18 2016-09-29 カシオ計算機株式会社 Image processor, image processing method and program
CN104883531A (en) * 2015-05-14 2015-09-02 无锡华海天和信息科技有限公司 Implementation method for echo cancellation for video call
CN107203734A (en) * 2016-03-17 2017-09-26 掌赢信息科技(上海)有限公司 A kind of method and electronic equipment for obtaining mouth state
CN105959723B (en) * 2016-05-16 2018-09-18 浙江大学 A kind of lip-sync detection method being combined based on machine vision and Speech processing
CN106331509B (en) * 2016-10-31 2019-08-20 维沃移动通信有限公司 A kind of photographic method and mobile terminal
CN109492506A (en) * 2017-09-13 2019-03-19 华为技术有限公司 Image processing method, device and system
EP3457716A1 (en) * 2017-09-15 2019-03-20 Oticon A/s Providing and transmitting audio signal
CN109817211B (en) * 2019-02-14 2021-04-02 珠海格力电器股份有限公司 Electric appliance control method and device, storage medium and electric appliance
CN110544479A (en) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 Denoising voice recognition method and device
CN111091824B (en) * 2019-11-30 2022-10-04 华为技术有限公司 Voice matching method and related equipment
WO2021114224A1 (en) * 2019-12-13 2021-06-17 华为技术有限公司 Voice detection method, prediction model training method, apparatus, device, and medium
CN111918127B (en) * 2020-07-02 2023-04-07 影石创新科技股份有限公司 Video clipping method and device, computer readable storage medium and camera
CN113642469A (en) * 2021-08-16 2021-11-12 北京百度网讯科技有限公司 Lip motion detection method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07225841A (en) * 1993-12-13 1995-08-22 Sharp Corp Picture processor
JP2003195883A (en) * 2001-12-26 2003-07-09 Toshiba Corp Noise eliminator and communication terminal equipped with the eliminator
JP2004240154A (en) * 2003-02-06 2004-08-26 Hitachi Ltd Information recognition device
JP2006202276A (en) * 2004-12-22 2006-08-03 Fuji Photo Film Co Ltd Image processing method, system, and program
JP2009098901A (en) * 2007-10-16 2009-05-07 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for detecting facial expression
JP2010204984A (en) * 2009-03-04 2010-09-16 Nissan Motor Co Ltd Driving support device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE389934T1 (en) * 2003-01-24 2008-04-15 Sony Ericsson Mobile Comm Ab NOISE REDUCTION AND AUDIOVISUAL SPEECH ACTIVITY DETECTION
CN1967564A (en) * 2005-11-17 2007-05-23 中华电信股份有限公司 Method and device for detecting and identifying human face applied to set environment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07225841A (en) * 1993-12-13 1995-08-22 Sharp Corp Picture processor
JP2003195883A (en) * 2001-12-26 2003-07-09 Toshiba Corp Noise eliminator and communication terminal equipped with the eliminator
JP2004240154A (en) * 2003-02-06 2004-08-26 Hitachi Ltd Information recognition device
JP2006202276A (en) * 2004-12-22 2006-08-03 Fuji Photo Film Co Ltd Image processing method, system, and program
JP2009098901A (en) * 2007-10-16 2009-05-07 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for detecting facial expression
JP2010204984A (en) * 2009-03-04 2010-09-16 Nissan Motor Co Ltd Driving support device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ATSUSHI SAYO ET AL.: "A Study on Classifier Generation Methods for Personal Authentication System Using Lip Variation", IEICE TECHNICAL REPORT, vol. 110, no. 217, 28 September 2010 (2010-09-28), pages 7 - 12 *
P.VIOLA, M.JONES: "Rapid Object Detection using a Boosted Cascade of Simple Features", IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2001, pages 511 - 518, XP010583787 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268472A (en) * 2013-04-17 2013-08-28 哈尔滨工业大学深圳研究生院 Dual-color-space-based lip detection method
WO2015076828A1 (en) * 2013-11-22 2015-05-28 Intel Corporation Apparatus and method for voice based user enrollment with video assistance
US9406295B2 (en) 2013-11-22 2016-08-02 Intel Corporation Apparatus and method for voice based user enrollment with video assistance
EP3103260A1 (en) * 2014-02-03 2016-12-14 Google, Inc. Enhancing video conferences
CN106157972B (en) * 2015-05-12 2021-11-05 汇顶科技(香港)有限公司 Method and apparatus for acoustic context recognition using local binary patterns
CN106157972A (en) * 2015-05-12 2016-11-23 恩智浦有限公司 Use the method and apparatus that local binary pattern carries out acoustics situation identification
DE102018206216A1 (en) * 2018-04-23 2019-10-24 Bayerische Motoren Werke Aktiengesellschaft A method, apparatus and means for automatically associating a first and second video data stream with a corresponding first and second audio data stream
JP2021526048A (en) * 2018-05-28 2021-09-30 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. Optical detection of subject's communication request
JP7304898B2 (en) 2018-05-28 2023-07-07 コーニンクレッカ フィリップス エヌ ヴェ Optical detection of subject communication requests
CN110750152A (en) * 2019-09-11 2020-02-04 云知声智能科技股份有限公司 Human-computer interaction method and system based on lip action
CN110750152B (en) * 2019-09-11 2023-08-29 云知声智能科技股份有限公司 Man-machine interaction method and system based on lip actions
CN112241521A (en) * 2020-12-04 2021-01-19 北京远鉴信息技术有限公司 Identity verification method and device of plosive, electronic equipment and medium
EP4009323A1 (en) * 2020-12-04 2022-06-08 BlackBerry Limited Speech activity detection using dual sensory based learning
US11451742B2 (en) 2020-12-04 2022-09-20 Blackberry Limited Speech activity detection using dual sensory based learning

Also Published As

Publication number Publication date
CN102682273A (en) 2012-09-19

Similar Documents

Publication Publication Date Title
WO2012128382A1 (en) Device and method for lip motion detection
US11527055B2 (en) Feature density object classification, systems and methods
US10181325B2 (en) Audio-visual speech recognition with scattering operators
CN106557726B (en) Face identity authentication system with silent type living body detection and method thereof
US9020188B2 (en) Method for object detection and apparatus using the same
CN106503691B (en) Identity labeling method and device for face picture
JP2001092974A (en) Speaker recognizing method, device for executing the same, method and device for confirming audio generation
Bendris et al. Lip activity detection for talking faces classification in TV-content
US10311287B2 (en) Face recognition system and method
CN110750152A (en) Human-computer interaction method and system based on lip action
WO2018109533A1 (en) A method for selecting frames used in face processing
CN113302907B (en) Shooting method, shooting device, shooting equipment and computer readable storage medium
Vajaria et al. Audio segmentation and speaker localization in meeting videos
Itkarkar et al. Hand gesture to speech conversion using Matlab
CN104598138B (en) electronic map control method and device
Cheng et al. The dku audio-visual wake word spotting system for the 2021 misp challenge
KR20140134549A (en) Apparatus and Method for extracting peak image in continuously photographed image
Monaci Towards real-time audiovisual speaker localization
Chin et al. Lips detection for audio-visual speech recognition system
Besson et al. A multimodal approach to extract optimized audio features for speaker detection
VINUPRIYA et al. Smart Face Recognition Using Machine Learning
Saravi et al. Real-time speaker identification for video conferencing
Wong et al. Audio-visual recognition system in compression domain
Takiguchi et al. Video editing based on situation awareness from voice information and face emotion
Sharma et al. Face detection from digital images: A comparative study

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12760954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12760954

Country of ref document: EP

Kind code of ref document: A1