KR20020068235A

KR20020068235A - Method and apparatus of recognizing speech using a tooth and lip image

Info

Publication number: KR20020068235A
Application number: KR1020010009818A
Authority: KR
Inventors: 유재천
Original assignee: 유재천
Priority date: 2001-02-20
Filing date: 2001-02-20
Publication date: 2002-08-27

Abstract

PURPOSE: A system and a method of recognizing voices using images of teeth and lips are provided to obtain a high voice recognition rate even in environments having high noise as well as environments having low noise. CONSTITUTION: A voice recognition system using images of teeth and lips includes an image input unit(10) for picking up the image of teeth and lips of a speaker to convert the image into an electric video signal, drivers(18,19,20) for driving at least one load(21,22,23), and a signal processor(14) for extracting teeth/lips parameters based on the video signal. The system further includes a storage(16) for storing the parameters during a learning procedure, and a controller(15) for storing the parameters in the storage during the learning procedure and, during a recognition procedure, compares the teeth/lips parameters of the speaker with the parameters stored in the storage during the learning procedure to find out similar parameters. The controller controls the driver corresponding to the found similar parameters, and provides control signals required for the components of the system.

Description

Speech recognition device and method using tooth and lip image {Method and apparatus of recognizing speech using a tooth and lip image}

정보통신 기술의 발전과 함께 최근 음성인식 과 음성합성기술이 주목을 받고 있다. 인간의 음성으로 컴퓨터기기 및 가전제품을 동작시키고 전자상거래까지 가능케 하는 것이 현실적으로 다가오고 있다.With the development of information and communication technology, voice recognition and voice synthesis technology have attracted attention recently. It is realistically possible to operate computer devices and home appliances with human voices and to enable electronic commerce.

생체측정을 통한 보안기술로서 지문인식,홍채인식등과 함께 각광 받고 있는 성문을 이용한 화자인증기술은, 인터넷 및 전자 상거래가 급증하여 보안이 더욱 중요해짐에 따라 정보화사회의 핵심 기술이 되고 있습니다.As a security technology through biometrics, the speaker authentication technology using the voiceprint, which is in the spotlight along with fingerprint recognition and iris recognition, has become a core technology of the information society as security and security become more important as the Internet and electronic commerce are proliferating.

화자 인증(Speaker Verification)이란 각 개인들이 독특하게 지니고 있는 음성의 특징을 이용하여 음성암호로서 본인여부를 인증할 수 있는 기술로서, 음성을 암호로 사용하므로 불법도용이 불가능한 첨단 보안 기술이다.Speaker Verification is a technology that can authenticate the identity as a voice password by using the unique voice features of each individual. It is an advanced security technology that cannot be used illegally because the voice is used as a password.

화자인증기술은 은행권의 홈뱅킹서비스(Home banking Service)와 증권사의 홈트레이딩서비스(Home Trading Service), 컴퓨터 및 각종 가정용 제품에 보안용 암호(password)기능으로 사용된다. 그러나 이들 음성인식 기술은 잡음환경하에서는 성능이 급격히 떨어지는 커다란 문제를 안고 있을 뿐만 아니라 제품의 신뢰도를 크게 저하시킨다.The speaker authentication technology is used as a security password function in the home banking service of banknotes, the home trading service of securities companies, computers and various household products. However, these speech recognition technologies not only have a big problem that performance drops sharply in a noisy environment, but also greatly reduce the reliability of the product.

이러한 문제를 해결키 위한 한 방법으로 미국특허 4,769,845가 있다. 이 특허는 입술 영상을 음성인식방법이다. 그러나 이특허는 입술의 모호성(ambiguity)로인해 음성 인식률을 크게 향상시키지 못하는 문제가 있다.One way to solve this problem is U.S. Patent 4,769,845. This patent is a voice recognition method of the lip image. However, this patent has a problem that does not significantly improve the speech recognition rate due to the ambiguity of the lips.

본 발명은 화자의 치아와 입술 영상을 이용한 음성 인식 장치 및 방법에 관한 것이다.The present invention relates to a speech recognition apparatus and method using the speaker's teeth and lips image.

특히, 저잡음 뿐만 아니라 큰 잡음 환경하에서도 높은 음성인식률을 갖는 음성 인식 장치 및 방법에 관한 것이다.In particular, the present invention relates to a speech recognition apparatus and method having a high speech recognition rate under a high noise environment as well as low noise.

화자의 음성 정보와; 치아와 입술의 영상정보와의 결합를 통해 더욱더 높은 음성인식률을 달성가능한 음성인식 방법 및 장치를 제공한다.The speaker's voice information; It provides a voice recognition method and apparatus that can achieve a higher voice recognition rate by combining the image information of the teeth and lips.

또한 본 발명의 또 다른 측면은 화자의 음성 정보와; X선 장치 혹은 적외선 열화상(thermograph)장치와 같은 투시 장치에 의해 화자의 구강내부 의 전체 구조 혹은 일부 구조를 나타내는 영상정보와의 결합를 통해 더욱더 높은 음성인식률을 달성가능한 음성인식 방법 및 장치를 제공한다.Still another aspect of the present invention relates to speech information of a speaker; Provides a voice recognition method and device that can achieve a higher voice recognition rate by combining with the image information representing the entire structure or part of the speaker's oral cavity by a fluoroscopy device such as an X-ray device or an infrared thermograph device. .

도 1은 화자의 음성신호뿐만 아니라 화자의 혀를 포함한 치아와 입술영상을 인식하여 고잡음 환경하에서도 높은 음성인식을 할수 있는 시스템 블록다이어그램.1 is a system block diagram that can recognize the voice and the image of the teeth including the speaker's tongue as well as the speaker's voice signal to enable high voice recognition in a high noise environment.

도 2는 컴퓨터를 이용한 본 발명의 일시실예.2 is a temporary embodiment of the present invention using a computer.

도 3은 상기 A/D변환부에 의해 얻어진 화자의 혀를 포함한 치아와 입술영상신호로부터 치아입술 파라미터를 추출하는 일실시예.Figure 3 is an embodiment of extracting the dental lip parameters from the teeth and lip image signal including the speaker's tongue obtained by the A / D conversion unit.

도 4a 내지 도 4c는 전형적인 입술 영상신호.4A-4C are typical lip image signals.

도 5a 내지 도 5c는 도 4a 내지 도 4c에 대응한 혀를 포함한 치아 및 입술 영상신호.5A to 5C are tooth and lip image signals including a tongue corresponding to FIGS. 4A to 4C.

도 6은 화자의 "아에이오우" 발음에 대한, 상기의 치아입술 파라미터의 시간변화에 따른 추이를 보인 일실시예.Figure 6 is an embodiment showing the change over time of the dental lip parameters of the "Aou Ou" pronunciation of the speaker.

도 7는 상기 학습과정동안 저장장치에 저장된 화자의 음성 파라미터 및 치아입술파라미터의 일실시예.7 is an embodiment of a speaker's voice parameters and dental lip parameters stored in a storage device during the learning process.

도 8은 치아와 입술 영상을 이용한 음성인식 방법에 대한 프로우차트.8 is a prochart for a voice recognition method using a tooth and lips image.

도 9는 본 발명의 치아 및 입술 영상을 이용한 음성인식 장치 및 방법을 반도체 칩(100)으로 설계한 일실시예.Figure 9 is an embodiment of the semiconductor chip 100, the speech recognition device and method using the teeth and lips image of the present invention.

도 10은 투시장치에 의한 화자의 구강구조의 영상을 보이는 일실시예이다.Figure 10 is an embodiment showing an image of the oral structure of the speaker by the fluoroscopy device.

본 발명의 목적은 잡음환경하에서 음성인식 성능이 급격히 떨어지는 커다란 문제을 해결키 위해 화자의 음성 정보뿐만 아니라 화자의 혀를 포함한 치아와 입술의 영상 정보를 함께 더 이용함으로서 큰 잡음 환경하에서도 높은 음성인식률을 갖는 음성인식 방법 및 장치를 제공하는 것이다. 화자의 음성 정보만으로는 잡음환경하에서는 높은 음성인식률을 갖는 것은 불가능하다. 또한 입술의 모호성으로 인해 입술만의 영상신호를 가지고 음성인식률을 향상시키는 것도 많은 어려움이 있다. 발음에 따른 입술의 모호성(ambiguity)은 화자의 혀를 포함한 치아와 입술 영상신호를 이용함으로서 극복될수 있다.An object of the present invention is to use the voice information of the speaker and the teeth and lips including the tongue information as well as the speaker's voice information in order to solve the big problem that the speech recognition performance is sharply dropped in the noise environment to achieve a high voice recognition rate even in a loud noise environment. It is to provide a voice recognition method and apparatus having. It is impossible to have a high voice recognition rate in a noisy environment only by the speaker's voice information. In addition, due to the ambiguity of the lips, it is difficult to improve the voice recognition rate with the image signal of the lips alone. The ambiguity of the lips due to pronunciation can be overcome by using image signals of teeth and lips including the speaker's tongue.

본 발명의 목적을 달성키 위해, 화자가 말하는 동안의 화자의 치아와 입술 영상은 비데오 카메라 또는 CCD 카메라 장치 또는 적외선 열화상(thermograph) 장치 또는 X선 장치 또는 MRI 장치(자기 공명 영상 장치: Magnetic Resonace Imaging system)에 의해 촬영되어지고 A/D 변환된후, 신호처리를 위해 컴퓨터 또는 연산처리가 가능한 반도체 칩에 제공된다. 상기 컴퓨터 또는 연산처리 가능한 반도체 칩은 상기 화자의 치아와 입술 영상에 근거한 치아입술 파라미터를 음소별 혹은 단어별 혹은 문장별로 발췌한다.In order to achieve the object of the present invention, the speaker's teeth and lips image while the speaker is speaking, the video camera or CCD camera device or infrared thermograph device or X-ray device or MRI device (magnetic resonance imaging device: Magnetic Resonace After imaging and A / D conversion by the imaging system, it is provided to a computer or a semiconductor chip which can be processed for signal processing. The computer or arithmetic semiconductor chip extracts a dental lip parameter based on the speaker's teeth and lips by phonemes, words, or sentences.

상기 치아와 입술 영상에 근거한 치아입술 파라미터는, 사전 학습 과정에 의해 음소별 혹은 단어별 혹은 문장별로 저장된 치아입술 파라미터들과 비교하여 유사한 음소 혹은 단어 혹은 문장을 찾아내어 음성인식을 한다. 또한 본 발명은 마이크로부터 입력된 상기 화자의 음성 신호을 A/D변환하여 상기 컴퓨터 혹은 연산처리 가능한 반도체 칩에 제공하고, 상기 컴퓨터 혹은 연산처리 가능한 반도체 칩에 의해 추출된 음성파라미터와 상기 치아입술파라미터와 정보결합하여 더욱더 높은 음성인식률을 달성가능한 음성인식 방법 및 장치를 제공한다.The dental lip parameters based on the teeth and lips images are compared with the dental lip parameters stored for each phoneme, word, or sentence by a pre-learning process to find similar phonemes, words, or sentences, and perform voice recognition. In addition, the present invention A / D conversion of the speech signal of the speaker input from the micro to provide to the computer or arithmetic processing semiconductor chip, the voice parameter and the tooth lip parameters extracted by the computer or arithmetic processing semiconductor chip and Provided are voice recognition methods and apparatus that can achieve higher voice recognition rates by combining information.

따라서 본 발명은 화자의 음성정보 뿐만 아니라, 혀를 포함한 화자의 치아와 입술 영상에 근거한 치아입술 파라미터에 의해 음성인식을 하므로, 잡음 환경하에서 높은 음성인식률을 갖는 치아와 입술영상을 이용한 음성인식 장치 및 방법을 제공한다.Therefore, the present invention is a voice recognition device using the teeth and lips image having a high voice recognition rate in a noisy environment, because the voice recognition by the lip parameters based on the speaker's teeth and lips images as well as the speaker's voice information, and Provide a method.

이하, 본 발명의 치아와 입술영상을 이용한 음성인식 방법 및 장치에 대해서첨부된 도면에 의해 상세히 설명하면 다음과 같다.Hereinafter, described in detail with reference to the accompanying drawings for the voice recognition method and apparatus using the teeth and lips image of the present invention.

도 1은 화자의 음성신호뿐만 아니라 화자의 혀를 포함한 치아와 입술영상을 인식하여 고잡음 환경하에서도 높은 음성인식을 할수 있는 시스템 블록다이어그램을 나타낸다. 도면부호 31,33은 각각 화자의 윗 입술과 아랫 입술을 나타내고 도면 부호 32와34는 각각 화자의 윗 치아 및 아래 치아를 각각 나타낸다. 도면 부호 36은 화자의 혀를 나타낸다. 도1은 적어도 한 개 이상의 부하들로 구성된 부하부(21,22,23); 상기 부하부(21,22,23)를 구동하기 위한 구동부(18,19,20); 화자가 말하는 동안의 화자의 혀(36)를 포함한 치아(32,34)와 입술(31,33) 영상을 픽업(Pick up)하여 전기 영상신호로 변환하기 위한 비데오 카메라 또는 CCD 카메라 장치 또는 적외선 열화상(thermograph) 장치 또는 X선 장치 또는 MRI 장치(자기 공명 영상 장치: Magnetic Resonace Imaging system)에 의해 구성되여지는 영상 입력부(10); 상기 화자의 음성신호를 픽업하기 위한 마이크을 구비한 음성입력부(11); 상기 영상입력부(10)에 의해 생성된 화자의 혀를 포함한 치아와 입술 영상신호를 디지털 신호로 변환하기 위한 A/D 변환부(12); 상기 음성입력부(10)에 의해 생성된 음성신호를 디지털 신호로 변환하기 위한 A/D 변환부(13); 디지털 신호로 변환된 상기 혀를 포함한 치아와 입술영상 신호 및 음성신호는 신호처리를 위해 컴퓨터 또는 연산처리가 가능한 반도체 칩로 구성된 신호처리부(14)에 제공된다. 상기 신호처리부(14)는 상기 화자의 혀를 포함한 치아와 입술 영상신호에 근거한 치아입술 파라미터를 음소별 혹은 단어별 혹은 문장별로 추출하고 상기 화자의 음성신호에 근거한 음성파라미터를 음소별 혹은 단어별 혹은 문장별로 추출한다. 사전 학습과정동안 상기 치아입술 파라미터 및 음성 파라미터를 저장하기 위한 저장장치(16); 상기 학습과정동안은 상기 치아입술 파라미터 및 음성 파라미터를 상기 저장장치(16)에 저장하고, 인식과정동안은 상기 화자의 치아입술 파라미터 및 음성파라미터는 상기 학습과정동안 저장장치(16)에 저장되었던 치아입술 파라미터와 음성파라미터들과 비교하여 유사한 음소 혹은 단어 혹은 문장을 찾아내고 상기 찾아낸 유사한 음소 혹은 단어 혹은 문장에 대응하여 상기 구동부(18,19,20)을 제어하고, 상기 각부에 필요한 제어신호를 제공하기 위한 제어부(15)를 구비한 것을 특징으로하는 치아와 입술 영상을 이용한 음성인식 장치 및 방법을 제공한다.FIG. 1 shows a system block diagram that can recognize not only a speaker's voice signal but also a tooth and lip image including the speaker's tongue, thereby enabling high voice recognition even in a high noise environment. Reference numerals 31 and 33 denote the speaker's upper and lower lips, respectively, and reference numerals 32 and 34 denote the speaker's upper and lower teeth, respectively. Reference numeral 36 denotes the speaker's tongue. 1 shows a load section 21, 22, 23 consisting of at least one load; Driving units (18, 19, 20) for driving the load units (21, 22, 23); A video camera or CCD camera device or infrared heat for picking up and converting the images of the teeth 32 and 34 and the lips 31 and 33, including the speaker's tongue 36, into an electrical video signal while the speaker is speaking. An image input unit 10 constituted by a thermograph apparatus or an X-ray apparatus or an MRI apparatus (Magnetic Resonace Imaging System); A voice input unit 11 having a microphone for picking up the speaker's voice signal; An A / D converter 12 for converting a tooth and lip image signal including the speaker's tongue generated by the image input unit 10 into a digital signal; An A / D converter 13 for converting a voice signal generated by the voice input unit 10 into a digital signal; Tooth and lip image signals and voice signals including the tongue converted into digital signals are provided to a signal processor 14 composed of a computer or a semiconductor chip capable of arithmetic processing for signal processing. The signal processor 14 extracts a dental lip parameter based on the teeth and the lip image signal including the speaker's tongue by phonemes, words, or sentences, and extracts a voice parameter based on the speaker's voice signal by phonemes or words. Extract by sentence. A storage device 16 for storing the dental lip parameters and voice parameters during a pre-learning process; During the learning process, the dental lip parameters and the voice parameters are stored in the storage device 16, and during the recognition process, the speaker's dental lip parameters and voice parameters are stored in the storage device 16 during the learning process. Compares the lip parameters and voice parameters to find similar phonemes, words or sentences, controls the drivers 18, 19 and 20 in response to the found similar phonemes, words or sentences, and provides the necessary control signals to the respective parts. It provides a voice recognition device and method using a tooth and lip image, characterized in that it comprises a control unit 15 for.

도 2는 컴퓨터를 이용한 본 발명의 일시실예를 보인다.2 shows a temporary embodiment of the present invention using a computer.

도 2는 적어도 한 개 이상의 부하들로 구성된 부하부(47); 상기 부하부(47)를 구동하기 위한 구동부(48); 각종 데이터 및 명령를 입력하기 위한 키보드(46); 각종 정보를 표시하기 위한 컴퓨터 모니터 및 TV 장치에 의해 구성되는 표시장치(45); 화자가 말하는 동안의 화자의 혀를 포함한 치아와 입술영상을 픽업(Pick up)하기 위한 비데오 카메라 또는 CCD 카메라 장치 또는 적외선 열화상(thermograph) 장치 또는 X선 장치 또는 MRI 장치(자기 공명 영상 장치: Magnetic Resonace Imaging system)에 의해 구성되여지는 영상 입력부(10); 상기 화자의 음성신호를 픽업하기 위한 마이크을 구비한 음성입력부(11); 상기 영상입력부(10)에 의해 생성된 화자의 혀를 포함한 치아와 입술 영상신호를 디지털 신호로 변환하기 위한 A/D 변환부(12); 상기 음성입력부(10)에 의해 생성된 음성신호를 디지털 신호로 변환하기 위한 A/D 변환부(13); 디지털 신호로 변환된 상기 혀를 포함한 치아와 입술 영상 신호 및 음성신호는 신호처리를 위해 중앙처리장치 또는 연산처리가 가능한 반도체 칩로 구성된 신호처리부(14)에 제공된다. 상기 신호처리부(14)는 상기 화자의 혀를 포함한 치아와 입술 영상신호에 근거한 치아입술 파라미터를 음소별 혹은 단어별 혹은 문장별로 추출하고 상기 화자의 음성신호에 근거한 음성파라미터를 추출한다. 학습과정동안 상기 치아입술 파라미터 및 음성 파라미터를 저장하기 위한 저장장치(16); 상기 학습과정동안은 상기 치아입술 파라미터 및 음성 파라미터를 상기 저장장치(16)에 저장하고, 인식과정동안은 상기 화자의 치아입술 파라미터 및 음성파라미터를 상기 학습과정동안 저장장치(16)에 저장된 치아입술 파라미터와 음성파라미터들과 비교하여 유사한 음소 혹은 단어 혹은 문장을 찾아내고 상기 찾아낸 유사한 음소 혹은 단어 혹은 문장에 대응하여 상기 구동부(47)을 제어하고, 상기 각부에 필요한 제어신호를 제공하기 위한 제어부(15); 및 상기 제어부(15)와 A/D변환부(12,13) 와 신호처리부(14) 와 저장장치(16)를 내장 포함하고 상기 키보드(46) 및 표시장치(45)에 필요한 인터페이스부(18)를 내장한 컴퓨터 본체(43)를 구비한 것을 특징으로하는 치아와 입술 영상을 이용한 음성인식 장치 및 방법을 제공한다.2 shows a load portion 47 consisting of at least one load; A driving unit 48 for driving the load unit 47; A keyboard 46 for inputting various data and commands; A display device 45 constituted by a computer monitor and a TV device for displaying various kinds of information; Video cameras or CCD camera devices or infrared thermograph devices or X-ray devices or MRI devices for picking up teeth and lips including the speaker's tongue while the speaker is speaking (Magnetic Resonance Imaging Device: Magnetic An image input unit 10 constituted by a Resonace Imaging system; A voice input unit 11 having a microphone for picking up the speaker's voice signal; An A / D converter 12 for converting a tooth and lip image signal including the speaker's tongue generated by the image input unit 10 into a digital signal; An A / D converter 13 for converting a voice signal generated by the voice input unit 10 into a digital signal; Tooth and lip image signals and audio signals including the tongue, which are converted into digital signals, are provided to a signal processor 14 including a central processing unit or a semiconductor chip capable of arithmetic processing for signal processing. The signal processor 14 extracts the dental lip parameters based on the teeth and the lips image signal including the speaker's tongue by phonemes, words, or sentences, and extracts voice parameters based on the speaker's voice signal. A storage device 16 for storing the dental lip parameters and voice parameters during a learning process; The dental lip parameters and voice parameters are stored in the storage device 16 during the learning process, and the dental lip parameters and voice parameters of the speaker are stored in the storage device 16 during the learning process during the recognition process. Control unit 15 for finding similar phonemes or words or sentences compared to parameters and voice parameters, controlling the driving unit 47 in response to the found similar phonemes, words or sentences, and providing control signals necessary for the respective units. ); And an interface unit 18 including the controller 15, the A / D converters 12 and 13, the signal processor 14, and the storage device 16, which are necessary for the keyboard 46 and the display device 45. The present invention provides a voice recognition device and method using a tooth and lips image, comprising a computer body 43 having a built-in).

도 3은 상기 A/D변환부(12)에 의해 얻어진 화자의 혀를 포함한 치아와 입술영상신호로부터 치아입술 파라미터를 추출하는 일실시예를 나타낸다. 도면부호 31,33은 각각 화자의 윗 입술과 아랫 입술을 나타내고 도면 부호 32와34는 각각 화자의 윗 치아 및 아래 치아를 각각 나타낸다. 도면 부호 36은 화자의 혀를 나타낸다.3 illustrates an embodiment of extracting a dental lip parameter from the teeth including the speaker's tongue obtained by the A / D converter 12 and the lip image signal. Reference numerals 31 and 33 denote the speaker's upper and lower lips, respectively, and reference numerals 32 and 34 denote the speaker's upper and lower teeth, respectively. Reference numeral 36 denotes the speaker's tongue.

본 발명에서는 치아입술 파라미터의 일실시예로서, 입술의 윤곽 모양;입술의 개구 면적(49); 수평 입술 열림 크기(50) 과 수직 입술 열림크기(51); 상기 수평입술열림크기(50) 와 수직 입술열림크기(51)의 비율; 윗 치아(32)와 아래치아(34)간의 치간 높이(52); 윗 치아의 노출크기(53) 와 아래 치아의 노출크기(54); 상기 윗 치아의 노출 크기(53) 와 아래 치아의 노출크기(54)의 비율; ; 노출된 윗 치아의 개수; 노출된 아래 치아의 개수;혀(36)의 노출 여부 및 혀(36)의 노출 크기; 혀(36)의 상하위치 및 전후위치의 시간에 따른 변화추이가 될수 있다.In one embodiment of the present invention, a dental lip parameter includes a contour shape of a lip; an opening area 49 of a lip; Horizontal lip opening size 50 and vertical lip opening size 51; The ratio of the horizontal lip opening size 50 to the vertical lip opening size 51; Interdental height 52 between upper tooth 32 and lower tooth 34; The exposure size 53 of the upper teeth and the exposure size 54 of the lower teeth; The ratio of the exposed size 53 of the upper tooth to the exposed size 54 of the lower tooth; ; The number of upper teeth exposed; The number of exposed lower teeth; whether the tongue 36 is exposed and the amount of exposure of the tongue 36; It can be a change of time depending on the time of the upper and lower positions and the front and rear positions of the tongue 36.

도 4a 내지 도 4c는 전형적인 입술 영상신호를 나타낸다.4A-4C show a typical lip image signal.

도 5a 내지 도5c는 이에 대응한 혀를 포함한 치아 및 입술 영상신호를 나타낸다.5A to 5C illustrate a tooth and lip image signal including a corresponding tongue.

도 4a는 "아" 발음의 경우 혀를 포함한 치아와 입술 패턴을 나타내고, 도 4b는 "에" 발음의 경우 혀를 포함한 치아와 입술 패턴을 나타내고, 도4c는 "이" 발음의 경우 혀를 포함한 치아와 입술 패턴을 나타낸다. 도 4a내지 도4c는 입술의 모호성이 잘 나타나 있는 일례로 입술 영상만으로는 "아" , "에" 및 "이"의 발음을 명확히 구분하기 힘들다. 도5a 내지 도 5c는 도 4a내지 4c에 대응한 혀를 포함한 치아와입술 영상으로, 상기 치아입술 파라미터를 사용하여 보다 명확히 발음구분이 가능함을 보인다.FIG. 4A shows the tooth and lip pattern including the tongue for "ah" pronunciation, FIG. 4B shows the tooth and lip pattern including the tongue for "E" pronunciation, and FIG. 4C shows the tongue and tongue pattern for "this" pronunciation Represents tooth and lips pattern. 4A to 4C are examples of ambiguity of the lips well, and it is difficult to clearly distinguish the pronunciation of "ah", "e", and "yi" only by the lip image. 5A to 5C are dental lip images including the tongues corresponding to FIGS. 4A to 4C, showing that pronunciation can be more clearly distinguished using the dental lip parameters.

도 6은 화자의 "아에이오우" 발음에 대한, 상기의 치아입술 파라미터의 시간변화에 따른 추이를 보인 일실시예이다.Figure 6 is an embodiment showing the change over time of the dental lip parameters for the speaker "Ae Ou" pronunciation.

도7는 상기 학습과정동안 저장장치(16)에 저장된 화자의 음성 파라미터 및치아입술파라미터의 일실시예를 나타낸다.는 저장장치의 주소(address)의 인덱스이고는 저장장치의 마지막 어드레스 인덱스를 나타낸다. 첫번째 어드레스에는 "TV켜" 에 해당하는 음성파라미터 및 치아입술파라미터가 저장되어 있고, 두번째 어드레스에는 "TV꺼" 에 해당하는 음성파라미터 및 치아입술파라미터가 저장되어 있는 실시예를 나타낸다.Fig. 7 shows one embodiment of the speaker's speech parameters and tooth lip parameters stored in storage 16 during the learning process. Is the index of the address of the storage device Indicates the last address index of the storage device. The first address stores voice parameters and dental lip parameters corresponding to "TV on", and the second address stores voice parameters and dental lip parameters corresponding to "TV off".

상기 인식과정동안은 상기 화자의 치아입술 파라미터 및 음성파라미터는 상기 학습 과정동안 상기 저장장치(16)에 저장된 치아입술 파라미터와 음성파라미터들과의 비교하여, 유사도에 근거하여 화자가 말한 해당 음성을 찾아내는 방법에 대한 일실시예는 다음과 같다.During the recognition process, the speaker's dental lip parameters and voice parameters are compared with the dental lip parameters and voice parameters stored in the storage device 16 during the learning process to find the corresponding voice spoken by the speaker based on the similarity. One embodiment of the method is as follows.

상기 파라미터간의 유사도는 거리(distance)의 조합에 의해 계산되며, 거리는 하기와 같이 화자의 파라미터와 저장장치(16)에 저장된 파라미터간의 차(difference)의 절대치 합에 의해 계산될수 있다.Similarity between the parameters Is distance Is calculated by the combination of and distance Can be calculated by the sum of the absolute values of the difference between the speaker's parameters and the parameters stored in the storage device 16 as follows.

여기서,는 저장장치의 주소(address)의 인덱스이고는 시간을 나타내고,는 파라미터의 전체 비교 구간시간이다.는 화자의 치아입술 파라미터 혹은 음성파라미터를 시간의 함수로서 표시한 기호로 ,here, Is the index of the address of the storage device Represents time, Is the total comparison edge time of the parameter. Is a symbol representing the speaker's dental lip parameter or voice parameter as a function of time.

상기의 화자의 입술의 윤곽 모양; 입술의 개구 면적(49); 수평 입술 열림 크기(50) 과 수직 입술 열림크기(51); 상기 수평입술열림크기(50) 와 수직입술열림크기(51)의 비율; 윗 치아(32)와 아래치아(34)간의 치간 높이(52); 윗 치아의 노출크기(53) 와 아래 치아의 노출크기(54); 상기 윗 치아의 노출 크기(53) 와 아래 치아의 노출크기(54)의 비율; 노출된 윗 치아의 개수; 노출된 아래 치아의 개수; 혀(36)의 노출 여부 및 혀(36)의 노출 크기; 혀(36)의 상하위치 및 전후 위치의 시간에 따른 변화추이를 나타낸다.Contour shape of the speaker's lips above; Opening area 49 of the lip; Horizontal lip opening size 50 and vertical lip opening size 51; The ratio of the horizontal lip opening size 50 to the vertical lip opening size 51; Interdental height 52 between upper tooth 32 and lower tooth 34; The exposure size 53 of the upper teeth and the exposure size 54 of the lower teeth; The ratio of the exposed size 53 of the upper tooth to the exposed size 54 of the lower tooth; The number of upper teeth exposed; The number of lower teeth exposed; Whether the tongue 36 is exposed and the amount of exposure of the tongue 36; The change in the time of the vertical position and the front and rear position of the tongue 36 is shown.

는 상기 학습과정동안에 저장장치(16)에 저장된 치아입술 파라미터 혹은 음성파라미터를 시간의 함수로서 표시한 기호로, Is a symbol representing a dental lip parameter or voice parameter stored in the storage device 16 as a function of time during the learning process,

상기 학습과정동안 저장장치(16)에 저장된 입술의 윤곽 모양; 입술의 개구 면적(49); 수평 입술 열림 크기(50) 과 수직 입술 열림크기(51); 상기 수평입술열림크기(50) 와 수직입술열림크기(51)의 비율; 윗 치아(32)와 아래치아(34)간의 치간 높이(52); 윗 치아의 노출 크기(53) 와 아래 치아의 노출크기(54); 상기 윗 치아의 노출 크기(53) 와 아래 치아의 노출크기(54)의 비율; 노출된 윗 치아의 개수; 노출된 아래 치아의 개수; 혀(36)의 노출 여부 ; 혀(36)의 노출 크기; 혀(36)의 상하위치 및 전후 위치의 시간에 따른 변화추이를 나타낸다.Contour shape of the lips stored in the storage device 16 during the learning process; Opening area 49 of the lip; Horizontal lip opening size 50 and vertical lip opening size 51; The ratio of the horizontal lip opening size 50 to the vertical lip opening size 51; Interdental height 52 between upper tooth 32 and lower tooth 34; The exposure size 53 of the upper teeth and the exposure size 54 of the lower teeth; The ratio of the exposed size 53 of the upper tooth to the exposed size 54 of the lower tooth; The number of upper teeth exposed; The number of lower teeth exposed; Whether the tongue 36 is exposed; Exposure size of tongue 36; The change in the time of the vertical position and the front and rear position of the tongue 36 is shown.

는 상기 치아입술파라미터 및 음성파라미터의 나타내는 인덱스로 다음과 예시가 있을수 있다. An index representing the dental lip parameters and voice parameters may be as follows.

=0 : 음성파라미터 = 0: voice parameter

=1 : 입술의 윤곽 모양 = 1: contour shape of lips

=2 : 입술의 개구 면적 = 2: opening area of the lips

=3 : 수평 입술 열림 크기 = 3: horizontal lip open size

=4 :수직 입술 열림크기 = 4: vertical lip opening size

=5: 수평입술열림크기 와 수직입술열림크기의 비(ratio) = 5: ratio of horizontal lip opening size to vertical lip opening size

=6: 윗 치아와 아래치아간의 치간 높이 = 6: interdental height between upper and lower teeth

=7: 윗 치아의 노출 크기 = 7: exposed size of upper tooth

=8: 아래 치아의 노출크기 = 8: exposed size of lower teeth

=9: 윗 치아의 노출 크기와 아래 치아의 노출크기의 비(ratio) = 9: ratio of the exposed size of the upper tooth to the exposed size of the lower tooth

=10: 노출된 윗 치아의 개수 = 10: number of exposed upper teeth

=11: 노출된 아래 치아의 개수 = 11: number of lower teeth exposed

=12: 혀의 노출 여부 = 12: tongue exposed

=13: 혀의 노출 크기 = 13: tongue exposure size

=14: 혀의 상하위치 = 14: up and down position of tongue

=15: 혀의 전후위치 = 15: front and back position of the tongue

상기 유사도는 하기와 같이 상기 거리(distance)의 무게합(weighted sum)에 의해 계산될수 있다.The similarity Is the distance as It can be calculated by the weighted sum of.

여기서는에 해당된 무게(weight)로,의 계산에 있어서의 중요도를 결정하는 변수(variable)이다.here Is Is the weight of, In the calculation of It is a variable that determines the importance of.

일실시예로 저잡음 환경하에서는 음성파라미터의 비중을 크게하기 위해값을 크게 유지하고, 고 잡음 환경하에서는 음성파라미터의 비중을 작게하기 위해값을 작게 설정하여를 계산한다.In one embodiment, in order to increase the proportion of voice parameters under low noise environment, To keep the value large and to reduce the specific gravity of the voice parameters under high noise By setting a small value Calculate

값이 작을수록 유사도가 크므로, 저장장치(16)에 저장된 치아입술 파라미터 및 음성파라미터 모두와 차례로 비교하여값을 계산하고, 값이 가장 작은값을 갖는 경우의 인텍스를 찾는것으로 음성인식 과정이 완성된다. The smaller the value, the greater the similarity, so that the values are sequentially compared with all the dental lip parameters and voice parameters stored in the storage device 16. Calculate the value, with the smallest value The index if it has a value The voice recognition process is completed by finding.

이때 가장 작은값은 어느 기준치(threshold) 이하값을 가져야한다.The smallest The value should have a value below a certain threshold.

도 8은 치아와 입술 영상을 이용한 음성인식 방법에 대한 프로우차트을 나타낸다.8 shows a prochart for a voice recognition method using teeth and lips images.

단계80은 상기 음성입력부(11)로부터 음성신호가 있는지 판단하는 과정으로 음성신호가 있는 경우 상기 영상입력부(10)로부터의 혀를 포함한 치아와 입술영상으로부터 치아입술 파라미터를 추출하고(단계81), 상기 음성입력부(11)로 부터의 음성신호로부터 음성파라미터를 추출한다(단계82). 이후 학습과정(모드)인지 판단한다(83단계). 만약 학습 모드인 경우는 상기 치아입술 파라미터 및 음성파라미터를 상기 저장장치(16)에 저장한다. 제 83단계에서 만약 학습모드가 아니면, 즉 인식 모드(과정)이면 상기 저장장치(16)에 저장된 파라미터들과 비교키 위해 어드레스 인덱스을 초기치 제로로 설정한다(85단계). 이후 어드레스 인덱스에 해당되는 저장장치(16)로부터 기 학습된 치아입술 파라미터 및 음성파라미터를 독출하여(86단계), 거리(distance)을 계산하고(87단계), 유사도계산을 위해 주변 환경을 모니터링(체킹)하여 무게(weight)값를 설정한다(90과 91단계). 이후 상기 설정된 무게(weight)값을 가지고 유사도를 계산할것이다. 이후 다음 어드레스에 저장된 파라미터와 비교키 위해 어드레스 인덱스값을 다음 인덱스 값으로 증가시킨후(93) 상기 단계86 내지 단계 93을 반복한다.Step 80 is a process of determining whether there is a voice signal from the voice input unit 11, and if there is a voice signal, extracting tooth lip parameters from the teeth and lip images including the tongue from the image input unit 10 (step 81), A voice parameter is extracted from the voice signal from the voice input unit 11 (step 82). After that, it is determined whether the learning process (mode) (step 83). If the learning mode, the dental lip parameters and voice parameters are stored in the storage device (16). In step 83, if it is not the learning mode, that is, the recognition mode (process), the address index is compared with the parameters stored in the storage device 16. Is set to the initial value zero (step 85). Address index Read the previously learned dental lip parameters and voice parameters from the storage device 16 corresponding to the (step 86), the distance (distance) Is calculated (step 87), For calculation, the environment is monitored (checked) and the weight is set (steps 90 and 91). The similarity will then be calculated with the set weight. The address index is then compared with the parameter stored at the next address. The value is increased to the next index value (93) and the above steps 86 to 93 are repeated.

단계 94에서는 어드레스 인덱스가 마지막인지 판단하는 단계로, 마지막인 경우 상기 단계 동안 얻어진에서 가장 작은 최소값에 해당하는값을 찾아내어 음성인식 과정을 완료한다(95단계). 이후 상기 찾아낸값에 해당하는 음성명령을 수행키위해 대응하는 구동부를 구동한다(96단계). 이후 다시, 다음 음성을 인식키 위해 음성 입력 여부를 체크하는 단계80으로 간다.In step 94, it is determined whether the address index is the last. Smallest minimum in Equivalent to Find the value and complete the voice recognition process (step 95). Since found above In operation 96, the corresponding driving unit is driven to execute a voice command corresponding to the value. Thereafter, the process proceeds to step 80 where it is checked whether a voice is input to recognize the next voice.

도 9는 본 발명의 치아 및 입술 영상을 이용한 음성인식 장치 및 방법을 반도체 칩(100)으로 설계한 일실시예를 나타낸다.9 illustrates an embodiment in which the semiconductor chip 100 is designed for a voice recognition device and method using the teeth and lips images of the present invention.

도 9는 상기 반도체 칩에 전원을 공급하기 위한 전원 단자(pin1, pin2); 상기 반도체 칩(100)의 외부에 연결된 부하를 구동하기 위한 적어도 한 개 이상의 구동부(18,19,20); 상기 구동부(18,19,20)와 외부 부하를 연결키 위한 출력 단자(pin5, pin6, pin7); 상기 영상입력부(10)에 의해 생성된 또는 A/D변환된 화자의 혀를 포함한 치아와 입술 영상신호를 상기 반도체 칩(100)에 제공하기 위한 입력 단자(pin3)); 상기 음성입력부(11)에 의해 생성된 또는 A/D변환된 음성신호를 상기 반도체 칩에 제공하기 위한 입력단자(pin4); 상기 혀를 포함한 치아와 입술영상 신호 및 음성신호로 부터 음소별 혹은 단어별 혹은 문장별로 치아입술 파라미터 및 음성파라미터를 추출하기 위한 신호처리부(14);9 shows power terminals pin1 and pin2 for supplying power to the semiconductor chip; At least one driving unit (18, 19, 20) for driving a load connected to the outside of the semiconductor chip (100); Output terminals (pin5, pin6, pin7) for connecting the drive unit (18, 19, 20) and an external load; An input terminal (pin3) for providing a tooth and lip image signal including the tongue of the speaker generated by the image input unit 10 or the A / D conversion to the semiconductor chip 100; An input terminal (pin4) for providing a voice signal generated by the voice input unit 11 or A / D converted to the semiconductor chip; A signal processor 14 for extracting a dental lip parameter and a voice parameter by phonemes, words, or sentences from the teeth and the lip image signal and the voice signal including the tongue;

학습과정동안 상기 치아입술 파라미터 및 음성 파라미터를 저장하기 위한 외부 저장장치(16) 및 이를 위한 인터페이스 단자(88); 상기 학습과정동안은 상기 치아입술 파라미터 및 음성 파라미터를 상기 외부 저장장치(16)에 저장하고, 인식과정동안은 상기 화자의 치아입술 파라미터 및 음성파라미터를 상기 학습과정동안 저장장치(16)에 저장된 치아입술 파라미터와 음성파라미터들과 비교하여 유사한 음소 혹은 단어 혹은 문장을 찾아내고 상기 찾아낸 유사한 음소 혹은 단어 혹은 문장에 대응하여 상기 구동부(18,19,20)을 제어하고, 상기 각부에 필요한 제어신호를 제공하기 위한 제어부(15)를 한 개의 반도체 칩에 내장한 치아와입술 영상을 이용한 음성인식 반도체 칩을 나타낸다.An external storage device 16 for storing the dental lip parameters and voice parameters during the learning process and an interface terminal 88 therefor; The dental lip parameters and voice parameters are stored in the external storage device 16 during the learning process, and the dental lip parameters and voice parameters of the speaker are stored in the storage device 16 during the learning process during the recognition process. Compares the lip parameters and voice parameters to find similar phonemes, words or sentences, controls the drivers 18, 19 and 20 in response to the found similar phonemes, words or sentences, and provides the necessary control signals to the respective parts. A voice recognition semiconductor chip using a dental lip image in which the control unit 15 is embedded in one semiconductor chip.

본 발명의 또 다른 일 실시예로서 상기 영상입력부(10)는 X선 장치 또는 MRI 장치(Magnetic Resonance Imaging system: 자기공명 영상장치) 또는 적외선 열화상 장치(thermograph)장치등을 포함하는 투시 장치를 구비하여 화자의 구강내부 의 전체 구조 혹은 일부 구조를 나타내는 영상정보 와 상기 음성입력부(11)의 음성정보와의 결합를 통해 더욱더 높은 음성인식률을 달성가능한 음성인식 방법 및 장치를 제공한다.As another embodiment of the present invention, the image input unit 10 includes a see-through device including an X-ray apparatus or an MRI apparatus (magnetic resonance imaging apparatus) or an infrared thermography apparatus. The present invention provides a voice recognition method and apparatus capable of achieving a higher voice recognition rate by combining image information representing the entire structure or a partial structure of the speaker's oral cavity with voice information of the voice input unit 11.

도 10은 상기 투시장치에 의한 화자의 구강구조의 영상을 보이는 일실시예이다.Figure 10 is an embodiment showing an image of the oral structure of the speaker by the see-through device.

도 10에 표시된 구강내부의 다양한 부분은 음성인식을 위한 영상정보로서 활용될수 있다. 이 실시예에 있어서, 상기 신호처리부(14)는 화자의 구강 구조의 영상에 근거한 구강파라미터를 음소별 혹은 단어별 혹은 문장별로 추출한다. 사전 학습과정동안 상기 구강 파라미터는 상기 저장장치(16)에 저장된다.Various parts of the oral cavity shown in FIG. 10 may be utilized as image information for voice recognition. In this embodiment, the signal processor 14 extracts oral parameters based on the image of the speaker's oral structure by phonemes, words, or sentences. The oral cavity parameters are stored in the storage device 16 during the pre-learning process.

인식과정동안은 상기 제어부(15)는 상기 화자의 구강 파라미터를 상기 학습과정동안 저장장치(16)에 저장되었던 구강 파라미터들과 비교하여 유사한 음소 혹은 단어 혹은 문장을 찾아내고 상기 찾아낸 유사한 음소 혹은 단어 혹은 문장에 대응하여 상기 구동부(18,19,20)을 제어한다.During the recognition process, the controller 15 compares the speaker's oral parameters with the oral parameters stored in the storage device 16 during the learning process to find similar phonemes, words or sentences, and finds the similar phonemes or words or words. The driving units 18, 19, and 20 are controlled in response to a sentence.

본 발명의 또 다른 측면은 상기 학습과정동안 학습세트(training set)로서 상기 치아입술 파라미터가 HMM(Hidden Markov Model)의 입력 파라미터로서 사용될수 있고 학습된 HMM에 의해 음성인식이 가능하다.Another aspect of the present invention is that the dental lip parameter as a training set during the learning process can be used as an input parameter of the Hidden Markov Model (HMM) and speech recognition is possible by the trained HMM.

본 발명의 또 다른 측면은 상기 학습과정동안 학습세트(training set)로서 상기 치아입술 파라미터가 신경회로망(Neural Network)의 입력 파라미터로서 사용될수 있고 학습된 신경회로망에 의해 음성인식이 가능하다.Another aspect of the invention is that the dental lip parameter as a training set during the learning process can be used as an input parameter of a neural network and speech recognition is possible by the learned neural network.

본 발명은 구체예와 실시예로 본 발명을 설명하고 있으나 이에 본 발명을 국한시키고자 함은 아니다. 또한, 여기에서 설명을 하는 것에 추가하여 다양한 변형 및 변화가 가능함을 당업자는 인지할 것이다. 이와 같은 변형 또한 첨부된 특허청구범위의 범위에 속한다.Although the present invention has been described in terms of embodiments and examples, it is not intended to limit the invention thereto. In addition, those skilled in the art will recognize that various modifications and variations are possible in addition to those described herein. Such modifications also fall within the scope of the appended claims.

이상에서 상술한바와 같이 본 발명은, 화자의 치아와 입술 영상을 이용한 음성 인식 장치 및 방법을 제공하고, 특히, 저잡음 뿐만 아니라 큰 잡음 환경하에서도 높은 음성인식률을 갖는 음성 인식 장치 및 방법을 제공한다.As described above, the present invention provides a speech recognition apparatus and method using an image of a speaker's teeth and lips, and in particular, provides a speech recognition apparatus and method having a high speech recognition rate under a high noise environment as well as low noise. .

또한 화자의 음성 정보와; 치아와 입술의 영상정보와의 결합를 통해 더욱더 높은 음성인식률을 달성가능한 음성인식 방법 및 장치를 제공한다.Also the speaker's voice information; It provides a voice recognition method and apparatus that can achieve a higher voice recognition rate by combining the image information of the teeth and lips.

또한 본 발명의 또 다른 측면은 화자의 음성 정보와; X선 장치 혹은 적외선 열화상(thermograph)장치와 같은 투시 장치에 의해 화자의 구강내부 의 전체 구조혹은 일부 구조를 나타내는 영상정보와의 결합를 통해 더욱더 높은 음성인식률을 달성가능한 음성인식 방법 및 장치를 제공한다.Still another aspect of the present invention relates to speech information of a speaker; Provides a voice recognition method and apparatus that can achieve a higher voice recognition rate by combining with the image information representing the entire structure or a partial structure of the speaker's oral cavity by a fluoroscopy device such as an X-ray device or an infrared thermograph device. .

Claims

In the voice recognition device using the teeth and lips image;

An image input unit for picking up the speaker's teeth and lips and converting the image into an electrical image signal;

A driving unit for driving at least one load;

A signal processor for extracting a dental lip parameter based on the speaker's teeth and lips image signal;

A storage device for storing the dental lip parameters during a learning process;

During the learning process, the dental lip parameters are stored in the storage device, and during the recognition process, similar dental lip parameters are found by comparing the speaker's dental lip parameters with the dental lip parameters stored in the storage device during the learning process. And a control unit for controlling the driving unit in response to the found similar dental lip parameters and providing a control signal required for each unit.

In the voice recognition device using the teeth and lips image;

An image input unit for picking up the teeth and the lips including the speaker's tongue and converting the images into electrical image signals;

A driving unit for driving at least one load;

A signal processor for extracting a dental lip parameter based on an image signal of teeth and lips including the speaker's tongue;

During the learning process, the dental lip parameter is stored in the storage device, and during the recognition process, similar parameters are found by comparing the speaker's dental lip parameter with the dental lip parameters stored in the storage device during the learning process. And a control unit for controlling the driving unit in response to the found similar parameter and providing a control signal required for each unit.

In the voice recognition device using the teeth and lips image;

A driving unit for driving at least one load;

A voice input unit having a microphone for picking up the speaker's voice signal;

A signal processing unit for extracting a dental lip parameter based on a speaker's teeth and a lip image signal and extracting a voice parameter based on a voice signal;

A storage device for storing the dental lip parameters and voice parameters during a learning process;

The dental lip parameters and voice parameters are stored in the storage device during the learning process, and the dental lip parameters and voice parameters are stored in the storage device during the learning process during the recognition process. And a control unit for finding a similar parameter and controlling the drive unit in response to the found similar parameter, and providing a control signal required for each unit.

In a voice recognition device;

An image input unit for picking up the image of the internal structure of the speaker and converting the image into an electrical image signal;

A driving unit for driving at least one load;

A signal processor for extracting oral parameters based on an image signal of the internal structure of the speaker and extracting a voice parameter based on the audio signal;

A storage device for storing the oral cavity parameters and the voice parameters during the learning process;

The oral parameters and voice parameters are stored in the storage device during the learning process, and the oral and voice parameters of the speaker are compared with the oral and voice parameters stored in the storage device during the learning process during the recognition process. And a control unit for finding a similar parameter, controlling the driving unit in response to the found similar parameter, and providing a control signal required for each unit.

The speech recognition apparatus according to any one of claims 1 to 4, wherein the image input unit is constituted by a video camera or a CCD camera apparatus or an infrared thermal imaging apparatus or an X-ray apparatus or a magnetic resonance imaging apparatus.

The apparatus of claim 1, wherein the dental lip parameter is extracted by phonemes, words, or sentences.

5. The speech recognition apparatus of claim 4, wherein the oral cavity parameter is extracted for each phoneme, word, or sentence.

The method of claim 1 or 3, wherein the dental lip parameter is

Changes in the shape of the contour of the lips over time, changes in the opening area of the lips over time, changes over time of the horizontal lip opening size, changes over time of the vertical lip opening size, horizontal lip opening size and vertical Changes in the ratio of the size of the lip opening to the change of time, change of the interdental height between the upper and lower teeth, change of the exposure size of the upper tooth over time, change of the exposure size of the lower tooth over time The parameter selected from the trend of the exposure size of the upper teeth and the ratio of the duration of exposure of the lower teeth over time, the trend of the number of exposed upper teeth over time, and the change of the number of exposed lower teeth over time Voice recognition device using a tooth and lips image, characterized in that.

The method of claim 2, wherein the dental lip parameters

Changes in the shape of the contour of the lips over time, changes in the opening area of the lips over time, changes over time of the horizontal lip opening size, changes over time of the vertical lip opening size, horizontal lip opening size and vertical Changes in the ratio of the size of the lip opening to the change of time, change of the interdental height between the upper and lower teeth, change of the exposure size of the upper tooth over time, change of the exposure size of the lower tooth over time , The change in the ratio of the exposed size of the upper teeth to the exposed size of the lower teeth, the change in time of the number of exposed upper teeth, the change in time of the number of exposed lower teeth, the exposure of the tongue Change of time according to whether or not, the size of the exposure of the tongue over time, change of the time of the upper and lower positions of the tongue, time of the front and rear position of the tongue Voice recognition device using a tooth and lips image, characterized in that the selected parameter among the change trend.

A voice recognition semiconductor chip using tooth and lip images;

A power supply terminal for supplying power to the semiconductor chip;

At least one driver for driving a load connected to the outside of the semiconductor chip;

An output terminal for connecting the driving unit and an external load;

An input terminal for providing a speaker tooth and lip image signal to the semiconductor chip;

An input terminal for providing a speaker's voice signal to the semiconductor chip;

A signal processor for extracting a dental lip parameter and a voice parameter by phonemes, words, or sentences from the tooth and lip image signals and voice signals;

An interface terminal for connecting the semiconductor chip with an external storage device for storing the dental lip parameters and voice parameters during a learning process;

The dental lip parameters and voice parameters are stored in the external storage device during the learning process, and the dental lip parameters and voice parameters are stored in the external storage device during the learning process during the recognition process. A controller for finding similar phonemes or words or sentences compared to parameters, controlling the driver in response to the found similar phonemes, words or sentences, and providing a control signal necessary for each part; And a voice recognition chip using a tooth and a lip image, wherein the driving unit and the signal processing unit are embedded in one semiconductor chip.

In the speech recognition method using the teeth and lips image signal;

(A) picking up the speaker's teeth and lips and converting them into electrical video signals;

(B) picking up the speaker's voice signal;

(C) extracting dental lip parameters based on the speaker's teeth and lips image signal,

(D) extracting a voice parameter based on the speaker's voice signal;

(E) storing the dental lip parameters and voice parameters in a storage device during a learning process;

(F) reading the dental lip parameters and voice parameters stored in the storage device during the learning process during the recognition process;

(G) finding similar parameters by comparing the parameters read from the storage device and the speaker's parameters during the recognition process;

(H) A voice recognition method using the teeth and lips image comprising the step of controlling the drive in response to the found similar parameters.

The method of claim 3 or 4, wherein in finding similar parameters during the recognition process, similar parameters are found by varying the weight and the importance of the video signal-based parameter and the voice signal-based parameter according to the noise environment. Voice recognition device.

The voice recognition according to claim 1 or 2 or 3 or 4, wherein the dental lip parameters or oral parameters are used as input parameters of the HMM (Hidden Markov Model) as a learning set during the learning process. Device.

5. A speech recognition device according to claim 1 or 2 or 3 or 4, wherein said dental lip parameters or oral parameters are used as input parameters of a neural network as a learning set during a learning process.