WO2016173132A1 - 语音识别方法、装置及用户设备 - Google Patents

语音识别方法、装置及用户设备 Download PDF

Info

Publication number
WO2016173132A1
WO2016173132A1 PCT/CN2015/084720 CN2015084720W WO2016173132A1 WO 2016173132 A1 WO2016173132 A1 WO 2016173132A1 CN 2015084720 W CN2015084720 W CN 2015084720W WO 2016173132 A1 WO2016173132 A1 WO 2016173132A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
mouth type
recognition
slice
Prior art date
Application number
PCT/CN2015/084720
Other languages
English (en)
French (fr)
Inventor
颜蓓
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016173132A1 publication Critical patent/WO2016173132A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • the present invention relates to the field of communications, and in particular to a voice recognition method, apparatus, and user equipment.
  • the voice recognition function is a bright spot of many intelligent terminals. Its feature is that it can liberate the users' hands, especially in the scenes such as driving a car.
  • the speech recognition technology in the related art is a method using a speech engine: by collecting sound and slicing and recognizing the sound, the recognition rate depends entirely on the pros and cons of the speech engine.
  • the speech recognition method in the related art has the following defects: for example, when some people have slurred speech and the accent is ambiguous, the recognition rate of the speech is very low; when the user is in a noisy environment or suddenly There is sharp noise, such as a car driving on the road, next to a large truck whizzing past, the recognition rate of voice is also very low.
  • the present invention provides a voice recognition method, apparatus, and user equipment.
  • a speech recognition method comprising: collecting speech information and visual information associated with the speech information; and performing speech recognition based on the visual information and the speech information.
  • collecting the visual information comprises collecting mouth type performance information associated with the voice information.
  • performing voice recognition according to the mouth type performance information and the voice information comprises: identifying, by voice recognition, the collected voice information as a primary voice instruction
  • the primary voice instruction includes: voice slice information in units of language words, and one or more pre-selected language words corresponding to the voice slice information; determining corresponding to each voice slice information in the primary voice instruction
  • the mouth type performance information is respectively matched according to the corresponding mouth type performance information, and each of the voice slice information is matched in each of the preselected language words to obtain an ultimate voice instruction.
  • each of the voice slice information is matched in each of the pre-selected language words according to the corresponding mouth type performance information, and the final voice instruction is obtained according to the mouth type performance corresponding to each voice slice information.
  • the information and the preset lip information database are used to determine a lip language language word corresponding to each voice slice information, wherein the preset lip language information library is set to store a correspondence between the mouth type performance information and the lip language word; The lip language word corresponding to the same voice slice information is matched with the pre-selected language word.
  • each of the voice slice information is matched in each of the pre-selected language words according to the corresponding mouth type performance information, respectively, and obtaining the final voice command further includes: separately for each voice slice information
  • the matched pre-selected language words are filtered by means of phrase matching and/or statement association to obtain the ultimate voice instruction.
  • a voice recognition apparatus comprising: an acquisition module configured to collect voice information and visual information associated with the voice information; and a voice recognition module configured to be based on the visual information And the voice information is used for voice recognition.
  • the collection module is configured to: collect mouth type performance information associated with voice information issued by the user.
  • the voice recognition module comprises: an identification unit configured to identify the collected voice information as a primary voice command by voice recognition, wherein the primary voice command comprises: voice slice information in units of language words, and One or more pre-selected language words corresponding to the voice slice information; the determining unit is configured to determine mouth type performance information corresponding to each voice slice information in the primary voice instruction; and the matching unit is configured to respectively according to the corresponding The mouth type performance information is matched for each of the voice slice information in each of the preselected language words to obtain an ultimate voice instruction.
  • an identification unit configured to identify the collected voice information as a primary voice command by voice recognition, wherein the primary voice command comprises: voice slice information in units of language words, and One or more pre-selected language words corresponding to the voice slice information
  • the determining unit is configured to determine mouth type performance information corresponding to each voice slice information in the primary voice instruction
  • the matching unit is configured to respectively according to the corresponding The mouth type performance information is matched for each of the voice slice information in each of the preselected language words to obtain an ultimate
  • the matching unit includes: a determining subunit, configured to determine a lip language word corresponding to each voice slice information according to the mouth type performance information and the preset lip language information library corresponding to each voice slice information, where
  • the preset lip language information library is configured to store a correspondence relationship between the mouth type performance information and the lip language word; the matching subunit is set to respectively respectively the lip language word and the preselected language word corresponding to the same voice slice information Make a match.
  • the matching unit further comprises: a screening subunit, configured to, by means of phrase matching and/or statement association, in the process of matching each of the voice slice information in a respective preselected language word, The matched pre-selected language words are filtered to obtain the ultimate voice instruction.
  • a screening subunit configured to, by means of phrase matching and/or statement association, in the process of matching each of the voice slice information in a respective preselected language word, The matched pre-selected language words are filtered to obtain the ultimate voice instruction.
  • a user equipment comprising: the voice recognition device described above.
  • a user equipment comprising: a microphone configured to collect voice information; a camera configured to collect visual information associated with the voice information; a processor, respectively, and the camera Connected to the microphone, configured to perform speech recognition based on the visual information and the voice information.
  • the user equipment further comprises: a memory, connected to the processor, configured to store a visual information library, wherein the visual information library is configured to store a correspondence between the visual information and the visual language word.
  • the voice information and the visual information associated with the voice information are collected; the voice recognition method is performed according to the visual information and the voice information, and the problem that the voice recognition technology has low recognition rate of the voice in the related art is solved, and the problem is improved.
  • the recognition rate of speech recognition is the recognition rate of speech recognition.
  • FIG. 1 is a flow chart of a voice recognition method according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a user equipment according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a voice recognition apparatus according to a preferred embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a user equipment having a voice recognition function according to a preferred embodiment of the present invention.
  • FIG. 6 is a flow chart showing a voice recognition method according to a preferred embodiment of the present invention.
  • FIG. 7 is a flow chart showing the processing of step S607 according to a preferred embodiment of the present invention.
  • FIG. 1 is a flowchart of a voice recognition method according to an embodiment of the present invention. As shown in FIG. 1 , the process includes the following steps:
  • Step S102 collecting voice information and visual information associated with the voice information
  • Step S104 performing voice recognition based on the visual information and the voice information.
  • the voice recognition is performed, and the problem that the voice recognition technology has low recognition rate of the voice in the related art is solved, and the voice is improved.
  • the recognition rate is recognized to enhance the user experience.
  • the visual information in step S102 may be visual information that can be used to correct the recognition result of the voice information. For example, in some application scenarios, if a television set exists in the current environment, the "TV" may be enhanced.
  • the recognition rate of the voice control command is mainly described in the embodiment of the present invention by using lip information as an example.
  • the mouth type performance information associated with the voice information may be collected by the graphic image acquisition system.
  • the graphic image acquisition system can be a front camera disposed on a panel of a user device (eg, a smart phone, etc.).
  • a user device eg, a smart phone, etc.
  • it is required to perform synchronous acquisition with the voice information, so as to assist the subsequent voice information recognition processing by using the corresponding mouth type performance information.
  • voice recognition may be performed in the following manner: the voice information is recognized as a primary voice command by voice recognition, wherein the primary voice command includes: voice slice information in units of language words, and One or more pre-selected language words corresponding to the voice slice information; determining mouth type performance information corresponding to each voice slice information in the primary voice instruction; respectively, each voice slice information is respectively in a respective pre-selection according to the corresponding mouth type performance information Match the words in the language to get the ultimate voice command.
  • the method firstly matches one or more pre-selected language words for each voice slice information by an ordinary voice recognition method, and then accurately matches the appropriate language words for each voice slice information according to the corresponding mouth type performance information, or Eliminate inappropriate language words. In this way, a method of improving the accuracy of speech recognition in combination with mouth type performance information is provided.
  • the mouth type performance information and the preset lip corresponding to each voice slice information may be used.
  • a language information library which determines a lip language word corresponding to each voice slice information, wherein the preset lip language information library is set to store a correspondence between the mouth type performance information and the lip language word; respectively, the lip corresponding to the same voice slice information
  • the language word matches the pre-selected language word. For example, for a certain mouth type performance information, it may correspond to a plurality of lip language words, and by matching the lip language words with the pre-selected language words, for example, the intersection manner, the majority of the mismatches may be directly eliminated.
  • Vocabulary which improves the recognition rate of speech recognition.
  • the foregoing preset lip information database may be pre-configured, may be established according to a related algorithm of the lip language recognition technology, or may be gradually learned by the mouth type performance information and the voice recognition result. Self-established. For example, in a speech recognition, if the language word corresponding to a certain mouth type is identified as “sound”, then by learning, the language word corresponding to the mouth type performance information of the mouth type will be recorded in the preset lip language information base. The addition of "sound" words, through long-term gradual learning, can make the mapping information in the preset lip language information database more abundant, and thus the accuracy of speech recognition is improved.
  • a method for screening the pre- and post-language words by using a phrase matching and/or a statement association manner is also provided, for example, when a certain precision has been accurately identified.
  • One of the language words of the voice command is pronounced “dian”, and the possible phonetic words are "electricity”, “temple”, “pad”, etc.; the latter pronunciation of the language word is similar to "nao", or in the presence of an accent , to identify the latter pronunciation is similar to "lao”, then, if the phrase matching function is used, the "computer” can be matched as the ultimate voice command.
  • a speech recognition apparatus is further provided to implement the above-mentioned embodiments and preferred embodiments.
  • the descriptions of the modules involved in the apparatus are described below.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • the apparatus includes: an acquisition module 22 and a voice recognition module 24, wherein the collection module 22 is configured to collect voice information and voice information. Associated visual information; a speech recognition module 24 coupled to the acquisition module 22, configured to perform speech recognition based on visual information and speech information.
  • the acquisition module 22 can be configured to collect mouth-type performance information associated with the voice information.
  • the voice recognition module 24 includes: an identification unit 242 configured to identify the collected voice information as a primary voice command by voice recognition, wherein the primary voice command comprises: voice slice information in units of language words, and voice slice One or more pre-selected language words corresponding to the information; the determining unit 244 is coupled to the identifying unit 242, configured to determine the mouth type performance information corresponding to each of the voice slice information in the primary voice instruction; the matching unit 246 is coupled to the determining unit 244, It is set to match each of the voice slice information in each of the pre-selected language words according to the corresponding mouth shape performance information, respectively, to obtain the ultimate voice instruction.
  • an identification unit 242 configured to identify the collected voice information as a primary voice command by voice recognition, wherein the primary voice command comprises: voice slice information in units of language words, and voice slice One or more pre-selected language words corresponding to the information
  • the determining unit 244 is coupled to the identifying unit 242, configured to determine the mouth type performance information corresponding to each of the voice slice information in the primary voice instruction
  • the matching unit 246 includes: a determining subunit 2462, configured to determine a lip language word corresponding to each voice slice information according to the mouth type performance information and the preset lip language information library corresponding to each voice slice information, where The preset lip information database is set to store the correspondence between the mouth type performance information and the lip language word; the matching subunit 2464 is coupled to the determining subunit 2462, and is set to respectively respectively lip language words and preselected corresponding to the same voice slice information. Language words are matched.
  • the matching unit 246 further comprises: a filtering sub-unit 2466 coupled to the matching sub-unit 2464, arranged to match the phrase and/or in the process of matching each of the voice slice information in the respective pre-selected language words. The way the statement is associated, the selected pre-selected language words are filtered to obtain the ultimate voice command.
  • a filtering sub-unit 2466 coupled to the matching sub-unit 2464, arranged to match the phrase and/or in the process of matching each of the voice slice information in the respective pre-selected language words. The way the statement is associated, the selected pre-selected language words are filtered to obtain the ultimate voice command.
  • the embodiment of the invention further provides a user equipment, comprising: the above voice recognition device.
  • the embodiment of the invention further provides a user equipment, which is used to implement the above voice recognition method.
  • the user equipment includes, but is not limited to, a user equipment such as a smart phone or a smart tablet.
  • FIG. 3 is a schematic structural diagram of a user equipment according to an embodiment of the present invention.
  • the apparatus includes: a camera 32, a microphone 34, and a processor 36.
  • the camera 32 is coupled to the processor 36 and configured to collect and The visual information associated with the voice information;
  • a microphone 34 coupled to the processor 36 for collecting voice information;
  • a processor 36 configured to perform voice recognition based on the visual information and the voice information.
  • the user equipment further includes a memory 38 connected to the processor 36 and configured to store a visual information library, wherein the visual information library is configured to store a correspondence between the visual information and the visual language word.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • a preferred embodiment of the present invention provides a mobile phone device capable of improving a speech recognition rate, and the device relates to the field of speech recognition of a terminal, and is applicable to a wireless terminal having a speech recognition function.
  • this collection device can also directly use the front camera to collect, recognize and process the speaker's mouth type information (equivalent to the above-mentioned mouth type performance information), and store it in a certain language.
  • this information is used to correct the recognition result of the user voice information collected synchronously, because the human voice may have an infinite variety, but the lip action (ie, the mouth type) when speaking is limited, only a limited species is used.
  • the information to constrain an infinite kind of information will filter out a lot of useless information.
  • Visual information can help smart products more recognize the language words they hear, especially at certain noise levels, up to several times higher.
  • the intelligent system of the terminal can obtain certain visual information from the movement of the speaker's face and lips by means of visual lip language. Under different noise levels, the movement information of the face and the lips can help improve the recognition rate.
  • the combination of the camera and the lip recognition algorithm is used to achieve the effect of improving the speech recognition rate.
  • This method is used to correct dialects that are not accurate in pronunciation, or that have the same mouth shape but different pronunciations, and speech recognition in a noisy environment. It is very helpful.
  • the combination of visual and auditory will greatly enhance the speech recognition rate, so as to enhance the user experience.
  • FIG. 4 is a schematic structural view of a voice recognition apparatus according to a preferred embodiment of the present invention, which is a modification of FIG. 2 or FIG. As shown in Figure 4, the entire system is divided into four parts:
  • the mouth type information collecting module 401 (corresponding to the microphone 34 described above): the function collects the mouth shape of the user, and transmits it to the baseband processing module 403 for subsequent lip language recognition analysis processing;
  • the voice information collecting module 402 (corresponding to the above-mentioned camera 32): the function is to collect the voice of the user during the call, and also collect the surrounding background noise, the module needs to be synchronized with the mouth type information collecting module 401, and the collected data is also The data generated by the baseband processing module 403 and the mouth type information collecting module 401 are simultaneously processed in parallel;
  • the baseband processing module 403 (corresponding to the processor 36 described above): the function of the module is to process and analyze the mouth type information generated by the mouth type information collecting module 401 and finally recognize it as a lip language;
  • the voice information data sent by the information collection module 402 is also processed.
  • the identification and analysis result of the mouth type information generated by the mouth type information collecting module 401 and the voice information obtained by the voice information collecting module 402 are mutually positive, and the first time and the second time of the identification information are corrected, and the correct rate is obtained. a higher user instruction statement;
  • the instruction action generation module 404 receives the user instruction statement processed by the baseband processing module 403, and performs various operations of the smart terminal in response to the user according to the instruction.
  • FIG. 5 is a schematic structural diagram of a user equipment having a voice recognition function according to a preferred embodiment of the present invention
  • FIG. 5 is a modification of FIGS. 2 to 4.
  • the main mic and the camera are installed on the front of the mobile phone.
  • the main mic can also be installed on the lower right side of the mobile phone.
  • it can be multiplexed directly with the front camera, as long as it can The mouth type information is clearly captured. If the camera directly uses the front camera, it will greatly save the space of the mobile phone layout, and will greatly reduce the production cost.
  • a camera is employed as the mouth type information collecting device.
  • the camera and its accessory circuit 501 the function is to capture the mouth information of the user, and transmit the captured content to the image data memory 504 in the baseband processing main chip 503 for preparation for subsequent recognition analysis;
  • the main microphone and its accessory circuit 502 the function is to collect the voice of the user during the call, and also collect the surrounding background noise, and the collected audio data is also transmitted to the audio data memory 505 of the baseband processing main chip 503 to be reserved with the camera and The image data generated by the accessory circuit 501 is processed together;
  • the baseband processing main chip 503 the function is to process and analyze the image data in the image data memory 504, and slice the image stream data to identify the content in each small slice as lip language (where the lip language recognition technology can It is implemented by the prior art in the related art; at the same time, the audio data which is noise-added to the voice in the audio data memory 505 is also subjected to slice recognition processing.
  • the word range of speech recognition and the word range of lip recognition are processed by intersection processing or more complicated algorithm processing, common words are found, some uncertain words are excluded, and the recognition rate is improved. Since speech recognition has a context-dependent method, the same method Lip recognition can also be processed in conjunction with before and after slices.
  • the baseband processing main chip 503 also completes the operations according to the user instruction statement finally obtained.
  • Image data memory 504 is configured to store an image data stream produced by the camera and its associated circuitry 501.
  • Audio data memory 505 is configured to store audio data streams generated by the primary microphone and its associated circuitry 202.
  • FIG. 6 is a schematic flowchart of a voice recognition method according to a preferred embodiment of the present invention. As shown in FIG. 6, the process includes the following steps:
  • Step S602 determining whether the voice recognition function starts, starting to step S603;
  • Step S603 The camera and its accessory circuit 501 start to work, and continuously collect image data of the user's mouth type information.
  • Step S604 The main microphone and its accessory circuit 502 start synchronous operation, and continuously collect audio data, and the audio data includes a component voice component of the user and a component of surrounding background noise.
  • Step S605 The image stream data collected by the camera and its accessory circuit 501 is stored in the image data memory 504.
  • Step S606 The audio information data collected by the primary microphone and its accessory circuit 502 is stored in the voice information data storage 505.
  • Step S607 The baseband processing main chip 503 synchronizes the image data memory 504 and the audio data memory 505 with the slice analysis processing, the image data from the camera and the voice data from the microphone, and the synchronization analysis processing.
  • the image data in the image slice N is obtained according to the lip language recognition algorithm, all the word ranges may be obtained, and the voice information generated by the corresponding audio slice N is obtained by the voice recognition algorithm, and all the word ranges that may be generated are obtained;
  • the lip language recognition word and the voice recognition word are mutually corrected, and the intersection is processed to eliminate the impossible words.
  • the identification words of the sliced before and after can be contacted to finally obtain a user instruction with a higher correct rate.
  • the acquisition and storage of the image stream and the audio stream need to be synchronized, and a synchronization reference line is required.
  • the image data and the audio data are sliced from the baseline, and the slices need to be synchronized, for example, every 0.3.
  • One slice per second since the average speech rate of a person is 180 words per minute, then both image data and audio data must be sliced simultaneously for this length.
  • step S607 is performed in more detail by way of an example.
  • the detailed flowchart of step S607 can be seen in FIG. 7:
  • the first slice of the image data is defined as S1
  • the first slice of the audio data is defined as Y1
  • the subsequent analogy the nth slice of the image data is defined as Sn
  • the nth slice of the audio data is defined as Yn.
  • step S701 When the user is driving, using the driving assistant function, at this time the user issues an instruction to "play music", but at the same time there is a large truck whizzing past, then the system is based on a slice of 0.3 seconds (step S701), Four audio slices are obtained, which are Y1, Y2, Y3, and Y4, respectively, and are stored in the audio data memory (step S702).
  • the front camera collects four mouth type information, namely S1, S2, S3, and S4, and stores them in the image data memory (step S702).
  • the baseband processing chip performs speech recognition processing on Y1, Y2, Y3, and Y4, that is, a process of converting audio information into text information.
  • the image baseband processing chip performs lip language recognition processing on S1, S2, S3, and S4, that is, a process of converting the mouth type information into text information (step S703).
  • the word "broadcast” as an example.
  • the words of the mouth type that can have such pronunciation are the same pronunciation characters as the "broadcast” word, such as "thin", "bo”, etc., so that there are 135 words in the pronunciation.
  • the infinite number of possible texts is reduced to 135, and among the 135, there are many uncommon words, unusable characters, characters that are completely impossible to be used as instructions, etc., and these characters are eliminated (step S704), and only 10 are left.
  • the mouth information is the most accurate instruction of the user, and it is not affected by ambient noise. And it doesn't matter if the pronunciation is not accurate, as long as the mouth type is right. Of course, this example is the most commonly used instruction, so it is easier to succeed. For other numbers, such as phone numbers, the number will be more complicated, but this method can also greatly improve the recognition rate and reduce the false recognition rate.
  • Step S608 Perform a response operation on the smart terminal according to the user instruction obtained by the final recognition process.
  • Step S609 determining whether the voice recognition module is closed, if not closed, returning to step S602;
  • Step S610 The voice recognition module is turned off, and the entire embodiment device stops working.
  • misidentification due to inaccurate pronunciation can be corrected to some extent, and misidentification due to background noise can also be calibrated.
  • the present invention combines the visual information associated with voice information, such as lip language information, environmental information, etc., to perform speech recognition, and solves the recognition rate of speech recognition technology in the related art.
  • voice information such as lip language information, environmental information, etc.
  • the low problem improves the recognition rate of speech recognition and improves the user experience.
  • a storage medium is further provided, wherein the software includes the above-mentioned software, including but not limited to: an optical disk, a floppy disk, a hard disk, an erasable memory, and the like.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本发明提供了一种语音识别方法、装置及用户设备。其中,该方法包括:采集语音信息及与语音信息相关联的视觉信息;根据视觉信息和语音信息进行语音识别。通过本发明,解决了相关技术中的语音识别技术对语音的识别率低的问题,提高了语音识别的识别率。

Description

语音识别方法、装置及用户设备 技术领域
本发明涉及通信领域,具体而言,涉及一种语音识别方法、装置及用户设备。
背景技术
目前,市面上的智能终端越来越多,语音识别功能是很多智能终端所具有的亮点,其特色在于可以解放用户的双手,尤其在驾驶汽车等场景中给予用户极大的帮助。
相关技术中的语音识别技术都是采用语音引擎的方法:通过采集声音,并对声音进行切片和识别的方法,识别率的高低完全取决于语音引擎的算法优劣程度。
发明人在研究过程中发现,相关技术中的语音识别方法存在下列缺陷:比如有的人有口齿不清、口音含糊不清的问题时,语音的识别率很低;当用户处于嘈杂环境或突然有尖锐噪声,比如汽车行驶在路上,旁边有一辆大货车呼啸而过,此时语音的识别率也很低。
针对相关技术中的语音识别技术对语音的识别率低的问题,目前没有提出有效的解决方案。
发明内容
为了解决上述技术问题,本发明提供了一种语音识别方法、装置及用户设备。
根据本发明的一个方面,提供了一种语音识别方法,包括:采集语音信息及与所述语音信息相关联的视觉信息;根据所述视觉信息和所述语音信息进行语音识别。
优选地,采集所述视觉信息包括:采集与所述语音信息相关联的嘴型表现信息。
优选地,在所述视觉信息为所述嘴型表现信息的情况下,根据所述嘴型表现信息和所述语音信息进行语音识别包括:通过语音识别,将采集的语音信息识别为初级语音指令,其中,所述初级语音指令包括:以语言单词为单位的语音切片信息,以及所述语音切片信息对应的一个或多个预选语言单词;确定所述初级语音指令中每个语音切片信息对应的嘴型表现信息;分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令。
优选地,分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令包括:根据每个语音切片信息对应的嘴型表现信息和预设唇语信息库,确定每个语音切片信息对应的唇语语言单词,其中,所述预设唇语信息库设置为存储嘴型表现信息与唇语语言单词的对应关系;分别将同一语音切片信息对应的所述唇语语言单词和所述预选语言单词进行匹配。
优选地,分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令还包括:在为所述每个语音切片信息分别在各自的预选语言单词中进行匹配的过程中,通过词组匹配和/或语句联想的方式,对匹配出来的各个预选语言单词进行筛选,得到所述终极语音指令。
根据本发明的另一个方面,还提供了一种语音识别装置,包括:采集模块,设置为采集语音信息及与所述语音信息相关联的视觉信息;语音识别模块,设置为根据所述视觉信息和所述语音信息进行语音识别。
优选地,所述采集模块设置为:采集与用户发布的语音信息相关联的嘴型表现信息。
优选地,所述语音识别模块包括:识别单元,设置为通过语音识别,将采集的语音信息识别为初级语音指令,其中,所述初级语音指令包括:以语言单词为单位的语音切片信息,以及所述语音切片信息对应的一个或多个预选语言单词;确定单元,设置为确定所述初级语音指令中每个语音切片信息对应的嘴型表现信息;匹配单元,设置为分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令。
优选地,所述匹配单元包括:确定子单元,设置为根据每个语音切片信息对应的嘴型表现信息和预设唇语信息库,确定每个语音切片信息对应的唇语语言单词,其中,所述预设唇语信息库设置为存储嘴型表现信息与唇语语言单词的对应关系;匹配子单元,设置为分别将同一语音切片信息对应的所述唇语语言单词和所述预选语言单词进行匹配。
优选地,所述匹配单元还包括:筛选子单元,设置为在为所述每个语音切片信息分别在各自的预选语言单词中进行匹配的过程中,通过词组匹配和/或语句联想的方式,对匹配出来的各个预选语言单词进行筛选,得到所述终极语音指令。
根据本发明的另一个方面,还提供了一种用户设备,包括:上述的语音识别装置。
根据本发明的另一个方面,还提供了一种用户设备,包括:麦克风,设置为采集语音信息;摄像头,设置为采集与所述语音信息相关联的视觉信息;处理器,分别与所述摄像头和所述麦克风连接,设置为根据所述视觉信息和所述语音信息进行语音识别。
优选地,所述用户设备还包括:存储器,与所述处理器连接,设置为存储视觉信息库,其中,所述视觉信息库设置为存储视觉信息与视觉语言单词的对应关系。
通过本发明,采用采集语音信息及与语音信息相关联的视觉信息;根据视觉信息和语音信息进行语音识别的方式,解决了相关技术中的语音识别技术对语音的识别率低的问题,提高了语音识别的识别率。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的语音识别方法的流程图;
图2是根据本发明实施例的语音识别装置的结构示意图;
图3是根据本发明实施例的用户设备的结构示意图;
图4为根据本发明优选实施例的语音识别装置的结构示意图;
图5是根据本发明优选实施例的具有语音识别功能的用户设备的结构示意图;
图6是根据本发明优选实施例的语音识别方法的流程示意图;
图7是根据本发明优选实施例的步骤S607的处理流程示意图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
本发明实施例提供了一种语音识别方法,图1是根据本发明实施例的语音识别方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,采集语音信息及与语音信息相关联的视觉信息;
步骤S104,根据视觉信息和语音信息进行语音识别。
通过上述步骤,由于结合了语音信息相关联的视觉信息,例如:唇语信息、环境信息等,进行语音识别,解决了相关技术中的语音识别技术对语音的识别率低的问题,提高了语音识别的识别率,提升了用户体验。
其中,步骤S102中的视觉信息可以是可用于矫正语音信息的识别结果的视觉信息,例如,在一些应用场景下,采集到当前环境中存在电视机,则可以增强对与“电视机”有关的语音控制指令的识别率;本发明实施例中主要通过视觉信息中优选地以唇语信息为例进行说明。
优选地,在视觉信息为嘴型表现信息的情况下,步骤S102中,可以通过图形图像采集系统,采集与语音信息相关联的嘴型表现信息。该图形图像采集系统可以是一个设置在用户设备(例如:智能手机等)面板上的前置摄像头。在采集嘴型表现信息时,需要与语音信息进行同步采集,以便于采用对应的嘴型表现信息辅助后续的语音信息的识别处理。
优选地,在步骤S104中,可以采用下列的方式进行语音识别:通过语音识别,将采集的语音信息识别为初级语音指令,其中,初级语音指令包括:以语言单词为单位的语音切片信息,以及语音切片信息对应的一个或多个预选语言单词;确定初级语音指令中每个语音切片信息对应的嘴型表现信息;分别根据对应的嘴型表现信息,为每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令。通过上 述方式,首先由普通的语音识别方式,对于每一个语音切片信息匹配一个或者多个预选语言单词,然后再根据对应的嘴型表现信息,为每个语音切片信息精确匹配合适的语言单词,或者剔除不合适的语言单词。通过这种方式,提供了一种结合嘴型表现信息提高语音识别准确性的方法。
优选地,在分别根据对应的嘴型表现信息,为每个语音切片信息分别在各自的预选语言单词中进行匹配的过程中,可以根据每个语音切片信息对应的嘴型表现信息和预设唇语信息库,确定每个语音切片信息对应的唇语语言单词,其中,预设唇语信息库设置为存储嘴型表现信息与唇语语言单词的对应关系;分别将同一语音切片信息对应的唇语语言单词和预选语言单词进行匹配。例如,对于某一个嘴型表现信息而言,可以对应于多个唇语语言单词,通过将唇语语言单词与预选语言单词进行匹配,例如取交集的方式,从而可以直接剔除不匹配的大部分词汇,从而提升了语音识别的识别率。
需要说明的是,上述的预设唇语信息库可以是预先配置的,也可以是根据唇语识别技术的相关算法建立的,还可以是通过对嘴型表现信息和语音识别结果的逐渐学习而自主建立的。例如,在一次语音识别中,识别出了某一嘴型对应的语言单词为“音”,那么通过学习,将在预设唇语信息库中记录该嘴型的嘴型表现信息对应的语言单词中添加“音”单词,通过长期的逐渐学习,从而可以使得预设唇语信息库中的映射信息更丰富,进而形成了对语音识别准确度的提升。
优选地,为了进一步提升语音识别的识别率,在本发明实施例中还提供了采用词组匹配和/或语句联想方式,对前后语言单词进行筛选的方式,例如:当已经较为精确识别出某一语音指令的其中一个语言单词发音为“dian”,可能的语音单词为“电”、“殿”、“垫”等;该语言单词的后一个发音类似“nao”,或者在存在口音的情况下,识别出后一个发音类似“lao”,那么,如果采用词组匹配功能,则可以匹配出“电脑”为终极语音指令。类似地,采用语句联想的方式,若在已经识别出“电脑”的情况下,在已经较为精确识别出第四个语言单词的发音为“kai”的情况下,即使第三个语言单词发音类似于“ta”,甚至模糊不清,也可以通过语句联想的方式,联想到第三个语言单词最为可能的识别结果为“打”,从而完成了“电脑打开”的指令识别。
在本实施例中还提供了一种语音识别装置,用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述,下面对该装置中涉及到的模块进行说明。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图2是根据本发明实施例的语音识别装置的结构示意图,如图2所示,该装置包括:采集模块22和语音识别模块24,其中,采集模块22,设置为采集语音信息及与语音信息相关联的视觉信息;语音识别模块24,耦合至采集模块22,设置为根据视觉信息和语音信息进行语音识别。
优选地,采集模块22可以设置为:采集与语音信息相关联的嘴型表现信息。
优选地,语音识别模块24包括:识别单元242,设置为通过语音识别,将采集的语音信息识别为初级语音指令,其中,初级语音指令包括:以语言单词为单位的语音切片信息,以及语音切片信息对应的一个或多个预选语言单词;确定单元244,耦合至识别单元242,设置为确定初级语音指令中每个语音切片信息对应的嘴型表现信息;匹配单元246,耦合至确定单元244,设置为分别根据对应的嘴型表现信息,为每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令。
优选地,匹配单元246包括:确定子单元2462,设置为根据每个语音切片信息对应的嘴型表现信息和预设唇语信息库,确定每个语音切片信息对应的唇语语言单词,其中,预设唇语信息库设置为存储嘴型表现信息与唇语语言单词的对应关系;匹配子单元2464,耦合至确定子单元2462,设置为分别将同一语音切片信息对应的唇语语言单词和预选语言单词进行匹配。
优选地,匹配单元246还包括:筛选子单元2466,耦合至匹配子单元2464,设置为在为每个语音切片信息分别在各自的预选语言单词中进行匹配的过程中,通过词组匹配和/或语句联想的方式,对匹配出来的各个预选语言单词进行筛选,得到终极语音指令。
本发明实施例还提供了一种用户设备,包括:上述的语音识别装置。
本发明实施例还提供了一种用户设备,该用户设备用于实现上述语音识别方法。需要指出的是,该用户设备包括但不限于:智能手机,智能平板电脑等用户设备。
图3是根据本发明实施例的用户设备的结构示意图,如图3所示,该装置包括:摄像头32、麦克风34和处理器36,其中,摄像头32,耦合至处理器36,设置为采集与语音信息相关联的视觉信息;麦克风34,耦合至处理器36,设置为采集语音信息;处理器36,设置为根据视觉信息和语音信息进行语音识别。
优选地,上述用户设备还包括:存储器38,与处理器36连接,设置为存储视觉信息库,其中,视觉信息库设置为存储视觉信息与视觉语言单词的对应关系。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
为了使本发明实施例的描述更加清楚,下面结合优选实施例进行描述和说明。
本发明优选实施例提供了一种可以提高语音识别率的手机装置,该装置涉及终端的语音识别领域,适用于具有语音识别功能的无线终端。
本发明优选实施例的技术方案是这样实现的:
在手机的麦克附近设置一个摄像头(这个采集装置也可以直接利用前置摄像头)对说话者的嘴型信息(相当于上述的嘴型表现信息)进行采集、识别并处理为唇语,存放到某个内存区域,用此信息对同步采集到的用户语音信息识别结果进行再次的校正,因为人的发音可能有无限种,但是说话时的唇部动作(即嘴型)只有有限种,用有限种的信息去约束一个无限种的信息,会过滤掉很多无用信息。视觉信息可以帮助智能产品更多地识别听到的语言单词,特别是在某些噪音水平下,这样的提升最高能达到好几倍。终端的智能系统可以借助于视觉唇语,从说话人的面部以及嘴唇的移动获得一定的视觉信息,在不同的噪音水平下,面部以及嘴唇的移动信息能帮助提高识别率。
在智能终端上,利用摄像头和唇语识别算法结合来达到语音识别率提升的效果,这种方式对于校正由于发音不准确,或是同样嘴型但是发音不同的方言以及在噪音环境中的语音识别都有很大的帮助。视觉和听觉的配合会大大提升语音识别率,从而达到提升用户体验感受的目的。
为了实现上述的技术方案,在本发明优选实施例中提供了一种装置。
图4为根据本发明优选实施例的语音识别装置的结构示意图,该图为图2或者图3的一种变形形式。如图4所示,整个系统分为四部分:
嘴型信息采集模块401(相当于上述的麦克风34):功能采集使用者的嘴型,并传送到基带处理模块403进行后续唇语识别分析处理;
语音信息采集模块402(相当于上述的摄像头32):功能是采集使用者通话时的语音,同时也会采集周围背景噪音,此模块工作需要和嘴型信息采集模块401同步进行,采集的数据也会传送到基带处理模块403和嘴型信息采集模块401同时产生的数据一起进行并行处理;
基带处理模块403(相当于上述的处理器36):此模块的功能是对嘴型信息采集模块401产生的嘴型信息进行处理和分析并最终识别为唇语,;此模块并对同时刻语音信息采集模块402发过来的语音信息数据也进行处理。嘴型信息采集模块401产生的嘴型信息的识别、分析结果与语音信息采集模块402得到的语音信息互为映正,对识别信息进行第一次识别和第二次校正,即可得到正确率较高的用户指令语句;
指令动作生成模块404:接收来自基带处理模块403处理过后的用户指令语句,并根据指令来进行智能终端的各种响应用户的操作。
下面采用实例对本发明优选实施例进行说明。
图5是根据本发明优选实施例的具有语音识别功能的用户设备的结构示意图,图5是图2~图4的一种变形形式。如图5所示,主麦克和摄像头均安装在手机的正面,当然主麦克也可安装在手机的右下侧,只要尽量靠近嘴部即可,直接和前置摄像头复用也可以,只要可以清晰地拍摄到嘴型信息。如果摄像头直接使用前置摄像头,会大大节省手机布局的空间,也会大大降低制作成本。在本优选实施例中,采用摄像头作为嘴型信息采集装置。
在上述用户设备中包括如下功能模块:
摄像头及其附属电路501:功能是将使用者的嘴型信息拍摄下来,并将拍摄的内容传送到基带处理主芯片503中的图像数据存储器504中准备进行后续识别分析处理;
主麦克及其附属电路502:功能是采集使用者通话时的语音,同时也采集周围的背景噪音,采集的音频数据也会传送到基带处理主芯片503的音频数据存储器505中留待与摄像头及其附属电路501产生的图像数据一起进行处理;
基带处理主芯片503:功能是对图像数据存储器504中的图像数据进行处理和分析,通过对图像流数据进行切片,对每个小切片内的内容识别为唇语(其中,唇语识别技术可以采用相关技术中现有的技术来实现);同时并对音频数据存储器505中的语音加噪音的音频数据也进行切片识别处理。将语音识别的字眼范围与唇语识别的字眼范围进行取交集处理或更复杂的算法处理,找到共同的字眼,排除一些不确定字眼,提升识别率,由于语音识别有联系上下文的方法,同样的唇语识别也可以联系前后切片进行处理。同时基带处理主芯片503也完成根据最终识别得到的用户指令语句完成各项操作。
图像数据存储器504:设置为存放摄像头及其附属电路501产生的图像数据流。
音频数据存储器505:设置为存放主麦克及其附属电路202产生的音频数据流。
图6是根据本发明优选实施例的语音识别方法的流程示意图;如图6所示,该流程包括如下步骤:
步骤S602:判断语音识别功能是否开始,开始则走向步骤S603;
步骤S603:摄像头及其附属电路501开始工作,对使用者嘴型信息进行图像数据的持续采集。
步骤S604:主麦克及其附属电路502开始同步工作,持续采集音频数据,音频数据中包括使用者的指令语音成分以及周围背景噪音的成分。
步骤S605:摄像头及其附属电路501采集的图像流数据存入图像数据存储器504。
步骤S606:主麦克及其附属电路502采集的音频信息数据存入语音信息数据存储器505。
步骤S607:基带处理主芯片503将图像数据存储器504和音频数据存储器505同步切片分析处理,自摄像头的图像数据和来自麦克的语音数据,同步分析处理。
其中,如果将图像切片N中的图像数据根据唇语识别算法得到可能的所有字眼范围,则将相应音频切片N产生的语音信息用语音识别算法得到可能产生的所有字眼范围;将相对应切片的唇语识别字眼与语音识别字眼进行相互校正、取交集等处理,剔除不可能的字眼。并可联系前后切片的识别字眼,来最终得到正确率较高的用户指令。
在本步骤中,图像流和音频流的采集与存储需要进行同步,以及需要要有同步基准线,从基准线开始对图像数据和音频数据进行切片,切片也需要是同步的,比如说每0.3秒一个切片(由于人的语速平均为180个字每分钟),那么图像数据和音频数据都必须同步进行这个长度的切片。
下面用举例说明的方法来更为详细地介绍步骤S607是如何进行的,步骤S607的详解流程图可见图7:
将图像数据的第一个切片定义为S1,将音频数据的第一个切片定义为Y1,后续的依次类推,图像数据的第n个切片定义为Sn,将音频数据的第n个切片定义为Yn。
当用户正在驾车,采用驾驶助手功能,此时用户发出“播放音乐”的指令,但是同时旁边有一辆大货车呼啸而过,那么此时系统根据0.3秒的一个切片(步骤S701), 得到个四个音频切片,分别为Y1,Y2,Y3,Y4,存入音频数据存储器(步骤S702)。而与此同时,前置摄像头采集了4个嘴型信息,分别为S1,S2,S3,S4,存入图像数据存储器(步骤S702)。基带处理芯片对Y1,Y2,Y3,Y4进行语音识别处理,即是将音频信息转化为文字信息的一个过程。由于当时有大货车呼啸而过的噪音,所以Y1,Y2,Y3,Y4都不是纯语音,且噪音的幅度远远大于语音,存储的音频信息可以说大部分都是噪音信息,所以基带处理芯片完全正确地识别出四个文字的概率几乎为0,语音引擎会识别出N种可能的文字组合,N甚至趋近于无穷,于是得到YS1,YS2,…,YSN等无数个可能的字符(步骤S703),而他们之间的组合又更加繁多错乱,导致识别出错误指令或完全无法识别的情况。同时,图像基带处理芯片对S1,S2,S3,S4进行唇语识别处理,即是将嘴型信息转化为文字信息的一个过程(步骤S703)。拿“播”字来举例说,能具有这种发音的嘴型的字眼都是和“播”字一样的发音的字符,例如“薄”,“伯”等,这样发音的文字一共有135个,这样一下将无穷个可能文字缩小范围为135个,而这135个中有很多生僻字、不常用文字、完全不可能作为指令的文字等,剔除这些文字(步骤S704),就只剩下10个文字左右,而平时在音源指令中使用率最高的就是“播”和“拨”了,结合后面的识别的文字“放”的嘴型和“打”字的嘴型相差很大,“放”完全不可能被处理为“打”字,所以“拨打”很容易就被排除,而“播放”的正确率就很高了(步骤S706)。并且由于前面的“播放”的被识别出来,后面的两个字的处理采用结合唇语并联系上文“播放”的方法,“音乐”两字就会被几乎毫无悬念地识别出来(步骤S706)。从这个举例可以看到,一个几乎正确率为0的识别率,由于加入了视觉唇语而提高了,究其原因是嘴型信息是用户最准确的指令,它不受周围环境噪音的影响,并且发音不准确也没关系,只要嘴型对那就可以。当然这个举例是最常用的指令,所以更容易成功,对于其他的,比如电话号码的数字会更复杂一些,但是这种方法也能极大程度地提高识别率,降低误识别率。
步骤S608:根据最终识别处理得到的用户指令对智能终端进行响应操作。
步骤S609:判断语音识别模块是否关闭,如没关闭则返回执行步骤S602;
步骤S610:语音识别模块关闭,整个实施例装置也随之停止工作。
综上所述,通过本发明提供的上述实施例,可以一定程度校正由于发音不准确导致的误识别,也可以校准由于背景噪音造成的误识别。
工业实用性:通过上述描述可知,本发明由于结合了语音信息相关联的视觉信息,例如:唇语信息、环境信息等,进行语音识别,解决了相关技术中的语音识别技术对语音的识别率低的问题,提高了语音识别的识别率,提升了用户体验。
在另外一个实施例中,还提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。
在另外一个实施例中,还提供了一种存储介质,该存储介质中存储有上述软件,该存储介质包括但不限于:光盘、软盘、硬盘、可擦写存储器等。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的对象在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (13)

  1. 一种语音识别方法,包括:
    采集语音信息及与所述语音信息相关联的视觉信息;
    根据所述视觉信息和所述语音信息进行语音识别。
  2. 根据权利要求1所述的方法,其中,采集所述视觉信息包括:
    采集与所述语音信息相关联的嘴型表现信息。
  3. 根据权利要求2所述的方法,其中,在所述视觉信息为所述嘴型表现信息的情况下,根据所述嘴型表现信息和所述语音信息进行语音识别包括:
    通过语音识别,将采集的语音信息识别为初级语音指令,其中,所述初级语音指令包括:以语言单词为单位的语音切片信息,以及所述语音切片信息对应的一个或多个预选语言单词;
    确定所述初级语音指令中每个语音切片信息对应的嘴型表现信息;
    分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令。
  4. 根据权利要求3所述的方法,其中,分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令包括:
    根据每个语音切片信息对应的嘴型表现信息和预设唇语信息库,确定每个语音切片信息对应的唇语语言单词,其中,所述预设唇语信息库设置为存储嘴型表现信息与唇语语言单词的对应关系;
    分别将同一语音切片信息对应的所述唇语语言单词和所述预选语言单词进行匹配。
  5. 根据权利要求3或4所述的方法,其中,分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令还包括:
    在为所述每个语音切片信息分别在各自的预选语言单词中进行匹配的过程中,通过词组匹配和/或语句联想的方式,对匹配出来的各个预选语言单词进行筛选,得到所述终极语音指令。
  6. 一种语音识别装置,包括:
    采集模块,设置为采集语音信息及与所述语音信息相关联的视觉信息;
    语音识别模块,设置为根据所述视觉信息和所述语音信息进行语音识别。
  7. 根据权利要求6所述的装置,其中,
    所述采集模块设置为:采集与所述语音信息相关联的嘴型表现信息。
  8. 根据权利要求7所述的装置,其中,所述语音识别模块包括:
    识别单元,设置为通过语音识别,将采集的语音信息识别为初级语音指令,其中,所述初级语音指令包括:以语言单词为单位的语音切片信息,以及所述语音切片信息对应的一个或多个预选语言单词;
    确定单元,设置为确定所述初级语音指令中每个语音切片信息对应的嘴型表现信息;
    匹配单元,设置为分别根据对应的所述嘴型表现信息,为所述每个语音切片信息分别在各自的预选语言单词中进行匹配,得到终极语音指令。
  9. 根据权利要求8所述的装置,其中,所述匹配单元包括:
    确定子单元,设置为根据每个语音切片信息对应的嘴型表现信息和预设唇语信息库,确定每个语音切片信息对应的唇语语言单词,其中,所述预设唇语信息库设置为存储嘴型表现信息与唇语语言单词的对应关系;
    匹配子单元,设置为分别将同一语音切片信息对应的所述唇语语言单词和所述预选语言单词进行匹配。
  10. 根据权利要求7或8所述的装置,其中,所述匹配单元还包括:
    筛选子单元,设置为在为所述每个语音切片信息分别在各自的预选语言单词中进行匹配的过程中,通过词组匹配和/或语句联想的方式,对匹配出来的各个预选语言单词进行筛选,得到所述终极语音指令。
  11. 一种用户设备,包括:如权利要求6至10中任一项所述的语音识别装置。
  12. 一种用户设备,包括:
    麦克风,设置为采集所述语音信息;
    摄像头,设置为采集与所述语音信息相关联的视觉信息;
    处理器,分别与所述摄像头和所述麦克风连接,设置为根据所述视觉信息和所述语音信息进行语音识别。
  13. 根据权利要求12所述的用户设备,其中,所述用户设备还包括:
    存储器,与所述处理器连接,设置为存储视觉信息库,其中,所述视觉信息库设置为存储视觉信息与视觉语言单词的对应关系。
PCT/CN2015/084720 2015-04-28 2015-07-21 语音识别方法、装置及用户设备 WO2016173132A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510208370.9 2015-04-28
CN201510208370.9A CN106157957A (zh) 2015-04-28 2015-04-28 语音识别方法、装置及用户设备

Publications (1)

Publication Number Publication Date
WO2016173132A1 true WO2016173132A1 (zh) 2016-11-03

Family

ID=57199578

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/084720 WO2016173132A1 (zh) 2015-04-28 2015-07-21 语音识别方法、装置及用户设备

Country Status (2)

Country Link
CN (1) CN106157957A (zh)
WO (1) WO2016173132A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018045703A1 (zh) * 2016-09-07 2018-03-15 中兴通讯股份有限公司 语音处理方法、装置及终端设备
CN109410957A (zh) * 2018-11-30 2019-03-01 福建实达电脑设备有限公司 基于计算机视觉辅助的正面人机交互语音识别方法及系统
CN110415701A (zh) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 唇语的识别方法及其装置
CN114464182A (zh) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415689B (zh) * 2018-04-26 2022-02-15 富泰华工业(深圳)有限公司 语音识别装置及方法
CN109377995B (zh) * 2018-11-20 2021-06-01 珠海格力电器股份有限公司 一种控制设备的方法与装置
CN111611825B (zh) * 2019-02-25 2024-04-23 北京嘀嘀无限科技发展有限公司 一种唇语内容识别方法及装置
CN110691204B (zh) * 2019-09-09 2021-04-02 苏州臻迪智能科技有限公司 一种音视频处理方法、装置、电子设备及存储介质
CN111343554A (zh) * 2020-03-02 2020-06-26 开放智能机器(上海)有限公司 一种视觉与语音结合的助听方法及系统
CN111445912A (zh) * 2020-04-03 2020-07-24 深圳市阿尔垎智能科技有限公司 语音处理方法和系统
CN112820274B (zh) * 2021-01-08 2021-09-28 上海仙剑文化传媒股份有限公司 一种语音信息识别校正方法和系统
CN113128228A (zh) * 2021-04-07 2021-07-16 北京大学深圳研究院 语音指令识别方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212557A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Coupled hidden markov model for audiovisual speech recognition
US20050071166A1 (en) * 2003-09-29 2005-03-31 International Business Machines Corporation Apparatus for the collection of data for performing automatic speech recognition
CN101472066A (zh) * 2007-12-27 2009-07-01 华晶科技股份有限公司 影像撷取装置的近端控制方法及应用该方法的影像撷取装置
CN102298443A (zh) * 2011-06-24 2011-12-28 华南理工大学 结合视频通道的智能家居语音控制系统及其控制方法
CN102324035A (zh) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 口型辅助语音识别术在车载导航中应用的方法及系统
EP2562746A1 (en) * 2011-08-25 2013-02-27 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image
CN104361276A (zh) * 2014-11-18 2015-02-18 新开普电子股份有限公司 一种多模态生物特征身份认证方法及系统
CN104409075A (zh) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 语音识别方法和系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201426733A (zh) * 2012-12-26 2014-07-01 Univ Kun Shan 唇形語音辨識方法
CN104157285B (zh) * 2013-05-14 2016-01-20 腾讯科技(深圳)有限公司 语音识别方法、装置及电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212557A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Coupled hidden markov model for audiovisual speech recognition
US20050071166A1 (en) * 2003-09-29 2005-03-31 International Business Machines Corporation Apparatus for the collection of data for performing automatic speech recognition
CN101472066A (zh) * 2007-12-27 2009-07-01 华晶科技股份有限公司 影像撷取装置的近端控制方法及应用该方法的影像撷取装置
CN102298443A (zh) * 2011-06-24 2011-12-28 华南理工大学 结合视频通道的智能家居语音控制系统及其控制方法
CN102324035A (zh) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 口型辅助语音识别术在车载导航中应用的方法及系统
EP2562746A1 (en) * 2011-08-25 2013-02-27 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image
CN104361276A (zh) * 2014-11-18 2015-02-18 新开普电子股份有限公司 一种多模态生物特征身份认证方法及系统
CN104409075A (zh) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 语音识别方法和系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018045703A1 (zh) * 2016-09-07 2018-03-15 中兴通讯股份有限公司 语音处理方法、装置及终端设备
CN109410957A (zh) * 2018-11-30 2019-03-01 福建实达电脑设备有限公司 基于计算机视觉辅助的正面人机交互语音识别方法及系统
CN110415701A (zh) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 唇语的识别方法及其装置
CN114464182A (zh) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法
CN114464182B (zh) * 2022-03-03 2022-10-21 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法

Also Published As

Publication number Publication date
CN106157957A (zh) 2016-11-23

Similar Documents

Publication Publication Date Title
WO2016173132A1 (zh) 语音识别方法、装置及用户设备
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
EP3963576B1 (en) Speaker attributed transcript generation
CN108630193B (zh) 语音识别方法及装置
CN107240398B (zh) 智能语音交互方法及装置
US10013977B2 (en) Smart home control method based on emotion recognition and the system thereof
WO2016150001A1 (zh) 语音识别的方法、装置及计算机存储介质
CN108399923B (zh) 多人发言中发言人识别方法以及装置
CN111128223B (zh) 一种基于文本信息的辅助说话人分离方法及相关装置
US9553979B2 (en) Bluetooth headset and voice interaction control thereof
WO2020237855A1 (zh) 声音分离方法、装置及计算机可读存储介质
EP3963901A1 (en) Synchronization of audio signals from distributed devices
US20210407516A1 (en) Processing Overlapping Speech from Distributed Devices
US20200349953A1 (en) Audio-visual diarization to identify meeting attendees
EP3669264A1 (en) System and methods for providing unplayed content
US10812921B1 (en) Audio stream processing for distributed device meeting
US20210280172A1 (en) Voice Response Method and Device, and Smart Device
WO2014120291A1 (en) System and method for improving voice communication over a network
KR20080023030A (ko) 온라인 방식에 의한 화자 인식 방법 및 이를 위한 장치
US11626104B2 (en) User speech profile management
CN109710949A (zh) 一种翻译方法及翻译机
CN111868823A (zh) 一种声源分离方法、装置及设备
CN109102813B (zh) 声纹识别方法、装置、电子设备和存储介质
JP7400364B2 (ja) 音声認識システム及び情報処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15890509

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15890509

Country of ref document: EP

Kind code of ref document: A1