WO2016082267A1 - 语音识别方法和系统 - Google Patents

语音识别方法和系统 Download PDF

Info

Publication number
WO2016082267A1
WO2016082267A1 PCT/CN2014/094624 CN2014094624W WO2016082267A1 WO 2016082267 A1 WO2016082267 A1 WO 2016082267A1 CN 2014094624 W CN2014094624 W CN 2014094624W WO 2016082267 A1 WO2016082267 A1 WO 2016082267A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
recognition result
lip
speech
keywords
Prior art date
Application number
PCT/CN2014/094624
Other languages
English (en)
French (fr)
Inventor
付春元
Original Assignee
深圳创维-Rgb电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳创维-Rgb电子有限公司 filed Critical 深圳创维-Rgb电子有限公司
Priority to AU2014412434A priority Critical patent/AU2014412434B2/en
Priority to US15/127,790 priority patent/US10262658B2/en
Publication of WO2016082267A1 publication Critical patent/WO2016082267A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • the present invention relates to the field of voice control, and more particularly to a voice recognition method and system.
  • voice interaction With the rapid development of voice interaction, it is a very widely used method to control terminals (such as televisions and air conditioners) by voice or to input data by voice.
  • terminals such as televisions and air conditioners
  • voice interaction there are still many problems in voice interaction, such as inaccurate speech recognition and high environmental impact. For example, if there is noise or background music around, the voice signal collected by the voice collection device includes the voice signal sent by the person and the surrounding voice. The noise signal makes the terminal unable to accurately recognize the received speech signal, resulting in inaccurate speech recognition.
  • the main object of the present invention is to propose a speech recognition method and system, aiming at solving the technical problem that speech recognition is not accurate enough.
  • the present invention provides a speech recognition method, and the speech recognition method includes the following steps:
  • the accuracy of the speech signal recognition result and the lip language recognition result is calculated, and the recognition result with higher accuracy is used as the current speech recognition result.
  • the step of performing lip language recognition on the image including the lip in the collected image to obtain a lip recognition result comprises:
  • Determining an image of the collected image that includes a lip using the image including the lip as an effective image, and determining a position of the lip in the effective image
  • the lip language recognition result is composed based on characters corresponding to the effective image of each frame.
  • the step of determining the image of the lip included in the collected image, using the image containing the lip as the effective image, and determining the position of the lip in the effective image comprises:
  • the frame image is determined to be an image including a lip, and the image including the lip is taken as an effective image;
  • the position of the lip is determined based on the RGB chromaticity values of the individual pixels in the lip region.
  • the step of identifying the received voice signal to obtain a voice signal recognition result comprises:
  • the unmatched keyword is used as the first keyword, and it is determined whether the preset confused vocabulary has the first keyword;
  • the first keyword is replaced with a second keyword, and when the replaced second keyword is matched with the adjacent keyword, the replaced second keyword and other keywords are recombined into a voice.
  • the signal recognition result and the recombined lip recognition result is used as the current speech signal recognition result.
  • the step of identifying the received voice signal to obtain a voice signal recognition result further includes:
  • the first keyword is replaced with another second keyword, and after the replacement is determined Whether the part of speech of the second keyword matches the adjacent part of the keyword, until all the second keywords are replaced, the converted character string is used as the current speech signal recognition result.
  • the step of calculating the accuracy of the speech signal recognition result and the lip language recognition result, and using the higher accuracy recognition result as the current speech recognition result comprises:
  • the recognition result with higher accuracy is taken as the current speech recognition result.
  • the present invention further provides a speech recognition system, characterized in that the speech recognition system comprises:
  • control module configured to control the image acquisition device to perform image acquisition when receiving the voice signal, and control the image collection device to stop image collection when the voice signal ends;
  • a voice signal recognition module configured to identify the received voice signal to obtain a voice signal recognition result
  • a lip language recognition module configured to perform lip language recognition on an image containing a lip in the collected image to obtain a lip language recognition result
  • the processing module is configured to calculate the accuracy of the speech signal recognition result and the lip language recognition result, and use the recognition result with higher accuracy as the current speech recognition result.
  • the lip language recognition module comprises:
  • a lip positioning sub-module configured to determine an image including a lip in the collected image, using the image including the lip as an effective image, and determining a lip position in the effective image
  • a recombination submodule configured to form a lip language recognition result based on characters corresponding to the effective image of each frame.
  • the lip positioning submodule comprises:
  • a facial contour determining unit configured to determine a facial contour in each of the acquired images
  • a face position locating unit configured to compare each pixel chromaticity value in the determined face contour with a chromaticity value of each pixel in the pre-stored face to determine a face in each captured image position
  • a lip region positioning unit for determining an eye position in the face position, and determining a lip region based on the relative position between the eye position and the lip position;
  • Aligning unit for comparing RGB chromaticity values of respective pixels in the lip region
  • a processing unit configured to: when there is a pixel point in the lip region that the RGB chromaticity value satisfies a preset condition, determine the frame image as an image including a lip, and use the image including the lip as an effective image;
  • a lip position locating unit for determining a position of the lip based on RGB chromaticity values of respective pixels in the lip region.
  • the voice signal identification module comprises:
  • a conversion submodule for converting the received voice signal into a character string
  • the character string is split into multiple keywords
  • a part of speech matching sub-module for labeling the part of speech of each of the keywords, and determining whether the part of speech of each adjacent keyword matches
  • Determining a sub-module configured to use the unmatched keyword as a first keyword when there is a part of speech matching between adjacent keywords, and determine whether the preset confounded sound lexicon has the first key a word, and when the confusing sound word inventory is in the unmatched keyword, determining a second keyword corresponding to the first keyword in the confusing sound vocabulary;
  • a processing submodule configured to replace the first keyword with a second keyword, and replace the second keyword and other when the replaced second keyword and the adjacent keyword have a part of speech matching
  • the keywords are recombined into the speech signal recognition result, and the recombined lip language recognition result is taken as the current speech signal recognition result.
  • the processing sub-module is further configured to: when the replaced second keyword has a part of speech non-match between adjacent keywords, and when there are multiple second keywords, replace the first keyword For other second keywords, and determining whether the word-of-speech between the replaced second keyword and the adjacent keyword matches until all the second keywords are replaced, and the converted character string is recognized as the current speech signal. result.
  • the processing module comprises:
  • a molecular module for splitting the speech signal recognition and the lip recognition result into a plurality of keywords
  • a correlation degree calculation sub-module configured to determine a first degree of association between each of the adjacent keywords in the keyword split into the voice signal recognition result, and determine a key to split the lip language recognition result into a second degree of association between adjacent keywords in a word;
  • the accuracy calculation sub-module is configured to sum the determined first association degrees, obtain the accuracy of the speech signal recognition result, and sum the determined second association degrees to obtain the accuracy of the speech signal recognition result. ;
  • Processing sub-module for using the higher-accuracy recognition result as the current speech recognition result
  • the speech recognition method and system proposed by the invention simultaneously recognizes the speech signal and the lip language, and calculates the accuracy of the speech signal recognition result and the lip language recognition result, and uses the recognition result with higher accuracy as the current recognition result. Instead of just recognizing a single voice signal, the accuracy of speech recognition is improved.
  • FIG. 1 is a schematic flow chart of a voice recognition method according to a preferred embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a refinement process of step S20 in FIG. 1;
  • step S30 in FIG. 1 is a schematic diagram of a refinement process of step S30 in FIG. 1;
  • step S31 in FIG. 3 is a schematic diagram showing the refinement process of step S31 in FIG. 3;
  • FIG. 5 is a schematic diagram showing the refinement process of step S40 in FIG. 1;
  • FIG. 6 is a schematic diagram of functional modules of a preferred embodiment of a speech recognition system of the present invention.
  • FIG. 7 is a schematic diagram of a refinement function module of the speech signal recognition module of FIG. 6;
  • FIG. 8 is a schematic diagram of a refinement function module of the lip language recognition module of FIG. 6;
  • FIG. 9 is a schematic diagram of a refinement function module of the lip positioning sub-module of FIG. 8;
  • FIG. 10 is a schematic diagram of a refinement function module of the processing module of FIG. 6.
  • FIG. 10 is a schematic diagram of a refinement function module of the processing module of FIG. 6.
  • the invention provides a speech recognition method.
  • FIG. 1 is a schematic flowchart of a voice recognition method according to a preferred embodiment of the present invention.
  • the voice recognition method proposed in this embodiment preferably runs in a controlled terminal (such as a television set and an air conditioner, etc.), and the controlled terminal performs corresponding operations based on voice recognition; or the voice recognition method can be run on the control terminal, and the control terminal The code corresponding to the voice signal recognition result is transmitted to the corresponding controlled terminal.
  • a controlled terminal such as a television set and an air conditioner, etc.
  • This embodiment provides a voice recognition method, where the voice recognition method includes:
  • Step S10 when receiving the voice signal, controlling the image capturing device to perform image capturing, and controlling the image capturing device to stop image capturing when the voice signal ends;
  • the image acquisition device is controlled to perform image acquisition only when the voice signal is received, and is in a sleep state when the voice signal is not received, to reduce energy consumption, for example, the voice is not received within the preset time interval.
  • the image capture device is controlled to enter a sleep state.
  • the image acquisition device can be controlled to perform image acquisition in real time or at a time.
  • the voice signal is received
  • the first time point of the received voice signal and the second time point when the voice signal ends are obtained.
  • Step S20 identifying the received voice signal to obtain a voice signal recognition result
  • the speech signal recognition result can be obtained by converting the speech signal into a character signal. Further, in order to improve the accuracy of the speech signal recognition result, the character string converted by the speech signal may be corrected.
  • the specific error correction process is as shown in FIG. 2, and the step S20 includes:
  • Step S21 converting the received voice signal into a character string, and splitting the character string into a plurality of keywords according to a preset keyword library;
  • the keyword library including a plurality of keywords may be preset, the character string converted by the voice signal is compared with the keyword stored in the thesaurus, and the keyword matching the string in the preset keyword database is determined, And split the string into matching keywords.
  • the keywords in the keyword library need not be set, and after the keywords matching the string are determined, the matching keywords in the string can be extracted first, and the remaining characters in the string are The mismatched part is used as a keyword. For example, if the character string obtained by the voice signal conversion is “television, switch to channel 23”, then the string matches the keyword in the preset keyword library as “television, switching, to, and channel”. Extract “TV, Switch, To, and Channel” directly from the string, and then use the remaining "23" as a keyword.
  • Step S22 marking the part of speech of each of the keywords, and determining whether the part of speech between the adjacent keywords matches;
  • the part of speech can be nouns, verbs, adjectives, adverbs and prepositions. You can reserve a combination of various parts of speech. For example, when adjacent keywords are verbs + adjectives, the words are not considered between adjacent keywords. Match, there may be a recognition error.
  • Step S23 When there is a partiality of the word between the adjacent keywords, the unmatched keyword is used as the first keyword, and determining whether the preset confusing dictionary has the first keyword;
  • Step S24 determining, when the confusing sound word inventory is in the unmatched keyword, determining a second keyword corresponding to the first keyword in the confusing sound vocabulary;
  • the confusing sound vocabulary can be preset, and the confusing sound vocabulary can set keywords that are easily confused when the voice signal is converted into a character string, and each confusing keyword association is saved.
  • the unmatched keyword may be used as the first keyword to compare with the keywords in the confusing vocabulary to correct the erroneous keywords.
  • the converted character string can be used as the current speech signal recognition result.
  • Step S25 replacing the first keyword with the second keyword, and re-changing the replaced second keyword and other keywords when the replaced second keyword matches the adjacent keyword.
  • the combination is the result of the speech signal recognition, and the recombined lip recognition result is used as the current speech signal recognition result.
  • the replaced second keyword does not match the part of speech between the adjacent keywords, and the second keyword exists
  • the first keyword is replaced with other
  • the second keyword is used to determine whether the replaced second word and the adjacent word are matched, until all the second keywords are replaced, and the converted character string is used as the current speech signal recognition result.
  • Step S30 performing lip language recognition on the image including the lip in the collected image to obtain a lip recognition result
  • the lip recognition result is determined according to the lip shape in each frame image and the lip shape in the image of the previous frame.
  • the specific process is as shown in FIG. 3, and the step S30 includes:
  • Step S31 determining an image including a lip in the collected image, using the image including the lip as an effective image, and determining a position of the lip in the effective image;
  • Step S311 determining a facial contour in each acquired image of the frame
  • the face position in each frame image can be directly obtained according to the chromaticity value distribution of the pixel points in each frame image and the preset facial contour.
  • the sound source direction can be positioned based on the received voice signal, and the collected image of the user is determined based on the determined sound source direction.
  • the location in the location, determining the location of the user in the acquired image based on the direction of the sound source belongs to the prior art, and details are not described herein again.
  • the speech signal recognition result corresponding to the speech signal is directly used as the current speech recognition result, or the user may be prompted to re-enter the speech signal.
  • Step S312 comparing each pixel point chromaticity value in the determined face contour with the chromaticity value of each pixel point in the pre-stored face to determine the face position in each captured image;
  • Step S313 determining an eye position in the face position, and determining a lip region based on the relative position between the eye position and the lip position;
  • the eye position can be determined according to the gradation value between the respective pixel points, below the eye position and the face.
  • the lower third of the department can easily determine the area where the lip is located.
  • Step S314 when there is a pixel point whose RGB chromaticity value satisfies a preset condition in the lip region, determining that the frame image is an image including a lip, and using the image including the lip as an effective image;
  • Step S315 determining the position of the lip based on the RGB chromaticity values of the respective pixels in the lip region.
  • the lip position needs to be determined within the region. Since the B (blue) component of the lip RGB chrominance value is much larger than the G (green) component, the preset condition can be set between the B (blue) component and the G (green) component. The difference is greater than the preset value, and the B (blue) component of the face is smaller than the G (green) component, and the B component and the G component of each pixel can be compared to determine the lip. position.
  • Step S32 determining a character output by the user according to the lip shape of the effective image of each frame and the lip shape of the effective image of the previous frame;
  • Step S33 the lip language recognition result is composed based on the characters corresponding to the effective image in each frame.
  • the lip shape of the image of the previous frame of the first frame image in the acquired image is closed lip shape by default, and the user can derive the lip shape according to the previous frame image and the frame image.
  • the user's lip trend compares the resulting lip trend with the pre-existing lip trend to get the current output character. Combine the files of each frame image into a lip recognition result according to the acquisition order of each frame image.
  • Step S40 calculating the accuracy of the speech signal recognition result and the lip language recognition result, and using the recognition result with higher accuracy as the current recognition result.
  • FIG. 5 a specific process for calculating the accuracy of the speech signal recognition result and the lip language recognition result is shown in FIG. 5, and the specific process is as follows:
  • Step S41 splitting the voice signal recognition and the lip language recognition result into a plurality of keywords
  • Step S42 determining a first degree of association between the adjacent keywords in the keyword split into the voice signal recognition result, and determining each of the keywords split into the lip language recognition result.
  • the second degree of association between adjacent keywords is a first degree of association between the adjacent keywords in the keyword split into the voice signal recognition result, and determining each of the keywords split into the lip language recognition result.
  • the second degree of association is the same as the first degree of association calculation, and is not described here.
  • Step S43 summing the determined first degree of association, obtaining the accuracy of the recognition result of the voice signal, and summing the determined second degree of association, to obtain the accuracy of the voice signal recognition result;
  • the first degree of association is calculated for each adjacent keyword in the string, and a plurality of first degrees of association are obtained, and the calculated degree of association is obtained to obtain the accuracy of the total of the string.
  • step S44 the recognition result with higher accuracy is taken as the current speech recognition result.
  • the speech recognition method proposed in the embodiment simultaneously performs the recognition of the speech signal and the lip language, and calculates the accuracy of the speech signal recognition result and the lip language recognition result, and uses the recognition result with higher accuracy as the current recognition result. Rather than simply recognizing the speech signal, the accuracy of speech recognition is improved.
  • the invention further provides a speech recognition system.
  • FIG. 6 is a schematic diagram of functional modules of a preferred embodiment of the speech recognition system of the present invention.
  • the functional module diagram shown in FIG. 6 is merely an exemplary diagram of a preferred embodiment, and those skilled in the art can surround the functional module of the voice recognition system shown in FIG.
  • the function module is added to the function module.
  • the name of each function module is a custom name. , the function to be achieved by the function module of each name.
  • the voice recognition system proposed in this embodiment preferably runs in a controlled terminal (such as a television set and an air conditioner, etc.), and the controlled terminal performs corresponding operations based on voice recognition; or the voice recognition system can run on the control terminal, and the control terminal The code corresponding to the voice signal recognition result is transmitted to the corresponding controlled terminal.
  • a controlled terminal such as a television set and an air conditioner, etc.
  • This embodiment provides a voice recognition system, where the voice recognition system includes:
  • the control module 10 is configured to control the image capturing device to perform image acquisition when receiving the voice signal, and control the image capturing device to stop image capturing when the voice signal ends;
  • control module 10 controls the image acquisition device to perform image acquisition only when receiving the voice signal, and is in a sleep state when the voice signal is not received, to reduce power consumption, for example, within a preset time interval.
  • control module 10 controls the image capture device to enter a sleep state.
  • control module 10 can control the image acquisition device to perform image acquisition in real time or timing, and when receiving the voice signal, determine the first time point of the received voice signal and the second time when the voice signal ends. Pointing, acquiring an image acquired by the image capturing device between the first time point and the second time point.
  • the voice signal identification module 20 is configured to identify the received voice signal to obtain a voice signal recognition result
  • the speech signal recognition module 20 can obtain the speech signal recognition result by converting the speech signal into a character signal. Further, in order to improve the accuracy of the speech signal recognition result, the speech signal converted character string may be error-corrected. Referring to FIG. 7, the speech signal recognition module 20 includes:
  • a conversion sub-module 21 configured to convert the received voice signal into a character string
  • the splitting module 22 splits the character string into a plurality of keywords according to a preset keyword library
  • a keyword library including a plurality of keywords may be preset, and the splitting module 22 compares the character string converted by the voice signal with the keyword stored in the thesaurus, and determines that the preset keyword library matches the string. Keyword and split the string into matching keywords. It can be understood by those skilled in the art that the keywords in the keyword library need not be set, and after determining the keywords matching the string, the splitting module 22 may first extract the matching keywords in the string, and The remaining unmatched part of the string is used as a keyword. For example, if the character string obtained by the voice signal conversion is “television, switch to channel 23”, then the string matches the keyword in the preset keyword library as “television, switching, to, and channel”. Extract “TV, Switch, To, and Channel” directly from the string, and then use the remaining "23" as a keyword.
  • the part of speech matching sub-module 23 is configured to mark the part of speech of each of the keywords, and determine whether the part of speech of each adjacent keyword matches;
  • the part of speech can be a noun, a verb, an adjective, an adverb, a preposition, etc., and can be reserved for each type of part of speech.
  • the adjacent keyword is a verb + an adjective
  • the part-of-speech matching sub-module 23 considers the adjacent key. Words do not match between words, there may be recognition errors.
  • the confusing sound vocabulary can be preset, and the confusing sound vocabulary can set keywords that are easily confused when the voice signal is converted into a character string, and each confusing keyword association is saved.
  • the unmatched keyword may be used as the first keyword to compare with the keywords in the confusing vocabulary to correct the erroneous keywords.
  • the converted character string can be used as the current speech signal recognition result.
  • the processing sub-module 25 is configured to replace the first keyword with the second keyword, and when the replaced second keyword matches the adjacent keyword, the replaced second keyword and The other keywords are recombined into the speech signal recognition result, and the recombined lip language recognition result is used as the current speech signal recognition result.
  • the processing sub-module 25 will use the first key.
  • the word is replaced with another second keyword, and it is determined whether the word-of-speech between the replaced second keyword and the adjacent keyword matches until all the second keywords are replaced, and the processing sub-module 25 takes the converted character string as The current speech signal identifies the result.
  • the lip language recognition module 30 is configured to perform lip language recognition on the image including the lip in the collected image to obtain a lip language recognition result;
  • the lip recognition result is determined according to the lip shape in each frame image and the lip shape in the image of the previous frame.
  • the lip language recognition module 30 includes:
  • a lip positioning sub-module 31 configured to determine an image including a lip in the collected image, use the image including the lip as an effective image, and determine a lip position in the effective image;
  • the lip positioning sub-module 31 includes:
  • a facial contour determining unit 311, configured to determine a facial contour in each acquired image of the frame
  • the face contour determining unit 311 can directly obtain the image of each frame according to the chromaticity value distribution of the pixel points in each frame image and the preset facial contour. Face position.
  • the face contour determining unit 311 can locate the sound source direction based on the received voice signal, and determine the sound source direction based on the determined sound source direction. It is a prior art to determine the position of the user in the acquired image based on the direction of the sound source, and the details are not described herein.
  • the processing module 40 When there is no face contour in the collected image, the processing module 40 directly uses the voice signal recognition result corresponding to the voice signal as the current voice recognition result, or may prompt the user to re-enter the voice signal.
  • the face position locating unit 312 is configured to compare the chrominance values of the respective pixels in the determined face contour with the chromaticity values of the pixels in the pre-stored face to determine the face in each captured image.
  • a lip region positioning unit 313, configured to determine an eye position in the face position, and determine a lip region based on the relative position between the eye position and the lip position;
  • the eye position can be determined according to the gradation value between the respective pixel points, below the eye position and the face.
  • the lower third of the department can easily determine the area where the lip is located.
  • the comparing unit 314 is configured to compare RGB chromaticity values of respective pixel points in the lip region;
  • the processing unit 315 is configured to: when there is a pixel point in the lip region that the RGB chromaticity value meets the preset condition, determine that the frame image is an image including a lip, and use the image including the lip as an effective image;
  • the lip position locating unit 316 is configured to determine the position of the lip based on the RGB chromaticity values of the respective pixel points in the lip region.
  • the lip position needs to be determined within the region. Since the B (blue) component of the lip RGB chrominance value is much larger than the G (green) component, the preset condition can be set between the B (blue) component and the G (green) component. The difference is greater than the preset value, and the B (blue) component of the face is smaller than the G (green) component, and the B component and the G component of each pixel can be compared to determine the lip. position.
  • a determining sub-module 32 configured to determine a character output by the user according to a lip shape of the effective image of each frame and a lip shape of the effective image of a previous frame;
  • the recombination sub-module 33 is configured to compose a lip-speech recognition result based on characters corresponding to the valid image in each frame.
  • the lip shape of the image of the previous frame of the first frame image in the acquired image is closed lip shape by default, and the user can derive the lip shape according to the previous frame image and the frame image.
  • the user's lip trend compares the resulting lip trend with the pre-existing lip trend to get the current output character. Combine the files of each frame image into a lip recognition result according to the acquisition order of each frame image.
  • the processing module 40 is configured to calculate the accuracy of the speech signal recognition result and the lip language recognition result, and use the recognition result with higher accuracy as the current speech recognition result.
  • the processing module 40 includes:
  • the splitting module 41 is configured to split the voice signal recognition and the lip language recognition result into a plurality of keywords
  • the correlation degree calculation sub-module 42 is configured to determine a first degree of association between the adjacent keywords in the keyword split into the voice signal recognition result, and determine that the lip language recognition result is split into a second degree of association between adjacent keywords in the keyword;
  • the first correlation degree is calculated as: p(x) is the number of times the keyword x appears in the string, and p(y) is the phase among the two adjacent keywords x and y.
  • p(x, y) is the adjacent two keywords x, y appear in the string in the adjacent way. The number of times.
  • the second degree of association is the same as the first degree of association calculation, and is not described here.
  • the accuracy calculation sub-module 43 is configured to sum the determined first correlation degrees, obtain the accuracy of the speech signal recognition result, and sum the determined second correlation degrees to obtain an accurate result of the speech signal recognition result. degree;
  • the first degree of association is calculated for each adjacent keyword in the string, and a plurality of first degrees of association are obtained, and the calculated degree of association is obtained to obtain the accuracy of the total of the string.
  • the processing sub-module 44 is configured to use the recognition result with higher accuracy as the current speech recognition result.
  • the speech recognition system proposed in the embodiment simultaneously performs the recognition of the speech signal and the lip language, and calculates the accuracy of the speech signal recognition result and the lip language recognition result, and uses the recognition result with higher accuracy as the current recognition result. Rather than simply recognizing the speech signal, the accuracy of speech recognition is improved.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Abstract

一种语音识别方法和系统,在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集(S10);对接收到的语音信号进行识别,以得到语音信号识别结果(S20);对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果(S30);计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果(S40)。该技术方案提高了语音识别的准确性。

Description

语音识别方法和系统
技术领域
本发明涉及语音控制领域,尤其涉及语音识别方法和系统。
背景技术
随着语音交互的飞速发展,通过语音的方式控制终端(如电视以及空调器等),或者通过语音的方式进行数据输入已成为应用非常广泛的方式。目前,语音交互仍存在诸多问题,如语音识别不准确,易受环境影响较大,例如周围有人声嘈杂或者有背景音乐的话,语音采集装置采集到的语音信号包括人发出的语音信号以及周围的噪音信号,使得终端无法准确识别接受到的语音信号,导致语音识别不够准确。
发明内容
本发明的主要目的在于提出一种语音识别方法和系统,旨在解决语音识别不够准确的技术问题。
为实现上述目的,本发明提供的一种语音识别方法,所述语音识别方法包括以下步骤:
在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;
对接收到的语音信号进行识别,以得到语音信号识别结果;
对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;
计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
优选地,所述对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果的步骤包括:
确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部的位置;
根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;
基于每一帧所述有效图像对应的字符组成唇语识别结果。
优选地,所述确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置的步骤包括:
确定采集到的每帧图像中脸部轮廓;
将脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;
确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;
将唇部区域中各个像素点的RGB色度值进行比对;
在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;
基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
优选地,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:
将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;
标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;
在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;
在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
优选地,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤还包括:
在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,将转换得到的字符串作为当前的语音信号识别结果。
优选地,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:
将所述语音信号识别以及唇语识别结果拆分为多个关键词;
确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
将准确度较高的识别结果作为当前的语音识别结果。
此外,为实现上述目的,本发明还提出一种语音识别系统,其特征在于,所述语音识别系统包括:
控制模块,用于在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;
语音信号识别模块,用于对接收到的语音信号进行识别,以得到语音信号识别结果;
唇语识别模块,用于对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;
处理模块,用于计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
优选地,所述唇语识别模块包括:
唇部定位子模块,用于确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置;
确定子模块,用于根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;
重组子模块,用于基于每一帧所述有效图像对应的字符组成唇语识别结果。
优选地,所述唇部定位子模块包括:
脸部轮廓确定单元,用于确定采集到的每帧图像中脸部轮廓;
脸部位置定位单元,用于将确定的脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;
唇部区域定位单元,用于确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;
比对单元,用于将唇部区域中各个像素点的RGB色度值进行比对;
处理单元,用于在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;
唇部位置定位单元,用于基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
优选地,所述语音信号识别模块包括:
转换子模块,用于将接收到的语音信号转换成字符串;
拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;
词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;
确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
优选地,所述处理子模块还用于在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,并将转换得到的字符串作为当前的语音信号识别结果。
优选地,所述处理模块包括:
拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;
关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
本发明提出的语音识别方法和系统,同时进行语音信号以及唇语的识别,并计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的识别结果,而不是仅仅单一识别语音信号,提高了语音识别的准确性。
附图说明
图1为本发明语音识别方法较佳实施例的流程示意图;
图2为图1中步骤S20的细化流程示意图;
图3为图1中步骤S30的细化流程示意图;
图4为图3中步骤S31的细化流程示意图;
图5为图1中步骤S40的细化流程示意图;
图6为本发明语音识别系统较佳实施例的功能模块示意图;
图7为图6中语音信号识别模块的细化功能模块示意图;
图8为图6中唇语识别模块的细化功能模块示意图;
图9为图8中唇部定位子模块的细化功能模块示意图;
图10为图6中处理模块的细化功能模块示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明提供一种语音识别方法。
参照图1,图1为本发明语音识别方法较佳实施例的流程示意图。
本实施例提出的语音识别方法优选运行于被控终端(如电视机以及空调器等)中,被控终端基于语音识别接收进行相应的操作;或者语音识别方法可运行于控制终端,控制终端将语音信号识别结果对应的代码传输至相应的被控终端。
本实施例提出一种语音识别方法,所述语音识别方法包括:
步骤S10,在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;
在本实施例中,仅在接收到语音信号时才控制图像采集装置进行图像采集,而在未接受语音信号时处于休眠状态,以减少能耗,例如,在预设时间间隔内未接受到语音信号时,控制所述图像采集装置进入休眠状态。
本领域技术人员可以理解的是,可控制图像采集装置实时或定时进行图像采集,在接收到语音信号时,确定接收到的语音信号的第一时间点以及语音信号结束的第二时间点,获取图像采集装置在该第一时间点以及第二时间点之间采集到的图像。
步骤S20,对接收到的语音信号进行识别,以得到语音信号识别结果;
在本实施例中,可通过将语音信号转换为字符信号得到语音信号识别结果。进一步地,为提高语音信号识别结果的准确性,可对语音信号转换的字符串进行纠错,具体纠错过程参照图2,所述步骤S20包括:
步骤S21,将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;
可预设包括多个关键词的关键词库,将语音信号转换得到的字符串与词库中存储的关键词进行比对,并确定预设的关键词库中与字符串匹配的关键词,并将该字符串拆分为各个匹配的关键词。本领域技术人员可以理解的是,关键词库中可不用设置数字类的关键词,在确定与字符串匹配的关键词后,可先提取字符串中匹配的关键词,并将字符串中剩余的不匹配的部分作为一个关键词。例如,语音信号转换得到的字符串为“电视机,切换至23频道”,则该字符串与预设的关键词库中的关键词匹配的为“电视机、切换、至以及频道”,则由字符串中直接提取出“电视机、切换、至以及频道”,然后将剩余的“23”作为一个关键词。
步骤S22,标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;
该关键词的词性可为名词、动词、形容词、副词以及介词等,可预约各类词性的搭配,例如在相邻的关键词为动词+形容词时,则认为相邻的关键词之间词性不匹配,可能存在识别错误。
步骤S23,在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;
步骤S24,在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
在本实施例中,可预设混淆音词库,该混淆音词库中可设置在语音信号转换为字符串时容易混淆的关键词,各个易混淆的关键词关联保存。在相邻的关键词不匹配时,可将该不配的关键词作为第一关键词与混淆音词库中的关键词进行比对,以对错误的关键词进行纠错。
本领域技术人员可以理解的是,在所述混淆音词库中不存在所述不匹配的关键词时,可将转换得到的字符串作为当前的语音信号识别结果。
步骤S25,将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
本领域技术人员可以理解的是,在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,将转换得到的字符串作为当前的语音信号识别结果。
步骤S30,对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;
在本实施例中,唇语识别结果可根据每一帧图像中的唇形以及与上一帧的图像中的唇形确定得到,具体过程如图3所示,所述步骤S30包括:
步骤S31,确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部的位置;
在本实施例中,确定采集到的每帧图像中唇部的位置具体过程如图4所示,具体过程如下:
步骤S311,确定采集到的每帧图像中脸部轮廓;
由于采集到的图像中的各个像素点对应的色度值不同,可直接根据每帧图像中的像素点的色度值分布以及预设的脸部轮廓得到每帧图像中脸部位置。
本领域技术人员可以理解的是,在图像采集装置的图像采集区域中有多个人存在时,可基于接收到语音信号对声源方向进行定位,基于确定的声源方向确定用户在采集到的图像中的位置,基于声源方向确定用户在采集到的图像中的位置属于现有技术,在此不再赘述。
在采集到的图像中没有脸部轮廓时,直接将语音信号对应的语音信号识别结果作为当前的语音识别结果,或者可提示用户重新输入语音信号。
步骤S312,将确定的脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;
确定脸部轮廓内各个像素点的YUV色度值至与预存的人脸中各个像素点的YUV色度值之间的相似度,在相似度大于预设值时,认为该像素点为脸部像素点,还相似度的计算公式为现有技术,在此不再赘述。
步骤S313,确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;
在本实施例中,由于眼部的像素点的灰度值小于脸部其它位置的灰度值,可根据各个像素点之间的灰度值确定眼部位置,在眼部位置的下方以及脸部的下三分之一出,即可容易确定出唇部所在的区域。
步骤S314,在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;
步骤S315,基于唇部区域中各个像素点的RGB色度值确定唇部的位置
但由于当前确定的唇部区域仅仅为初步确定,该区域内有唇部的像素点以及脸部的像素点,则需要在该区域内确定唇部位置。由于唇部像素点的RGB色度值中,B(蓝色)分量的远远大于G(绿色)分量,则预设的条件可设置为B(蓝色)分量与G(绿色)分量之间的差值大于预设值,而脸部的像素点中B(蓝色)分量的小于G(绿色)分量,则可通过对各个像素点的B分量以及G分量进行比对,以确定唇部位置。
步骤S32,根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;
步骤S33,基于每一帧所述有效图像对应的字符组成唇语识别结果。
本领域技术人员可以理解的是,采集到的图像中第一帧图像的上一帧图像的唇形默认为闭嘴唇形,用户可基于上一帧图像以及该帧图像对应的唇形可得出用户的唇部走势,将得到的唇部走势与预存的唇部走势进行比对,以得到当前输出的字符。按照每一帧图像的采集顺序,将每一帧图像的文件组合成唇语识别结果
步骤S40,计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的识别结果。
在本实施例中,计算所述语音信号识别结果和唇语识别结果的准确度的具体过程如图5所示,具体过程如下:
步骤S41,将所述语音信号识别以及唇语识别结果拆分为多个关键词;
关键词拆分的过程见上述语音信号关键词拆分过程,在此不再赘述。
步骤S42,确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
在本实施例中,第一关联度的计算公式为:I (x, y)=log2p(x, y)/p(x)*p(y),p(x)为相邻的两个关键词x、y中,关键词x在字符串中出现的次数,p(y)为相邻的两个关键词x、y中,关键词y在字符串中出现的次数,p(x,y)为相邻的两个关键词x、y以相邻的方式在字符串中同时出现的次数。第二关联度与第一关联度计算的方式相同,在此不再赘述。
步骤S43,对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
在本实施例中,对字符串中各个相邻的关键词进行第一关联度计算,可得到多个第一关联度,对计算得到的关联度求和可得到字符串总体的准确度。
步骤S44,将准确度较高的识别结果作为当前的语音识别结果。
本实施例提出的语音识别方法,同时进行语音信号以及唇语的识别,并计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的识别结果,而不是仅仅单一识别语音信号,提高了语音识别的准确性。
本发明进一步提供一种语音识别系统。
参照图6,图6为本发明语音识别系统较佳实施例的功能模块示意图。
需要强调的是,对本领域的技术人员来说,图6所示功能模块图仅仅是一个较佳实施例的示例图,本领域的技术人员围绕图6所示的语音识别系统的功能模块,可轻易进行新的功能模块的补充;各功能模块的名称是自定义名称,仅用于辅助理解该语音识别系统的各个程序功能块,不用于限定本发明的技术方案,本发明技术方案的核心是,各自定义名称的功能模块所要达成的功能。
本实施例提出的语音识别系统优选运行于被控终端(如电视机以及空调器等)中,被控终端基于语音识别接收进行相应的操作;或者语音识别系统可运行于控制终端,控制终端将语音信号识别结果对应的代码传输至相应的被控终端。
本实施例提出一种语音识别系统,所述语音识别系统包括:
控制模块10,用于在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;
在本实施例中,控制模块10仅在接收到语音信号时才控制图像采集装置进行图像采集,而在未接受语音信号时处于休眠状态,以减少能耗,例如,在预设时间间隔内未接受到语音信号时,控制模块10控制所述图像采集装置进入休眠状态。
本领域技术人员可以理解的是,控制模块10可控制图像采集装置实时或定时进行图像采集,在接收到语音信号时,确定接收到的语音信号的第一时间点以及语音信号结束的第二时间点,获取图像采集装置在该第一时间点以及第二时间点之间采集到的图像。
语音信号识别模块20,用于对接收到的语音信号进行识别,以得到语音信号识别结果;
在本实施例中,语音信号识别模块20可通过将语音信号转换为字符信号得到语音信号识别结果。进一步地,为提高语音信号识别结果的准确性,可对语音信号转换的字符串进行纠错,参照图7,所述语音信号识别模块20包括:
转换子模块21,用于将接收到的语音信号转换成字符串;
拆分子模块22,按照预设的关键词库,将所述字符串拆分为多个关键词;
可预设包括多个关键词的关键词库,拆分子模块22将语音信号转换得到的字符串与词库中存储的关键词进行比对,并确定预设的关键词库中与字符串匹配的关键词,并将该字符串拆分为各个匹配的关键词。本领域技术人员可以理解的是,关键词库中可不用设置数字类的关键词,在确定与字符串匹配的关键词后,拆分子模块22可先提取字符串中匹配的关键词,并将字符串中剩余的不匹配的部分作为一个关键词。例如,语音信号转换得到的字符串为“电视机,切换至23频道”,则该字符串与预设的关键词库中的关键词匹配的为“电视机、切换、至以及频道”,则由字符串中直接提取出“电视机、切换、至以及频道”,然后将剩余的“23”作为一个关键词。
词性匹配子模块23,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;
该关键词的词性可为名词、动词、形容词、副词以及介词等,可预约各类词性的搭配,例如在相邻的关键词为动词+形容词时,则词性匹配子模块23认为相邻的关键词之间词性不匹配,可能存在识别错误。
确定子模块24,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
在本实施例中,可预设混淆音词库,该混淆音词库中可设置在语音信号转换为字符串时容易混淆的关键词,各个易混淆的关键词关联保存。在相邻的关键词不匹配时,可将该不配的关键词作为第一关键词与混淆音词库中的关键词进行比对,以对错误的关键词进行纠错。
本领域技术人员可以理解的是,在所述混淆音词库中不存在所述不匹配的关键词时,可将转换得到的字符串作为当前的语音信号识别结果。
处理子模块25,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
本领域技术人员可以理解的是,在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,处理子模块25将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,处理子模块25将转换得到的字符串作为当前的语音信号识别结果。
唇语识别模块30,用于对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;
在本实施例中,唇语识别结果可根据每一帧图像中的唇形以及与上一帧的图像中的唇形确定得到,参照图8所示,所述唇语识别模块30包括:
唇部定位子模块31,用于确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置;
在本实施例中,确定采集到的每帧图像中唇部的位置具体单元如图9所示,所述唇部定位子模块31包括:
脸部轮廓确定单元311,用于确定采集到的每帧图像中脸部轮廓;
由于采集到的图像中的各个像素点对应的色度值不同,脸部轮廓确定单元311可直接根据每帧图像中的像素点的色度值分布以及预设的脸部轮廓得到每帧图像中脸部位置。
本领域技术人员可以理解的是,在图像采集装置的图像采集区域中有多个人存在时,脸部轮廓确定单元311可基于接收到语音信号对声源方向进行定位,基于确定的声源方向确定用户在采集到的图像中的位置,基于声源方向确定用户在采集到的图像中的位置属于现有技术,在此不再赘述。
在采集到的图像中没有脸部轮廓时,处理模块40直接将语音信号对应的语音信号识别结果作为当前的语音识别结果,或者可提示用户重新输入语音信号。
脸部位置定位单元312,用于将确定的脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;
确定脸部轮廓内各个像素点的YUV色度值至与预存的人脸中各个像素点的YUV色度值之间的相似度,在相似度大于预设值时,认为该像素点为脸部像素点,还相似度的计算公式为现有技术,在此不再赘述。
唇部区域定位单元313,用于确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;
在本实施例中,由于眼部的像素点的灰度值小于脸部其它位置的灰度值,可根据各个像素点之间的灰度值确定眼部位置,在眼部位置的下方以及脸部的下三分之一出,即可容易确定出唇部所在的区域。
比对单元314,用于将唇部区域中各个像素点的RGB色度值进行比对;
处理单元315,用于在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;
唇部位置定位单元316,用于基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
但由于当前确定的唇部区域仅仅为初步确定,该区域内有唇部的像素点以及脸部的像素点,则需要在该区域内确定唇部位置。由于唇部像素点的RGB色度值中,B(蓝色)分量的远远大于G(绿色)分量,则预设的条件可设置为B(蓝色)分量与G(绿色)分量之间的差值大于预设值,而脸部的像素点中B(蓝色)分量的小于G(绿色)分量,则可通过对各个像素点的B分量以及G分量进行比对,以确定唇部位置。
确定子模块32,用于根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;
重组子模块33,用于基于每一帧所述有效图像对应的字符组成唇语识别结果。
本领域技术人员可以理解的是,采集到的图像中第一帧图像的上一帧图像的唇形默认为闭嘴唇形,用户可基于上一帧图像以及该帧图像对应的唇形可得出用户的唇部走势,将得到的唇部走势与预存的唇部走势进行比对,以得到当前输出的字符。按照每一帧图像的采集顺序,将每一帧图像的文件组合成唇语识别结果
处理模块40,用于计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
在本实施例中,参照图10,所述处理模块40包括:
拆分子模块41,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;
关键词拆分的过程见上述语音信号关键词拆分过程,在此不再赘述。
关联度计算子模块42,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
在本实施例中,第一关联度的计算公式为:,p(x)为相邻的两个关键词x、y中,关键词x在字符串中出现的次数,p(y)为相邻的两个关键词x、y中,关键词y在字符串中出现的次数,p(x,y)为相邻的两个关键词x、y以相邻的方式在字符串中同时出现的次数。第二关联度与第一关联度计算的方式相同,在此不再赘述。
准确度计算子模块43,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
在本实施例中,对字符串中各个相邻的关键词进行第一关联度计算,可得到多个第一关联度,对计算得到的关联度求和可得到字符串总体的准确度。
处理子模块44,用于将准确度较高的识别结果作为当前的语音识别结果。
本实施例提出的语音识别系统,同时进行语音信号以及唇语的识别,并计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的识别结果,而不是仅仅单一识别语音信号,提高了语音识别的准确性。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (20)

  1. 一种语音识别方法,其特征在于,所述语音识别方法包括以下步骤:
    在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;
    对接收到的语音信号进行识别,以得到语音信号识别结果;
    对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;
    计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
  2. 如权利要求1所述的语音识别方法,其特征在于,所述对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果的步骤包括:
    确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部的位置;
    根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;
    基于每一帧所述有效图像对应的字符组成唇语识别结果。
  3. 如权利要求2所述的语音识别方法,其特征在于,所述确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置的步骤包括:
    确定采集到的每帧图像中脸部轮廓;
    将脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;
    确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;
    将唇部区域中各个像素点的RGB色度值进行比对;
    在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;
    基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
  4. 如权利要求1所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:
    将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;
    标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;
    在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;
    在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
    将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
  5. 如权利要求4所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤还包括:
    在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,将转换得到的字符串作为当前的语音信号识别结果。
  6. 如权利要求如权利要求2所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:
    将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;
    标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;
    在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;
    在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
    将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
  7. 如权利要求如权利要求3所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:
    将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;
    标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;
    在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;
    在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
    将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
  8. 如权利要求1所述的语音识别方法,其特征在于,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:
    将所述语音信号识别以及唇语识别结果拆分为多个关键词;
    确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
    对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
    将准确度较高的识别结果作为当前的语音识别结果。
  9. 如权利要求2所述的语音识别方法,其特征在于,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:
    将所述语音信号识别以及唇语识别结果拆分为多个关键词;
    确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
    对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
    将准确度较高的识别结果作为当前的语音识别结果。
  10. 如权利要求3所述的语音识别方法,其特征在于,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:
    将所述语音信号识别以及唇语识别结果拆分为多个关键词;
    确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
    对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
    将准确度较高的识别结果作为当前的语音识别结果。
  11. 一种语音识别系统,其特征在于,所述语音识别系统包括:
    控制模块,用于在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;
    语音信号识别模块,用于对接收到的语音信号进行识别,以得到语音信号识别结果;
    唇语识别模块,用于对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;
    处理模块,用于计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
  12. 如权利要求11所述的语音识别系统,其特征在于,所述唇语识别模块包括:
    唇部定位子模块,用于确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置;
    确定子模块,用于根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;
    重组子模块,用于基于每一帧所述有效图像对应的字符组成唇语识别结果。
  13. 如权利要求12所述的语音识别系统,其特征在于,所述唇部定位子模块包括:
    脸部轮廓确定单元,用于确定采集到的每帧图像中脸部轮廓;
    脸部位置定位单元,用于将确定的脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;
    唇部区域定位单元,用于确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;
    比对单元,用于将唇部区域中各个像素点的RGB色度值进行比对;
    处理单元,用于在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;
    唇部位置定位单元,用于基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
  14. 如权利要求11所述的语音识别系统,其特征在于,所述语音信号识别模块包括:
    转换子模块,用于将接收到的语音信号转换成字符串;
    拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;
    词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;
    确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
    处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
  15. 如权利要求14所述的语音识别系统,其特征在于,所述处理子模块还用于在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,并将转换得到的字符串作为当前的语音信号识别结果。
  16. 如权利要求12所述的语音识别系统,其特征在于,所述语音信号识别模块包括:
    转换子模块,用于将接收到的语音信号转换成字符串;
    拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;
    词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;
    确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
    处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
  17. 如权利要求13所述的语音识别系统,其特征在于,所述语音信号识别模块包括:
    转换子模块,用于将接收到的语音信号转换成字符串;
    拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;
    词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;
    确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;
    处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
  18. 如权利要求11所述的语音识别系统,其特征在于,所述处理模块包括:
    拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;
    关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
    准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
    处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
  19. 如权利要求12所述的语音识别系统,其特征在于,所述处理模块包括:
    拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;
    关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
    准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
    处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
  20. 如权利要求13所述的语音识别系统,其特征在于,所述处理模块包括:
    拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;
    关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;
    准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;
    处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
PCT/CN2014/094624 2014-11-28 2014-12-23 语音识别方法和系统 WO2016082267A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2014412434A AU2014412434B2 (en) 2014-11-28 2014-12-23 Voice recognition method and system
US15/127,790 US10262658B2 (en) 2014-11-28 2014-12-23 Voice recognition method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410714386.2 2014-11-28
CN201410714386.2A CN104409075B (zh) 2014-11-28 2014-11-28 语音识别方法和系统

Publications (1)

Publication Number Publication Date
WO2016082267A1 true WO2016082267A1 (zh) 2016-06-02

Family

ID=52646698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/094624 WO2016082267A1 (zh) 2014-11-28 2014-12-23 语音识别方法和系统

Country Status (4)

Country Link
US (1) US10262658B2 (zh)
CN (1) CN104409075B (zh)
AU (1) AU2014412434B2 (zh)
WO (1) WO2016082267A1 (zh)

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157956A (zh) * 2015-03-24 2016-11-23 中兴通讯股份有限公司 语音识别的方法及装置
CN106157957A (zh) * 2015-04-28 2016-11-23 中兴通讯股份有限公司 语音识别方法、装置及用户设备
CN105334743B (zh) * 2015-11-18 2018-10-26 深圳创维-Rgb电子有限公司 一种基于情感识别的智能家居控制方法及其系统
CN106971722B (zh) * 2016-01-14 2020-07-17 芋头科技(杭州)有限公司 一种设置有关联度的远程语音识别系统及方法
CN107452381B (zh) * 2016-05-30 2020-12-29 中国移动通信有限公司研究院 一种多媒体语音识别装置及方法
CN106250829A (zh) * 2016-07-22 2016-12-21 中国科学院自动化研究所 基于唇部纹理结构的数字识别方法
CN106529502B (zh) * 2016-08-01 2019-09-24 深圳奥比中光科技有限公司 唇语识别方法以及装置
CN106648530B (zh) * 2016-11-21 2020-09-08 海信集团有限公司 语音控制方法及终端
US11132429B2 (en) * 2016-12-14 2021-09-28 Telefonaktiebolaget Lm Ericsson (Publ) Authenticating a user subvocalizing a displayed text
CN106875941B (zh) * 2017-04-01 2020-02-18 彭楚奥 一种服务机器人的语音语义识别方法
CN107369449B (zh) * 2017-07-14 2019-11-26 上海木木机器人技术有限公司 一种有效语音识别方法及装置
CN107293300A (zh) * 2017-08-01 2017-10-24 珠海市魅族科技有限公司 语音识别方法及装置、计算机装置及可读存储介质
CN107702273B (zh) * 2017-09-20 2020-06-16 珠海格力电器股份有限公司 空调控制方法和装置
US10936705B2 (en) * 2017-10-31 2021-03-02 Baidu Usa Llc Authentication method, electronic device, and computer-readable program medium
US11095502B2 (en) 2017-11-03 2021-08-17 Otis Elevator Company Adhoc protocol for commissioning connected devices in the field
CN108363745B (zh) 2018-01-26 2020-06-30 阿里巴巴集团控股有限公司 机器人客服转人工客服的方法和装置
CN108320747A (zh) * 2018-02-08 2018-07-24 广东美的厨房电器制造有限公司 家电设备控制方法、设备、终端及计算机可读存储介质
CN108427548A (zh) * 2018-02-26 2018-08-21 广东小天才科技有限公司 基于麦克风的用户交互方法、装置、设备及存储介质
CN108596107A (zh) 2018-04-26 2018-09-28 京东方科技集团股份有限公司 基于ar设备的唇语识别方法及其装置、ar设备
CN109031201A (zh) * 2018-06-01 2018-12-18 深圳市鹰硕技术有限公司 基于行为识别的语音定位方法以及装置
CN110837758B (zh) * 2018-08-17 2023-06-02 杭州海康威视数字技术股份有限公司 一种关键词输入方法、装置及电子设备
CN109102805A (zh) * 2018-09-20 2018-12-28 北京长城华冠汽车技术开发有限公司 语音交互方法、装置及实现装置
CN109377995B (zh) * 2018-11-20 2021-06-01 珠海格力电器股份有限公司 一种控制设备的方法与装置
KR20200073733A (ko) 2018-12-14 2020-06-24 삼성전자주식회사 전자 장치의 기능 실행 방법 및 이를 사용하는 전자 장치
CN109817211B (zh) * 2019-02-14 2021-04-02 珠海格力电器股份有限公司 一种电器控制方法、装置、存储介质及电器
CN109979450B (zh) * 2019-03-11 2021-12-07 海信视像科技股份有限公司 信息处理方法、装置及电子设备
CN110545396A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种基于定位去噪的语音识别方法及装置
CN110827823A (zh) * 2019-11-13 2020-02-21 联想(北京)有限公司 语音辅助识别方法、装置、存储介质及电子设备
CN111445912A (zh) * 2020-04-03 2020-07-24 深圳市阿尔垎智能科技有限公司 语音处理方法和系统
CN111447325A (zh) * 2020-04-03 2020-07-24 上海闻泰电子科技有限公司 通话辅助方法、装置、终端及存储介质
CN111626310B (zh) * 2020-05-27 2023-08-29 百度在线网络技术(北京)有限公司 图像比对方法、装置、设备以及存储介质
CN113763941A (zh) * 2020-06-01 2021-12-07 青岛海尔洗衣机有限公司 语音识别方法、语音识别系统和电器设备
CN112037788B (zh) * 2020-09-10 2021-08-24 中航华东光电(上海)有限公司 一种语音纠正融合方法
CN112820274B (zh) * 2021-01-08 2021-09-28 上海仙剑文化传媒股份有限公司 一种语音信息识别校正方法和系统
CN113068058A (zh) * 2021-03-19 2021-07-02 安徽宝信信息科技有限公司 一种基于语音识别及转写技术的实时字幕上屏直播系统
CN117217212A (zh) * 2022-05-30 2023-12-12 青岛海尔科技有限公司 语料识别方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002304194A (ja) * 2001-02-05 2002-10-18 Masanobu Kujirada 音声及び/又は口形状入力のためのシステム、方法、プログラム
US20030171932A1 (en) * 2002-03-07 2003-09-11 Biing-Hwang Juang Speech recognition
US6633844B1 (en) * 1999-12-02 2003-10-14 International Business Machines Corporation Late integration in audio-visual continuous speech recognition
US20100063820A1 (en) * 2002-09-12 2010-03-11 Broadcom Corporation Correlating video images of lip movements with audio signals to improve speech recognition
CN102023703A (zh) * 2009-09-22 2011-04-20 现代自动车株式会社 组合唇读与语音识别的多模式界面系统
US20130054240A1 (en) * 2011-08-25 2013-02-28 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6594629B1 (en) * 1999-08-06 2003-07-15 International Business Machines Corporation Methods and apparatus for audio-visual speech detection and recognition
AU2001296459A1 (en) * 2000-10-02 2002-04-15 Clarity, L.L.C. Audio visual speech processing
US7526425B2 (en) * 2001-08-14 2009-04-28 Evri Inc. Method and system for extending keyword searching to syntactically and semantically annotated data
US7165029B2 (en) * 2002-05-09 2007-01-16 Intel Corporation Coupled hidden Markov model for audiovisual speech recognition
JP4867654B2 (ja) * 2006-12-28 2012-02-01 日産自動車株式会社 音声認識装置、および音声認識方法
CN102013103B (zh) * 2010-12-03 2013-04-03 上海交通大学 实时动态嘴唇跟踪方法
CN102298443B (zh) * 2011-06-24 2013-09-25 华南理工大学 结合视频通道的智能家居语音控制系统及其控制方法
WO2013097075A1 (en) * 2011-12-26 2013-07-04 Intel Corporation Vehicle based determination of occupant audio and visual input
CN103678271B (zh) * 2012-09-10 2016-09-14 华为技术有限公司 一种文本校正方法及用户设备
CN102932212A (zh) * 2012-10-12 2013-02-13 华南理工大学 一种基于多通道交互方式的智能家居控制系统
KR101482430B1 (ko) * 2013-08-13 2015-01-15 포항공과대학교 산학협력단 전치사 교정 방법 및 이를 수행하는 장치
CN105096935B (zh) * 2014-05-06 2019-08-09 阿里巴巴集团控股有限公司 一种语音输入方法、装置和系统
US9854139B2 (en) * 2014-06-24 2017-12-26 Sony Mobile Communications Inc. Lifelog camera and method of controlling same using voice triggers

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633844B1 (en) * 1999-12-02 2003-10-14 International Business Machines Corporation Late integration in audio-visual continuous speech recognition
JP2002304194A (ja) * 2001-02-05 2002-10-18 Masanobu Kujirada 音声及び/又は口形状入力のためのシステム、方法、プログラム
US20030171932A1 (en) * 2002-03-07 2003-09-11 Biing-Hwang Juang Speech recognition
US20100063820A1 (en) * 2002-09-12 2010-03-11 Broadcom Corporation Correlating video images of lip movements with audio signals to improve speech recognition
CN102023703A (zh) * 2009-09-22 2011-04-20 现代自动车株式会社 组合唇读与语音识别的多模式界面系统
US20130054240A1 (en) * 2011-08-25 2013-02-28 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image

Also Published As

Publication number Publication date
AU2014412434A1 (en) 2016-10-20
US10262658B2 (en) 2019-04-16
CN104409075A (zh) 2015-03-11
AU2014412434B2 (en) 2020-10-08
US20170098447A1 (en) 2017-04-06
CN104409075B (zh) 2018-09-04

Similar Documents

Publication Publication Date Title
WO2016082267A1 (zh) 语音识别方法和系统
WO2019051899A1 (zh) 终端控制方法、装置及存储介质
WO2019051890A1 (zh) 终端控制方法、装置及计算机可读存储介质
WO2019041406A1 (zh) 不雅图片识别方法、终端、设备及计算机可读存储介质
WO2015127859A1 (en) Sensitive text detecting method and apparatus
WO2020246844A1 (en) Device control method, conflict processing method, corresponding apparatus and electronic device
WO2019051895A1 (zh) 终端控制方法、装置及存储介质
WO2019085495A1 (zh) 微表情识别方法、装置、系统及计算机可读存储介质
WO2018166236A1 (zh) 理赔账单识别方法、装置、设备及计算机可读存储介质
WO2015184760A1 (zh) 空中手势输入方法及装置
WO2019029261A1 (zh) 微表情识别方法、装置及存储介质
WO2019205323A1 (zh) 空调器及其参数调整方法、装置和可读存储介质
WO2016018004A1 (en) Method, apparatus, and system for providing translated content
WO2019114269A1 (zh) 一种节目续播方法、电视设备及计算机可读存储介质
WO2015007007A1 (zh) 一种adc自动校正的方法及装置
WO2015158132A1 (zh) 语音控制方法和系统
WO2018149191A1 (zh) 保单核保的方法、装置、设备及计算机可读存储介质
WO2018076569A1 (zh) 基于行车电脑的程序刷写方法及装置
WO2019041851A1 (zh) 家电售后咨询方法、电子设备和计算机可读存储介质
WO2021029627A1 (en) Server that supports speech recognition of device, and operation method of the server
WO2021261830A1 (en) Video quality assessment method and apparatus
WO2016127458A1 (zh) 改进的基于语义词典的词语相似度计算方法和装置
WO2015158133A1 (zh) 语音控制指令纠错方法和系统
WO2019165723A1 (zh) 处理音视频的方法、系统、设备及存储介质
WO2019051934A1 (zh) 业务人员考核方法、考核平台和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14907044

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15127790

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2014412434

Country of ref document: AU

Date of ref document: 20141223

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/10/2017)

122 Ep: pct application non-entry in european phase

Ref document number: 14907044

Country of ref document: EP

Kind code of ref document: A1