WO2016082267A1 - 语音识别方法和系统 - Google Patents
语音识别方法和系统 Download PDFInfo
- Publication number
- WO2016082267A1 WO2016082267A1 PCT/CN2014/094624 CN2014094624W WO2016082267A1 WO 2016082267 A1 WO2016082267 A1 WO 2016082267A1 CN 2014094624 W CN2014094624 W CN 2014094624W WO 2016082267 A1 WO2016082267 A1 WO 2016082267A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- keyword
- recognition result
- lip
- speech
- keywords
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Definitions
- the present invention relates to the field of voice control, and more particularly to a voice recognition method and system.
- voice interaction With the rapid development of voice interaction, it is a very widely used method to control terminals (such as televisions and air conditioners) by voice or to input data by voice.
- terminals such as televisions and air conditioners
- voice interaction there are still many problems in voice interaction, such as inaccurate speech recognition and high environmental impact. For example, if there is noise or background music around, the voice signal collected by the voice collection device includes the voice signal sent by the person and the surrounding voice. The noise signal makes the terminal unable to accurately recognize the received speech signal, resulting in inaccurate speech recognition.
- the main object of the present invention is to propose a speech recognition method and system, aiming at solving the technical problem that speech recognition is not accurate enough.
- the present invention provides a speech recognition method, and the speech recognition method includes the following steps:
- the accuracy of the speech signal recognition result and the lip language recognition result is calculated, and the recognition result with higher accuracy is used as the current speech recognition result.
- the step of performing lip language recognition on the image including the lip in the collected image to obtain a lip recognition result comprises:
- Determining an image of the collected image that includes a lip using the image including the lip as an effective image, and determining a position of the lip in the effective image
- the lip language recognition result is composed based on characters corresponding to the effective image of each frame.
- the step of determining the image of the lip included in the collected image, using the image containing the lip as the effective image, and determining the position of the lip in the effective image comprises:
- the frame image is determined to be an image including a lip, and the image including the lip is taken as an effective image;
- the position of the lip is determined based on the RGB chromaticity values of the individual pixels in the lip region.
- the step of identifying the received voice signal to obtain a voice signal recognition result comprises:
- the unmatched keyword is used as the first keyword, and it is determined whether the preset confused vocabulary has the first keyword;
- the first keyword is replaced with a second keyword, and when the replaced second keyword is matched with the adjacent keyword, the replaced second keyword and other keywords are recombined into a voice.
- the signal recognition result and the recombined lip recognition result is used as the current speech signal recognition result.
- the step of identifying the received voice signal to obtain a voice signal recognition result further includes:
- the first keyword is replaced with another second keyword, and after the replacement is determined Whether the part of speech of the second keyword matches the adjacent part of the keyword, until all the second keywords are replaced, the converted character string is used as the current speech signal recognition result.
- the step of calculating the accuracy of the speech signal recognition result and the lip language recognition result, and using the higher accuracy recognition result as the current speech recognition result comprises:
- the recognition result with higher accuracy is taken as the current speech recognition result.
- the present invention further provides a speech recognition system, characterized in that the speech recognition system comprises:
- control module configured to control the image acquisition device to perform image acquisition when receiving the voice signal, and control the image collection device to stop image collection when the voice signal ends;
- a voice signal recognition module configured to identify the received voice signal to obtain a voice signal recognition result
- a lip language recognition module configured to perform lip language recognition on an image containing a lip in the collected image to obtain a lip language recognition result
- the processing module is configured to calculate the accuracy of the speech signal recognition result and the lip language recognition result, and use the recognition result with higher accuracy as the current speech recognition result.
- the lip language recognition module comprises:
- a lip positioning sub-module configured to determine an image including a lip in the collected image, using the image including the lip as an effective image, and determining a lip position in the effective image
- a recombination submodule configured to form a lip language recognition result based on characters corresponding to the effective image of each frame.
- the lip positioning submodule comprises:
- a facial contour determining unit configured to determine a facial contour in each of the acquired images
- a face position locating unit configured to compare each pixel chromaticity value in the determined face contour with a chromaticity value of each pixel in the pre-stored face to determine a face in each captured image position
- a lip region positioning unit for determining an eye position in the face position, and determining a lip region based on the relative position between the eye position and the lip position;
- Aligning unit for comparing RGB chromaticity values of respective pixels in the lip region
- a processing unit configured to: when there is a pixel point in the lip region that the RGB chromaticity value satisfies a preset condition, determine the frame image as an image including a lip, and use the image including the lip as an effective image;
- a lip position locating unit for determining a position of the lip based on RGB chromaticity values of respective pixels in the lip region.
- the voice signal identification module comprises:
- a conversion submodule for converting the received voice signal into a character string
- the character string is split into multiple keywords
- a part of speech matching sub-module for labeling the part of speech of each of the keywords, and determining whether the part of speech of each adjacent keyword matches
- Determining a sub-module configured to use the unmatched keyword as a first keyword when there is a part of speech matching between adjacent keywords, and determine whether the preset confounded sound lexicon has the first key a word, and when the confusing sound word inventory is in the unmatched keyword, determining a second keyword corresponding to the first keyword in the confusing sound vocabulary;
- a processing submodule configured to replace the first keyword with a second keyword, and replace the second keyword and other when the replaced second keyword and the adjacent keyword have a part of speech matching
- the keywords are recombined into the speech signal recognition result, and the recombined lip language recognition result is taken as the current speech signal recognition result.
- the processing sub-module is further configured to: when the replaced second keyword has a part of speech non-match between adjacent keywords, and when there are multiple second keywords, replace the first keyword For other second keywords, and determining whether the word-of-speech between the replaced second keyword and the adjacent keyword matches until all the second keywords are replaced, and the converted character string is recognized as the current speech signal. result.
- the processing module comprises:
- a molecular module for splitting the speech signal recognition and the lip recognition result into a plurality of keywords
- a correlation degree calculation sub-module configured to determine a first degree of association between each of the adjacent keywords in the keyword split into the voice signal recognition result, and determine a key to split the lip language recognition result into a second degree of association between adjacent keywords in a word;
- the accuracy calculation sub-module is configured to sum the determined first association degrees, obtain the accuracy of the speech signal recognition result, and sum the determined second association degrees to obtain the accuracy of the speech signal recognition result. ;
- Processing sub-module for using the higher-accuracy recognition result as the current speech recognition result
- the speech recognition method and system proposed by the invention simultaneously recognizes the speech signal and the lip language, and calculates the accuracy of the speech signal recognition result and the lip language recognition result, and uses the recognition result with higher accuracy as the current recognition result. Instead of just recognizing a single voice signal, the accuracy of speech recognition is improved.
- FIG. 1 is a schematic flow chart of a voice recognition method according to a preferred embodiment of the present invention.
- FIG. 2 is a schematic diagram of a refinement process of step S20 in FIG. 1;
- step S30 in FIG. 1 is a schematic diagram of a refinement process of step S30 in FIG. 1;
- step S31 in FIG. 3 is a schematic diagram showing the refinement process of step S31 in FIG. 3;
- FIG. 5 is a schematic diagram showing the refinement process of step S40 in FIG. 1;
- FIG. 6 is a schematic diagram of functional modules of a preferred embodiment of a speech recognition system of the present invention.
- FIG. 7 is a schematic diagram of a refinement function module of the speech signal recognition module of FIG. 6;
- FIG. 8 is a schematic diagram of a refinement function module of the lip language recognition module of FIG. 6;
- FIG. 9 is a schematic diagram of a refinement function module of the lip positioning sub-module of FIG. 8;
- FIG. 10 is a schematic diagram of a refinement function module of the processing module of FIG. 6.
- FIG. 10 is a schematic diagram of a refinement function module of the processing module of FIG. 6.
- the invention provides a speech recognition method.
- FIG. 1 is a schematic flowchart of a voice recognition method according to a preferred embodiment of the present invention.
- the voice recognition method proposed in this embodiment preferably runs in a controlled terminal (such as a television set and an air conditioner, etc.), and the controlled terminal performs corresponding operations based on voice recognition; or the voice recognition method can be run on the control terminal, and the control terminal The code corresponding to the voice signal recognition result is transmitted to the corresponding controlled terminal.
- a controlled terminal such as a television set and an air conditioner, etc.
- This embodiment provides a voice recognition method, where the voice recognition method includes:
- Step S10 when receiving the voice signal, controlling the image capturing device to perform image capturing, and controlling the image capturing device to stop image capturing when the voice signal ends;
- the image acquisition device is controlled to perform image acquisition only when the voice signal is received, and is in a sleep state when the voice signal is not received, to reduce energy consumption, for example, the voice is not received within the preset time interval.
- the image capture device is controlled to enter a sleep state.
- the image acquisition device can be controlled to perform image acquisition in real time or at a time.
- the voice signal is received
- the first time point of the received voice signal and the second time point when the voice signal ends are obtained.
- Step S20 identifying the received voice signal to obtain a voice signal recognition result
- the speech signal recognition result can be obtained by converting the speech signal into a character signal. Further, in order to improve the accuracy of the speech signal recognition result, the character string converted by the speech signal may be corrected.
- the specific error correction process is as shown in FIG. 2, and the step S20 includes:
- Step S21 converting the received voice signal into a character string, and splitting the character string into a plurality of keywords according to a preset keyword library;
- the keyword library including a plurality of keywords may be preset, the character string converted by the voice signal is compared with the keyword stored in the thesaurus, and the keyword matching the string in the preset keyword database is determined, And split the string into matching keywords.
- the keywords in the keyword library need not be set, and after the keywords matching the string are determined, the matching keywords in the string can be extracted first, and the remaining characters in the string are The mismatched part is used as a keyword. For example, if the character string obtained by the voice signal conversion is “television, switch to channel 23”, then the string matches the keyword in the preset keyword library as “television, switching, to, and channel”. Extract “TV, Switch, To, and Channel” directly from the string, and then use the remaining "23" as a keyword.
- Step S22 marking the part of speech of each of the keywords, and determining whether the part of speech between the adjacent keywords matches;
- the part of speech can be nouns, verbs, adjectives, adverbs and prepositions. You can reserve a combination of various parts of speech. For example, when adjacent keywords are verbs + adjectives, the words are not considered between adjacent keywords. Match, there may be a recognition error.
- Step S23 When there is a partiality of the word between the adjacent keywords, the unmatched keyword is used as the first keyword, and determining whether the preset confusing dictionary has the first keyword;
- Step S24 determining, when the confusing sound word inventory is in the unmatched keyword, determining a second keyword corresponding to the first keyword in the confusing sound vocabulary;
- the confusing sound vocabulary can be preset, and the confusing sound vocabulary can set keywords that are easily confused when the voice signal is converted into a character string, and each confusing keyword association is saved.
- the unmatched keyword may be used as the first keyword to compare with the keywords in the confusing vocabulary to correct the erroneous keywords.
- the converted character string can be used as the current speech signal recognition result.
- Step S25 replacing the first keyword with the second keyword, and re-changing the replaced second keyword and other keywords when the replaced second keyword matches the adjacent keyword.
- the combination is the result of the speech signal recognition, and the recombined lip recognition result is used as the current speech signal recognition result.
- the replaced second keyword does not match the part of speech between the adjacent keywords, and the second keyword exists
- the first keyword is replaced with other
- the second keyword is used to determine whether the replaced second word and the adjacent word are matched, until all the second keywords are replaced, and the converted character string is used as the current speech signal recognition result.
- Step S30 performing lip language recognition on the image including the lip in the collected image to obtain a lip recognition result
- the lip recognition result is determined according to the lip shape in each frame image and the lip shape in the image of the previous frame.
- the specific process is as shown in FIG. 3, and the step S30 includes:
- Step S31 determining an image including a lip in the collected image, using the image including the lip as an effective image, and determining a position of the lip in the effective image;
- Step S311 determining a facial contour in each acquired image of the frame
- the face position in each frame image can be directly obtained according to the chromaticity value distribution of the pixel points in each frame image and the preset facial contour.
- the sound source direction can be positioned based on the received voice signal, and the collected image of the user is determined based on the determined sound source direction.
- the location in the location, determining the location of the user in the acquired image based on the direction of the sound source belongs to the prior art, and details are not described herein again.
- the speech signal recognition result corresponding to the speech signal is directly used as the current speech recognition result, or the user may be prompted to re-enter the speech signal.
- Step S312 comparing each pixel point chromaticity value in the determined face contour with the chromaticity value of each pixel point in the pre-stored face to determine the face position in each captured image;
- Step S313 determining an eye position in the face position, and determining a lip region based on the relative position between the eye position and the lip position;
- the eye position can be determined according to the gradation value between the respective pixel points, below the eye position and the face.
- the lower third of the department can easily determine the area where the lip is located.
- Step S314 when there is a pixel point whose RGB chromaticity value satisfies a preset condition in the lip region, determining that the frame image is an image including a lip, and using the image including the lip as an effective image;
- Step S315 determining the position of the lip based on the RGB chromaticity values of the respective pixels in the lip region.
- the lip position needs to be determined within the region. Since the B (blue) component of the lip RGB chrominance value is much larger than the G (green) component, the preset condition can be set between the B (blue) component and the G (green) component. The difference is greater than the preset value, and the B (blue) component of the face is smaller than the G (green) component, and the B component and the G component of each pixel can be compared to determine the lip. position.
- Step S32 determining a character output by the user according to the lip shape of the effective image of each frame and the lip shape of the effective image of the previous frame;
- Step S33 the lip language recognition result is composed based on the characters corresponding to the effective image in each frame.
- the lip shape of the image of the previous frame of the first frame image in the acquired image is closed lip shape by default, and the user can derive the lip shape according to the previous frame image and the frame image.
- the user's lip trend compares the resulting lip trend with the pre-existing lip trend to get the current output character. Combine the files of each frame image into a lip recognition result according to the acquisition order of each frame image.
- Step S40 calculating the accuracy of the speech signal recognition result and the lip language recognition result, and using the recognition result with higher accuracy as the current recognition result.
- FIG. 5 a specific process for calculating the accuracy of the speech signal recognition result and the lip language recognition result is shown in FIG. 5, and the specific process is as follows:
- Step S41 splitting the voice signal recognition and the lip language recognition result into a plurality of keywords
- Step S42 determining a first degree of association between the adjacent keywords in the keyword split into the voice signal recognition result, and determining each of the keywords split into the lip language recognition result.
- the second degree of association between adjacent keywords is a first degree of association between the adjacent keywords in the keyword split into the voice signal recognition result, and determining each of the keywords split into the lip language recognition result.
- the second degree of association is the same as the first degree of association calculation, and is not described here.
- Step S43 summing the determined first degree of association, obtaining the accuracy of the recognition result of the voice signal, and summing the determined second degree of association, to obtain the accuracy of the voice signal recognition result;
- the first degree of association is calculated for each adjacent keyword in the string, and a plurality of first degrees of association are obtained, and the calculated degree of association is obtained to obtain the accuracy of the total of the string.
- step S44 the recognition result with higher accuracy is taken as the current speech recognition result.
- the speech recognition method proposed in the embodiment simultaneously performs the recognition of the speech signal and the lip language, and calculates the accuracy of the speech signal recognition result and the lip language recognition result, and uses the recognition result with higher accuracy as the current recognition result. Rather than simply recognizing the speech signal, the accuracy of speech recognition is improved.
- the invention further provides a speech recognition system.
- FIG. 6 is a schematic diagram of functional modules of a preferred embodiment of the speech recognition system of the present invention.
- the functional module diagram shown in FIG. 6 is merely an exemplary diagram of a preferred embodiment, and those skilled in the art can surround the functional module of the voice recognition system shown in FIG.
- the function module is added to the function module.
- the name of each function module is a custom name. , the function to be achieved by the function module of each name.
- the voice recognition system proposed in this embodiment preferably runs in a controlled terminal (such as a television set and an air conditioner, etc.), and the controlled terminal performs corresponding operations based on voice recognition; or the voice recognition system can run on the control terminal, and the control terminal The code corresponding to the voice signal recognition result is transmitted to the corresponding controlled terminal.
- a controlled terminal such as a television set and an air conditioner, etc.
- This embodiment provides a voice recognition system, where the voice recognition system includes:
- the control module 10 is configured to control the image capturing device to perform image acquisition when receiving the voice signal, and control the image capturing device to stop image capturing when the voice signal ends;
- control module 10 controls the image acquisition device to perform image acquisition only when receiving the voice signal, and is in a sleep state when the voice signal is not received, to reduce power consumption, for example, within a preset time interval.
- control module 10 controls the image capture device to enter a sleep state.
- control module 10 can control the image acquisition device to perform image acquisition in real time or timing, and when receiving the voice signal, determine the first time point of the received voice signal and the second time when the voice signal ends. Pointing, acquiring an image acquired by the image capturing device between the first time point and the second time point.
- the voice signal identification module 20 is configured to identify the received voice signal to obtain a voice signal recognition result
- the speech signal recognition module 20 can obtain the speech signal recognition result by converting the speech signal into a character signal. Further, in order to improve the accuracy of the speech signal recognition result, the speech signal converted character string may be error-corrected. Referring to FIG. 7, the speech signal recognition module 20 includes:
- a conversion sub-module 21 configured to convert the received voice signal into a character string
- the splitting module 22 splits the character string into a plurality of keywords according to a preset keyword library
- a keyword library including a plurality of keywords may be preset, and the splitting module 22 compares the character string converted by the voice signal with the keyword stored in the thesaurus, and determines that the preset keyword library matches the string. Keyword and split the string into matching keywords. It can be understood by those skilled in the art that the keywords in the keyword library need not be set, and after determining the keywords matching the string, the splitting module 22 may first extract the matching keywords in the string, and The remaining unmatched part of the string is used as a keyword. For example, if the character string obtained by the voice signal conversion is “television, switch to channel 23”, then the string matches the keyword in the preset keyword library as “television, switching, to, and channel”. Extract “TV, Switch, To, and Channel” directly from the string, and then use the remaining "23" as a keyword.
- the part of speech matching sub-module 23 is configured to mark the part of speech of each of the keywords, and determine whether the part of speech of each adjacent keyword matches;
- the part of speech can be a noun, a verb, an adjective, an adverb, a preposition, etc., and can be reserved for each type of part of speech.
- the adjacent keyword is a verb + an adjective
- the part-of-speech matching sub-module 23 considers the adjacent key. Words do not match between words, there may be recognition errors.
- the confusing sound vocabulary can be preset, and the confusing sound vocabulary can set keywords that are easily confused when the voice signal is converted into a character string, and each confusing keyword association is saved.
- the unmatched keyword may be used as the first keyword to compare with the keywords in the confusing vocabulary to correct the erroneous keywords.
- the converted character string can be used as the current speech signal recognition result.
- the processing sub-module 25 is configured to replace the first keyword with the second keyword, and when the replaced second keyword matches the adjacent keyword, the replaced second keyword and The other keywords are recombined into the speech signal recognition result, and the recombined lip language recognition result is used as the current speech signal recognition result.
- the processing sub-module 25 will use the first key.
- the word is replaced with another second keyword, and it is determined whether the word-of-speech between the replaced second keyword and the adjacent keyword matches until all the second keywords are replaced, and the processing sub-module 25 takes the converted character string as The current speech signal identifies the result.
- the lip language recognition module 30 is configured to perform lip language recognition on the image including the lip in the collected image to obtain a lip language recognition result;
- the lip recognition result is determined according to the lip shape in each frame image and the lip shape in the image of the previous frame.
- the lip language recognition module 30 includes:
- a lip positioning sub-module 31 configured to determine an image including a lip in the collected image, use the image including the lip as an effective image, and determine a lip position in the effective image;
- the lip positioning sub-module 31 includes:
- a facial contour determining unit 311, configured to determine a facial contour in each acquired image of the frame
- the face contour determining unit 311 can directly obtain the image of each frame according to the chromaticity value distribution of the pixel points in each frame image and the preset facial contour. Face position.
- the face contour determining unit 311 can locate the sound source direction based on the received voice signal, and determine the sound source direction based on the determined sound source direction. It is a prior art to determine the position of the user in the acquired image based on the direction of the sound source, and the details are not described herein.
- the processing module 40 When there is no face contour in the collected image, the processing module 40 directly uses the voice signal recognition result corresponding to the voice signal as the current voice recognition result, or may prompt the user to re-enter the voice signal.
- the face position locating unit 312 is configured to compare the chrominance values of the respective pixels in the determined face contour with the chromaticity values of the pixels in the pre-stored face to determine the face in each captured image.
- a lip region positioning unit 313, configured to determine an eye position in the face position, and determine a lip region based on the relative position between the eye position and the lip position;
- the eye position can be determined according to the gradation value between the respective pixel points, below the eye position and the face.
- the lower third of the department can easily determine the area where the lip is located.
- the comparing unit 314 is configured to compare RGB chromaticity values of respective pixel points in the lip region;
- the processing unit 315 is configured to: when there is a pixel point in the lip region that the RGB chromaticity value meets the preset condition, determine that the frame image is an image including a lip, and use the image including the lip as an effective image;
- the lip position locating unit 316 is configured to determine the position of the lip based on the RGB chromaticity values of the respective pixel points in the lip region.
- the lip position needs to be determined within the region. Since the B (blue) component of the lip RGB chrominance value is much larger than the G (green) component, the preset condition can be set between the B (blue) component and the G (green) component. The difference is greater than the preset value, and the B (blue) component of the face is smaller than the G (green) component, and the B component and the G component of each pixel can be compared to determine the lip. position.
- a determining sub-module 32 configured to determine a character output by the user according to a lip shape of the effective image of each frame and a lip shape of the effective image of a previous frame;
- the recombination sub-module 33 is configured to compose a lip-speech recognition result based on characters corresponding to the valid image in each frame.
- the lip shape of the image of the previous frame of the first frame image in the acquired image is closed lip shape by default, and the user can derive the lip shape according to the previous frame image and the frame image.
- the user's lip trend compares the resulting lip trend with the pre-existing lip trend to get the current output character. Combine the files of each frame image into a lip recognition result according to the acquisition order of each frame image.
- the processing module 40 is configured to calculate the accuracy of the speech signal recognition result and the lip language recognition result, and use the recognition result with higher accuracy as the current speech recognition result.
- the processing module 40 includes:
- the splitting module 41 is configured to split the voice signal recognition and the lip language recognition result into a plurality of keywords
- the correlation degree calculation sub-module 42 is configured to determine a first degree of association between the adjacent keywords in the keyword split into the voice signal recognition result, and determine that the lip language recognition result is split into a second degree of association between adjacent keywords in the keyword;
- the first correlation degree is calculated as: p(x) is the number of times the keyword x appears in the string, and p(y) is the phase among the two adjacent keywords x and y.
- p(x, y) is the adjacent two keywords x, y appear in the string in the adjacent way. The number of times.
- the second degree of association is the same as the first degree of association calculation, and is not described here.
- the accuracy calculation sub-module 43 is configured to sum the determined first correlation degrees, obtain the accuracy of the speech signal recognition result, and sum the determined second correlation degrees to obtain an accurate result of the speech signal recognition result. degree;
- the first degree of association is calculated for each adjacent keyword in the string, and a plurality of first degrees of association are obtained, and the calculated degree of association is obtained to obtain the accuracy of the total of the string.
- the processing sub-module 44 is configured to use the recognition result with higher accuracy as the current speech recognition result.
- the speech recognition system proposed in the embodiment simultaneously performs the recognition of the speech signal and the lip language, and calculates the accuracy of the speech signal recognition result and the lip language recognition result, and uses the recognition result with higher accuracy as the current recognition result. Rather than simply recognizing the speech signal, the accuracy of speech recognition is improved.
- the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
- Implementation Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
- the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
Abstract
Description
Claims (20)
- 一种语音识别方法,其特征在于,所述语音识别方法包括以下步骤:在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;对接收到的语音信号进行识别,以得到语音信号识别结果;对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
- 如权利要求1所述的语音识别方法,其特征在于,所述对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果的步骤包括:确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部的位置;根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;基于每一帧所述有效图像对应的字符组成唇语识别结果。
- 如权利要求2所述的语音识别方法,其特征在于,所述确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置的步骤包括:确定采集到的每帧图像中脸部轮廓;将脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;将唇部区域中各个像素点的RGB色度值进行比对;在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
- 如权利要求1所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
- 如权利要求4所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤还包括:在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,将转换得到的字符串作为当前的语音信号识别结果。
- 如权利要求如权利要求2所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
- 如权利要求如权利要求3所述的语音识别方法,其特征在于,所述对接收到的语音信号进行识别,以得到语音信号识别结果的步骤包括:将接收到的语音信号转换成字符串,并按照预设的关键词库,将所述字符串拆分为多个关键词;标注各个所述关键词的词性,确定各个相邻的关键词之间的词性是否匹配;在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词;在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
- 如权利要求1所述的语音识别方法,其特征在于,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:将所述语音信号识别以及唇语识别结果拆分为多个关键词;确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;将准确度较高的识别结果作为当前的语音识别结果。
- 如权利要求2所述的语音识别方法,其特征在于,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:将所述语音信号识别以及唇语识别结果拆分为多个关键词;确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;将准确度较高的识别结果作为当前的语音识别结果。
- 如权利要求3所述的语音识别方法,其特征在于,所述计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果的步骤包括:将所述语音信号识别以及唇语识别结果拆分为多个关键词;确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;将准确度较高的识别结果作为当前的语音识别结果。
- 一种语音识别系统,其特征在于,所述语音识别系统包括:控制模块,用于在接收到语音信号时,控制图像采集装置进行图像采集,并在所述语音信号结束时,控制所述图像采集装置停止图像采集;语音信号识别模块,用于对接收到的语音信号进行识别,以得到语音信号识别结果;唇语识别模块,用于对采集到的图像中包含唇部的图像进行唇语识别,以得到唇语识别结果;处理模块,用于计算所述语音信号识别结果和唇语识别结果的准确度,将准确度较高的识别结果作为当前的语音识别结果。
- 如权利要求11所述的语音识别系统,其特征在于,所述唇语识别模块包括:唇部定位子模块,用于确定采集到的图像中包含唇部的图像,将所述包含唇部的图像作为有效图像,并确定所述有效图像中唇部位置;确定子模块,用于根据每一帧所述有效图像的唇形以及上一帧所述有效图像的唇形确定用户输出的字符;重组子模块,用于基于每一帧所述有效图像对应的字符组成唇语识别结果。
- 如权利要求12所述的语音识别系统,其特征在于,所述唇部定位子模块包括:脸部轮廓确定单元,用于确定采集到的每帧图像中脸部轮廓;脸部位置定位单元,用于将确定的脸部轮廓内的各个像素点色度值与预存的人脸中各个像素点的色度值进行比对,以确定采集到的每帧图像中脸部位置;唇部区域定位单元,用于确定脸部位置中眼部位置,并基于眼部位置以及唇部位置之间的相对位置确定唇部区域;比对单元,用于将唇部区域中各个像素点的RGB色度值进行比对;处理单元,用于在唇部区域存在RGB色度值满足预设条件的像素点时,确定该帧图像为包含唇部的图像,将所述包含唇部的图像作为有效图像;唇部位置定位单元,用于基于唇部区域中各个像素点的RGB色度值确定唇部的位置。
- 如权利要求11所述的语音识别系统,其特征在于,所述语音信号识别模块包括:转换子模块,用于将接收到的语音信号转换成字符串;拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
- 如权利要求14所述的语音识别系统,其特征在于,所述处理子模块还用于在替换后的第二关键词与相邻关键词之间词性不匹配,且所述第二关键词存在多个时,将所述第一关键词替换为其它第二关键词,并确定替换后的第二关键词与相邻关键词之间词性是否匹配,直至替换完所有的第二关键词,并将转换得到的字符串作为当前的语音信号识别结果。
- 如权利要求12所述的语音识别系统,其特征在于,所述语音信号识别模块包括:转换子模块,用于将接收到的语音信号转换成字符串;拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
- 如权利要求13所述的语音识别系统,其特征在于,所述语音信号识别模块包括:转换子模块,用于将接收到的语音信号转换成字符串;拆分子模块,按照预设的关键词库,将所述字符串拆分为多个关键词;词性匹配子模块,用于标注各个所述关键词的词性,,确定各个相邻的关键词之间的词性是否匹配;确定子模块,用于在有相邻的关键词之间的词性不匹配时,将所述不匹配关键词作为第一关键词,并确定预设的混淆音词库是否存在所述第一关键词,以及在所述混淆音词库存在所述不匹配的关键词时,确定所述混淆音词库中所述第一关键词对应的第二关键词;处理子模块,用于将所述第一关键词替换为第二关键词,并在替换后的第二关键词与相邻关键词之间词性匹配时,将替换后的第二关键词以及其它关键词重新组合成为语音信号识别结果,并将重新组合的唇语识别结果作为当前的语音信号识别结果。
- 如权利要求11所述的语音识别系统,其特征在于,所述处理模块包括:拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
- 如权利要求12所述的语音识别系统,其特征在于,所述处理模块包括:拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
- 如权利要求13所述的语音识别系统,其特征在于,所述处理模块包括:拆分子模块,用于将所述语音信号识别以及唇语识别结果拆分为多个关键词;关联度计算子模块,用于确定所述语音信号识别结果拆分成的关键词中,各个相邻关键词的之间的第一关联度,并确定所述唇语识别结果拆分成的关键词中,各个相邻关键词的之间的第二关联度;准确度计算子模块,用于对确定的第一关联度求和,得到所述语音信号识别结果的准确度,并对确定的第二关联度求和,得到所述语音信号识别结果的准确度;处理子模块,用于将准确度较高的识别结果作为当前的语音识别结果。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2014412434A AU2014412434B2 (en) | 2014-11-28 | 2014-12-23 | Voice recognition method and system |
US15/127,790 US10262658B2 (en) | 2014-11-28 | 2014-12-23 | Voice recognition method and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410714386.2 | 2014-11-28 | ||
CN201410714386.2A CN104409075B (zh) | 2014-11-28 | 2014-11-28 | 语音识别方法和系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016082267A1 true WO2016082267A1 (zh) | 2016-06-02 |
Family
ID=52646698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/094624 WO2016082267A1 (zh) | 2014-11-28 | 2014-12-23 | 语音识别方法和系统 |
Country Status (4)
Country | Link |
---|---|
US (1) | US10262658B2 (zh) |
CN (1) | CN104409075B (zh) |
AU (1) | AU2014412434B2 (zh) |
WO (1) | WO2016082267A1 (zh) |
Families Citing this family (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106157956A (zh) * | 2015-03-24 | 2016-11-23 | 中兴通讯股份有限公司 | 语音识别的方法及装置 |
CN106157957A (zh) * | 2015-04-28 | 2016-11-23 | 中兴通讯股份有限公司 | 语音识别方法、装置及用户设备 |
CN105334743B (zh) * | 2015-11-18 | 2018-10-26 | 深圳创维-Rgb电子有限公司 | 一种基于情感识别的智能家居控制方法及其系统 |
CN106971722B (zh) * | 2016-01-14 | 2020-07-17 | 芋头科技(杭州)有限公司 | 一种设置有关联度的远程语音识别系统及方法 |
CN107452381B (zh) * | 2016-05-30 | 2020-12-29 | 中国移动通信有限公司研究院 | 一种多媒体语音识别装置及方法 |
CN106250829A (zh) * | 2016-07-22 | 2016-12-21 | 中国科学院自动化研究所 | 基于唇部纹理结构的数字识别方法 |
CN106529502B (zh) * | 2016-08-01 | 2019-09-24 | 深圳奥比中光科技有限公司 | 唇语识别方法以及装置 |
CN106648530B (zh) * | 2016-11-21 | 2020-09-08 | 海信集团有限公司 | 语音控制方法及终端 |
US11132429B2 (en) * | 2016-12-14 | 2021-09-28 | Telefonaktiebolaget Lm Ericsson (Publ) | Authenticating a user subvocalizing a displayed text |
CN106875941B (zh) * | 2017-04-01 | 2020-02-18 | 彭楚奥 | 一种服务机器人的语音语义识别方法 |
CN107369449B (zh) * | 2017-07-14 | 2019-11-26 | 上海木木机器人技术有限公司 | 一种有效语音识别方法及装置 |
CN107293300A (zh) * | 2017-08-01 | 2017-10-24 | 珠海市魅族科技有限公司 | 语音识别方法及装置、计算机装置及可读存储介质 |
CN107702273B (zh) * | 2017-09-20 | 2020-06-16 | 珠海格力电器股份有限公司 | 空调控制方法和装置 |
US10936705B2 (en) * | 2017-10-31 | 2021-03-02 | Baidu Usa Llc | Authentication method, electronic device, and computer-readable program medium |
US11095502B2 (en) | 2017-11-03 | 2021-08-17 | Otis Elevator Company | Adhoc protocol for commissioning connected devices in the field |
CN108363745B (zh) | 2018-01-26 | 2020-06-30 | 阿里巴巴集团控股有限公司 | 机器人客服转人工客服的方法和装置 |
CN108320747A (zh) * | 2018-02-08 | 2018-07-24 | 广东美的厨房电器制造有限公司 | 家电设备控制方法、设备、终端及计算机可读存储介质 |
CN108427548A (zh) * | 2018-02-26 | 2018-08-21 | 广东小天才科技有限公司 | 基于麦克风的用户交互方法、装置、设备及存储介质 |
CN108596107A (zh) | 2018-04-26 | 2018-09-28 | 京东方科技集团股份有限公司 | 基于ar设备的唇语识别方法及其装置、ar设备 |
CN109031201A (zh) * | 2018-06-01 | 2018-12-18 | 深圳市鹰硕技术有限公司 | 基于行为识别的语音定位方法以及装置 |
CN110837758B (zh) * | 2018-08-17 | 2023-06-02 | 杭州海康威视数字技术股份有限公司 | 一种关键词输入方法、装置及电子设备 |
CN109102805A (zh) * | 2018-09-20 | 2018-12-28 | 北京长城华冠汽车技术开发有限公司 | 语音交互方法、装置及实现装置 |
CN109377995B (zh) * | 2018-11-20 | 2021-06-01 | 珠海格力电器股份有限公司 | 一种控制设备的方法与装置 |
KR20200073733A (ko) | 2018-12-14 | 2020-06-24 | 삼성전자주식회사 | 전자 장치의 기능 실행 방법 및 이를 사용하는 전자 장치 |
CN109817211B (zh) * | 2019-02-14 | 2021-04-02 | 珠海格力电器股份有限公司 | 一种电器控制方法、装置、存储介质及电器 |
CN109979450B (zh) * | 2019-03-11 | 2021-12-07 | 海信视像科技股份有限公司 | 信息处理方法、装置及电子设备 |
CN110545396A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种基于定位去噪的语音识别方法及装置 |
CN110827823A (zh) * | 2019-11-13 | 2020-02-21 | 联想(北京)有限公司 | 语音辅助识别方法、装置、存储介质及电子设备 |
CN111445912A (zh) * | 2020-04-03 | 2020-07-24 | 深圳市阿尔垎智能科技有限公司 | 语音处理方法和系统 |
CN111447325A (zh) * | 2020-04-03 | 2020-07-24 | 上海闻泰电子科技有限公司 | 通话辅助方法、装置、终端及存储介质 |
CN111626310B (zh) * | 2020-05-27 | 2023-08-29 | 百度在线网络技术(北京)有限公司 | 图像比对方法、装置、设备以及存储介质 |
CN113763941A (zh) * | 2020-06-01 | 2021-12-07 | 青岛海尔洗衣机有限公司 | 语音识别方法、语音识别系统和电器设备 |
CN112037788B (zh) * | 2020-09-10 | 2021-08-24 | 中航华东光电(上海)有限公司 | 一种语音纠正融合方法 |
CN112820274B (zh) * | 2021-01-08 | 2021-09-28 | 上海仙剑文化传媒股份有限公司 | 一种语音信息识别校正方法和系统 |
CN113068058A (zh) * | 2021-03-19 | 2021-07-02 | 安徽宝信信息科技有限公司 | 一种基于语音识别及转写技术的实时字幕上屏直播系统 |
CN117217212A (zh) * | 2022-05-30 | 2023-12-12 | 青岛海尔科技有限公司 | 语料识别方法、装置、设备及存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002304194A (ja) * | 2001-02-05 | 2002-10-18 | Masanobu Kujirada | 音声及び/又は口形状入力のためのシステム、方法、プログラム |
US20030171932A1 (en) * | 2002-03-07 | 2003-09-11 | Biing-Hwang Juang | Speech recognition |
US6633844B1 (en) * | 1999-12-02 | 2003-10-14 | International Business Machines Corporation | Late integration in audio-visual continuous speech recognition |
US20100063820A1 (en) * | 2002-09-12 | 2010-03-11 | Broadcom Corporation | Correlating video images of lip movements with audio signals to improve speech recognition |
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
US20130054240A1 (en) * | 2011-08-25 | 2013-02-28 | Samsung Electronics Co., Ltd. | Apparatus and method for recognizing voice by using lip image |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6594629B1 (en) * | 1999-08-06 | 2003-07-15 | International Business Machines Corporation | Methods and apparatus for audio-visual speech detection and recognition |
AU2001296459A1 (en) * | 2000-10-02 | 2002-04-15 | Clarity, L.L.C. | Audio visual speech processing |
US7526425B2 (en) * | 2001-08-14 | 2009-04-28 | Evri Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
US7165029B2 (en) * | 2002-05-09 | 2007-01-16 | Intel Corporation | Coupled hidden Markov model for audiovisual speech recognition |
JP4867654B2 (ja) * | 2006-12-28 | 2012-02-01 | 日産自動車株式会社 | 音声認識装置、および音声認識方法 |
CN102013103B (zh) * | 2010-12-03 | 2013-04-03 | 上海交通大学 | 实时动态嘴唇跟踪方法 |
CN102298443B (zh) * | 2011-06-24 | 2013-09-25 | 华南理工大学 | 结合视频通道的智能家居语音控制系统及其控制方法 |
WO2013097075A1 (en) * | 2011-12-26 | 2013-07-04 | Intel Corporation | Vehicle based determination of occupant audio and visual input |
CN103678271B (zh) * | 2012-09-10 | 2016-09-14 | 华为技术有限公司 | 一种文本校正方法及用户设备 |
CN102932212A (zh) * | 2012-10-12 | 2013-02-13 | 华南理工大学 | 一种基于多通道交互方式的智能家居控制系统 |
KR101482430B1 (ko) * | 2013-08-13 | 2015-01-15 | 포항공과대학교 산학협력단 | 전치사 교정 방법 및 이를 수행하는 장치 |
CN105096935B (zh) * | 2014-05-06 | 2019-08-09 | 阿里巴巴集团控股有限公司 | 一种语音输入方法、装置和系统 |
US9854139B2 (en) * | 2014-06-24 | 2017-12-26 | Sony Mobile Communications Inc. | Lifelog camera and method of controlling same using voice triggers |
-
2014
- 2014-11-28 CN CN201410714386.2A patent/CN104409075B/zh active Active
- 2014-12-23 US US15/127,790 patent/US10262658B2/en active Active
- 2014-12-23 WO PCT/CN2014/094624 patent/WO2016082267A1/zh active Application Filing
- 2014-12-23 AU AU2014412434A patent/AU2014412434B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6633844B1 (en) * | 1999-12-02 | 2003-10-14 | International Business Machines Corporation | Late integration in audio-visual continuous speech recognition |
JP2002304194A (ja) * | 2001-02-05 | 2002-10-18 | Masanobu Kujirada | 音声及び/又は口形状入力のためのシステム、方法、プログラム |
US20030171932A1 (en) * | 2002-03-07 | 2003-09-11 | Biing-Hwang Juang | Speech recognition |
US20100063820A1 (en) * | 2002-09-12 | 2010-03-11 | Broadcom Corporation | Correlating video images of lip movements with audio signals to improve speech recognition |
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
US20130054240A1 (en) * | 2011-08-25 | 2013-02-28 | Samsung Electronics Co., Ltd. | Apparatus and method for recognizing voice by using lip image |
Also Published As
Publication number | Publication date |
---|---|
AU2014412434A1 (en) | 2016-10-20 |
US10262658B2 (en) | 2019-04-16 |
CN104409075A (zh) | 2015-03-11 |
AU2014412434B2 (en) | 2020-10-08 |
US20170098447A1 (en) | 2017-04-06 |
CN104409075B (zh) | 2018-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016082267A1 (zh) | 语音识别方法和系统 | |
WO2019051899A1 (zh) | 终端控制方法、装置及存储介质 | |
WO2019051890A1 (zh) | 终端控制方法、装置及计算机可读存储介质 | |
WO2019041406A1 (zh) | 不雅图片识别方法、终端、设备及计算机可读存储介质 | |
WO2015127859A1 (en) | Sensitive text detecting method and apparatus | |
WO2020246844A1 (en) | Device control method, conflict processing method, corresponding apparatus and electronic device | |
WO2019051895A1 (zh) | 终端控制方法、装置及存储介质 | |
WO2019085495A1 (zh) | 微表情识别方法、装置、系统及计算机可读存储介质 | |
WO2018166236A1 (zh) | 理赔账单识别方法、装置、设备及计算机可读存储介质 | |
WO2015184760A1 (zh) | 空中手势输入方法及装置 | |
WO2019029261A1 (zh) | 微表情识别方法、装置及存储介质 | |
WO2019205323A1 (zh) | 空调器及其参数调整方法、装置和可读存储介质 | |
WO2016018004A1 (en) | Method, apparatus, and system for providing translated content | |
WO2019114269A1 (zh) | 一种节目续播方法、电视设备及计算机可读存储介质 | |
WO2015007007A1 (zh) | 一种adc自动校正的方法及装置 | |
WO2015158132A1 (zh) | 语音控制方法和系统 | |
WO2018149191A1 (zh) | 保单核保的方法、装置、设备及计算机可读存储介质 | |
WO2018076569A1 (zh) | 基于行车电脑的程序刷写方法及装置 | |
WO2019041851A1 (zh) | 家电售后咨询方法、电子设备和计算机可读存储介质 | |
WO2021029627A1 (en) | Server that supports speech recognition of device, and operation method of the server | |
WO2021261830A1 (en) | Video quality assessment method and apparatus | |
WO2016127458A1 (zh) | 改进的基于语义词典的词语相似度计算方法和装置 | |
WO2015158133A1 (zh) | 语音控制指令纠错方法和系统 | |
WO2019165723A1 (zh) | 处理音视频的方法、系统、设备及存储介质 | |
WO2019051934A1 (zh) | 业务人员考核方法、考核平台和计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14907044 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15127790 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2014412434 Country of ref document: AU Date of ref document: 20141223 Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/10/2017) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14907044 Country of ref document: EP Kind code of ref document: A1 |