CN109545196B - Speech recognition method, device and computer readable storage medium - Google Patents
Speech recognition method, device and computer readable storage medium Download PDFInfo
- Publication number
- CN109545196B CN109545196B CN201811644306.5A CN201811644306A CN109545196B CN 109545196 B CN109545196 B CN 109545196B CN 201811644306 A CN201811644306 A CN 201811644306A CN 109545196 B CN109545196 B CN 109545196B
- Authority
- CN
- China
- Prior art keywords
- voice
- user
- sound
- model
- background sound
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012544 monitoring process Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims description 22
- 238000004458 analytical method Methods 0.000 claims description 14
- 230000008859 change Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 5
- 201000007201 aphasia Diseases 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012806 monitoring device Methods 0.000 description 2
- 210000000214 mouth Anatomy 0.000 description 2
- 210000003928 nasal cavity Anatomy 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 230000002618 waking effect Effects 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 238000004378 air conditioning Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 210000001983 hard palate Anatomy 0.000 description 1
- 201000000615 hard palate cancer Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000001847 jaw Anatomy 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 210000001584 soft palate Anatomy 0.000 description 1
- 210000003437 trachea Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
- Manipulator (AREA)
Abstract
The invention discloses a voice recognition method, which comprises the following steps: monitoring voice information sent by a user; denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model; collecting background sounds of the surrounding environment of a user; identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result; and combining the voice command and the position information to form a final recognition result and outputting the final recognition result. The invention also discloses a voice recognition device and a computer readable storage medium. The invention can improve the voice recognition accuracy of the intelligent terminal equipment.
Description
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, and computer-readable storage medium.
Background
With the development of science and technology and the progress of computer technology, speech recognition technology is already applied to various fields such as life and industry, and the prior art has various speech recognition methods or devices for realizing human-computer interaction, thereby making great contribution to the economic development of human society. However, the existing voice recognition technology can generally only recognize the pronunciation of a normal person, and when the pronunciation of a user is inaccurate or language barriers exist, the existing voice recognition technology is difficult to recognize or inaccurate to recognize. Take the old as an example: with the age, some diseases in language are in a high-incidence state in the elderly population, such as aphasia. Aphasia patients may have a disorder of language expression when speaking, reading, or writing, but the intelligence is not affected by aphasia. The existing voice recognition technology is difficult to perform voice recognition on the people suffering from aphasia, or the recognition accuracy is greatly reduced, so that the related technology is difficult to apply, for example, when the voice recognition technology is applied to a companion robot, the companion robot is difficult to really play the role of the companion robot due to the difficulty in recognizing the voice.
In view of the above, it is necessary to provide a speech recognition technology to improve the accuracy of speech recognition and expand the application range of the speech recognition technology.
Disclosure of Invention
The invention mainly aims to provide a voice recognition method, aiming at improving the accuracy of voice recognition and expanding the application range of voice recognition technology.
In order to achieve the above object, the present invention provides a speech recognition method, including:
monitoring voice information sent by a user;
denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
collecting background sounds of the surrounding environment of a user;
identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result;
and combining the voice command and the position information to form a final recognition result and outputting the final recognition result.
Preferably, the denoising the voice information and recognizing the voice instruction of the user according to a pre-stored voice model includes:
acquiring characteristic parameters of plosive, fricative and nasal sound in voice information of a user and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
Preferably, the method further comprises:
and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
Preferably, the recognizing the background sound according to a pre-stored background sound model, and determining the location of the user according to the recognition result includes:
and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.
Preferably, the method may further include: and displaying the recognition result in a picture and text mode for a user to select or confirm, outputting the recognition result to external equipment after the user selects or confirms, and/or broadcasting the recognition result to the user through voice and receiving feedback information of the user.
The present invention also provides a voice recognition apparatus, comprising:
the voice acquisition module is used for monitoring voice information sent by a user;
the first processing module is used for carrying out denoising processing on the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
the background sound monitoring module is used for acquiring background sounds of the surrounding environment of the user;
the second processing module is used for identifying the background sound according to a pre-stored background sound model and determining the position of the user according to an identification result;
and the output module is used for combining the voice command and the position information to form a final recognition result and outputting the final recognition result.
Preferably, the voice acquisition module is configured to:
acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
Preferably, the above apparatus further comprises:
and the updating module is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and a computer program of the computer-executable instructions is executed by a processor to implement the foregoing speech recognition method.
According to the invention, the voice instruction and the background sound of the user are extracted, the voice instruction of the user is extracted and the recognition of the background sound in the environment is combined, and when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the voice recognition accuracy is improved.
Drawings
FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating the steps of comparing the voice information of the user with the voice model to obtain the voice command of the user in the voice recognition method according to the embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a first processing module and a second processing module of a speech recognition apparatus according to an embodiment of the present invention.
Detailed Description
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the present invention provides a voice recognition method, including:
step S10, voice information sent by a user is intercepted; in the embodiment of the invention, a voice monitoring device can be arranged on intelligent equipment such as a mobile phone, a tablet or a robot and the like to collect voice information sent by a user.
S20, denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model; when voice information is collected, denoising processing is carried out through a voice chip, and a voice instruction sent by a user is obtained.
S30, collecting background sounds of the surrounding environment of the user; after the voice instruction is obtained, a second voice monitoring device in intelligent equipment such as a mobile phone, a tablet or a robot is awakened, and background sounds in the environment are detected and received.
S40, identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result; for example, the background sound is analyzed by the voice chip, and whether the user is outdoors or indoors is judged according to the volume of the sound, and further whether the user is in a horizontal type, a living room or a kitchen can be judged according to the volume or the type.
And S50, combining the voice command and the position information to form a final recognition result and outputting the final recognition result. And when the voice command and the position information are clear, giving a recognition result and outputting the recognition result. In the embodiment of the invention, when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the accuracy rate of voice recognition is improved.
The following is an application scenario of the present invention, by means of which the detailed solution of the speech recognition of the present invention can be further understood:
scene one: the old people turn on the air conditioner in the bedroom by saying 'air conditioner' or 'turning on the air conditioner' with the help of the robot. The specific process is as follows:
step A: a user sends a voice command to the accompanying robot;
and B, step B: a first sound receiving device of the accompanying robot receives a voice signal of a user;
step C: and analyzing by a microprocessor of the accompanying robot to obtain a first recognition result: turning on the air conditioner, simultaneously waking up the second sound receiving device, and receiving a background sound signal from the ambient environment;
step D: and analyzing by the microprocessor of the accompanying robot to obtain a second recognition result: a bedroom;
and E, step E: and (3) carrying out micro-processing comprehensive analysis on the accompany robot to obtain a final recognition result: opening an air conditioner in a bedroom;
step F: and the network device of the accompanying robot sends an operation command to the air conditioner in the bedroom according to the preset position information of the storage device, so that the air conditioner starts to start and operate.
In the embodiment of the present invention, before performing all the steps, the method may further include: training and modeling the voice information of the user and the background sound, forming a voice model and a background sound model and storing the voice model and the background sound model. In the embodiment of the invention, the voices of people with difficult pronunciation or obstacles are collected for training and modeling, so that the pronunciation of a user can be correctly recognized during application. In addition, the background sounds of the indoor and outdoor are collected and modeled to identify the environment where the user is located, for example, the background sounds of a plurality of bedroom environments can be collected at different time periods, trained, modeled and stored, and the user can extract a background sound model for comparison in practical application, so that the environment where the user is located is determined.
It can be understood that the foregoing step of denoising the voice information to recognize the voice instruction of the user includes:
denoising the received voice information to obtain the voice information of the user;
comparing the voice information of the user with a voice model to obtain a voice instruction of the user;
the step of identifying the background sound according to the pre-stored background sound model and determining the position of the user according to the identification result comprises the following steps:
and denoising the acquired background sound, and determining the position of the user according to the denoised background sound to obtain position information.
Considering that the background sound models of some environments may be very similar or the same, sound sources for identifying the environments can be set in different environments in advance, real-time collection is carried out through the voice collection module, the voice chip compares the collected sound emitted by the preset sound sources and the background sound in the environments with the background sound models respectively, and the position of the user is determined according to the comparison result. For example, the environment may be represented by a wind chime as a living room or a kitchen, and when a user gives a voice command in the environment, the voice chip may recognize the location based on a background sound given by the environment sound source.
Specifically, the scheme of the present invention can be further understood through the following application scenarios:
scene two: the old man can turn on the lamp of its environment by sending out voice "turn on the light". The specific process is as follows:
step A1: the user sends a voice command to the accompanying robot: turning on a lamp;
step B1: a first sound receiving device of the accompanying robot receives a voice signal of a user;
step C1: the microprocessor of the accompanying robot extracts a voice model and analyzes the voice model to obtain a first recognition result: turning on the light, simultaneously waking up the second sound receiving device, and receiving background sound signals from the surrounding environment;
step D1: since the user is located between two environments (such as a living room and a kitchen), the microprocessor of the accompanying robot acquires sounds emitted by sound sources of the living room and the kitchen, and obtains a second recognition result according to different analysis of the sounds: a living room;
step E1: and (3) carrying out micro-processing comprehensive analysis on the accompanying robot to obtain a final recognition result: turning on a lamp of the living room;
step F1: and the network device of the accompanying robot sends a command to the switch of the hall lamp according to the preset position information of the storage device, so that the network device executes a lamp-on command.
The extraction and selection of acoustic features is an important link of speech recognition. The extraction of the acoustic features is a process of greatly compressing information and a signal deconvolution process, and aims to enable the mode divider to divide better.
Due to the time-varying nature of speech signals, feature extraction must be performed on a small segment of the speech signal, i.e., a short analysis. This segment is considered to be a stationary analysis interval, commonly referred to as a frame, and the frame-to-frame offset typically takes 1/2 or 1/3 of the frame length. The signal is usually pre-emphasized to boost the high frequencies and windowed to avoid the effects of short-time speech segment edges.
Some of the acoustic features that are commonly used:
(1) Linear Predictive Coefficient (LPC): linear predictive analysis starts with the human phonation mechanism and through the study of the short-tube cascade model of the vocal tract, the transfer function of the system is considered to match the form of the all-pole digital filter, so that the signal at time n can be estimated using the linear combination of the signals at the first few times.
(2) Cepstral coefficients: by using a homomorphic processing method, after Discrete Fourier Transform (DFT) is solved for a voice signal, logarithm is taken, and then inverse transformation iDFT is solved to obtain a cepstrum coefficient.
(3) Mel-Frequency Cepstral Coefficinets (MFCCs) and Perceptual Linear Prediction (PLP): unlike LPC and the like, which are derived through the study of human vocal mechanisms, mel-frequency cepstral coefficients MFCC and perceptual linear predictive PLP are acoustic features derived by being motivated by the research results of human auditory systems.
Chinese acoustic features: taking mandarin pronunciation as an example, the pronunciation of a word can be cut into two parts, namely initial (initials) and final (finals). In the process of pronunciation, the conversion from Initial consonant to Final consonant is a gradual change rather than an instant change, and for this, right-Context-Dependent Initial and Final consonant mode (RCDIF) is used as an analysis method, so that the correct Syllable (Syllable) can be more accurately identified.
Considering that the old people or the people with difficult pronunciation are difficult to pronounce accurate pronunciation, the pronunciation of the consonant is divided into the following four categories according to different characteristics of the consonant and modeled:
plosive (Plosive) in which the lips are closed during speaking, and then the airflow is discharged to produce a sound similar to a burst. The amplitude change of the sound will decrease to a minimum value (representing tight lips) and then increase sharply.
Fricative (Fricative): when the voice is sounded, the tongue is tightly attached to the hard jaw to form a narrow channel, and airflow generates turbulence to generate friction when passing through the narrow channel, so that the sound is sounded. Because the airflow is stably output during the friction sound, the amplitude change of the sound is smaller than that of the plosive.
Affricite (affricite): this type of sound generation model has both the sound generation characteristics of plosive and fricative. The main sounding structure is like friction sound, and the tongue clings to the hard palate to generate friction sound when air flow passes through. The channels are more compact, so that the airflow can be instantly rushed out to generate characteristics like a plosive.
Nasal sound (Nasal): when speaking, the soft palate is pressed downwards, and after pressing downwards, the airflow from the trachea is blocked and can not enter the oral cavity, thus turning to the nasal cavity. And thus the nasal cavity and the oral cavity resonate.
Referring to fig. 2, in an embodiment of the present invention, when a user pronounces voice, characteristic parameters of plosive, fricative, and nasal sound in voice information of the user are obtained and compared with corresponding preset models; and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound. Therefore, even if the pronunciation of the user is inaccurate, the voice command of the user can be accurately recognized. For example, after the characteristic parameters of the plosive, the fricative and the nasal sound in the voice information of the user are obtained, the characteristic parameters are compared with corresponding preset models, when the amplitude of the plosive, the fricative or the nasal sound is within a preset range, analysis is continued, and the next characteristic parameter is compared and adjusted until all parameters are compared and adjusted.
In the embodiment of the invention, in order to further enhance the identification accuracy, the identification result can be displayed in a picture and text mode for the user to select or confirm, and the identification result is output to external equipment after the user selects or confirms, and/or the identification result is broadcasted to the user through voice and the feedback information of the user is received. For example, when the voice instruction sent by the user is to turn on the air conditioner, the voice chip cannot accurately recognize the voice instruction of the user, at this time, a plurality of results (turning on the air conditioner, turning on an air conditioning fan, turning on a fan) can be sent to the user interaction module, the user confirms through the touch screen, and then executes the command of turning on the air conditioner after confirming.
In a preferred embodiment of the present invention, the method further includes:
and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result. For example, the language ability of the old people gradually declines, a plurality of periods can be preset, the change of the voice of the old people is judged according to different pronunciations corresponding to the same voice command collected in one period, and the voice model is updated so as to adapt.
The present invention also provides a speech recognition apparatus for implementing the above method, and as shown in fig. 3, the speech recognition apparatus includes:
the voice acquisition module 10 is used for intercepting voice information sent by a user; in the embodiment of the present invention, the voice collecting module 10 may be an intercepting device such as a microphone in an intelligent terminal such as a mobile phone, a tablet computer, or a robot, and is configured to collect voice information sent by a user.
The first processing module 20 is configured to perform denoising processing on the voice information and recognize a voice instruction of a user according to a pre-stored voice model; the first processing module 20 may be a voice processing chip, and when the voice information is collected, performs denoising processing on the voice information to obtain a voice instruction sent by a user.
The background sound monitoring module 30 is used for collecting the background sound of the surrounding environment of the user; the background sound interception module 30 may be an interception device such as a microphone disposed at different positions, and is configured to collect background sound information emitted in the environment. After obtaining the voice command, the smart device such as a mobile phone, a tablet, or a robot may wake up the background sound listening module 30 through the chip to detect and receive the background sound in the environment.
The second processing module 40 is configured to identify the background sound according to a pre-stored background sound model, and determine a location of the user according to an identification result; for example, the background sound is analyzed by the voice chip, and whether the user is outdoors or indoors is judged according to the volume of the sound, and further whether the user is in a horizontal type, a living room or a kitchen can be judged according to the volume or the type.
And the output module 50 is used for combining the voice command and the position information to form a final recognition result and outputting the final recognition result. And when the voice command and the position information are clear, giving a recognition result and outputting the recognition result. In the embodiment of the invention, when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the accuracy rate of voice recognition is improved.
In a preferred embodiment, the voice recognition apparatus further includes:
and the model establishing module 60 is used for training and modeling the voice information and the background sound of the user, forming a voice model and a background sound model and storing the voice model and the background sound model. In the embodiment of the present invention, the model building module 60 collects voices of people with difficulty in pronunciation or with obstacles to train and model, so as to correctly recognize pronunciation of the user during application. In addition, the model building module 60 collects the background sounds of indoor and outdoor environments and performs modeling to identify the environment where the user is located, for example, the background sounds of a plurality of bedroom environments can be collected at different time periods, trained, modeled and stored, and the user can extract the background sound model for comparison in actual application, so as to determine the environment where the user is located.
Referring to fig. 4, in one embodiment, the first processing module 20 includes:
a denoising unit 21, configured to perform denoising processing on the received voice information to obtain voice information of a user;
a voice instruction obtaining unit 22, configured to compare the voice information of the user with a voice model to obtain a voice instruction of the user;
the second processing module 40 includes:
the position information obtaining unit 41 performs denoising processing on the acquired background sound, and determines the position of the user according to the denoised background sound to obtain the position information.
Preferably, the voice instruction obtaining unit 22 is configured to:
acquiring characteristic parameters of plosive, fricative and nasal sound in voice information of a user and comparing the characteristic parameters with corresponding preset models; and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound. Therefore, even if the pronunciation of the user is inaccurate, the voice command of the user can be accurately recognized.
In an embodiment of the present invention, the apparatus may further include:
and the updating module 70 is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result. For example, the language ability of the elderly gradually declines, a plurality of periods can be preset, the updating module 70 judges the voice change of the elderly according to different pronunciations corresponding to the same voice command collected in one period, and updates the voice model so as to adapt.
The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and a computer program of the computer-executable instructions is executed by a processor to implement the foregoing speech recognition method. The computer-readable storage medium provided by the present invention can store a program for implementing the aforementioned voice recognition method, and is carried and loaded on a computer device, where such a computer device can be an intelligent terminal such as a mobile phone, a tablet computer, or a service robot.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.
Claims (10)
1. A method of speech recognition, the method comprising:
monitoring voice information sent by a user;
denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
collecting background sounds of the surrounding environment of a user;
identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result, wherein the background sound model is trained and modeled based on the background sounds of a plurality of environments corresponding to different time periods;
and combining the voice command and the position information, and when the pronunciation of the user is not complete enough or clear enough, performing sentence completion on the voice command by using the position information to form a final recognition result and outputting the final recognition result.
2. The method of claim 1, wherein denoising the voice information and recognizing the user's voice command according to the pre-stored voice model comprises:
acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
3. The method of claim 1 or 2, further comprising:
and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
4. The method according to claim 3, wherein the recognizing the background sound according to the pre-stored background sound model and determining the position of the user according to the recognition result comprises:
and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.
5. The method of claim 4, further comprising: and displaying the recognition result in a picture and text mode for a user to select or confirm, outputting the recognition result to external equipment after the user selects or confirms, and/or broadcasting the recognition result to the user through voice and receiving feedback information of the user.
6. A speech recognition apparatus, comprising:
the voice acquisition module is used for monitoring voice information sent by a user;
the first processing module is used for carrying out denoising processing on the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
the background sound monitoring module is used for acquiring background sounds of the surrounding environment of the user;
the second processing module is used for identifying the background sound according to a pre-stored background sound model and determining the position of the user according to an identification result, wherein the background sound model is trained and modeled based on the background sounds of a plurality of environments corresponding to different time periods;
and the output module is used for combining the voice command and the position information, and when the pronunciation of the user is not complete enough or clear enough, the position information is used for completing the sentence of the voice command to form a final recognition result and outputting the final recognition result.
7. The speech recognition device of claim 6, wherein the speech acquisition module is configured to:
acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
8. The speech recognition device according to claim 6 or 7, further comprising:
and the updating module is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
9. The speech recognition device of claim 6, wherein the first processing module comprises:
and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.
10. A computer-readable storage medium, in which computer-executable instructions are stored, a computer program of which computer-executable instructions, when executed by a processor, implement the speech recognition method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811644306.5A CN109545196B (en) | 2018-12-29 | 2018-12-29 | Speech recognition method, device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811644306.5A CN109545196B (en) | 2018-12-29 | 2018-12-29 | Speech recognition method, device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109545196A CN109545196A (en) | 2019-03-29 |
CN109545196B true CN109545196B (en) | 2022-11-29 |
Family
ID=65831549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811644306.5A Active CN109545196B (en) | 2018-12-29 | 2018-12-29 | Speech recognition method, device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109545196B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109974225A (en) * | 2019-04-09 | 2019-07-05 | 珠海格力电器股份有限公司 | Air conditioner control method and device, storage medium and air conditioner |
CN110473547B (en) * | 2019-07-12 | 2021-07-30 | 云知声智能科技股份有限公司 | Speech recognition method |
CN110867184A (en) * | 2019-10-23 | 2020-03-06 | 张家港市祥隆五金厂 | Voice intelligent terminal equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102918591A (en) * | 2010-04-14 | 2013-02-06 | 谷歌公司 | Geotagged environmental audio for enhanced speech recognition accuracy |
CN108877773A (en) * | 2018-06-12 | 2018-11-23 | 广东小天才科技有限公司 | Voice recognition method and electronic equipment |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8762143B2 (en) * | 2007-05-29 | 2014-06-24 | At&T Intellectual Property Ii, L.P. | Method and apparatus for identifying acoustic background environments based on time and speed to enhance automatic speech recognition |
CN105580071B (en) * | 2013-05-06 | 2020-08-21 | 谷歌技术控股有限责任公司 | Method and apparatus for training a voice recognition model database |
CN104143342B (en) * | 2013-05-15 | 2016-08-17 | 腾讯科技(深圳)有限公司 | A kind of pure and impure sound decision method, device and speech synthesis system |
CN105448292B (en) * | 2014-08-19 | 2019-03-12 | 北京羽扇智信息科技有限公司 | A kind of time Speech Recognition System and method based on scene |
CN105913039B (en) * | 2016-04-26 | 2020-08-18 | 北京光年无限科技有限公司 | Interactive processing method and device for dialogue data based on vision and voice |
CN106941506A (en) * | 2017-05-17 | 2017-07-11 | 北京京东尚科信息技术有限公司 | Data processing method and device based on biological characteristic |
CN107742517A (en) * | 2017-10-10 | 2018-02-27 | 广东中星电子有限公司 | A kind of detection method and device to abnormal sound |
-
2018
- 2018-12-29 CN CN201811644306.5A patent/CN109545196B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102918591A (en) * | 2010-04-14 | 2013-02-06 | 谷歌公司 | Geotagged environmental audio for enhanced speech recognition accuracy |
CN108877773A (en) * | 2018-06-12 | 2018-11-23 | 广东小天才科技有限公司 | Voice recognition method and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109545196A (en) | 2019-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103617799B (en) | A kind of English statement pronunciation quality detection method being adapted to mobile device | |
CN108847215B (en) | Method and device for voice synthesis based on user timbre | |
CN109545196B (en) | Speech recognition method, device and computer readable storage medium | |
WO2017084360A1 (en) | Method and system for speech recognition | |
JP2020524308A (en) | Method, apparatus, computer device, program and storage medium for constructing voiceprint model | |
Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
CN108648527B (en) | English pronunciation matching correction method | |
US20190279644A1 (en) | Speech processing device, speech processing method, and recording medium | |
CN114121006A (en) | Image output method, device, equipment and storage medium of virtual character | |
WO2013052292A1 (en) | Waveform analysis of speech | |
CN108470476B (en) | English pronunciation matching correction system | |
CN110047474A (en) | A kind of English phonetic pronunciation intelligent training system and training method | |
Chauhan et al. | Speech to text converter using Gaussian Mixture Model (GMM) | |
JP4811993B2 (en) | Audio processing apparatus and program | |
Sahoo et al. | MFCC feature with optimized frequency range: An essential step for emotion recognition | |
CN109545195B (en) | Accompanying robot and control method thereof | |
CN115331670A (en) | Off-line voice remote controller for household appliances | |
Scharf et al. | Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features | |
Barczewska et al. | Detection of disfluencies in speech signal | |
CN116705070B (en) | Method and system for correcting speech pronunciation and nasal sound after cleft lip and palate operation | |
Sedigh | Application of polyscale methods for speaker verification | |
Laska et al. | Cough sound analysis using vocal tract models | |
JP2006293102A (en) | Education system accompanied by check on intelligibility by judgment on whether trainee has self-confidence | |
RU2589851C2 (en) | System and method of converting voice signal into transcript presentation with metadata | |
Majda et al. | Modeling and optimization of the feature generator for speaker recognition systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |