CN109545196B

CN109545196B - Speech recognition method, device and computer readable storage medium

Info

Publication number: CN109545196B
Application number: CN201811644306.5A
Authority: CN
Inventors: 袁晖
Original assignee: Shenzhen Ikmak Tech Co ltd
Current assignee: Shenzhen Ikmak Tech Co ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2022-11-29
Anticipated expiration: 2038-12-29
Also published as: CN109545196A

Abstract

The invention discloses a voice recognition method, which comprises the following steps: monitoring voice information sent by a user; denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model; collecting background sounds of the surrounding environment of a user; identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result; and combining the voice command and the position information to form a final recognition result and outputting the final recognition result. The invention also discloses a voice recognition device and a computer readable storage medium. The invention can improve the voice recognition accuracy of the intelligent terminal equipment.

Description

Speech recognition method, speech recognition device and computer-readable storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, and computer-readable storage medium.

Background

With the development of science and technology and the progress of computer technology, speech recognition technology is already applied to various fields such as life and industry, and the prior art has various speech recognition methods or devices for realizing human-computer interaction, thereby making great contribution to the economic development of human society. However, the existing voice recognition technology can generally only recognize the pronunciation of a normal person, and when the pronunciation of a user is inaccurate or language barriers exist, the existing voice recognition technology is difficult to recognize or inaccurate to recognize. Take the old as an example: with the age, some diseases in language are in a high-incidence state in the elderly population, such as aphasia. Aphasia patients may have a disorder of language expression when speaking, reading, or writing, but the intelligence is not affected by aphasia. The existing voice recognition technology is difficult to perform voice recognition on the people suffering from aphasia, or the recognition accuracy is greatly reduced, so that the related technology is difficult to apply, for example, when the voice recognition technology is applied to a companion robot, the companion robot is difficult to really play the role of the companion robot due to the difficulty in recognizing the voice.

In view of the above, it is necessary to provide a speech recognition technology to improve the accuracy of speech recognition and expand the application range of the speech recognition technology.

Disclosure of Invention

The invention mainly aims to provide a voice recognition method, aiming at improving the accuracy of voice recognition and expanding the application range of voice recognition technology.

In order to achieve the above object, the present invention provides a speech recognition method, including:

monitoring voice information sent by a user;

denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model;

collecting background sounds of the surrounding environment of a user;

identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result;

and combining the voice command and the position information to form a final recognition result and outputting the final recognition result.

Preferably, the denoising the voice information and recognizing the voice instruction of the user according to a pre-stored voice model includes:

acquiring characteristic parameters of plosive, fricative and nasal sound in voice information of a user and comparing the characteristic parameters with corresponding preset models;

and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.

Preferably, the method further comprises:

and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.

Preferably, the recognizing the background sound according to a pre-stored background sound model, and determining the location of the user according to the recognition result includes:

and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.

Preferably, the method may further include: and displaying the recognition result in a picture and text mode for a user to select or confirm, outputting the recognition result to external equipment after the user selects or confirms, and/or broadcasting the recognition result to the user through voice and receiving feedback information of the user.

The present invention also provides a voice recognition apparatus, comprising:

the voice acquisition module is used for monitoring voice information sent by a user;

the first processing module is used for carrying out denoising processing on the voice information and identifying a voice instruction of a user according to a pre-stored voice model;

the background sound monitoring module is used for acquiring background sounds of the surrounding environment of the user;

the second processing module is used for identifying the background sound according to a pre-stored background sound model and determining the position of the user according to an identification result;

and the output module is used for combining the voice command and the position information to form a final recognition result and outputting the final recognition result.

Preferably, the voice acquisition module is configured to:

acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;

Preferably, the above apparatus further comprises:

and the updating module is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.

The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and a computer program of the computer-executable instructions is executed by a processor to implement the foregoing speech recognition method.

According to the invention, the voice instruction and the background sound of the user are extracted, the voice instruction of the user is extracted and the recognition of the background sound in the environment is combined, and when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the voice recognition accuracy is improved.

Drawings

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating the steps of comparing the voice information of the user with the voice model to obtain the voice command of the user in the voice recognition method according to the embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a first processing module and a second processing module of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, the present invention provides a voice recognition method, including:

step S10, voice information sent by a user is intercepted; in the embodiment of the invention, a voice monitoring device can be arranged on intelligent equipment such as a mobile phone, a tablet or a robot and the like to collect voice information sent by a user.

S20, denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model; when voice information is collected, denoising processing is carried out through a voice chip, and a voice instruction sent by a user is obtained.

S30, collecting background sounds of the surrounding environment of the user; after the voice instruction is obtained, a second voice monitoring device in intelligent equipment such as a mobile phone, a tablet or a robot is awakened, and background sounds in the environment are detected and received.

S40, identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result; for example, the background sound is analyzed by the voice chip, and whether the user is outdoors or indoors is judged according to the volume of the sound, and further whether the user is in a horizontal type, a living room or a kitchen can be judged according to the volume or the type.

And S50, combining the voice command and the position information to form a final recognition result and outputting the final recognition result. And when the voice command and the position information are clear, giving a recognition result and outputting the recognition result. In the embodiment of the invention, when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the accuracy rate of voice recognition is improved.

The following is an application scenario of the present invention, by means of which the detailed solution of the speech recognition of the present invention can be further understood:

scene one: the old people turn on the air conditioner in the bedroom by saying 'air conditioner' or 'turning on the air conditioner' with the help of the robot. The specific process is as follows:

step A: a user sends a voice command to the accompanying robot;

and B, step B: a first sound receiving device of the accompanying robot receives a voice signal of a user;

step C: and analyzing by a microprocessor of the accompanying robot to obtain a first recognition result: turning on the air conditioner, simultaneously waking up the second sound receiving device, and receiving a background sound signal from the ambient environment;

step D: and analyzing by the microprocessor of the accompanying robot to obtain a second recognition result: a bedroom;

and E, step E: and (3) carrying out micro-processing comprehensive analysis on the accompany robot to obtain a final recognition result: opening an air conditioner in a bedroom;

step F: and the network device of the accompanying robot sends an operation command to the air conditioner in the bedroom according to the preset position information of the storage device, so that the air conditioner starts to start and operate.

In the embodiment of the present invention, before performing all the steps, the method may further include: training and modeling the voice information of the user and the background sound, forming a voice model and a background sound model and storing the voice model and the background sound model. In the embodiment of the invention, the voices of people with difficult pronunciation or obstacles are collected for training and modeling, so that the pronunciation of a user can be correctly recognized during application. In addition, the background sounds of the indoor and outdoor are collected and modeled to identify the environment where the user is located, for example, the background sounds of a plurality of bedroom environments can be collected at different time periods, trained, modeled and stored, and the user can extract a background sound model for comparison in practical application, so that the environment where the user is located is determined.

It can be understood that the foregoing step of denoising the voice information to recognize the voice instruction of the user includes:

denoising the received voice information to obtain the voice information of the user;

comparing the voice information of the user with a voice model to obtain a voice instruction of the user;

the step of identifying the background sound according to the pre-stored background sound model and determining the position of the user according to the identification result comprises the following steps:

and denoising the acquired background sound, and determining the position of the user according to the denoised background sound to obtain position information.

Considering that the background sound models of some environments may be very similar or the same, sound sources for identifying the environments can be set in different environments in advance, real-time collection is carried out through the voice collection module, the voice chip compares the collected sound emitted by the preset sound sources and the background sound in the environments with the background sound models respectively, and the position of the user is determined according to the comparison result. For example, the environment may be represented by a wind chime as a living room or a kitchen, and when a user gives a voice command in the environment, the voice chip may recognize the location based on a background sound given by the environment sound source.

Specifically, the scheme of the present invention can be further understood through the following application scenarios:

scene two: the old man can turn on the lamp of its environment by sending out voice "turn on the light". The specific process is as follows:

step A1: the user sends a voice command to the accompanying robot: turning on a lamp;

step B1: a first sound receiving device of the accompanying robot receives a voice signal of a user;

step C1: the microprocessor of the accompanying robot extracts a voice model and analyzes the voice model to obtain a first recognition result: turning on the light, simultaneously waking up the second sound receiving device, and receiving background sound signals from the surrounding environment;

step D1: since the user is located between two environments (such as a living room and a kitchen), the microprocessor of the accompanying robot acquires sounds emitted by sound sources of the living room and the kitchen, and obtains a second recognition result according to different analysis of the sounds: a living room;

step E1: and (3) carrying out micro-processing comprehensive analysis on the accompanying robot to obtain a final recognition result: turning on a lamp of the living room;

step F1: and the network device of the accompanying robot sends a command to the switch of the hall lamp according to the preset position information of the storage device, so that the network device executes a lamp-on command.

The extraction and selection of acoustic features is an important link of speech recognition. The extraction of the acoustic features is a process of greatly compressing information and a signal deconvolution process, and aims to enable the mode divider to divide better.

Due to the time-varying nature of speech signals, feature extraction must be performed on a small segment of the speech signal, i.e., a short analysis. This segment is considered to be a stationary analysis interval, commonly referred to as a frame, and the frame-to-frame offset typically takes 1/2 or 1/3 of the frame length. The signal is usually pre-emphasized to boost the high frequencies and windowed to avoid the effects of short-time speech segment edges.

Some of the acoustic features that are commonly used:

(1) Linear Predictive Coefficient (LPC): linear predictive analysis starts with the human phonation mechanism and through the study of the short-tube cascade model of the vocal tract, the transfer function of the system is considered to match the form of the all-pole digital filter, so that the signal at time n can be estimated using the linear combination of the signals at the first few times.

(2) Cepstral coefficients: by using a homomorphic processing method, after Discrete Fourier Transform (DFT) is solved for a voice signal, logarithm is taken, and then inverse transformation iDFT is solved to obtain a cepstrum coefficient.

(3) Mel-Frequency Cepstral Coefficinets (MFCCs) and Perceptual Linear Prediction (PLP): unlike LPC and the like, which are derived through the study of human vocal mechanisms, mel-frequency cepstral coefficients MFCC and perceptual linear predictive PLP are acoustic features derived by being motivated by the research results of human auditory systems.

Chinese acoustic features: taking mandarin pronunciation as an example, the pronunciation of a word can be cut into two parts, namely initial (initials) and final (finals). In the process of pronunciation, the conversion from Initial consonant to Final consonant is a gradual change rather than an instant change, and for this, right-Context-Dependent Initial and Final consonant mode (RCDIF) is used as an analysis method, so that the correct Syllable (Syllable) can be more accurately identified.

Considering that the old people or the people with difficult pronunciation are difficult to pronounce accurate pronunciation, the pronunciation of the consonant is divided into the following four categories according to different characteristics of the consonant and modeled:

plosive (Plosive) in which the lips are closed during speaking, and then the airflow is discharged to produce a sound similar to a burst. The amplitude change of the sound will decrease to a minimum value (representing tight lips) and then increase sharply.

Fricative (Fricative): when the voice is sounded, the tongue is tightly attached to the hard jaw to form a narrow channel, and airflow generates turbulence to generate friction when passing through the narrow channel, so that the sound is sounded. Because the airflow is stably output during the friction sound, the amplitude change of the sound is smaller than that of the plosive.

Affricite (affricite): this type of sound generation model has both the sound generation characteristics of plosive and fricative. The main sounding structure is like friction sound, and the tongue clings to the hard palate to generate friction sound when air flow passes through. The channels are more compact, so that the airflow can be instantly rushed out to generate characteristics like a plosive.

Nasal sound (Nasal): when speaking, the soft palate is pressed downwards, and after pressing downwards, the airflow from the trachea is blocked and can not enter the oral cavity, thus turning to the nasal cavity. And thus the nasal cavity and the oral cavity resonate.

Referring to fig. 2, in an embodiment of the present invention, when a user pronounces voice, characteristic parameters of plosive, fricative, and nasal sound in voice information of the user are obtained and compared with corresponding preset models; and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound. Therefore, even if the pronunciation of the user is inaccurate, the voice command of the user can be accurately recognized. For example, after the characteristic parameters of the plosive, the fricative and the nasal sound in the voice information of the user are obtained, the characteristic parameters are compared with corresponding preset models, when the amplitude of the plosive, the fricative or the nasal sound is within a preset range, analysis is continued, and the next characteristic parameter is compared and adjusted until all parameters are compared and adjusted.

In the embodiment of the invention, in order to further enhance the identification accuracy, the identification result can be displayed in a picture and text mode for the user to select or confirm, and the identification result is output to external equipment after the user selects or confirms, and/or the identification result is broadcasted to the user through voice and the feedback information of the user is received. For example, when the voice instruction sent by the user is to turn on the air conditioner, the voice chip cannot accurately recognize the voice instruction of the user, at this time, a plurality of results (turning on the air conditioner, turning on an air conditioning fan, turning on a fan) can be sent to the user interaction module, the user confirms through the touch screen, and then executes the command of turning on the air conditioner after confirming.

In a preferred embodiment of the present invention, the method further includes:

and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result. For example, the language ability of the old people gradually declines, a plurality of periods can be preset, the change of the voice of the old people is judged according to different pronunciations corresponding to the same voice command collected in one period, and the voice model is updated so as to adapt.

The present invention also provides a speech recognition apparatus for implementing the above method, and as shown in fig. 3, the speech recognition apparatus includes:

the voice acquisition module 10 is used for intercepting voice information sent by a user; in the embodiment of the present invention, the voice collecting module 10 may be an intercepting device such as a microphone in an intelligent terminal such as a mobile phone, a tablet computer, or a robot, and is configured to collect voice information sent by a user.

The first processing module 20 is configured to perform denoising processing on the voice information and recognize a voice instruction of a user according to a pre-stored voice model; the first processing module 20 may be a voice processing chip, and when the voice information is collected, performs denoising processing on the voice information to obtain a voice instruction sent by a user.

The background sound monitoring module 30 is used for collecting the background sound of the surrounding environment of the user; the background sound interception module 30 may be an interception device such as a microphone disposed at different positions, and is configured to collect background sound information emitted in the environment. After obtaining the voice command, the smart device such as a mobile phone, a tablet, or a robot may wake up the background sound listening module 30 through the chip to detect and receive the background sound in the environment.

The second processing module 40 is configured to identify the background sound according to a pre-stored background sound model, and determine a location of the user according to an identification result; for example, the background sound is analyzed by the voice chip, and whether the user is outdoors or indoors is judged according to the volume of the sound, and further whether the user is in a horizontal type, a living room or a kitchen can be judged according to the volume or the type.

And the output module 50 is used for combining the voice command and the position information to form a final recognition result and outputting the final recognition result. And when the voice command and the position information are clear, giving a recognition result and outputting the recognition result. In the embodiment of the invention, when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the accuracy rate of voice recognition is improved.

In a preferred embodiment, the voice recognition apparatus further includes:

and the model establishing module 60 is used for training and modeling the voice information and the background sound of the user, forming a voice model and a background sound model and storing the voice model and the background sound model. In the embodiment of the present invention, the model building module 60 collects voices of people with difficulty in pronunciation or with obstacles to train and model, so as to correctly recognize pronunciation of the user during application. In addition, the model building module 60 collects the background sounds of indoor and outdoor environments and performs modeling to identify the environment where the user is located, for example, the background sounds of a plurality of bedroom environments can be collected at different time periods, trained, modeled and stored, and the user can extract the background sound model for comparison in actual application, so as to determine the environment where the user is located.

Referring to fig. 4, in one embodiment, the first processing module 20 includes:

a denoising unit 21, configured to perform denoising processing on the received voice information to obtain voice information of a user;

a voice instruction obtaining unit 22, configured to compare the voice information of the user with a voice model to obtain a voice instruction of the user;

the second processing module 40 includes:

the position information obtaining unit 41 performs denoising processing on the acquired background sound, and determines the position of the user according to the denoised background sound to obtain the position information.

Preferably, the voice instruction obtaining unit 22 is configured to:

acquiring characteristic parameters of plosive, fricative and nasal sound in voice information of a user and comparing the characteristic parameters with corresponding preset models; and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound. Therefore, even if the pronunciation of the user is inaccurate, the voice command of the user can be accurately recognized.

In an embodiment of the present invention, the apparatus may further include:

and the updating module 70 is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result. For example, the language ability of the elderly gradually declines, a plurality of periods can be preset, the updating module 70 judges the voice change of the elderly according to different pronunciations corresponding to the same voice command collected in one period, and updates the voice model so as to adapt.

The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and a computer program of the computer-executable instructions is executed by a processor to implement the foregoing speech recognition method. The computer-readable storage medium provided by the present invention can store a program for implementing the aforementioned voice recognition method, and is carried and loaded on a computer device, where such a computer device can be an intelligent terminal such as a mobile phone, a tablet computer, or a service robot.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method of speech recognition, the method comprising:

monitoring voice information sent by a user;

collecting background sounds of the surrounding environment of a user;

identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result, wherein the background sound model is trained and modeled based on the background sounds of a plurality of environments corresponding to different time periods;

and combining the voice command and the position information, and when the pronunciation of the user is not complete enough or clear enough, performing sentence completion on the voice command by using the position information to form a final recognition result and outputting the final recognition result.

2. The method of claim 1, wherein denoising the voice information and recognizing the user's voice command according to the pre-stored voice model comprises:

3. The method of claim 1 or 2, further comprising:

4. The method according to claim 3, wherein the recognizing the background sound according to the pre-stored background sound model and determining the position of the user according to the recognition result comprises:

5. The method of claim 4, further comprising: and displaying the recognition result in a picture and text mode for a user to select or confirm, outputting the recognition result to external equipment after the user selects or confirms, and/or broadcasting the recognition result to the user through voice and receiving feedback information of the user.

6. A speech recognition apparatus, comprising:

the second processing module is used for identifying the background sound according to a pre-stored background sound model and determining the position of the user according to an identification result, wherein the background sound model is trained and modeled based on the background sounds of a plurality of environments corresponding to different time periods;

and the output module is used for combining the voice command and the position information, and when the pronunciation of the user is not complete enough or clear enough, the position information is used for completing the sentence of the voice command to form a final recognition result and outputting the final recognition result.

7. The speech recognition device of claim 6, wherein the speech acquisition module is configured to:

8. The speech recognition device according to claim 6 or 7, further comprising:

9. The speech recognition device of claim 6, wherein the first processing module comprises:

10. A computer-readable storage medium, in which computer-executable instructions are stored, a computer program of which computer-executable instructions, when executed by a processor, implement the speech recognition method of any one of claims 1 to 5.