CN111599345A - Speech recognition algorithm evaluation method, system, mobile terminal and storage medium - Google Patents
Speech recognition algorithm evaluation method, system, mobile terminal and storage medium Download PDFInfo
- Publication number
- CN111599345A CN111599345A CN202010257506.6A CN202010257506A CN111599345A CN 111599345 A CN111599345 A CN 111599345A CN 202010257506 A CN202010257506 A CN 202010257506A CN 111599345 A CN111599345 A CN 111599345A
- Authority
- CN
- China
- Prior art keywords
- voice sample
- voice
- frequency domain
- time domain
- background
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 105
- 238000011156 evaluation Methods 0.000 title claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 47
- 238000000034 method Methods 0.000 claims abstract description 22
- 238000012360 testing method Methods 0.000 claims abstract description 10
- 238000001228 spectrum Methods 0.000 claims description 38
- 230000003595 spectral effect Effects 0.000 claims description 24
- 238000005070 sampling Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 7
- 238000005311 autocorrelation function Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000004907 flux Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 230000004044 response Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010224 classification analysis Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000012854 evaluation process Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a method and a system for evaluating a voice recognition algorithm, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring voice sample data, and acquiring a time domain feature training time domain classifier of a background voice sample; calculating the volume ratio between the background voice sample and the effective voice sample, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database; acquiring a frequency domain characteristic training frequency domain classifier of a background voice sample; controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters; and evaluating and classifying the failed samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result. The method can evaluate the voice recognition algorithm from three scene angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition algorithm evaluation method, a system, a mobile terminal and a storage medium.
Background
The research of voice recognition has been in history for decades, the voice recognition technology mainly comprises four parts of acoustic model modeling, language model modeling, pronunciation dictionary construction and decoding, each part can become an independent research direction, and compared with images and texts, the difficulty of voice data acquisition and labeling is greatly improved, so that the construction of a complete voice recognition system is a work which consumes a lot of time and has high difficulty, and the development of the voice recognition technology is greatly hindered. With the research and development of artificial intelligence technology, especially deep learning, some end-to-end-based speech recognition algorithms are proposed, compared with the traditional speech recognition method, the end-to-end speech recognition method simplifies the speech recognition process, and hands a large amount of work to deep neural network for learning and reasoning, so that the method has been widely concerned in recent years.
The existing voice recognition process is based on a voice recognition algorithm to achieve the effect of voice recognition, so that performance evaluation aiming at the voice recognition algorithm is particularly important for guaranteeing the accuracy of voice recognition. However, in the existing speech recognition algorithm evaluation process, the performance of the algorithm is evaluated only based on the recognition rate of the speech recognition algorithm, and the recognition effect of the speech recognition algorithm in different application scenes cannot be embodied, so that misjudgment is easy to occur in the selection of the speech recognition algorithm in different application scenes, and the accuracy of the speech recognition algorithm evaluation is reduced.
Disclosure of Invention
The embodiment of the invention aims to provide a speech recognition algorithm evaluation method, a system, a mobile terminal and a storage medium, and aims to solve the problem of low evaluation accuracy caused by incapability of reflecting recognition effects of a speech recognition algorithm in different application scenes in the existing speech recognition algorithm evaluation process.
The embodiment of the invention is realized in such a way that a speech recognition algorithm evaluation method comprises the following steps:
acquiring voice sample data, wherein the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
acquiring time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
calculating a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result.
Further, the step of obtaining the time domain feature of the background speech sample and training the time domain classifier according to the time domain feature includes:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
Further, the step of obtaining the frequency domain characteristics of the background speech sample and training the frequency domain classifier according to the frequency domain characteristics includes:
acquiring a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
and constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow.
Further, the step of performing an evaluation classification on the failed samples according to the volume ratio database, the time domain classifier and the frequency domain classifier comprises:
controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
and controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result.
Further, the step of performing an evaluation classification on the failed sample according to the volume ratio database, the time domain classifier and the frequency domain classifier further includes:
when the failure number corresponding to any one of the volume segment ranges is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
and when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the frequency domain scene.
Further, after the step of constructing the frequency domain classifier, the method further comprises:
acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
It is another object of an embodiment of the present invention to provide a speech recognition algorithm evaluation system, which includes:
the system comprises a sample acquisition module, a voice processing module and a voice processing module, wherein the sample acquisition module is used for acquiring voice sample data, the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
the time domain classifier training module is used for acquiring the time domain characteristics of the background voice sample and training a time domain classifier according to the time domain characteristics;
the energy ratio calculation module is used for calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
the frequency domain classifier training module is used for acquiring the frequency domain characteristics of the background voice sample and training a frequency domain classifier according to the frequency domain characteristics;
the voice recognition module is used for controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and the algorithm evaluation module is used for evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier so as to obtain an algorithm evaluation result.
Further, the time domain classifier training module is further configured to:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned speech recognition algorithm evaluation method.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the speech recognition algorithm evaluation method.
The embodiment of the invention can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of the subsequent voice recognition.
Drawings
FIG. 1 is a flow chart of a speech recognition algorithm evaluation method provided by a first embodiment of the present invention;
FIG. 2 is a flow chart of a speech recognition algorithm evaluation method provided by a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a speech recognition algorithm evaluation system according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
Example one
Referring to fig. 1, a flowchart of a speech recognition algorithm evaluation method according to a first embodiment of the present invention is shown, which includes the steps of:
step S10, acquiring voice sample data;
the voice sample data comprises at least one piece of voice sample information, the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters, and preferably, the background voice samples stored in different pieces of voice sample information are different;
step S20, acquiring the time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
the time domain features comprise short-time energy features, short-time average zero-crossing rate, short-time average amplitude difference and zero-crossing rate ratio, and because the time domain features in different background voice samples are different, the design of a time domain classifier is trained according to the time domain features, so that the trained time domain classifier can perform time domain classification analysis on the input voice signal, and then the subsequent time domain analysis can be performed on the voice recognition algorithm to be tested, so that the accuracy of the voice recognition algorithm to be tested in different time domain scenes is analyzed;
step S30, calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
the method comprises the steps that the volume ratio between a background voice sample and an effective voice sample is calculated, so that the volume of the background voice sample and the volume of the effective voice sample are digitized, the ratio between background sound and effective sound in each voice sample information is analyzed, the subsequent voice recognition algorithm to be tested can be subjected to volume analysis, the accuracy of voice recognition of the voice recognition algorithm to be tested in scenes with different volume ratios is analyzed, and the algorithm evaluation effect is effectively achieved;
step S40, acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
the frequency domain characteristics comprise a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow, and because the frequency domain characteristics in different background voice samples are different, the design of a frequency domain classifier is trained according to the frequency domain characteristics, so that the trained frequency domain classifier can perform frequency domain classification analysis on an input voice signal, and then the subsequent frequency domain analysis can be performed on the voice recognition algorithm to be tested, so that the accuracy of the voice recognition algorithm to be tested in different frequency domain scenes is analyzed, and the algorithm evaluation effect is effectively achieved;
step S50, controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
when the voice recognition characters stored in the voice recognition result are different from the corresponding voice sample characters, judging that the voice recognition algorithm to be tested fails to recognize the sample corresponding to the voice recognition characters so as to obtain a failed sample;
step S60, evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result;
the volume ratio database is used for analyzing the volume ratio between background sound and effective sound of the failed sample, the time domain classifier is used for analyzing a time domain scene of the background of the failed sample, and the frequency domain classifier is used for analyzing a frequency domain scene of the background of the failed sample;
specifically, in the step, the recognition effect of the speech recognition algorithm to be tested in different volume ratio scenes, time domain scenes and frequency domain scenes is judged through the classification evaluation result of the failed samples based on the volume ratio database, the time domain classifier and the frequency domain classifier;
the method and the device can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of subsequent voice recognition.
Example two
Referring to fig. 2, a flowchart of a speech recognition algorithm evaluation method according to a second embodiment of the present invention is shown, which includes the steps of:
step S11, acquiring voice sample data;
the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
step S21, acquiring short-time energy characteristics, a short-time average zero crossing rate, a short-time average amplitude difference and a zero crossing rate ratio in the background voice sample;
the energy of the voice signal changes obviously along with time, the energy of a soft part is much smaller than that of a voiced part, and the selection of the window function plays a decisive role in the characteristics of the short-time energy representation method. The short-time energy characteristics are mainly applied to the following aspects: firstly, the short-time energy characteristics can be used for distinguishing unvoiced sound and voiced sound, because the energy of the voiced sound is much larger than that of the unvoiced sound; secondly, judging the voiced sections and the unvoiced sections by using short-time energy characteristics, decomposing initials and finals, and dividing ligatures, so that in the step, the time domain characteristics of the background voice sample are obtained by extracting the short-time energy characteristics in the background voice sample;
preferably, the short-time average zero-crossing rate is the number of times that the signal passes through a zero value, a continuous speech signal can be considered when a time domain waveform passes through a time axis, a discrete signal is considered, the short-time average zero-crossing rate is substantially the number of times that a signal sampling point changes in sign, and the short-time average zero-crossing rate can be used for analysis of the speech signal.
Furthermore, the short-time average amplitude difference is a representation of the energy of a frame of audio signal, since the average amplitude function has no square operation, its dynamic range is smaller than the short-time energy and is close to the square of the dynamic range of the standard energy calculation, the influence of the window length N on the average amplitude function is completely consistent with the analysis conclusion of the short-time energy, and the influence of the voiced sound on the amplitude difference is consistent with the analysis of the short-time energy, and the amplitude difference of the voiced sound is much larger than that of the unvoiced sound, so the short-time average amplitude function can be used to distinguish the unvoiced sound from the background speech sample;
furthermore, if the zero-crossing rate of a certain frame is zero, the frame is considered as a zero-crossing frame, and the ratio of the number of frames with zero-crossing rate in the background speech sample to the number of all frames in the whole segment is the zero-crossing rate.
Step S31, intercepting the background voice sample by a short time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
the number and the sampling coordinates of the preset sampling points can be selected according to requirements, and in the step, the time domain characteristics of the background voice sample are further acquired by calculating the design of the autocorrelation value;
step S41, constructing the time domain classifier, and training the time domain classifier according to the short-time energy feature, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value;
the trained time domain classifier can judge the time domain classification of the background voice information in the input sample information by training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value so as to analyze the time domain scene of the background voice in the sample information input into the time domain classifier;
step S51, calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
frequency calculation is carried out on the background voice sample and the effective voice sample to obtain amplitude-frequency information, and the volume ratio between the background voice sample and the effective voice sample is calculated according to the amplitude-frequency information;
step S61, obtaining the spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
the spectral centroid is a parameter reflecting the brightness of the voice signal, and the spectral centroid calculates a balance point of the voice signal in the whole frequency spectrum to reflect the frequency domain characteristics of the background voice sample;
preferably, the sub-band energy is used to identify the data transmission capability of the signal transmission, the sub-band energy reflects the signal power or the range of the signal energy concentrated in the spectrum, the sub-band energy is defined as the square root of the mean value of the difference between the signal spectrum component and the spectrum centroid weighted by the energy, and therefore, the sub-band energy can reflect the frequency domain characteristics of the background voice sample;
further, the pitch frequency is a unit for measuring the pitch height, and the pitch is also called a fundamental frequency or simply a fundamental frequency. When the sounding body sounds due to vibration, the sound can be generally decomposed into a plurality of pure sine waves, that is, all natural sounds are basically composed of a plurality of sine waves with different frequencies, wherein the sine wave with the lowest frequency is the fundamental tone, and the sine waves with higher frequencies are overtones, so that the fundamental tone frequency can reflect the frequency domain characteristics of the background voice sample;
furthermore, the wavelet entropy is used for measuring the information complexity, the spectrum flow is the variation of two adjacent frames on the spectrum distribution, and the spectrum flow is a reaction of the dynamic characteristics of the signal, so that the wavelet entropy and the spectrum flow can both reflect the frequency domain characteristics of the background voice sample;
step S71, constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow;
the trained frequency domain classifier can judge the frequency domain classification of background voice information in the input sample information by training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectral flow so as to analyze the frequency domain scene of the background voice in the sample information input into the frequency domain classifier;
preferably, in this step, after the step of constructing the frequency domain classifier, the method further includes:
acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
Step S81, controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
in the step, the voice recognition algorithm to be tested is controlled to test the voice sample information so as to obtain a voice recognition result of the voice recognition algorithm to be tested, and a correct result corresponding to the voice sample information is compared with the voice recognition result so as to obtain a sample which is failed to be recognized in the voice recognition algorithm to be tested;
step S91, controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
step S101, when the failure number corresponding to any volume segment range is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
the first preset number may be set according to a requirement or based on a value of the voice sample information, for example, the first preset number may be 10%, 11%, or 20% of a total value of the voice sample information;
for example, when the number of failures corresponding to any one of the volume segment ranges is judged to be greater than a first preset number, it is judged that the speech recognition accuracy of the speech recognition algorithm to be tested is low in a background scene where the acquisition object is in the corresponding volume segment, for example, it is judged that the recognition of the speech recognition algorithm to be tested is low in a situation where the acquisition object is indoors;
step S111, controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
step S121, when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
the second preset quantity is the same as the first preset quantity in a setting mode, and can be set according to requirements or based on the numerical value of the voice sample information;
for example, when the failure number corresponding to any one time domain scene is judged to be larger than a second preset number, the speech recognition accuracy of the speech recognition algorithm to be tested is judged to be low when the acquisition object is in the corresponding time domain scene;
step S131, controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result;
step S141, when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the voice recognition of the voice recognition algorithm to be tested is unqualified when the background is in the frequency domain scene;
the third preset quantity is the same as the first preset quantity in a setting mode, and can be set according to requirements or based on the numerical value of the voice sample information;
for example, when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, the speech recognition accuracy of the speech recognition algorithm to be tested is judged to be low when the acquisition object is in the corresponding frequency domain scene;
the method and the device can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of subsequent voice recognition.
EXAMPLE III
Referring to fig. 3, a schematic structural diagram of a speech recognition algorithm evaluation system 100 according to a third embodiment of the present invention is shown, including: a sample obtaining module 10, a time domain classifier training module 11, an energy ratio calculating module 12, a frequency domain classifier training module 13, a voice recognition module 14 and an algorithm evaluating module 15, wherein:
the system comprises a sample acquisition module 10, a voice processing module and a voice processing module, wherein the sample acquisition module is used for acquiring voice sample data, the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
the time domain classifier training module 11 is configured to acquire a time domain feature of the background voice sample, and train a time domain classifier according to the time domain feature;
wherein, the time domain classifier training module 11 is further configured to: acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
And the energy ratio calculation module 12 is configured to calculate a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and store the energy ratio and the voice sample information corresponding to the energy ratio to obtain a volume ratio database.
And the frequency domain classifier training module 13 is configured to acquire frequency domain features of the background speech sample, and train a frequency domain classifier according to the frequency domain features.
Wherein, the frequency domain classifier training module 13 is further configured to: acquiring a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
and constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow.
Preferably, the frequency domain classifier training module 13 is further configured to: acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
The voice recognition module 14 is configured to control a to-be-tested voice recognition algorithm to test the voice sample information pair to obtain a voice recognition result, and obtain a failed recognition sample in the voice recognition result according to the voice sample text;
and the algorithm evaluation module 15 is configured to evaluate and classify the failure sample according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result.
Wherein the algorithm evaluation module 15 is further configured to: controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
and controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result.
Preferably, the algorithm evaluation module 15 is further configured to: when the failure number corresponding to any one of the volume segment ranges is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
and when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the frequency domain scene.
The method and the device can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of subsequent voice recognition.
Example four
Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the above-mentioned speech recognition algorithm evaluation method.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
acquiring voice sample data, wherein the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
acquiring time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
calculating a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the configuration shown in fig. 3 does not constitute a limitation of the speech recognition algorithm evaluation system of the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the speech recognition algorithm evaluation method of fig. 1-2 may be implemented using more or fewer components than those shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the target speech recognition algorithm evaluation system and that can perform specific functions, all of which can be stored in a storage device (not shown) of the target speech recognition algorithm evaluation system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (10)
1. A method for evaluating a speech recognition algorithm, the method comprising:
acquiring voice sample data, wherein the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
acquiring time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
calculating a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result.
2. The method of claim 1, wherein the step of obtaining the time domain features of the background speech sample and training a time domain classifier based on the time domain features comprises:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
3. The method of claim 1, wherein the steps of obtaining frequency domain features of the background speech sample and training a frequency domain classifier based on the frequency domain features comprises:
acquiring a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
and constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow.
4. The speech recognition algorithm evaluation method of claim 1 wherein the step of evaluating and classifying the failed samples according to the volume ratio database, the time domain classifier, and the frequency domain classifier comprises:
controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
and controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result.
5. The speech recognition algorithm evaluation method of claim 4 wherein the step of evaluating and classifying the failed samples according to the volume ratio database, the time domain classifier, and the frequency domain classifier further comprises:
when the failure number corresponding to any one of the volume segment ranges is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
and when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the frequency domain scene.
6. The speech recognition algorithm evaluation method of claim 3, wherein after the step of constructing the frequency domain classifier, the method further comprises:
acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
7. A speech recognition algorithm evaluation system, the system comprising:
the system comprises a sample acquisition module, a voice processing module and a voice processing module, wherein the sample acquisition module is used for acquiring voice sample data, the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
the time domain classifier training module is used for acquiring the time domain characteristics of the background voice sample and training a time domain classifier according to the time domain characteristics;
the energy ratio calculation module is used for calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
the frequency domain classifier training module is used for acquiring the frequency domain characteristics of the background voice sample and training a frequency domain classifier according to the frequency domain characteristics;
the voice recognition module is used for controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and the algorithm evaluation module is used for evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier so as to obtain an algorithm evaluation result.
8. The speech recognition algorithm evaluation system of claim 7, wherein the time domain classifier training module is further configured to:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
9. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to make the mobile terminal execute the speech recognition algorithm evaluation method according to any one of claims 1 to 6.
10. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 9, which computer program, when being executed by a processor, carries out the steps of the speech recognition algorithm evaluation method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010257506.6A CN111599345B (en) | 2020-04-03 | 2020-04-03 | Speech recognition algorithm evaluation method, system, mobile terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010257506.6A CN111599345B (en) | 2020-04-03 | 2020-04-03 | Speech recognition algorithm evaluation method, system, mobile terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111599345A true CN111599345A (en) | 2020-08-28 |
CN111599345B CN111599345B (en) | 2023-02-10 |
Family
ID=72191979
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010257506.6A Active CN111599345B (en) | 2020-04-03 | 2020-04-03 | Speech recognition algorithm evaluation method, system, mobile terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111599345B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112017639A (en) * | 2020-09-10 | 2020-12-01 | 歌尔科技有限公司 | Voice signal detection method, terminal device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345923A (en) * | 2013-07-26 | 2013-10-09 | 电子科技大学 | Sparse representation based short-voice speaker recognition method |
US20180033454A1 (en) * | 2016-07-27 | 2018-02-01 | Vocollect, Inc. | Distinguishing user speech from background speech in speech-dense environments |
US20180158470A1 (en) * | 2015-06-26 | 2018-06-07 | Zte Corporation | Voice Activity Modification Frame Acquiring Method, and Voice Activity Detection Method and Apparatus |
CN109036458A (en) * | 2018-08-22 | 2018-12-18 | 昆明理工大学 | A kind of multilingual scene analysis method based on audio frequency characteristics parameter |
CN109192196A (en) * | 2018-08-22 | 2019-01-11 | 昆明理工大学 | A kind of audio frequency characteristics selection method of the SVM classifier of anti-noise |
CN110335611A (en) * | 2019-07-15 | 2019-10-15 | 易诚高科(大连)科技有限公司 | A kind of voiceprint recognition algorithm appraisal procedure based on quality dimensions |
CN110610709A (en) * | 2019-09-26 | 2019-12-24 | 浙江百应科技有限公司 | Identity distinguishing method based on voiceprint recognition |
-
2020
- 2020-04-03 CN CN202010257506.6A patent/CN111599345B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345923A (en) * | 2013-07-26 | 2013-10-09 | 电子科技大学 | Sparse representation based short-voice speaker recognition method |
US20180158470A1 (en) * | 2015-06-26 | 2018-06-07 | Zte Corporation | Voice Activity Modification Frame Acquiring Method, and Voice Activity Detection Method and Apparatus |
US20180033454A1 (en) * | 2016-07-27 | 2018-02-01 | Vocollect, Inc. | Distinguishing user speech from background speech in speech-dense environments |
CN109036458A (en) * | 2018-08-22 | 2018-12-18 | 昆明理工大学 | A kind of multilingual scene analysis method based on audio frequency characteristics parameter |
CN109192196A (en) * | 2018-08-22 | 2019-01-11 | 昆明理工大学 | A kind of audio frequency characteristics selection method of the SVM classifier of anti-noise |
CN110335611A (en) * | 2019-07-15 | 2019-10-15 | 易诚高科(大连)科技有限公司 | A kind of voiceprint recognition algorithm appraisal procedure based on quality dimensions |
CN110610709A (en) * | 2019-09-26 | 2019-12-24 | 浙江百应科技有限公司 | Identity distinguishing method based on voiceprint recognition |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112017639A (en) * | 2020-09-10 | 2020-12-01 | 歌尔科技有限公司 | Voice signal detection method, terminal device and storage medium |
CN112017639B (en) * | 2020-09-10 | 2023-11-07 | 歌尔科技有限公司 | Voice signal detection method, terminal equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111599345B (en) | 2023-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alim et al. | Some commonly used speech feature extraction algorithms | |
Chen et al. | Semi-automatic classification of bird vocalizations using spectral peak tracks | |
US8428945B2 (en) | Acoustic signal classification system | |
CN110880329B (en) | Audio identification method and equipment and storage medium | |
CN109256138B (en) | Identity verification method, terminal device and computer readable storage medium | |
CN101292283B (en) | Voice judging system, and voice judging method | |
CN109979486B (en) | Voice quality assessment method and device | |
CN103559892A (en) | Method and system for evaluating spoken language | |
CN105308679A (en) | Method and system for identifying location associated with voice command to control home appliance | |
KR100770895B1 (en) | Speech signal classification system and method thereof | |
CN103366759A (en) | Speech data evaluation method and speech data evaluation device | |
CN111724770A (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN111696580A (en) | Voice detection method and device, electronic equipment and storage medium | |
Li et al. | A comparative study on physical and perceptual features for deepfake audio detection | |
Wiśniewski et al. | Automatic detection of disorders in a continuous speech with the hidden Markov models approach | |
AU2021101586A4 (en) | A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model | |
CN111599345B (en) | Speech recognition algorithm evaluation method, system, mobile terminal and storage medium | |
Wei et al. | RMVPE: A robust model for vocal pitch estimation in polyphonic music | |
Li et al. | A pitch estimation algorithm for speech in complex noise environments based on the radon transform | |
Prathosh et al. | Cumulative impulse strength for epoch extraction | |
CN114302301B (en) | Frequency response correction method and related product | |
CN110675858A (en) | Terminal control method and device based on emotion recognition | |
Slaney et al. | Pitch-gesture modeling using subband autocorrelation change detection. | |
CN114724589A (en) | Voice quality inspection method and device, electronic equipment and storage medium | |
CN111091816B (en) | Data processing system and method based on voice evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |