CN111599345A - Speech recognition algorithm evaluation method, system, mobile terminal and storage medium - Google Patents

Speech recognition algorithm evaluation method, system, mobile terminal and storage medium Download PDF

Info

Publication number
CN111599345A
CN111599345A CN202010257506.6A CN202010257506A CN111599345A CN 111599345 A CN111599345 A CN 111599345A CN 202010257506 A CN202010257506 A CN 202010257506A CN 111599345 A CN111599345 A CN 111599345A
Authority
CN
China
Prior art keywords
voice sample
voice
frequency domain
time domain
background
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010257506.6A
Other languages
Chinese (zh)
Other versions
CN111599345B (en
Inventor
肖龙源
李稀敏
刘晓葳
谭玉坤
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010257506.6A priority Critical patent/CN111599345B/en
Publication of CN111599345A publication Critical patent/CN111599345A/en
Application granted granted Critical
Publication of CN111599345B publication Critical patent/CN111599345B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a method and a system for evaluating a voice recognition algorithm, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring voice sample data, and acquiring a time domain feature training time domain classifier of a background voice sample; calculating the volume ratio between the background voice sample and the effective voice sample, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database; acquiring a frequency domain characteristic training frequency domain classifier of a background voice sample; controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters; and evaluating and classifying the failed samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result. The method can evaluate the voice recognition algorithm from three scene angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes.

Description

Speech recognition algorithm evaluation method, system, mobile terminal and storage medium
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition algorithm evaluation method, a system, a mobile terminal and a storage medium.
Background
The research of voice recognition has been in history for decades, the voice recognition technology mainly comprises four parts of acoustic model modeling, language model modeling, pronunciation dictionary construction and decoding, each part can become an independent research direction, and compared with images and texts, the difficulty of voice data acquisition and labeling is greatly improved, so that the construction of a complete voice recognition system is a work which consumes a lot of time and has high difficulty, and the development of the voice recognition technology is greatly hindered. With the research and development of artificial intelligence technology, especially deep learning, some end-to-end-based speech recognition algorithms are proposed, compared with the traditional speech recognition method, the end-to-end speech recognition method simplifies the speech recognition process, and hands a large amount of work to deep neural network for learning and reasoning, so that the method has been widely concerned in recent years.
The existing voice recognition process is based on a voice recognition algorithm to achieve the effect of voice recognition, so that performance evaluation aiming at the voice recognition algorithm is particularly important for guaranteeing the accuracy of voice recognition. However, in the existing speech recognition algorithm evaluation process, the performance of the algorithm is evaluated only based on the recognition rate of the speech recognition algorithm, and the recognition effect of the speech recognition algorithm in different application scenes cannot be embodied, so that misjudgment is easy to occur in the selection of the speech recognition algorithm in different application scenes, and the accuracy of the speech recognition algorithm evaluation is reduced.
Disclosure of Invention
The embodiment of the invention aims to provide a speech recognition algorithm evaluation method, a system, a mobile terminal and a storage medium, and aims to solve the problem of low evaluation accuracy caused by incapability of reflecting recognition effects of a speech recognition algorithm in different application scenes in the existing speech recognition algorithm evaluation process.
The embodiment of the invention is realized in such a way that a speech recognition algorithm evaluation method comprises the following steps:
acquiring voice sample data, wherein the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
acquiring time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
calculating a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result.
Further, the step of obtaining the time domain feature of the background speech sample and training the time domain classifier according to the time domain feature includes:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
Further, the step of obtaining the frequency domain characteristics of the background speech sample and training the frequency domain classifier according to the frequency domain characteristics includes:
acquiring a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
and constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow.
Further, the step of performing an evaluation classification on the failed samples according to the volume ratio database, the time domain classifier and the frequency domain classifier comprises:
controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
and controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result.
Further, the step of performing an evaluation classification on the failed sample according to the volume ratio database, the time domain classifier and the frequency domain classifier further includes:
when the failure number corresponding to any one of the volume segment ranges is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
and when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the frequency domain scene.
Further, after the step of constructing the frequency domain classifier, the method further comprises:
acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
It is another object of an embodiment of the present invention to provide a speech recognition algorithm evaluation system, which includes:
the system comprises a sample acquisition module, a voice processing module and a voice processing module, wherein the sample acquisition module is used for acquiring voice sample data, the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
the time domain classifier training module is used for acquiring the time domain characteristics of the background voice sample and training a time domain classifier according to the time domain characteristics;
the energy ratio calculation module is used for calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
the frequency domain classifier training module is used for acquiring the frequency domain characteristics of the background voice sample and training a frequency domain classifier according to the frequency domain characteristics;
the voice recognition module is used for controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and the algorithm evaluation module is used for evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier so as to obtain an algorithm evaluation result.
Further, the time domain classifier training module is further configured to:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned speech recognition algorithm evaluation method.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the speech recognition algorithm evaluation method.
The embodiment of the invention can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of the subsequent voice recognition.
Drawings
FIG. 1 is a flow chart of a speech recognition algorithm evaluation method provided by a first embodiment of the present invention;
FIG. 2 is a flow chart of a speech recognition algorithm evaluation method provided by a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a speech recognition algorithm evaluation system according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
Example one
Referring to fig. 1, a flowchart of a speech recognition algorithm evaluation method according to a first embodiment of the present invention is shown, which includes the steps of:
step S10, acquiring voice sample data;
the voice sample data comprises at least one piece of voice sample information, the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters, and preferably, the background voice samples stored in different pieces of voice sample information are different;
step S20, acquiring the time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
the time domain features comprise short-time energy features, short-time average zero-crossing rate, short-time average amplitude difference and zero-crossing rate ratio, and because the time domain features in different background voice samples are different, the design of a time domain classifier is trained according to the time domain features, so that the trained time domain classifier can perform time domain classification analysis on the input voice signal, and then the subsequent time domain analysis can be performed on the voice recognition algorithm to be tested, so that the accuracy of the voice recognition algorithm to be tested in different time domain scenes is analyzed;
step S30, calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
the method comprises the steps that the volume ratio between a background voice sample and an effective voice sample is calculated, so that the volume of the background voice sample and the volume of the effective voice sample are digitized, the ratio between background sound and effective sound in each voice sample information is analyzed, the subsequent voice recognition algorithm to be tested can be subjected to volume analysis, the accuracy of voice recognition of the voice recognition algorithm to be tested in scenes with different volume ratios is analyzed, and the algorithm evaluation effect is effectively achieved;
step S40, acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
the frequency domain characteristics comprise a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow, and because the frequency domain characteristics in different background voice samples are different, the design of a frequency domain classifier is trained according to the frequency domain characteristics, so that the trained frequency domain classifier can perform frequency domain classification analysis on an input voice signal, and then the subsequent frequency domain analysis can be performed on the voice recognition algorithm to be tested, so that the accuracy of the voice recognition algorithm to be tested in different frequency domain scenes is analyzed, and the algorithm evaluation effect is effectively achieved;
step S50, controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
when the voice recognition characters stored in the voice recognition result are different from the corresponding voice sample characters, judging that the voice recognition algorithm to be tested fails to recognize the sample corresponding to the voice recognition characters so as to obtain a failed sample;
step S60, evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result;
the volume ratio database is used for analyzing the volume ratio between background sound and effective sound of the failed sample, the time domain classifier is used for analyzing a time domain scene of the background of the failed sample, and the frequency domain classifier is used for analyzing a frequency domain scene of the background of the failed sample;
specifically, in the step, the recognition effect of the speech recognition algorithm to be tested in different volume ratio scenes, time domain scenes and frequency domain scenes is judged through the classification evaluation result of the failed samples based on the volume ratio database, the time domain classifier and the frequency domain classifier;
the method and the device can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of subsequent voice recognition.
Example two
Referring to fig. 2, a flowchart of a speech recognition algorithm evaluation method according to a second embodiment of the present invention is shown, which includes the steps of:
step S11, acquiring voice sample data;
the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
step S21, acquiring short-time energy characteristics, a short-time average zero crossing rate, a short-time average amplitude difference and a zero crossing rate ratio in the background voice sample;
the energy of the voice signal changes obviously along with time, the energy of a soft part is much smaller than that of a voiced part, and the selection of the window function plays a decisive role in the characteristics of the short-time energy representation method. The short-time energy characteristics are mainly applied to the following aspects: firstly, the short-time energy characteristics can be used for distinguishing unvoiced sound and voiced sound, because the energy of the voiced sound is much larger than that of the unvoiced sound; secondly, judging the voiced sections and the unvoiced sections by using short-time energy characteristics, decomposing initials and finals, and dividing ligatures, so that in the step, the time domain characteristics of the background voice sample are obtained by extracting the short-time energy characteristics in the background voice sample;
preferably, the short-time average zero-crossing rate is the number of times that the signal passes through a zero value, a continuous speech signal can be considered when a time domain waveform passes through a time axis, a discrete signal is considered, the short-time average zero-crossing rate is substantially the number of times that a signal sampling point changes in sign, and the short-time average zero-crossing rate can be used for analysis of the speech signal.
Furthermore, the short-time average amplitude difference is a representation of the energy of a frame of audio signal, since the average amplitude function has no square operation, its dynamic range is smaller than the short-time energy and is close to the square of the dynamic range of the standard energy calculation, the influence of the window length N on the average amplitude function is completely consistent with the analysis conclusion of the short-time energy, and the influence of the voiced sound on the amplitude difference is consistent with the analysis of the short-time energy, and the amplitude difference of the voiced sound is much larger than that of the unvoiced sound, so the short-time average amplitude function can be used to distinguish the unvoiced sound from the background speech sample;
furthermore, if the zero-crossing rate of a certain frame is zero, the frame is considered as a zero-crossing frame, and the ratio of the number of frames with zero-crossing rate in the background speech sample to the number of all frames in the whole segment is the zero-crossing rate.
Step S31, intercepting the background voice sample by a short time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
the number and the sampling coordinates of the preset sampling points can be selected according to requirements, and in the step, the time domain characteristics of the background voice sample are further acquired by calculating the design of the autocorrelation value;
step S41, constructing the time domain classifier, and training the time domain classifier according to the short-time energy feature, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value;
the trained time domain classifier can judge the time domain classification of the background voice information in the input sample information by training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value so as to analyze the time domain scene of the background voice in the sample information input into the time domain classifier;
step S51, calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
frequency calculation is carried out on the background voice sample and the effective voice sample to obtain amplitude-frequency information, and the volume ratio between the background voice sample and the effective voice sample is calculated according to the amplitude-frequency information;
step S61, obtaining the spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
the spectral centroid is a parameter reflecting the brightness of the voice signal, and the spectral centroid calculates a balance point of the voice signal in the whole frequency spectrum to reflect the frequency domain characteristics of the background voice sample;
preferably, the sub-band energy is used to identify the data transmission capability of the signal transmission, the sub-band energy reflects the signal power or the range of the signal energy concentrated in the spectrum, the sub-band energy is defined as the square root of the mean value of the difference between the signal spectrum component and the spectrum centroid weighted by the energy, and therefore, the sub-band energy can reflect the frequency domain characteristics of the background voice sample;
further, the pitch frequency is a unit for measuring the pitch height, and the pitch is also called a fundamental frequency or simply a fundamental frequency. When the sounding body sounds due to vibration, the sound can be generally decomposed into a plurality of pure sine waves, that is, all natural sounds are basically composed of a plurality of sine waves with different frequencies, wherein the sine wave with the lowest frequency is the fundamental tone, and the sine waves with higher frequencies are overtones, so that the fundamental tone frequency can reflect the frequency domain characteristics of the background voice sample;
furthermore, the wavelet entropy is used for measuring the information complexity, the spectrum flow is the variation of two adjacent frames on the spectrum distribution, and the spectrum flow is a reaction of the dynamic characteristics of the signal, so that the wavelet entropy and the spectrum flow can both reflect the frequency domain characteristics of the background voice sample;
step S71, constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow;
the trained frequency domain classifier can judge the frequency domain classification of background voice information in the input sample information by training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectral flow so as to analyze the frequency domain scene of the background voice in the sample information input into the frequency domain classifier;
preferably, in this step, after the step of constructing the frequency domain classifier, the method further includes:
acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
Step S81, controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
in the step, the voice recognition algorithm to be tested is controlled to test the voice sample information so as to obtain a voice recognition result of the voice recognition algorithm to be tested, and a correct result corresponding to the voice sample information is compared with the voice recognition result so as to obtain a sample which is failed to be recognized in the voice recognition algorithm to be tested;
step S91, controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
step S101, when the failure number corresponding to any volume segment range is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
the first preset number may be set according to a requirement or based on a value of the voice sample information, for example, the first preset number may be 10%, 11%, or 20% of a total value of the voice sample information;
for example, when the number of failures corresponding to any one of the volume segment ranges is judged to be greater than a first preset number, it is judged that the speech recognition accuracy of the speech recognition algorithm to be tested is low in a background scene where the acquisition object is in the corresponding volume segment, for example, it is judged that the recognition of the speech recognition algorithm to be tested is low in a situation where the acquisition object is indoors;
step S111, controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
step S121, when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
the second preset quantity is the same as the first preset quantity in a setting mode, and can be set according to requirements or based on the numerical value of the voice sample information;
for example, when the failure number corresponding to any one time domain scene is judged to be larger than a second preset number, the speech recognition accuracy of the speech recognition algorithm to be tested is judged to be low when the acquisition object is in the corresponding time domain scene;
step S131, controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result;
step S141, when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the voice recognition of the voice recognition algorithm to be tested is unqualified when the background is in the frequency domain scene;
the third preset quantity is the same as the first preset quantity in a setting mode, and can be set according to requirements or based on the numerical value of the voice sample information;
for example, when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, the speech recognition accuracy of the speech recognition algorithm to be tested is judged to be low when the acquisition object is in the corresponding frequency domain scene;
the method and the device can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of subsequent voice recognition.
EXAMPLE III
Referring to fig. 3, a schematic structural diagram of a speech recognition algorithm evaluation system 100 according to a third embodiment of the present invention is shown, including: a sample obtaining module 10, a time domain classifier training module 11, an energy ratio calculating module 12, a frequency domain classifier training module 13, a voice recognition module 14 and an algorithm evaluating module 15, wherein:
the system comprises a sample acquisition module 10, a voice processing module and a voice processing module, wherein the sample acquisition module is used for acquiring voice sample data, the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
the time domain classifier training module 11 is configured to acquire a time domain feature of the background voice sample, and train a time domain classifier according to the time domain feature;
wherein, the time domain classifier training module 11 is further configured to: acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
And the energy ratio calculation module 12 is configured to calculate a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and store the energy ratio and the voice sample information corresponding to the energy ratio to obtain a volume ratio database.
And the frequency domain classifier training module 13 is configured to acquire frequency domain features of the background speech sample, and train a frequency domain classifier according to the frequency domain features.
Wherein, the frequency domain classifier training module 13 is further configured to: acquiring a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
and constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow.
Preferably, the frequency domain classifier training module 13 is further configured to: acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
The voice recognition module 14 is configured to control a to-be-tested voice recognition algorithm to test the voice sample information pair to obtain a voice recognition result, and obtain a failed recognition sample in the voice recognition result according to the voice sample text;
and the algorithm evaluation module 15 is configured to evaluate and classify the failure sample according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result.
Wherein the algorithm evaluation module 15 is further configured to: controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
and controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result.
Preferably, the algorithm evaluation module 15 is further configured to: when the failure number corresponding to any one of the volume segment ranges is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
and when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the frequency domain scene.
The method and the device can effectively evaluate the voice recognition algorithm from three background angles of time domain, volume ratio and frequency domain so as to evaluate the recognition effect of the voice recognition algorithm in different application scenes, effectively improve the accuracy of the selection of the voice recognition algorithm in different application scenes and improve the accuracy of subsequent voice recognition.
Example four
Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the above-mentioned speech recognition algorithm evaluation method.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
acquiring voice sample data, wherein the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
acquiring time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
calculating a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the configuration shown in fig. 3 does not constitute a limitation of the speech recognition algorithm evaluation system of the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the speech recognition algorithm evaluation method of fig. 1-2 may be implemented using more or fewer components than those shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the target speech recognition algorithm evaluation system and that can perform specific functions, all of which can be stored in a storage device (not shown) of the target speech recognition algorithm evaluation system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A method for evaluating a speech recognition algorithm, the method comprising:
acquiring voice sample data, wherein the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
acquiring time domain characteristics of the background voice sample, and training a time domain classifier according to the time domain characteristics;
calculating a volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
acquiring the frequency domain characteristics of the background voice sample, and training a frequency domain classifier according to the frequency domain characteristics;
controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier to obtain an algorithm evaluation result.
2. The method of claim 1, wherein the step of obtaining the time domain features of the background speech sample and training a time domain classifier based on the time domain features comprises:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
3. The method of claim 1, wherein the steps of obtaining frequency domain features of the background speech sample and training a frequency domain classifier based on the frequency domain features comprises:
acquiring a spectrum centroid, sub-band energy, bandwidth, fundamental tone frequency, wavelet entropy and spectrum flow in the background voice sample;
and constructing the frequency domain classifier, and training the frequency domain classifier according to the spectrum centroid, the sub-band energy, the bandwidth, the fundamental tone frequency, the wavelet entropy and the spectrum flow.
4. The speech recognition algorithm evaluation method of claim 1 wherein the step of evaluating and classifying the failed samples according to the volume ratio database, the time domain classifier, and the frequency domain classifier comprises:
controlling the volume ratio database to classify the volume of the background failure samples in the failure samples, and calculating the failure number corresponding to each volume segment range according to the volume classification result;
controlling the time domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each time domain scene according to the time domain classification result;
and controlling the frequency domain classifier to perform time domain classification on the background failure samples in the failure samples, and calculating the failure number corresponding to each frequency domain scene according to the frequency domain classification result.
5. The speech recognition algorithm evaluation method of claim 4 wherein the step of evaluating and classifying the failed samples according to the volume ratio database, the time domain classifier, and the frequency domain classifier further comprises:
when the failure number corresponding to any one of the volume segment ranges is judged to be larger than a first preset number, judging that the speech recognition of the speech recognition algorithm to be tested in the volume segment range at the background is unqualified;
when the failure number corresponding to any time domain scene is judged to be larger than a second preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the time domain scene;
and when the failure number corresponding to any one frequency domain scene is judged to be larger than a third preset number, judging that the speech recognition of the speech recognition algorithm to be tested is unqualified when the background is in the frequency domain scene.
6. The speech recognition algorithm evaluation method of claim 3, wherein after the step of constructing the frequency domain classifier, the method further comprises:
acquiring a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a linear spectrum pair coefficient, a spectrum crest factor and spectrum flatness in the background voice sample;
training the frequency domain classifier according to the spectral centroid, the sub-band energy, the bandwidth, the pitch frequency, the wavelet entropy, the spectral flux, the Mel cepstral coefficient, the linear prediction cepstral coefficient, the linear spectral pair coefficient, the spectral crest factor, and the spectral flatness.
7. A speech recognition algorithm evaluation system, the system comprising:
the system comprises a sample acquisition module, a voice processing module and a voice processing module, wherein the sample acquisition module is used for acquiring voice sample data, the voice sample data comprises at least one piece of voice sample information, and the voice sample information comprises a background voice sample, an effective voice sample and voice sample characters;
the time domain classifier training module is used for acquiring the time domain characteristics of the background voice sample and training a time domain classifier according to the time domain characteristics;
the energy ratio calculation module is used for calculating the volume ratio between the background voice sample and the effective voice sample in the voice sample information, and storing the energy ratio and the corresponding voice sample information to obtain a volume ratio database;
the frequency domain classifier training module is used for acquiring the frequency domain characteristics of the background voice sample and training a frequency domain classifier according to the frequency domain characteristics;
the voice recognition module is used for controlling a voice recognition algorithm to be tested to test the voice sample information pair to obtain a voice recognition result, and obtaining a failed recognition sample in the voice recognition result according to the voice sample characters;
and the algorithm evaluation module is used for evaluating and classifying the failure samples according to the volume ratio database, the time domain classifier and the frequency domain classifier so as to obtain an algorithm evaluation result.
8. The speech recognition algorithm evaluation system of claim 7, wherein the time domain classifier training module is further configured to:
acquiring short-time energy characteristics, a short-time average zero-crossing rate, a short-time average amplitude difference and a zero-crossing rate ratio in the background voice sample;
intercepting the background voice sample by a short-time window signal according to a preset sampling point to obtain a sampling signal, and calculating an autocorrelation value of the sampling signal according to a preset autocorrelation function;
and constructing the time domain classifier, and training the time domain classifier according to the short-time energy characteristic, the short-time average zero-crossing rate, the short-time average amplitude difference, the zero-crossing rate ratio and the autocorrelation value.
9. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to make the mobile terminal execute the speech recognition algorithm evaluation method according to any one of claims 1 to 6.
10. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 9, which computer program, when being executed by a processor, carries out the steps of the speech recognition algorithm evaluation method according to any one of claims 1 to 6.
CN202010257506.6A 2020-04-03 2020-04-03 Speech recognition algorithm evaluation method, system, mobile terminal and storage medium Active CN111599345B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010257506.6A CN111599345B (en) 2020-04-03 2020-04-03 Speech recognition algorithm evaluation method, system, mobile terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010257506.6A CN111599345B (en) 2020-04-03 2020-04-03 Speech recognition algorithm evaluation method, system, mobile terminal and storage medium

Publications (2)

Publication Number Publication Date
CN111599345A true CN111599345A (en) 2020-08-28
CN111599345B CN111599345B (en) 2023-02-10

Family

ID=72191979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010257506.6A Active CN111599345B (en) 2020-04-03 2020-04-03 Speech recognition algorithm evaluation method, system, mobile terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111599345B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017639A (en) * 2020-09-10 2020-12-01 歌尔科技有限公司 Voice signal detection method, terminal device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method
US20180033454A1 (en) * 2016-07-27 2018-02-01 Vocollect, Inc. Distinguishing user speech from background speech in speech-dense environments
US20180158470A1 (en) * 2015-06-26 2018-06-07 Zte Corporation Voice Activity Modification Frame Acquiring Method, and Voice Activity Detection Method and Apparatus
CN109036458A (en) * 2018-08-22 2018-12-18 昆明理工大学 A kind of multilingual scene analysis method based on audio frequency characteristics parameter
CN109192196A (en) * 2018-08-22 2019-01-11 昆明理工大学 A kind of audio frequency characteristics selection method of the SVM classifier of anti-noise
CN110335611A (en) * 2019-07-15 2019-10-15 易诚高科(大连)科技有限公司 A kind of voiceprint recognition algorithm appraisal procedure based on quality dimensions
CN110610709A (en) * 2019-09-26 2019-12-24 浙江百应科技有限公司 Identity distinguishing method based on voiceprint recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method
US20180158470A1 (en) * 2015-06-26 2018-06-07 Zte Corporation Voice Activity Modification Frame Acquiring Method, and Voice Activity Detection Method and Apparatus
US20180033454A1 (en) * 2016-07-27 2018-02-01 Vocollect, Inc. Distinguishing user speech from background speech in speech-dense environments
CN109036458A (en) * 2018-08-22 2018-12-18 昆明理工大学 A kind of multilingual scene analysis method based on audio frequency characteristics parameter
CN109192196A (en) * 2018-08-22 2019-01-11 昆明理工大学 A kind of audio frequency characteristics selection method of the SVM classifier of anti-noise
CN110335611A (en) * 2019-07-15 2019-10-15 易诚高科(大连)科技有限公司 A kind of voiceprint recognition algorithm appraisal procedure based on quality dimensions
CN110610709A (en) * 2019-09-26 2019-12-24 浙江百应科技有限公司 Identity distinguishing method based on voiceprint recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017639A (en) * 2020-09-10 2020-12-01 歌尔科技有限公司 Voice signal detection method, terminal device and storage medium
CN112017639B (en) * 2020-09-10 2023-11-07 歌尔科技有限公司 Voice signal detection method, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN111599345B (en) 2023-02-10

Similar Documents

Publication Publication Date Title
Alim et al. Some commonly used speech feature extraction algorithms
Chen et al. Semi-automatic classification of bird vocalizations using spectral peak tracks
US8428945B2 (en) Acoustic signal classification system
CN110880329B (en) Audio identification method and equipment and storage medium
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
CN101292283B (en) Voice judging system, and voice judging method
CN109979486B (en) Voice quality assessment method and device
CN103559892A (en) Method and system for evaluating spoken language
CN105308679A (en) Method and system for identifying location associated with voice command to control home appliance
KR100770895B1 (en) Speech signal classification system and method thereof
CN103366759A (en) Speech data evaluation method and speech data evaluation device
CN111724770A (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN111696580A (en) Voice detection method and device, electronic equipment and storage medium
Li et al. A comparative study on physical and perceptual features for deepfake audio detection
Wiśniewski et al. Automatic detection of disorders in a continuous speech with the hidden Markov models approach
AU2021101586A4 (en) A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model
CN111599345B (en) Speech recognition algorithm evaluation method, system, mobile terminal and storage medium
Wei et al. RMVPE: A robust model for vocal pitch estimation in polyphonic music
Li et al. A pitch estimation algorithm for speech in complex noise environments based on the radon transform
Prathosh et al. Cumulative impulse strength for epoch extraction
CN114302301B (en) Frequency response correction method and related product
CN110675858A (en) Terminal control method and device based on emotion recognition
Slaney et al. Pitch-gesture modeling using subband autocorrelation change detection.
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium
CN111091816B (en) Data processing system and method based on voice evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant