WO2023050301A1 - Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole - Google Patents

Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole Download PDF

Info

Publication number
WO2023050301A1
WO2023050301A1 PCT/CN2021/122149 CN2021122149W WO2023050301A1 WO 2023050301 A1 WO2023050301 A1 WO 2023050301A1 CN 2021122149 W CN2021122149 W CN 2021122149W WO 2023050301 A1 WO2023050301 A1 WO 2023050301A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
quality
test
voice
speech recognition
Prior art date
Application number
PCT/CN2021/122149
Other languages
English (en)
Chinese (zh)
Inventor
周宇
聂为然
向腾
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/122149 priority Critical patent/WO2023050301A1/fr
Priority to CN202180008040.9A priority patent/CN116210050A/zh
Publication of WO2023050301A1 publication Critical patent/WO2023050301A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • the present application relates to the technical field of speech recognition, in particular to a method and device for evaluating speech quality, a method and device for predicting speech recognition quality, a method and device for improving speech recognition quality, a vehicle, a computer-readable storage medium, and a computer program product.
  • the voice signal will be enhanced by the voice pre-processing module after being collected by the sensor, and then sent to the voice recognition module for voice wake-up and recognition. Therefore, the recognition effect of the voice recognition system mainly depends on the performance of the voice recognition module and the quality of the voice signal. big factor.
  • the quality of the voice signal is subject to the environment, audio collection hardware and the algorithm of the voice pre-processing module.
  • the industry generally adopts the overall joint debugging of the speech recognition module and the speech pre-processing module to test the quality of speech recognition for performance tuning.
  • Standards cannot evaluate the quality of the voice signal output by the voice pre-processing module and provide tuning baselines and feedback for the acquisition hardware, voice pre-processing module and voice recognition module.
  • This solution is not conducive to the independent positioning and solution of voice signal quality and voice recognition module performance related issues in the actual voice recognition business, and it is easy to cause difficulty in problem location for voice recognition system failures. Therefore, the voice recognition module is decoupled from the voice pre-processing module. , to obtain voice quality calibration and feedback, which is of great significance to improve the efficiency of module fault diagnosis.
  • the present application provides a speech quality assessment method and device, a speech recognition quality prediction method and device, a speech recognition quality improving method and device, a vehicle, a computer readable storage medium and a computer program product.
  • the first aspect of the present application provides a voice quality assessment method, including: acquiring a test voice; evaluating the voice quality of the test voice according to the semantic related information of the test voice; and outputting the voice quality assessment result.
  • the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech.
  • the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality.
  • the vehicle when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle.
  • the vehicle can also perform corresponding operations based on the output evaluation results to improve the in-car voice interaction experience, such as automatically closing the car windows, automatically reducing the sound in the car (such as the sound of the car playing music) , or automatically perform parameter tuning, etc.
  • the evaluation result of the voice quality includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a manner of adjusting the quality of the test voice.
  • the output content can be flexibly set.
  • the quality of the output test speech can be output by quantified specific parameters, can also be output in qualitative ways such as excellent, good, and medium, and can also be output in combination with images and texts.
  • the factors that affect the test voice quality can be the noise outside the car caused by the window being opened; the tire noise or engine noise caused by the high speed of the car; Adjustments desired by the user for reference. Adjust the way to test the voice quality, you can close the car windows, reduce the speed of the car, reduce the sound in the car, replace the problematic microphone, adjust the parameters of the pre-processing module, etc.
  • evaluating the speech quality of the test speech according to the semantic related information of the test speech includes: obtaining a first feature vector of the test speech, where the first feature vector includes time-frequency features of the test speech vector; obtaining a second feature vector of the test voice according to the first feature vector of the test voice, wherein the second feature vector is related to the semantics of the test voice; evaluating the voice quality of the test voice according to the second feature vector.
  • the first feature vector including the time-frequency feature vector of the speech can be obtained first, and then the second feature vector related to the semantics of the speech can be obtained based on this, so as to evaluate the speech quality of the test speech according to the second feature vector.
  • the first eigenvector is related to the time-frequency eigenvector, so the preprocessing parameters are related to the second eigenvector, so the speech quality of the test speech evaluated by the second eigenvector can be used as a reference for tuning the preprocessing parameters.
  • evaluating the speech quality of the test speech according to the second feature vector includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: the center of each group of test speech The concentration index of position, wherein, different groups of test voices have different semantics, the center position of each group of test voices is the center position of the second eigenvector of the group of test voices in the second feature space, and the second feature space is the center position of the second feature vector The space of the two eigenvectors.
  • the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
  • evaluating the speech quality of the test speech according to the second feature vector includes: using a second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: the first An indicator of the degree of dispersion of the two feature vectors in the second feature space, wherein different groups of test speeches have different semantics, and the second feature space is the space where the second feature vectors are located.
  • the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
  • evaluating the speech quality of the test speech according to the second feature vector includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: the center of each group of test speech The similarity index of the center position of each group of reference voices corresponding to the position and semantics; wherein, different groups of test voices have different semantics, and the center position of each group of test voices is the second eigenvector of the group of test voices in the second feature space The center position in the second feature space is the space where the second feature vector is located. Different groups of reference voices have different semantics. The center position of each group of reference voices is the second feature vector of the group of reference voices in the second feature space central location.
  • the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
  • obtaining the first feature vector of the test speech includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors, the plurality of eigenvectors constitute the first eigenvector.
  • the second aspect of the present application provides a speech recognition quality prediction method, including: obtaining a test speech; predicting the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, and the speech recognition quality function is used to indicate The relationship between the speech recognition quality and the speech quality, the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; and a prediction result of the speech recognition quality is output.
  • the output prediction result of speech recognition quality includes one or more of the following: the speech recognition quality of the speech recognition model; factors affecting the speech recognition quality of the speech recognition model; adjusting the speech The way to identify the speech recognition quality of the model.
  • the content of the predicted result can be flexibly set.
  • the quality of a speech recognition model can be provided in quantified specific parameters, or in qualitative ways such as excellent, good, and medium, or in combination with images and text.
  • the output factors affecting the speech recognition quality of the speech recognition model may be factors in the method provided in the first aspect, or may be performance factors of the speech recognition model of the speech recognition module.
  • the output mode for adjusting the test voice quality may be the adjustment mode provided in the method provided in the first aspect, or it may be a tuning prompt for the parameters of the speech recognition model.
  • the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.
  • the above is a way of constructing the speech recognition quality function through reference speech. Specifically, the method of degrading the reference speech is adopted without introducing other reference speech, which can effectively reduce the data volume of the reference speech.
  • the third aspect of the present application provides a method for improving the quality of speech recognition, including: obtaining a test speech; obtaining the speech of the test speech according to the method of the first aspect or any possible implementation manner of the first aspect A quality assessment result; when the voice quality is lower than the preset first baseline, output the voice quality assessment result.
  • the speech quality evaluation result of the test speech can be obtained, and according to the speech quality evaluation result of the test speech, content related to the speech quality can be included, wherein the content can include whether to adjust the pre-processing parameters.
  • the content can include whether to adjust the pre-processing parameters. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle.
  • This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.
  • the third aspect also includes: when the voice quality is greater than or equal to the first baseline, predicting the speech recognition quality according to the second aspect or any one possible implementation manner of the second aspect; when When the voice recognition quality is lower than the preset second baseline, output a prediction result of the voice recognition quality.
  • the fourth aspect of the present application provides a voice quality assessment device, including:
  • the obtaining module is used to obtain the test speech; the evaluation module is used to evaluate the speech quality of the test speech according to the semantic related information of the test speech; the output module is used to output the speech quality evaluation result.
  • the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech.
  • the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle.
  • the vehicle can also perform corresponding operations based on the output evaluation results to improve the voice interaction experience in the vehicle, such as automatically closing the windows, automatically reducing the sound in the vehicle (such as the vehicle playing music sound), or automatically perform parameter tuning, etc.
  • the evaluation result of the voice quality output by the output module includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a way of adjusting the quality of the test voice.
  • the evaluation module is specifically configured to: obtain a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech; Obtaining a second feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.
  • the evaluation module when it evaluates the speech quality of the test speech according to the second feature vector, it includes: using the first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: each group The concentration index of the center position of the test speech, wherein, the test speech of different groups has different semantics, the center position of each group of test speech is the center position of the second feature vector of this group of test speech in the second feature space, the second The feature space is the space where the second feature vector resides.
  • the evaluation module when it evaluates the speech quality of the test speech according to the second feature vector, it includes: using the second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: each group An indicator of the degree of dispersion of the second feature vector of the test speech in the second feature space, wherein different groups of test speech have different semantics, and the second feature space is the space where the second feature vector is located.
  • the evaluation module when it evaluates the speech quality of the test speech according to the second feature vector, it includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: each group The center position of the test voice and the similarity index of the center position of each group of reference voices corresponding to the semantics; Wherein, the test voices of different groups have different semantics, and the center position of each group of test voices is the second feature vector of this group of test voices in The center position in the second feature space, the second feature space is the space where the second feature vector is located, different groups of reference speeches have different semantics, the center position of each group of reference speech is the second feature vector of the group of reference speech at the The center position in the second feature space.
  • the evaluation module when it obtains the first feature vector of the test speech, it includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors including frequency domain features, the plurality of eigenvectors constitute a first eigenvector.
  • the fifth aspect of the present application provides a speech recognition quality prediction device, including: an acquisition module, used to obtain a test speech; a prediction module, used to predict the speech of the speech recognition model to the test speech according to the speech recognition quality function Recognition quality, the speech recognition quality function is used to indicate the relationship between the speech recognition quality and the speech quality, and the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; the output module is used to output the speech Prediction results of recognition quality.
  • the prediction result of the speech recognition quality output by the output module includes one or more of the following: the speech recognition quality of the speech recognition model; Factor; a way to tune the speech recognition quality of a speech recognition model.
  • the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.
  • the sixth aspect of the present application provides a device for improving the quality of speech recognition, including: an acquisition module, used to acquire test speech; an evaluation and prediction module, used according to any one of the first aspect or the first aspect
  • the method in a possible implementation manner obtains the speech quality evaluation result of the test speech; the output module is configured to output the speech quality evaluation result when the speech quality is lower than a preset first baseline.
  • the speech quality evaluation result of the test speech can be obtained, and the content related to the speech quality can be output according to the speech quality evaluation result of the test speech, wherein the output content can include whether to adjust the pre-processing parameters.
  • the pre-processing parameters For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle.
  • This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.
  • the evaluation and prediction module is also used to predict speech recognition according to the method of the second aspect or any possible implementation of the second aspect when the voice quality is greater than or equal to the first baseline
  • the quality; output module is also used to output the prediction result of the speech recognition quality when the speech recognition quality is lower than the preset second baseline.
  • the seventh aspect of the present application provides a vehicle, including: a voice collection device for collecting the user's voice command; a pre-processing device for pre-processing the sound of the voice command; a voice recognition device for Recognition of the pre-processed sound; the device in the fourth, fifth, and sixth aspects and any possible implementation manners thereof.
  • the eighth aspect of the present application provides a computing device, including one or more processors and one or more memories, the memories store program instructions, and when executed by the one or more processors, the program instructions make One or more processors implement the method of the first aspect and any possible implementation manner thereof.
  • the ninth aspect of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a computer, the computer implements the first aspect and any possible implementation manner thereof. method.
  • the tenth aspect of the present application provides a computer program product, which includes program instructions, and when the program instructions are executed by a computer, the computer implements the method of the first aspect and any possible implementation manner thereof.
  • the embodiment of the present application realizes the decoupling of the evaluation of the speech pre-processing process and the prediction of speech recognition by the speech recognition model, and realizes the separate positioning of the pre-processing problem and the speech recognition problem, which is beneficial to the independent positioning and coordination of each module problem. performance tuning.
  • the evaluation of the test voice quality can also be used to prompt the user for corresponding operations to improve the in-vehicle voice interaction experience
  • the prediction of speech recognition can also be used to prompt the user for corresponding operations. Improve the voice interaction experience in the car.
  • FIG. 1 is a schematic structural diagram of an application scenario 1 involved in an embodiment of the present application
  • FIG. 2A is a schematic flow chart of a voice quality assessment method provided in an embodiment of the present application.
  • FIG. 2B is a schematic flow chart of a voice quality assessment method for test voice provided in an embodiment of the present application
  • FIG. 3A is a schematic flowchart of a method for predicting speech recognition quality provided by an embodiment of the present application
  • FIG. 3B is a schematic flow chart of a method for constructing a speech recognition quality function provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for improving speech recognition quality provided by an embodiment of the present application
  • FIG. 5 is a schematic flow chart of a specific implementation of the method for improving the quality of speech recognition provided by the embodiment of the present application;
  • FIG. 6A is a schematic diagram of a speech quality assessment device provided in an embodiment of the present application.
  • FIG. 6B is a schematic diagram of a speech recognition quality prediction device provided in an embodiment of the present application.
  • FIG. 6C is a schematic diagram of a device for improving speech recognition quality provided by an embodiment of the present application.
  • FIG. 7A is a schematic structural diagram of a vehicle provided in an embodiment of the present application.
  • FIG. 7B is a schematic diagram of the cockpit of the vehicle provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computing device provided by an embodiment of the present application.
  • the voice quality assessment solution provided by the embodiments of the present application includes a voice quality assessment method and device, a voice recognition quality prediction method and device, a method and device for improving voice recognition quality, a computer-readable storage medium, and a computer program product. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have been referred to each other and can be combined with each other.
  • the index of Mean Opinion Score can be used for evaluation. This index is also known as the subjective voice quality index.
  • multiple index levels can be used to evaluate the quality of the tested voice. The quality of the tested voice is obtained by averaging the scores of all test listeners.
  • the differences in auditory ability and subjective auditory experience between different listeners will cause differences in scores; especially when only a single sentence is provided without providing context, the scores of the test listeners are significantly different. This will lead to low objectivity of the evaluation result of the tested voice quality, and the evaluation method is poor in adaptability.
  • Objective evaluation indicators When evaluating the voice quality under test, objective evaluation indicators can also be used for evaluation. Objective evaluation indicators include Signal-to-Noise Ratio (SNR), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Analysis (POLQA), short-term objective Intelligibility (Short-Time Objective Intelligibility, STOI), etc.
  • SNR Signal-to-Noise Ratio
  • PESQ Perceptual Evaluation of Speech Quality
  • POLQA Perceptual Objective Listening Quality Analysis
  • STOI short-term objective Intelligibility
  • the PESQ algorithm requires a noisy attenuation signal and an original reference signal. After level adjustment, input filter filtering, time alignment and compensation, and auditory transformation of the two speech signals to be compared, the parameters of the two signals are extracted respectively. , synthesize its time-frequency characteristics, get the PESQ score, and finally map this score to the subjective mean opinion score (MOS).
  • MOS subjective mean opinion score
  • POLQA is the successor to PESQ, extended to handle higher bandwidth audio signals.
  • STOI is one of the important indicators to measure speech intelligibility, which is used to evaluate the intelligibility of noisy speech that has been masked in the time domain or short-time Fourier transformed and weighted in the frequency domain. STOI is scored by comparing the clean speech with the speech to be evaluated.
  • the method of evaluating the voice quality under test by using the above-mentioned objective evaluation indicators pays attention to the correlation between sound characteristics and subjective hearing perception from an acoustic point of view, but the relationship with the performance of machine-oriented speech recognition (that is, speech recognition as text or semantics) is unclear. It is difficult to serve as an effective reference for the speech recognition module in the process of R&D and tuning of the speech recognition system.
  • the embodiment of the present application provides an improved speech quality evaluation scheme, wherein the speech quality of the test speech related to semantics can be determined based on the time-frequency characteristics of the test speech, and the speech quality of the test speech can also be predicted based on the speech quality of the test speech related to semantics.
  • Recognition quality and by comparing the test speech and semantic-related speech quality and the predicted speech recognition quality with the baseline, determine the problem that affects the speech recognition quality, so as to improve the speech recognition quality by locating or solving the problem.
  • the speech quality evaluation solution provided by the embodiment of the present application can be applied to application fields such as quality detection and evaluation in the speech recognition process.
  • application fields such as quality detection and evaluation in the speech recognition process.
  • it when applied in the smart cockpit of a vehicle, it can be used to determine the quality of the currently received voice in the cockpit, or the quality of voice recognition, and then give corresponding prompts or perform corresponding actions, such as reducing the music in the cockpit Prompts or actions that can improve the voice quality in the cockpit, such as volume, closing windows, or parameter tuning.
  • the voice quality or voice recognition quality of the terminal's current environment can be evaluated, and then corresponding prompts can be given, such as prompting whether the microphone is blocked, such as prompting to activate
  • the camera can be combined with lip recognition during speech recognition to improve the accuracy of speech recognition, etc., or perform actions that can be allowed (for example, the permission of the corresponding application is set to call the camera), for example, start the camera to combine lip recognition during speech recognition identify.
  • the testing terminal when applied to a testing terminal used as a quality test, the testing terminal can be used to test the voice recognition function of a product to be tested, such as testing the voice recognition quality of a vehicle, so as to conduct voice recognition-related tests on the vehicle. Tuning of parameters.
  • the vehicle, smart terminal, and product under test usually have a microphone and a processor.
  • the microphone is used to collect the user's voice; the processor can be used to pre-process the collected voice, and perform voice recognition on the pre-processed voice to recognize it as text, and further recognize instructions based on the recognized text.
  • the vehicle or the intelligent terminal when applied to a vehicle or an intelligent terminal, may also have a man-machine interface, which is used to provide the above corresponding prompts to the user through display or sound.
  • the processor may also perform parameter tuning of pre-processing or speech recognition according to user operations through a human-machine interface (Human Machine Interface, HMI).
  • HMI Human Machine Interface
  • the processor of the above-mentioned vehicle may be an electronic device, specifically, it may be a processor of a vehicle-mounted processing device such as a vehicle machine or a vehicle-mounted computer, or it may be a central processing unit (central processing unit, CPU), a microprocessor ( micro control unit, MCU) and other conventional chip processors.
  • the testing terminal when applied to a testing terminal for quality testing, the testing terminal may have a man-machine interface to provide the above corresponding prompts to the user through display or sound.
  • the speech quality assessment solution provided by the embodiments of the present application may be embedded in the above-mentioned vehicles, smart terminals, and products under test, and exist as a functional module thereof.
  • the detection terminal when the voice quality evaluation solution provided by the embodiment of the present application is applied to an independent quality detection detection terminal, the detection terminal can be wired or Communicate wirelessly to obtain the required data, for example, the data is pre-processed voice data, based on which the voice quality test or the voice recognition quality can be predicted, and a prompt of the test result can be given.
  • the detection terminal can also feed back the test results to the device under test, and the device under test performs operations such as parameter tuning according to the test results.
  • Fig. 1 shows a scenario in which the voice quality assessment solution provided by the embodiment of the present application is applied to a vehicle, which can be used for voice quality assessment, and can also be used for quality assessment of voice recognition.
  • the cockpit of the vehicle includes: a sound collection module 110 , a pre-processing module 120 , a speech recognition module 130 , an evaluation and prediction module 140 , and an output module 150 .
  • the pre-processing module 120, the speech recognition module 130 and the evaluation and prediction module 140 may be implemented by the same processor of the vehicle, or may be implemented by three or more processors respectively.
  • the sound collection module 110 can be a microphone, which can be used to collect test voices broadcast by users, wherein a test voice corresponds to a reference voice, and both correspond to the same sentence content, such as corresponding to the same voice command, and have the same semantics. In the stage of establishing each model for speech quality assessment, the sound collection module 110 is also used to collect reference speech.
  • the reference speech refers to the speech used for training the speech recognition module 130 or training the semantic-related feature model described later.
  • the semantic-related feature model is used for speech quality assessment, which will be further described later.
  • the pre-processing module 120 is used to perform pre-processing such as pre-emphasis, framing or windowing on the collected sound, so that the user's test voice contained in the sound can be recognized more easily.
  • pre-emphasis includes emphasizing the high-frequency part of the voice, removing the influence of lip radiation, and increasing the high-frequency resolution of the voice; framing uses the short-term stationarity of the voice signal to divide the voice signal into individual voice frames for processing. There is overlap between adjacent speech frames to make each speech frame continuous; windowing is to strengthen the speech waveform near each speech frame sampling and weaken the rest of the waveform, so as to make the speech smooth.
  • the parameter adjustment of the pre-processing includes one or more of the following: the frequency band of the high frequency targeted by the pre-emphasis processing, the degree of emphasis, the frame length in the framing processing, the length of the overlapping frame, and the partial waveform in the windowing processing The degree of strengthening and the degree of weakening of part of the waveform.
  • a speech recognition (Automatic Speech Recognition, ASR) module 130 is used for recognizing the sentence content of the pre-processed test speech. Vocabulary in the test speech can be recognized through the speech recognition module and converted into computer-readable character sequences. After the content of the voice recognition is obtained, the control command can be further recognized based on the content, and the control command corresponding to the voice can be executed by the vehicle actuator.
  • the recognition process from speech recognition content to control instructions can identify control instructions based on keyword matching, and can also identify corresponding control instructions based on neural network semantic recognition technology.
  • adjusting the parameters of the speech recognition module refers to adjusting the parameters of the speech recognition model of the speech recognition module, for example, adjusting the parameters and hyperparameters of the neural network of the speech recognition model.
  • the assessment and prediction module 140 is used to implement voice quality assessment, and can generate a semantically related voice quality assessment result for the test voice.
  • the speech recognition quality of the speech recognition module for the test speech can also be predicted according to the speech quality evaluation result.
  • the evaluation and prediction module is also described as including an evaluation module and a prediction module, respectively implementing the above speech quality evaluation and speech recognition quality prediction.
  • the output module 150 is used for outputting information such as speech quality assessment results or prediction results of speech recognition quality.
  • the output content can be provided to the vehicle controller, so that the vehicle can perform corresponding operations accordingly.
  • the outputted content can also be provided to the user through the man-machine interface of the vehicle.
  • the outputted information includes the quality of the test voice, the quality of the voice recognition of the voice recognition model, factors affecting the quality of the test voice, Adjusting the manner of testing voice quality, etc.
  • the human-machine interface may include a display screen in the vehicle cockpit (a display screen such as a liquid crystal display screen, a head-up display (Head Up Display, HUD) and the like), and a speaker to prompt the user through images or sounds.
  • a display screen in the vehicle cockpit a display screen such as a liquid crystal display screen, a head-up display (Head Up Display, HUD) and the like
  • a speaker to prompt the user through images or sounds.
  • the man-machine interface may be a central control panel. After receiving the above-mentioned information output by the output module 150 through the central control panel, the user may use the man-machine interface to adjust the parameters of the pre-processing module 120 or the voice recognition module 130. Or control the relevant actuators in the car, such as controlling the opening and closing of the car windows, controlling the playback volume of the audio playback device in the car, etc.
  • the parameter adjustment interface provided by the man-machine interface can be displayed in a way that is easy to understand and adjust for ordinary users (such as graphical display), or can be displayed in a way for professional maintenance personnel.
  • the evaluation and prediction module 140 may also be deployed on an independent test device or in the cloud, and the output module 150 may also be deployed on an independent test device.
  • the test equipment here can be a special test equipment, and can be an intelligent terminal installed with corresponding software, for example, the intelligent terminal can be a mobile phone, a computer, a tablet computer, and the like.
  • the communication between the vehicle and the test equipment or cloud can be realized based on communication technology.
  • FIG. 2A shows the flow of an embodiment of a method for assessing voice quality.
  • the application to a vehicle is used as an example for illustration, which includes steps S210 to S230.
  • the test voice is obtained through a sound collection module arranged in the vehicle cabin.
  • the sound collection module may be a microphone, and in some embodiments, may be a plurality of microphones arranged in different positions of the vehicle cabin.
  • this step may be performed during the test or inspection of the vehicle, and the test voice may be broadcast by the tester.
  • this step may be performed while the user is using the vehicle, for example, when the vehicle is running or parked.
  • the test voice may be broadcast by the driver (ie, the user). Wherein, the test voice matches the voice command of the vehicle. Since the driver already knows the voice content used in the voice command, the voice command broadcast by the driver can be used as the test voice.
  • this step may be triggered when the vehicle cannot accurately recognize the voice command of the user (such as the driver), and use the voice command that the user has broadcast or re-broadcast as the test voice.
  • the vehicle can also prompt the user to enter the speech quality assessment process of this embodiment through the man-machine interface through the man-machine interface, or it can be further guided by guidance The user broadcasts the corresponding test voices.
  • the user can repeat the broadcast multiple times for a certain voice content (such as a voice command), and the multiple voices obtained are also called multiple voices corresponding to the group of test voices, or called corresponding to a multiple samples of the corpus.
  • the user can also repeat the broadcast several times for several voice commands respectively, so as to obtain multiple voices corresponding to these groups of test voices respectively.
  • the multiple voices of "turn on the air conditioner" broadcast by the user for many times are a group of test voices
  • the multiple voices of "increase the volume" broadcasted by the user for many times are another group of test voices.
  • a group of test voices corresponds to a voice instruction, or corresponds to a semantic meaning, or corresponds to the same sentence content.
  • S220 Evaluate the speech quality of the test speech according to the semantic related information of the test speech.
  • the evaluation of the voice quality is performed by an evaluation module of the vehicle.
  • the evaluation module is implemented by a processor of the vehicle, and the processor is connected to the sound collection module by signal.
  • the speech quality evaluation result of the test speech is related to predetermined semantics, so that the speech quality evaluation result can not only be used to evaluate the speech quality, but also be used to predict the speech recognition quality of the speech recognition module.
  • the semantic-related information of the test speech is a semantic-related feature vector generated by the neural network for the test speech, and the feature vector is the output of any layer before the output layer of the neural network.
  • the feature vector may also be formed by cascading outputs of multiple layers before the output layer of the neural network, and the multiple layers may be any two or more layers.
  • the semantic related information of the test speech may be a one-dimensional vector formed by the output of the output layer of the neural network.
  • the output of the output layer corresponds to a one-dimensional vector formed by each confidence level of each voice instruction (ie, each semantic meaning).
  • S230 Output an evaluation result of the speech quality of the test speech.
  • the human-machine interface can include a display screen, which can be a vehicle central control screen, a head-up display (Head Up Display, HUD), or an augmented reality head-up display (Augmented Reality-HUD, AR-HUD), etc.
  • the machine interface can also include a speaker, and the human-machine interface can also include an input component, which can be a touch screen integrated in the display screen, or an independent button, etc.
  • the prompt can be executed in the form of image, text, or voice.
  • the evaluation result includes content related to the voice quality of the test voice
  • the content related to the test voice quality may include one or a combination of the following: the quality of the test voice, factors affecting the quality of the test voice, and adjusting the quality of the test voice The way.
  • the factors that affect the quality of the test voice may include one or a combination of the following: the window is open, and this factor introduces noise from outside the vehicle; the vehicle speed is too high, and this factor makes tire noise or engine noise too loud; Other sounds in the car are too loud, such as music played by the car, etc.; the performance or location of the microphone in the car.
  • the way of adjusting the voice quality of the test may include one or a combination of the following: closing the car window, reducing the speed of the car, reducing the sound in the car, replacing a problematic microphone, and tuning the parameters of the pre-processing module.
  • step S220 includes steps S221 to S225.
  • S221 Obtain a first feature vector of the test speech.
  • the first feature vector includes time-frequency features of the test speech.
  • multiple feature vectors including frequency domain features of each consecutive frame of the test speech may be obtained, and the multiple feature vectors constitute the first feature vector.
  • adjacent frames may have overlapping information.
  • the pre-processing process of the pre-processing module may perform frame division processing on the acquired test voice to form the continuous frames.
  • the pre-processing process also includes processing such as pre-emphasis and windowing.
  • the first feature vector includes the time-frequency feature of the test speech, it is also called the time-frequency feature vector diagram, which is a two-dimensional figure, one dimension is the time coordinate, and the other dimension is the frequency coordinate.
  • Intensity is the intensity of the corresponding frequency for each successive frame.
  • the frequency domain features include one or a combination of the following: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), spectrum.
  • MFCC Mel-Frequency Cepstral Coefficients
  • LPCC Linear Predictive Cepstral Coefficients
  • S223 Obtain a second feature vector of the test voice according to the first feature vector of the test voice.
  • the second feature vector is related to the semantics of speech.
  • the semantically relevant feature model is used to extract the second feature vector based on the first feature vector, wherein the semantically relevant feature model represents the relationship between semantics and time-frequency features, so the extracted second feature vector is related to semantics .
  • the semantically relevant feature model is constructed according to the first feature vector of the reference speech and the semantics of the reference speech.
  • a semantically relevant feature model can be constructed based on a neural network, and the neural network can be a fully connected neural network (Fully Connected Neural Network, FCNN), a recurrent neural network (Recurrent Neural Network, RNN), a convolutional neural network (Convolutional Neural Network, CNN), etc. Wherein the present embodiment adopts CNN network.
  • FCNN Fully Connected Neural Network
  • RNN recurrent neural network
  • CNN convolutional Neural Network
  • the semantic-related feature model can be trained in combination with the pre-processing module.
  • the reference speech has semantic annotations
  • the reference speech is passed through the pre-processing module to obtain the first feature vector, and then the first feature vector
  • the semantic correlation model is input, and the semantic correlation model is trained according to whether the semantics of the semantic correlation feature model is converged.
  • Gradient descent method, confrontation network method, etc. can be used for training.
  • the semantic-related feature model constructed based on the neural network includes a multi-layer network, and the output of the semantic-related feature model corresponds to each semantic classification.
  • the second feature vector may be a feature vector output by any layer of the multi-layer network, since the first feature vector is used as the input of the neural network, the second feature vector is a feature vector extracted based on the first feature vector .
  • each semantic corresponds to the output of the neural network (the neural network is equivalent to a classification network, and each category corresponds to each semantic), it can be understood that the second feature vector is a feature vector related to semantics.
  • the second feature vector when the output of a layer before the output layer of the neural network is used as the second feature vector, the second feature vector may be a feature vector with more dimensions than the first feature vector.
  • the operation results of the multiple convolution kernels constitute multiple dimensions of the second feature vector.
  • the second feature vector can also be a feature vector formed by the cascade of two or more layers of output vectors of the neural network (similar to the residual connection), so that the second feature vector can not only have low-level features, It is also possible to have high-level features at the same time.
  • the output of the second layer network of the neural network and the output of the fourth layer network are cascaded to form the second feature vector.
  • the second feature vector when the output of the output layer of the neural network is used as the second feature vector, is a vector composed of confidence levels corresponding to each semantic meaning. For example, when the output of the neural network is 20 nodes, and the 20 nodes correspond to 20 classifications (that is, for identifying one of the 20 semantics), then the second feature vector can be a one-dimensional vector with 20 parameters, each The value of the parameter corresponds to the confidence of each semantic.
  • the reference voice set includes several groups of reference voices, each group of reference voices corresponds to the same semantics, for example, the semantics can be "turn on the air conditioner", “turn off the air conditioner”, “turn up the volume”, “turn down the volume”, etc. Semantics of the command.
  • the reference voice can be understood as a voice collected in an environment with little noise, or as a standard voice.
  • S225 Evaluate the voice quality of the test voice according to the second feature vector of the test voice.
  • the evaluated voice quality is also related to semantics.
  • the voice quality of the test voice can be understood as the voice quality of the test voice set.
  • the voice quality evaluation result includes the following three evaluation indicators:
  • the first evaluation index represent the concentration degree index of the central position of each group of described test voices, wherein, the described test voices of different groups have different semantics, and the central position of each group of described test voices is the described test voice of this group
  • a certain group of test voices may include multiple test voices, and the center position of the group of test voices may be obtained through calculation based on the distribution of the second feature vectors of the multiple test voices.
  • Mahalanobis distance Euclidean distance, or other distance indicators that can calculate the similarity between samples can be used as the concentration index.
  • the evaluation method of the first evaluation index can be as follows: first, for each group of test voices, calculate the distance between the center position of this group of test voices and the center positions of other groups of test voices, and obtain the Then, the average value of each distance minimum value obtained for each group of test voices is taken as an index of the concentration degree of the center position of each group of test voices.
  • the calculation method of the first evaluation index D1 can be shown in the following formula (1):
  • C represents the grouping type in the test voice set, i.e. the corpus type of the test voice set
  • ⁇ tj represents the center position of the j group test voice in the test voice set
  • ⁇ ti represents the center position of the i group test voice in the test voice set
  • represents the test voice
  • the second evaluation index represents the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is The space in which the second eigenvector resides.
  • the evaluation method of the second evaluation index can be as follows: first, for each group of test speech, calculate the semi-axis length of the distribution of the second feature vector of the group of test speech in the space where the second feature vector is located, and then , take the mean value of the semi-principal lengths obtained for each group of test voices, and use it as the dispersion degree index of each group of test voices.
  • the calculation method of the second evaluation index D2 can be shown in the following formula (2):
  • C represents the grouping type in the test speech set
  • d represents the dimension of the second feature vector of each group of test speech in its space
  • fjk represents the k-th dimension semi-principal length of the second feature vector of the j group of test speech
  • represents the covariance matrix of the second eigenvector distribution of the jth test utterance set in the test utterance set.
  • the 3rd evaluation index the similar degree index of the center position of each group of reference speeches corresponding to the center position of each group of described test speeches of semantics; Wherein, the described test speeches of different groups have different semantics, each group described test speeches
  • the central position is the central position of the second feature vector of the group of test voices in the second feature space
  • the second feature space is the space where the second feature vector is located
  • the reference voices of different groups have different Semantics
  • the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
  • the evaluation method of the third evaluation index can be as follows: First, in the space where the second feature vector is located, for each group of test speeches, calculate the difference between the center positions of the group of test speeches and the center positions of a group of reference speech sounds The distance between this group of reference voices is the same as the group of test voice semantics, that is, corresponding to the same sentence content, and then, the average value of each distance obtained for each group of test voices is taken as the similarity index between the test voice set and the reference voice set .
  • the center position of a certain group of test voices refers to the distribution center of the second feature vectors of the group of test voices
  • the center position of a certain group of reference voices refers to the distribution center of the second feature vectors of the group of reference voices.
  • C represents the grouping type in the test speech set
  • ⁇ rj represents the central position of the jth group of reference speeches
  • ⁇ tj represents the center position of the jth group of test speeches
  • represents the second feature vector of the jth group of reference speeches in the reference speech set
  • the speech quality assessment result includes one of the above assessment indicators, and may also include a combination of any number of assessment indicators, and different assessment indicators may have different weights.
  • the embodiment of the present application also provides a speech recognition quality prediction method, which can predict the speech recognition quality of the test speech based on the speech quality evaluation result of the test speech.
  • FIG. 3A shows the flow of an embodiment of the speech recognition quality prediction method, which includes steps S310 to S340.
  • S320 Evaluate the speech quality of the test speech according to the semantic related information of the test speech.
  • the speech recognition quality prediction method and the aforementioned speech quality evaluation method can be integrated and executed in one process.
  • the content described in the above steps S310 and S320 can directly use the execution results of the aforementioned steps S210 and S220, There is no need to repeat the same content.
  • S330 Predict the speech recognition quality of the speech recognition model for the test speech by using the speech recognition quality function according to the evaluated test speech quality.
  • the speech recognition quality function is used to indicate the relationship between speech recognition quality and speech quality.
  • the recognition result of the speech recognition model is used in constructing the speech recognition quality function, so the predicted speech recognition quality can be used as the prediction result of the recognition quality of the speech recognition model.
  • the speech recognition model can be realized by the above-mentioned speech recognition module.
  • the outputted speech recognition quality prediction results include: the predicted speech recognition quality of the speech recognition module, factors affecting the speech recognition quality of the speech recognition module, and ways to adjust the speech recognition quality of the speech recognition module.
  • the factors affecting the voice recognition quality of the voice recognition module include: the windows are open, the vehicle speed is too high, the sound inside the vehicle is too loud, the performance and deployment position of the microphone, and the performance of the voice recognition model of the voice recognition module.
  • the manner of adjusting the voice recognition quality of the voice recognition module includes one or a combination of the following: closing the window, reducing the speed of the vehicle, reducing the sound in the vehicle, replacing the microphone in question or optimizing the deployment of the microphone, and adjusting the parameters of the pre-processing module. Tuning, tuning of the speech recognition model parameters of the speech recognition module.
  • the prediction result can be output to the man-machine interface of the vehicle, so as to show the evaluation result to the user.
  • the prediction result can be output to the man-machine interface of the vehicle, so as to show the evaluation result to the user.
  • the speech recognition quality function in the above step S330 can be constructed as shown in FIG. 3B , and the construction process of the function includes steps S321 to S327.
  • S321 Perform multiple degradations on each reference voice in the reference voice set, each degradation forms a set of degraded voice sets, thereby obtaining multiple sets of degraded reference voice sets.
  • the reference speech is degraded to different degrees to generate multiple sets of degraded speech, each degraded speech of each degree, or each degraded speech of different ways constitutes a set of degraded speech sets.
  • the degradation method may be voice scrambling.
  • each reference voice in the reference voice set may be scrambled according to the possible noise environment of the vehicle, such as adding background music interference, simulated tire noise, and wind resistance noise interference. , Simulated interference from other car horns outside the car, etc.
  • each set of degraded reference voices is used as the test voice, and the voice quality evaluated for each set of degraded reference voices is obtained according to the voice quality assessment method in the foregoing embodiments, and the voice quality of each set of degraded reference voices constitutes the first evaluation result.
  • S325 Perform recognition and statistics on multiple sets of degraded reference voices according to the above-mentioned voice recognition model, and use the statistical voice recognition results as a first statistical result.
  • each set of degraded reference speech is recognized by using the speech recognition model, and the speech recognition result of each set of degraded reference speech recognition is generated, and the speech recognition results of each set of degraded reference speech constitute the first statistical result.
  • S327 Obtain a speech recognition quality function according to a functional relationship between the first statistical result and the first evaluation result.
  • the speech recognition quality function can be constructed based on machine learning. For example, for the first statistical results and the first evaluation results of each set of degraded reference speech, the speech recognition quality function can be constructed by fitting a polynomial. It can also be based on Deep learning builds a speech recognition quality function, for example, by training a neural network model.
  • each evaluation indicator in the first evaluation result is used as a dependent variable to construct a speech recognition quality function.
  • one or more combined indicators in the first evaluation result are combined and used as the dependent variable to construct the speech recognition quality function.
  • the evaluation indicators here are, for example, the indicators shown in the above formula (1) to formula (3).
  • the embodiment of the present application also provides a method for improving speech recognition quality. It can be determined based on the voice quality assessment results of the test voice and the voice recognition quality prediction results that the pre-processing module or the voice recognition module needs to be tuned to improve the voice recognition quality, as shown in Figure 4. Improve the voice recognition quality
  • the process of the method embodiment includes steps S410 to S480.
  • step S220 reference may be made to the above step S220 or the descriptions in its various embodiments, and will not be described in detail here.
  • step S430 Determine whether the voice quality is lower than a preset first baseline. Wherein, when the voice quality is lower than the first baseline, perform step S440, otherwise, perform step S450.
  • the first baseline judges the evaluation index of voice quality, and is also referred to as the index baseline.
  • the first baselines are respectively set for each evaluation index for evaluating voice quality; in other embodiments, each evaluation index for evaluating voice quality is combined into one or more combined indexes, and the corresponding first baseline is set. baseline.
  • S440 Output an evaluation result of the speech quality of the test speech.
  • step S230 reference may be made to the above-mentioned step S230 or the descriptions in its various embodiments, which will not be described in detail here.
  • this step may return to step S410, or end this process.
  • step S450 may be continued, or step S480 may be executed to continue speech recognition.
  • S450 Predict speech recognition quality.
  • step S460 Determine whether the predicted speech recognition quality is lower than a preset second baseline. Wherein, when the voice recognition quality is lower than the preset second baseline, step S470 is performed; otherwise, step S480 is performed.
  • the second baseline judges the speech recognition quality, that is, judges the speech recognition accuracy of the semantic recognition model, and is also the accuracy baseline.
  • this step may return to step S410, or end this process.
  • step S480 may also be performed to continue speech recognition.
  • S480 Recognize the test voice by the voice recognition module.
  • step S430 when the speech quality evaluated in step S430 is higher than the first baseline, and the speech recognition quality predicted in step S460 is higher than the second baseline, it is considered that the accuracy of the speech recognition quality will be higher, and the recognition result It can be used as a follow-up, such as for controlling the vehicle, etc.
  • the speech recognition result when entering this step when the speech quality evaluated in step S430 is lower than the first baseline, or the predicted speech recognition quality in step S460 is lower than the second baseline, the speech recognition result can be further prompted Give user confirmation to determine whether to use the speech recognition result.
  • the user may adjust pre-processing parameters or speech recognition model parameters according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470, so as to Improve the quality of speech recognition.
  • the device under test such as a vehicle, may automatically perform the pre-processing according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470. Adjustment of processing parameters or speech recognition model parameters.
  • the specific implementation of the method for improving the quality of speech recognition involves speech quality assessment
  • the steps of the method, the steps of the speech recognition quality prediction method, the steps corresponding to these two parts can also be independent with reference to the foregoing embodiments, as the specific implementation of the speech quality evaluation method and the speech recognition quality prediction method, in order to simplify the description, the The separate specific implementation manners of these two parts will not be described in detail.
  • the flow process of the specific implementation of speech recognition quality prediction method comprises the following steps:
  • S510 The vehicle acquires a test voice set based on the reference voice set through the microphone arranged in the cockpit.
  • the test voice set includes several groups.
  • the tester sits in the driver's seat of the vehicle and broadcasts each group of test voices in turn.
  • the semantics of each group can correspond to a common command.
  • Each group of test voices includes 10 test voices broadcast by the tester.
  • the content of the test voice reported corresponds to the content of the reference voice set.
  • the vehicle guides the testers to play the test voices in a guided manner through the man-machine interface. For example, each voice content of the corresponding reference voice set and the number of times to be played can be displayed on the screen in turn, and the testers will broadcast accordingly.
  • S515 The collected test voice is pre-processed by the pre-processing module on the vehicle, including extracting the time-frequency feature vector graph of each test voice in the test voice set, that is, the first feature vector.
  • S520 Using the semantic correlation feature model, obtain a semantic correlation feature vector of the test voice in the test voice set, that is, a second feature vector, according to the time-frequency feature vector map.
  • S525 Evaluate the speech quality of the test speech set based on the semantically relevant feature vectors of the test speech.
  • the speech quality can be evaluated using one or more of the above formulas (1) to (3).
  • step S530 According to the evaluation result of the voice quality, judge whether the voice quality is lower than the set index baseline, if it is lower than the set index baseline, go to step S535, otherwise go to step S540.
  • S535 Output an evaluation result of the speech quality of the test speech.
  • the evaluation result can be output to the man-machine interface, and the evaluation result includes content related to the voice quality of the test voice.
  • the content related to the speech quality of the test speech displayed on the man-machine interface may include: prompting the user to tune the parameters and algorithms in the pre-processing module.
  • the adjustable interface and parameters can be displayed in a graphical form.
  • S540 Using the speech recognition quality function, based on the speech quality of the test speech set, predict the speech recognition quality of the speech recognition module for the test speech set.
  • step S545 Determine whether the predicted recognition quality is lower than the set accuracy baseline, if it is lower than the accuracy baseline, go to step S550, otherwise go to step S555.
  • the prediction result can be output to the man-machine interface, and the prediction result includes content related to speech recognition quality.
  • the content related to the speech recognition quality displayed on the man-machine interface includes: prompting the user to optimize the speech recognition model of the speech recognition module.
  • the adjustable interface and parameters can be displayed in a graphical form.
  • the speech recognition quality function can be further optimized.
  • the steps in the speech recognition quality function construction method can be used to retrain the speech recognition quality function to optimize the speech recognition quality function.
  • the speech recognition quality function can be retrained according to the above-mentioned steps S323-S327; Quality function retraining.
  • the embodiments of the present application also provide corresponding devices. Regarding the beneficial effects or technical problems solved by the devices, you can refer to the descriptions in the methods corresponding to the devices respectively, or refer to the descriptions in the summary of the invention, which are only briefly described here. Each device in this embodiment may be used to implement each optional embodiment in the foregoing method. Each device embodiment of the present application will be described below based on each figure.
  • the speech quality assessment device provided by the embodiment of the present application can be used to implement various embodiments of the speech quality assessment method. As shown in FIG. 6A , the device has an acquisition module 610 , an evaluation module 620 and an output module 630 .
  • the obtaining module 610 is used for obtaining test voice. It is specifically used to execute the foregoing step S210 and various embodiments thereof.
  • the evaluation module 620 is used for evaluating the speech quality of the test speech according to the semantic related information of the test speech. It is specifically used to execute the foregoing step S220 and various embodiments thereof.
  • the output module 630 is configured to output the evaluation result of the voice quality. It is specifically used to execute the foregoing step S230 and various embodiments thereof.
  • the evaluation result of the voice quality output by the output module 630 includes one or more of the following: the quality of the test voice; factors affecting the quality of the test voice; adjusting the quality of the test voice The way.
  • the quality of the test voice includes one or more of the following: the quality of the test voice; factors affecting the quality of the test voice; adjusting the quality of the test voice The way.
  • the evaluation module 620 is specifically configured to: obtain a first feature vector of the test voice, wherein the first feature vector includes a time-frequency feature vector of the test voice; Obtaining a second feature vector of the test speech from the first feature vector, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.
  • the evaluation module 620 when evaluating the speech quality of the test speech according to the second feature vector, includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation
  • the indicators include: the concentration index of the center position of each group of the test voices, wherein the test voices of different groups have different semantics, and the center position of each group of the test voices is the second feature of the group of test voices The center position of the vector in the second feature space, where the second feature space is the space where the second feature vector is located.
  • the evaluation module 620 when evaluating the speech quality of the test speech according to the second feature vector, includes: using a second evaluation index to evaluate the speech quality of the test speech, the second evaluation The index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is the second feature space. The space of the two eigenvectors.
  • the evaluation module 620 when evaluating the speech quality of the test speech according to the second feature vector, includes: using a third evaluation index to evaluate the speech quality of the test speech, the third evaluation Index comprises: the similar degree index of the central position of each group described test speech and the central position of each group of reference speech corresponding to semantics; Wherein, the described test speech of different groups has different semantics, the central position of each group described test speech The central position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference voices of different groups have different semantics, The center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
  • the evaluation module 620 when the evaluation module 620 obtains the first feature vector of the test speech, it includes: obtaining consecutive frames contained in the test speech, wherein adjacent frames contain overlapping information; according to the A plurality of feature vectors including frequency domain features are obtained for consecutive frames, and the plurality of feature vectors constitute the first feature vector.
  • the embodiment of the present application also provides a device for speech recognition quality prediction, which can be used to implement the method embodiment of speech recognition quality prediction, as shown in Figure 6B, the device has an acquisition module 612, a prediction module 622 and an output Module 632.
  • the obtaining module 612 is used for obtaining the test voice. It is specifically used to execute the above step S310 and various embodiments thereof.
  • the prediction module 622 is used to predict the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, the speech recognition quality function represents the relationship between the speech recognition quality and the speech quality, and the speech quality is based on the aforementioned Any one of the possible embodiments of the voice quality assessment method is evaluated. It is specifically used to execute the above steps S320-S330 and various embodiments thereof.
  • the output module 632 is used to output the prediction result of speech recognition quality. It is specifically used to execute the above step S340 and various embodiments thereof.
  • the prediction result of the speech recognition quality output by the output module 632 includes one or more of the following, including one or more of the following: the speech recognition quality of the speech recognition model; Factors affecting the speech recognition quality of the speech recognition model; ways of adjusting the speech recognition quality of the speech recognition model.
  • the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; obtaining speech recognition results of multiple sets of the degraded reference speeches according to the speech recognition model, and using them as the first A statistical result; respectively using multiple sets of the degraded reference voices as test voices, and obtaining multiple sets of voice quality assessment results of the degraded reference voices according to any possible embodiment of the voice quality assessment method, to obtain a first evaluation result;
  • the speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
  • the embodiment of the present application also provides a device for improving speech recognition quality, which can be used to implement the method embodiment for improving speech recognition quality, as shown in Figure 6C, the speech quality evaluation device has an acquisition module 614, an evaluation prediction module 624 and output module 634 .
  • the obtaining module 614 is used for obtaining the test voice. It is specifically used to execute the above step S410 and various embodiments thereof.
  • the evaluation prediction module 624 is used to evaluate the speech quality of the test speech. Any possible embodiment of the foregoing voice quality assessment method may be used for assessment. It is specifically used to execute the above steps S420-S430 and various embodiments thereof.
  • the output module 634 is configured to output the assessment result of the voice quality when the voice quality is lower than the preset first baseline. It is specifically used to execute the above step S440 and various embodiments thereof.
  • the evaluation and prediction module 624 is further configured to predict the speech recognition quality according to the method described in any possible embodiment of the speech recognition quality prediction method when the speech quality is greater than or equal to the first baseline . It is specifically used to execute the above steps S450-S460 and various embodiments thereof.
  • the output module 634 is further configured to output the prediction result of the speech recognition quality when the speech recognition quality is lower than a preset second baseline. It is specifically used to execute the above step S470 and various embodiments thereof.
  • the embodiment of the present application also provides a vehicle, as shown in Figure 7A and Figure 7B, the vehicle includes a sound collection device 710, a pre-processing device 720, a speech recognition device 730, and the aforementioned speech quality evaluation device, speech recognition quality Predictive means or means to improve the quality of speech recognition.
  • the sound collecting device 710 is used for collecting the semantic commands based on the reference speech spoken by the driver.
  • it may be a microphone in the cockpit.
  • the microphone is set at the central control panel 740 , and may also be set at one or more other positions such as the instrument panel above the steering wheel, the rearview mirror in the cockpit, and the steering wheel.
  • the pre-processing device 720 is used for pre-processing the speech broadcast by the driver.
  • the voice recognition device 730 is used to recognize the driver's command when the voice quality requirement of the driver's command and the predicted voice recognition quality meet the requirements.
  • the aforementioned speech quality evaluation device, speech recognition quality predicting device or speech recognition quality improving device is used for the aforementioned purpose, and the user can perform corresponding operations based on this to improve speech quality and speech recognition quality.
  • a central control panel 740 as a man-machine interface is also shown, through which the user receives the output of a speech quality evaluation device, a speech recognition quality prediction device, or a device for improving speech recognition quality.
  • the information can be displayed and displayed, and the parameter adjustment interface can be displayed with the help of the central control panel 740, which is convenient for the user to perform the above-mentioned parameter adjustment by manipulating the central control panel 740.
  • the pre-processing device 720, the speech recognition device 730, and the speech quality evaluation device, the speech recognition quality prediction device or the device for improving the speech recognition quality can be processed by one or more in the vehicle In this embodiment, it may be implemented by a processor of the vehicle infotainment system.
  • FIG. 8 is a schematic structural diagram of a computing device 800 provided by an embodiment of the present application.
  • the computing device 800 includes: a processor 810 , a memory 820 , and a communication interface 830 .
  • the communication interface 830 in the computing device 800 shown in FIG. 8 can be used to communicate with other devices.
  • the processor 810 may be connected to the memory 820 .
  • the memory 820 can be used to store the program codes and data. Therefore, the memory 820 may be a storage module inside the processor 810, or an external storage module independent of the processor 810, or may include a storage module inside the processor 810 and an external storage module independent of the processor 810. part.
  • the processor 810 executes the computer-implemented instructions in the memory 820 to perform the operation steps of the above method.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the program is used to execute one or more of the solutions described in the various embodiments of the present application when executed by a processor.
  • an embodiment means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Therefore, the phrase “in an embodiment” appearing in various places in this specification does not necessarily all refer to the same embodiment, but may refer to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

La présente demande se rapporte au domaine technique de la reconnaissance de la parole. Des modes de réalisation de la présente demande concernent un procédé d'évaluation de la qualité de parole, un procédé de prédiction de la qualité de la reconnaissance de la parole et un procédé d'amélioration de la qualité de la reconnaissance de la parole. Tout d'abord, une parole de test est acquise ; puis la qualité de parole de la parole de test est évaluée en fonction d'informations relatives à la sémantique de la parole de test pour déterminer si un paramètre de prétraitement de la parole doit être réglé, et la qualité de la reconnaissance de la parole est prédite en fonction de la qualité de la parole évaluée pour déterminer si un paramètre d'un modèle de reconnaissance de la parole doit être réglé ; et un résultat d'évaluation ou un résultat de prédiction correspondant est délivré, de telle sorte que le paramètre du prétraitement ou du modèle de reconnaissance de la parole peut être réglé en fonction du résultat d'évaluation ou du résultat de prédiction. Selon la présente demande, le processus de réglage de paramètre du prétraitement et du modèle de reconnaissance de la parole dans le processus de reconnaissance de la parole peut être découplé.
PCT/CN2021/122149 2021-09-30 2021-09-30 Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole WO2023050301A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/122149 WO2023050301A1 (fr) 2021-09-30 2021-09-30 Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole
CN202180008040.9A CN116210050A (zh) 2021-09-30 2021-09-30 语音质量评估、语音识别质量预测与提高的方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/122149 WO2023050301A1 (fr) 2021-09-30 2021-09-30 Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole

Publications (1)

Publication Number Publication Date
WO2023050301A1 true WO2023050301A1 (fr) 2023-04-06

Family

ID=85781128

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122149 WO2023050301A1 (fr) 2021-09-30 2021-09-30 Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole

Country Status (2)

Country Link
CN (1) CN116210050A (fr)
WO (1) WO2023050301A1 (fr)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389059A (zh) * 2000-06-29 2003-01-01 皇家菲利浦电子有限公司 为后续的离线语音识别记录语音信息的记录装置
CN1802694A (zh) * 2003-05-08 2006-07-12 语音信号科技公司 信噪比中介的语音识别算法
CN1965218A (zh) * 2004-06-04 2007-05-16 皇家飞利浦电子股份有限公司 交互式语音识别系统的性能预测
US20150073785A1 (en) * 2013-09-06 2015-03-12 Nuance Communications, Inc. Method for voicemail quality detection
CN106297795A (zh) * 2015-05-25 2017-01-04 展讯通信(上海)有限公司 语音识别方法及装置
CN107093427A (zh) * 2016-02-17 2017-08-25 通用汽车环球科技运作有限责任公司 不流畅语言的自动语音识别
CN107221319A (zh) * 2017-05-16 2017-09-29 厦门盈趣科技股份有限公司 一种语音识别测试系统和方法
WO2020166322A1 (fr) * 2019-02-12 2020-08-20 日本電信電話株式会社 Dispositif d'acquisition de données d'apprentissage, dispositif d'apprentissage de modèle, procédés associés et programme
CN112951270A (zh) * 2019-11-26 2021-06-11 新东方教育科技集团有限公司 语音流利度检测的方法、装置和电子设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389059A (zh) * 2000-06-29 2003-01-01 皇家菲利浦电子有限公司 为后续的离线语音识别记录语音信息的记录装置
CN1802694A (zh) * 2003-05-08 2006-07-12 语音信号科技公司 信噪比中介的语音识别算法
CN1965218A (zh) * 2004-06-04 2007-05-16 皇家飞利浦电子股份有限公司 交互式语音识别系统的性能预测
US20150073785A1 (en) * 2013-09-06 2015-03-12 Nuance Communications, Inc. Method for voicemail quality detection
CN106297795A (zh) * 2015-05-25 2017-01-04 展讯通信(上海)有限公司 语音识别方法及装置
CN107093427A (zh) * 2016-02-17 2017-08-25 通用汽车环球科技运作有限责任公司 不流畅语言的自动语音识别
CN107221319A (zh) * 2017-05-16 2017-09-29 厦门盈趣科技股份有限公司 一种语音识别测试系统和方法
WO2020166322A1 (fr) * 2019-02-12 2020-08-20 日本電信電話株式会社 Dispositif d'acquisition de données d'apprentissage, dispositif d'apprentissage de modèle, procédés associés et programme
CN112951270A (zh) * 2019-11-26 2021-06-11 新东方教育科技集团有限公司 语音流利度检测的方法、装置和电子设备

Also Published As

Publication number Publication date
CN116210050A (zh) 2023-06-02

Similar Documents

Publication Publication Date Title
DE112017003563B4 (de) Verfahren und system einer automatischen spracherkennung unter verwendung von a-posteriori-vertrauenspunktzahlen
Schädler et al. Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
Tawari et al. Speech based emotion classification framework for driver assistance system
CN105593936B (zh) 用于文本转语音性能评价的系统和方法
Liu et al. Bone-conducted speech enhancement using deep denoising autoencoder
CN110178178A (zh) 具有环境自动语音识别(asr)的麦克风选择和多个讲话者分割
CN104078039A (zh) 基于隐马尔科夫模型的家用服务机器人语音识别系统
CN109313892A (zh) 稳健的语言识别方法和系统
CN109448726A (zh) 一种语音控制准确率的调整方法及系统
CN101114449A (zh) 非特定人孤立词的模型训练方法、识别系统及识别方法
Shrawankar et al. Adverse conditions and ASR techniques for robust speech user interface
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
Venkatesan et al. Binaural classification-based speech segregation and robust speaker recognition system
Chen et al. InQSS: a speech intelligibility and quality assessment model using a multi-task learning network
Lavechin et al. Statistical learning models of early phonetic acquisition struggle with child-centered audio data
CN110176243A (zh) 语音增强方法、模型训练方法、装置和计算机设备
WO2023050301A1 (fr) Procédé et appareil d'évaluation de la qualité de la parole, procédé et appareil de prédiction de la qualité de la reconnaissance de la parole et procédé et appareil d'amélioration de la qualité de la reconnaissance de la parole
Hepsiba et al. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN
CN116230018A (zh) 一种用于语音合成系统的合成语音质量评估方法
Chen et al. InQSS: a speech intelligibility assessment model using a multi-task learning network
Kumawat et al. SSQA: Speech signal quality assessment method using spectrogram and 2-D convolutional neural networks for improving efficiency of ASR devices
Nandyala et al. Real time isolated word speech recognition system for human computer interaction
CN115171878A (zh) 基于BiGRU和BiLSTM的抑郁症检测方法
George et al. A Review on Speech Emotion Recognition: A Survey, Recent Advances, Challenges, and the Influence of Noise
Boril et al. Data-driven design of front-end filter bank for Lombard speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958875

Country of ref document: EP

Kind code of ref document: A1