WO2023050301A1 - Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus - Google Patents

Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus Download PDF

Info

Publication number
WO2023050301A1
WO2023050301A1 PCT/CN2021/122149 CN2021122149W WO2023050301A1 WO 2023050301 A1 WO2023050301 A1 WO 2023050301A1 CN 2021122149 W CN2021122149 W CN 2021122149W WO 2023050301 A1 WO2023050301 A1 WO 2023050301A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
quality
test
voice
speech recognition
Prior art date
Application number
PCT/CN2021/122149
Other languages
French (fr)
Chinese (zh)
Inventor
周宇
聂为然
向腾
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/122149 priority Critical patent/WO2023050301A1/en
Priority to CN202180008040.9A priority patent/CN116210050A/en
Publication of WO2023050301A1 publication Critical patent/WO2023050301A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • the present application relates to the technical field of speech recognition, in particular to a method and device for evaluating speech quality, a method and device for predicting speech recognition quality, a method and device for improving speech recognition quality, a vehicle, a computer-readable storage medium, and a computer program product.
  • the voice signal will be enhanced by the voice pre-processing module after being collected by the sensor, and then sent to the voice recognition module for voice wake-up and recognition. Therefore, the recognition effect of the voice recognition system mainly depends on the performance of the voice recognition module and the quality of the voice signal. big factor.
  • the quality of the voice signal is subject to the environment, audio collection hardware and the algorithm of the voice pre-processing module.
  • the industry generally adopts the overall joint debugging of the speech recognition module and the speech pre-processing module to test the quality of speech recognition for performance tuning.
  • Standards cannot evaluate the quality of the voice signal output by the voice pre-processing module and provide tuning baselines and feedback for the acquisition hardware, voice pre-processing module and voice recognition module.
  • This solution is not conducive to the independent positioning and solution of voice signal quality and voice recognition module performance related issues in the actual voice recognition business, and it is easy to cause difficulty in problem location for voice recognition system failures. Therefore, the voice recognition module is decoupled from the voice pre-processing module. , to obtain voice quality calibration and feedback, which is of great significance to improve the efficiency of module fault diagnosis.
  • the present application provides a speech quality assessment method and device, a speech recognition quality prediction method and device, a speech recognition quality improving method and device, a vehicle, a computer readable storage medium and a computer program product.
  • the first aspect of the present application provides a voice quality assessment method, including: acquiring a test voice; evaluating the voice quality of the test voice according to the semantic related information of the test voice; and outputting the voice quality assessment result.
  • the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech.
  • the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality.
  • the vehicle when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle.
  • the vehicle can also perform corresponding operations based on the output evaluation results to improve the in-car voice interaction experience, such as automatically closing the car windows, automatically reducing the sound in the car (such as the sound of the car playing music) , or automatically perform parameter tuning, etc.
  • the evaluation result of the voice quality includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a manner of adjusting the quality of the test voice.
  • the output content can be flexibly set.
  • the quality of the output test speech can be output by quantified specific parameters, can also be output in qualitative ways such as excellent, good, and medium, and can also be output in combination with images and texts.
  • the factors that affect the test voice quality can be the noise outside the car caused by the window being opened; the tire noise or engine noise caused by the high speed of the car; Adjustments desired by the user for reference. Adjust the way to test the voice quality, you can close the car windows, reduce the speed of the car, reduce the sound in the car, replace the problematic microphone, adjust the parameters of the pre-processing module, etc.
  • evaluating the speech quality of the test speech according to the semantic related information of the test speech includes: obtaining a first feature vector of the test speech, where the first feature vector includes time-frequency features of the test speech vector; obtaining a second feature vector of the test voice according to the first feature vector of the test voice, wherein the second feature vector is related to the semantics of the test voice; evaluating the voice quality of the test voice according to the second feature vector.
  • the first feature vector including the time-frequency feature vector of the speech can be obtained first, and then the second feature vector related to the semantics of the speech can be obtained based on this, so as to evaluate the speech quality of the test speech according to the second feature vector.
  • the first eigenvector is related to the time-frequency eigenvector, so the preprocessing parameters are related to the second eigenvector, so the speech quality of the test speech evaluated by the second eigenvector can be used as a reference for tuning the preprocessing parameters.
  • evaluating the speech quality of the test speech according to the second feature vector includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: the center of each group of test speech The concentration index of position, wherein, different groups of test voices have different semantics, the center position of each group of test voices is the center position of the second eigenvector of the group of test voices in the second feature space, and the second feature space is the center position of the second feature vector The space of the two eigenvectors.
  • the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
  • evaluating the speech quality of the test speech according to the second feature vector includes: using a second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: the first An indicator of the degree of dispersion of the two feature vectors in the second feature space, wherein different groups of test speeches have different semantics, and the second feature space is the space where the second feature vectors are located.
  • the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
  • evaluating the speech quality of the test speech according to the second feature vector includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: the center of each group of test speech The similarity index of the center position of each group of reference voices corresponding to the position and semantics; wherein, different groups of test voices have different semantics, and the center position of each group of test voices is the second eigenvector of the group of test voices in the second feature space The center position in the second feature space is the space where the second feature vector is located. Different groups of reference voices have different semantics. The center position of each group of reference voices is the second feature vector of the group of reference voices in the second feature space central location.
  • the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
  • obtaining the first feature vector of the test speech includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors, the plurality of eigenvectors constitute the first eigenvector.
  • the second aspect of the present application provides a speech recognition quality prediction method, including: obtaining a test speech; predicting the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, and the speech recognition quality function is used to indicate The relationship between the speech recognition quality and the speech quality, the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; and a prediction result of the speech recognition quality is output.
  • the output prediction result of speech recognition quality includes one or more of the following: the speech recognition quality of the speech recognition model; factors affecting the speech recognition quality of the speech recognition model; adjusting the speech The way to identify the speech recognition quality of the model.
  • the content of the predicted result can be flexibly set.
  • the quality of a speech recognition model can be provided in quantified specific parameters, or in qualitative ways such as excellent, good, and medium, or in combination with images and text.
  • the output factors affecting the speech recognition quality of the speech recognition model may be factors in the method provided in the first aspect, or may be performance factors of the speech recognition model of the speech recognition module.
  • the output mode for adjusting the test voice quality may be the adjustment mode provided in the method provided in the first aspect, or it may be a tuning prompt for the parameters of the speech recognition model.
  • the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.
  • the above is a way of constructing the speech recognition quality function through reference speech. Specifically, the method of degrading the reference speech is adopted without introducing other reference speech, which can effectively reduce the data volume of the reference speech.
  • the third aspect of the present application provides a method for improving the quality of speech recognition, including: obtaining a test speech; obtaining the speech of the test speech according to the method of the first aspect or any possible implementation manner of the first aspect A quality assessment result; when the voice quality is lower than the preset first baseline, output the voice quality assessment result.
  • the speech quality evaluation result of the test speech can be obtained, and according to the speech quality evaluation result of the test speech, content related to the speech quality can be included, wherein the content can include whether to adjust the pre-processing parameters.
  • the content can include whether to adjust the pre-processing parameters. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle.
  • This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.
  • the third aspect also includes: when the voice quality is greater than or equal to the first baseline, predicting the speech recognition quality according to the second aspect or any one possible implementation manner of the second aspect; when When the voice recognition quality is lower than the preset second baseline, output a prediction result of the voice recognition quality.
  • the fourth aspect of the present application provides a voice quality assessment device, including:
  • the obtaining module is used to obtain the test speech; the evaluation module is used to evaluate the speech quality of the test speech according to the semantic related information of the test speech; the output module is used to output the speech quality evaluation result.
  • the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech.
  • the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle.
  • the vehicle can also perform corresponding operations based on the output evaluation results to improve the voice interaction experience in the vehicle, such as automatically closing the windows, automatically reducing the sound in the vehicle (such as the vehicle playing music sound), or automatically perform parameter tuning, etc.
  • the evaluation result of the voice quality output by the output module includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a way of adjusting the quality of the test voice.
  • the evaluation module is specifically configured to: obtain a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech; Obtaining a second feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.
  • the evaluation module when it evaluates the speech quality of the test speech according to the second feature vector, it includes: using the first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: each group The concentration index of the center position of the test speech, wherein, the test speech of different groups has different semantics, the center position of each group of test speech is the center position of the second feature vector of this group of test speech in the second feature space, the second The feature space is the space where the second feature vector resides.
  • the evaluation module when it evaluates the speech quality of the test speech according to the second feature vector, it includes: using the second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: each group An indicator of the degree of dispersion of the second feature vector of the test speech in the second feature space, wherein different groups of test speech have different semantics, and the second feature space is the space where the second feature vector is located.
  • the evaluation module when it evaluates the speech quality of the test speech according to the second feature vector, it includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: each group The center position of the test voice and the similarity index of the center position of each group of reference voices corresponding to the semantics; Wherein, the test voices of different groups have different semantics, and the center position of each group of test voices is the second feature vector of this group of test voices in The center position in the second feature space, the second feature space is the space where the second feature vector is located, different groups of reference speeches have different semantics, the center position of each group of reference speech is the second feature vector of the group of reference speech at the The center position in the second feature space.
  • the evaluation module when it obtains the first feature vector of the test speech, it includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors including frequency domain features, the plurality of eigenvectors constitute a first eigenvector.
  • the fifth aspect of the present application provides a speech recognition quality prediction device, including: an acquisition module, used to obtain a test speech; a prediction module, used to predict the speech of the speech recognition model to the test speech according to the speech recognition quality function Recognition quality, the speech recognition quality function is used to indicate the relationship between the speech recognition quality and the speech quality, and the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; the output module is used to output the speech Prediction results of recognition quality.
  • the prediction result of the speech recognition quality output by the output module includes one or more of the following: the speech recognition quality of the speech recognition model; Factor; a way to tune the speech recognition quality of a speech recognition model.
  • the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.
  • the sixth aspect of the present application provides a device for improving the quality of speech recognition, including: an acquisition module, used to acquire test speech; an evaluation and prediction module, used according to any one of the first aspect or the first aspect
  • the method in a possible implementation manner obtains the speech quality evaluation result of the test speech; the output module is configured to output the speech quality evaluation result when the speech quality is lower than a preset first baseline.
  • the speech quality evaluation result of the test speech can be obtained, and the content related to the speech quality can be output according to the speech quality evaluation result of the test speech, wherein the output content can include whether to adjust the pre-processing parameters.
  • the pre-processing parameters For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle.
  • This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.
  • the evaluation and prediction module is also used to predict speech recognition according to the method of the second aspect or any possible implementation of the second aspect when the voice quality is greater than or equal to the first baseline
  • the quality; output module is also used to output the prediction result of the speech recognition quality when the speech recognition quality is lower than the preset second baseline.
  • the seventh aspect of the present application provides a vehicle, including: a voice collection device for collecting the user's voice command; a pre-processing device for pre-processing the sound of the voice command; a voice recognition device for Recognition of the pre-processed sound; the device in the fourth, fifth, and sixth aspects and any possible implementation manners thereof.
  • the eighth aspect of the present application provides a computing device, including one or more processors and one or more memories, the memories store program instructions, and when executed by the one or more processors, the program instructions make One or more processors implement the method of the first aspect and any possible implementation manner thereof.
  • the ninth aspect of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a computer, the computer implements the first aspect and any possible implementation manner thereof. method.
  • the tenth aspect of the present application provides a computer program product, which includes program instructions, and when the program instructions are executed by a computer, the computer implements the method of the first aspect and any possible implementation manner thereof.
  • the embodiment of the present application realizes the decoupling of the evaluation of the speech pre-processing process and the prediction of speech recognition by the speech recognition model, and realizes the separate positioning of the pre-processing problem and the speech recognition problem, which is beneficial to the independent positioning and coordination of each module problem. performance tuning.
  • the evaluation of the test voice quality can also be used to prompt the user for corresponding operations to improve the in-vehicle voice interaction experience
  • the prediction of speech recognition can also be used to prompt the user for corresponding operations. Improve the voice interaction experience in the car.
  • FIG. 1 is a schematic structural diagram of an application scenario 1 involved in an embodiment of the present application
  • FIG. 2A is a schematic flow chart of a voice quality assessment method provided in an embodiment of the present application.
  • FIG. 2B is a schematic flow chart of a voice quality assessment method for test voice provided in an embodiment of the present application
  • FIG. 3A is a schematic flowchart of a method for predicting speech recognition quality provided by an embodiment of the present application
  • FIG. 3B is a schematic flow chart of a method for constructing a speech recognition quality function provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for improving speech recognition quality provided by an embodiment of the present application
  • FIG. 5 is a schematic flow chart of a specific implementation of the method for improving the quality of speech recognition provided by the embodiment of the present application;
  • FIG. 6A is a schematic diagram of a speech quality assessment device provided in an embodiment of the present application.
  • FIG. 6B is a schematic diagram of a speech recognition quality prediction device provided in an embodiment of the present application.
  • FIG. 6C is a schematic diagram of a device for improving speech recognition quality provided by an embodiment of the present application.
  • FIG. 7A is a schematic structural diagram of a vehicle provided in an embodiment of the present application.
  • FIG. 7B is a schematic diagram of the cockpit of the vehicle provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computing device provided by an embodiment of the present application.
  • the voice quality assessment solution provided by the embodiments of the present application includes a voice quality assessment method and device, a voice recognition quality prediction method and device, a method and device for improving voice recognition quality, a computer-readable storage medium, and a computer program product. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have been referred to each other and can be combined with each other.
  • the index of Mean Opinion Score can be used for evaluation. This index is also known as the subjective voice quality index.
  • multiple index levels can be used to evaluate the quality of the tested voice. The quality of the tested voice is obtained by averaging the scores of all test listeners.
  • the differences in auditory ability and subjective auditory experience between different listeners will cause differences in scores; especially when only a single sentence is provided without providing context, the scores of the test listeners are significantly different. This will lead to low objectivity of the evaluation result of the tested voice quality, and the evaluation method is poor in adaptability.
  • Objective evaluation indicators When evaluating the voice quality under test, objective evaluation indicators can also be used for evaluation. Objective evaluation indicators include Signal-to-Noise Ratio (SNR), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Analysis (POLQA), short-term objective Intelligibility (Short-Time Objective Intelligibility, STOI), etc.
  • SNR Signal-to-Noise Ratio
  • PESQ Perceptual Evaluation of Speech Quality
  • POLQA Perceptual Objective Listening Quality Analysis
  • STOI short-term objective Intelligibility
  • the PESQ algorithm requires a noisy attenuation signal and an original reference signal. After level adjustment, input filter filtering, time alignment and compensation, and auditory transformation of the two speech signals to be compared, the parameters of the two signals are extracted respectively. , synthesize its time-frequency characteristics, get the PESQ score, and finally map this score to the subjective mean opinion score (MOS).
  • MOS subjective mean opinion score
  • POLQA is the successor to PESQ, extended to handle higher bandwidth audio signals.
  • STOI is one of the important indicators to measure speech intelligibility, which is used to evaluate the intelligibility of noisy speech that has been masked in the time domain or short-time Fourier transformed and weighted in the frequency domain. STOI is scored by comparing the clean speech with the speech to be evaluated.
  • the method of evaluating the voice quality under test by using the above-mentioned objective evaluation indicators pays attention to the correlation between sound characteristics and subjective hearing perception from an acoustic point of view, but the relationship with the performance of machine-oriented speech recognition (that is, speech recognition as text or semantics) is unclear. It is difficult to serve as an effective reference for the speech recognition module in the process of R&D and tuning of the speech recognition system.
  • the embodiment of the present application provides an improved speech quality evaluation scheme, wherein the speech quality of the test speech related to semantics can be determined based on the time-frequency characteristics of the test speech, and the speech quality of the test speech can also be predicted based on the speech quality of the test speech related to semantics.
  • Recognition quality and by comparing the test speech and semantic-related speech quality and the predicted speech recognition quality with the baseline, determine the problem that affects the speech recognition quality, so as to improve the speech recognition quality by locating or solving the problem.
  • the speech quality evaluation solution provided by the embodiment of the present application can be applied to application fields such as quality detection and evaluation in the speech recognition process.
  • application fields such as quality detection and evaluation in the speech recognition process.
  • it when applied in the smart cockpit of a vehicle, it can be used to determine the quality of the currently received voice in the cockpit, or the quality of voice recognition, and then give corresponding prompts or perform corresponding actions, such as reducing the music in the cockpit Prompts or actions that can improve the voice quality in the cockpit, such as volume, closing windows, or parameter tuning.
  • the voice quality or voice recognition quality of the terminal's current environment can be evaluated, and then corresponding prompts can be given, such as prompting whether the microphone is blocked, such as prompting to activate
  • the camera can be combined with lip recognition during speech recognition to improve the accuracy of speech recognition, etc., or perform actions that can be allowed (for example, the permission of the corresponding application is set to call the camera), for example, start the camera to combine lip recognition during speech recognition identify.
  • the testing terminal when applied to a testing terminal used as a quality test, the testing terminal can be used to test the voice recognition function of a product to be tested, such as testing the voice recognition quality of a vehicle, so as to conduct voice recognition-related tests on the vehicle. Tuning of parameters.
  • the vehicle, smart terminal, and product under test usually have a microphone and a processor.
  • the microphone is used to collect the user's voice; the processor can be used to pre-process the collected voice, and perform voice recognition on the pre-processed voice to recognize it as text, and further recognize instructions based on the recognized text.
  • the vehicle or the intelligent terminal when applied to a vehicle or an intelligent terminal, may also have a man-machine interface, which is used to provide the above corresponding prompts to the user through display or sound.
  • the processor may also perform parameter tuning of pre-processing or speech recognition according to user operations through a human-machine interface (Human Machine Interface, HMI).
  • HMI Human Machine Interface
  • the processor of the above-mentioned vehicle may be an electronic device, specifically, it may be a processor of a vehicle-mounted processing device such as a vehicle machine or a vehicle-mounted computer, or it may be a central processing unit (central processing unit, CPU), a microprocessor ( micro control unit, MCU) and other conventional chip processors.
  • the testing terminal when applied to a testing terminal for quality testing, the testing terminal may have a man-machine interface to provide the above corresponding prompts to the user through display or sound.
  • the speech quality assessment solution provided by the embodiments of the present application may be embedded in the above-mentioned vehicles, smart terminals, and products under test, and exist as a functional module thereof.
  • the detection terminal when the voice quality evaluation solution provided by the embodiment of the present application is applied to an independent quality detection detection terminal, the detection terminal can be wired or Communicate wirelessly to obtain the required data, for example, the data is pre-processed voice data, based on which the voice quality test or the voice recognition quality can be predicted, and a prompt of the test result can be given.
  • the detection terminal can also feed back the test results to the device under test, and the device under test performs operations such as parameter tuning according to the test results.
  • Fig. 1 shows a scenario in which the voice quality assessment solution provided by the embodiment of the present application is applied to a vehicle, which can be used for voice quality assessment, and can also be used for quality assessment of voice recognition.
  • the cockpit of the vehicle includes: a sound collection module 110 , a pre-processing module 120 , a speech recognition module 130 , an evaluation and prediction module 140 , and an output module 150 .
  • the pre-processing module 120, the speech recognition module 130 and the evaluation and prediction module 140 may be implemented by the same processor of the vehicle, or may be implemented by three or more processors respectively.
  • the sound collection module 110 can be a microphone, which can be used to collect test voices broadcast by users, wherein a test voice corresponds to a reference voice, and both correspond to the same sentence content, such as corresponding to the same voice command, and have the same semantics. In the stage of establishing each model for speech quality assessment, the sound collection module 110 is also used to collect reference speech.
  • the reference speech refers to the speech used for training the speech recognition module 130 or training the semantic-related feature model described later.
  • the semantic-related feature model is used for speech quality assessment, which will be further described later.
  • the pre-processing module 120 is used to perform pre-processing such as pre-emphasis, framing or windowing on the collected sound, so that the user's test voice contained in the sound can be recognized more easily.
  • pre-emphasis includes emphasizing the high-frequency part of the voice, removing the influence of lip radiation, and increasing the high-frequency resolution of the voice; framing uses the short-term stationarity of the voice signal to divide the voice signal into individual voice frames for processing. There is overlap between adjacent speech frames to make each speech frame continuous; windowing is to strengthen the speech waveform near each speech frame sampling and weaken the rest of the waveform, so as to make the speech smooth.
  • the parameter adjustment of the pre-processing includes one or more of the following: the frequency band of the high frequency targeted by the pre-emphasis processing, the degree of emphasis, the frame length in the framing processing, the length of the overlapping frame, and the partial waveform in the windowing processing The degree of strengthening and the degree of weakening of part of the waveform.
  • a speech recognition (Automatic Speech Recognition, ASR) module 130 is used for recognizing the sentence content of the pre-processed test speech. Vocabulary in the test speech can be recognized through the speech recognition module and converted into computer-readable character sequences. After the content of the voice recognition is obtained, the control command can be further recognized based on the content, and the control command corresponding to the voice can be executed by the vehicle actuator.
  • the recognition process from speech recognition content to control instructions can identify control instructions based on keyword matching, and can also identify corresponding control instructions based on neural network semantic recognition technology.
  • adjusting the parameters of the speech recognition module refers to adjusting the parameters of the speech recognition model of the speech recognition module, for example, adjusting the parameters and hyperparameters of the neural network of the speech recognition model.
  • the assessment and prediction module 140 is used to implement voice quality assessment, and can generate a semantically related voice quality assessment result for the test voice.
  • the speech recognition quality of the speech recognition module for the test speech can also be predicted according to the speech quality evaluation result.
  • the evaluation and prediction module is also described as including an evaluation module and a prediction module, respectively implementing the above speech quality evaluation and speech recognition quality prediction.
  • the output module 150 is used for outputting information such as speech quality assessment results or prediction results of speech recognition quality.
  • the output content can be provided to the vehicle controller, so that the vehicle can perform corresponding operations accordingly.
  • the outputted content can also be provided to the user through the man-machine interface of the vehicle.
  • the outputted information includes the quality of the test voice, the quality of the voice recognition of the voice recognition model, factors affecting the quality of the test voice, Adjusting the manner of testing voice quality, etc.
  • the human-machine interface may include a display screen in the vehicle cockpit (a display screen such as a liquid crystal display screen, a head-up display (Head Up Display, HUD) and the like), and a speaker to prompt the user through images or sounds.
  • a display screen in the vehicle cockpit a display screen such as a liquid crystal display screen, a head-up display (Head Up Display, HUD) and the like
  • a speaker to prompt the user through images or sounds.
  • the man-machine interface may be a central control panel. After receiving the above-mentioned information output by the output module 150 through the central control panel, the user may use the man-machine interface to adjust the parameters of the pre-processing module 120 or the voice recognition module 130. Or control the relevant actuators in the car, such as controlling the opening and closing of the car windows, controlling the playback volume of the audio playback device in the car, etc.
  • the parameter adjustment interface provided by the man-machine interface can be displayed in a way that is easy to understand and adjust for ordinary users (such as graphical display), or can be displayed in a way for professional maintenance personnel.
  • the evaluation and prediction module 140 may also be deployed on an independent test device or in the cloud, and the output module 150 may also be deployed on an independent test device.
  • the test equipment here can be a special test equipment, and can be an intelligent terminal installed with corresponding software, for example, the intelligent terminal can be a mobile phone, a computer, a tablet computer, and the like.
  • the communication between the vehicle and the test equipment or cloud can be realized based on communication technology.
  • FIG. 2A shows the flow of an embodiment of a method for assessing voice quality.
  • the application to a vehicle is used as an example for illustration, which includes steps S210 to S230.
  • the test voice is obtained through a sound collection module arranged in the vehicle cabin.
  • the sound collection module may be a microphone, and in some embodiments, may be a plurality of microphones arranged in different positions of the vehicle cabin.
  • this step may be performed during the test or inspection of the vehicle, and the test voice may be broadcast by the tester.
  • this step may be performed while the user is using the vehicle, for example, when the vehicle is running or parked.
  • the test voice may be broadcast by the driver (ie, the user). Wherein, the test voice matches the voice command of the vehicle. Since the driver already knows the voice content used in the voice command, the voice command broadcast by the driver can be used as the test voice.
  • this step may be triggered when the vehicle cannot accurately recognize the voice command of the user (such as the driver), and use the voice command that the user has broadcast or re-broadcast as the test voice.
  • the vehicle can also prompt the user to enter the speech quality assessment process of this embodiment through the man-machine interface through the man-machine interface, or it can be further guided by guidance The user broadcasts the corresponding test voices.
  • the user can repeat the broadcast multiple times for a certain voice content (such as a voice command), and the multiple voices obtained are also called multiple voices corresponding to the group of test voices, or called corresponding to a multiple samples of the corpus.
  • the user can also repeat the broadcast several times for several voice commands respectively, so as to obtain multiple voices corresponding to these groups of test voices respectively.
  • the multiple voices of "turn on the air conditioner" broadcast by the user for many times are a group of test voices
  • the multiple voices of "increase the volume" broadcasted by the user for many times are another group of test voices.
  • a group of test voices corresponds to a voice instruction, or corresponds to a semantic meaning, or corresponds to the same sentence content.
  • S220 Evaluate the speech quality of the test speech according to the semantic related information of the test speech.
  • the evaluation of the voice quality is performed by an evaluation module of the vehicle.
  • the evaluation module is implemented by a processor of the vehicle, and the processor is connected to the sound collection module by signal.
  • the speech quality evaluation result of the test speech is related to predetermined semantics, so that the speech quality evaluation result can not only be used to evaluate the speech quality, but also be used to predict the speech recognition quality of the speech recognition module.
  • the semantic-related information of the test speech is a semantic-related feature vector generated by the neural network for the test speech, and the feature vector is the output of any layer before the output layer of the neural network.
  • the feature vector may also be formed by cascading outputs of multiple layers before the output layer of the neural network, and the multiple layers may be any two or more layers.
  • the semantic related information of the test speech may be a one-dimensional vector formed by the output of the output layer of the neural network.
  • the output of the output layer corresponds to a one-dimensional vector formed by each confidence level of each voice instruction (ie, each semantic meaning).
  • S230 Output an evaluation result of the speech quality of the test speech.
  • the human-machine interface can include a display screen, which can be a vehicle central control screen, a head-up display (Head Up Display, HUD), or an augmented reality head-up display (Augmented Reality-HUD, AR-HUD), etc.
  • the machine interface can also include a speaker, and the human-machine interface can also include an input component, which can be a touch screen integrated in the display screen, or an independent button, etc.
  • the prompt can be executed in the form of image, text, or voice.
  • the evaluation result includes content related to the voice quality of the test voice
  • the content related to the test voice quality may include one or a combination of the following: the quality of the test voice, factors affecting the quality of the test voice, and adjusting the quality of the test voice The way.
  • the factors that affect the quality of the test voice may include one or a combination of the following: the window is open, and this factor introduces noise from outside the vehicle; the vehicle speed is too high, and this factor makes tire noise or engine noise too loud; Other sounds in the car are too loud, such as music played by the car, etc.; the performance or location of the microphone in the car.
  • the way of adjusting the voice quality of the test may include one or a combination of the following: closing the car window, reducing the speed of the car, reducing the sound in the car, replacing a problematic microphone, and tuning the parameters of the pre-processing module.
  • step S220 includes steps S221 to S225.
  • S221 Obtain a first feature vector of the test speech.
  • the first feature vector includes time-frequency features of the test speech.
  • multiple feature vectors including frequency domain features of each consecutive frame of the test speech may be obtained, and the multiple feature vectors constitute the first feature vector.
  • adjacent frames may have overlapping information.
  • the pre-processing process of the pre-processing module may perform frame division processing on the acquired test voice to form the continuous frames.
  • the pre-processing process also includes processing such as pre-emphasis and windowing.
  • the first feature vector includes the time-frequency feature of the test speech, it is also called the time-frequency feature vector diagram, which is a two-dimensional figure, one dimension is the time coordinate, and the other dimension is the frequency coordinate.
  • Intensity is the intensity of the corresponding frequency for each successive frame.
  • the frequency domain features include one or a combination of the following: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), spectrum.
  • MFCC Mel-Frequency Cepstral Coefficients
  • LPCC Linear Predictive Cepstral Coefficients
  • S223 Obtain a second feature vector of the test voice according to the first feature vector of the test voice.
  • the second feature vector is related to the semantics of speech.
  • the semantically relevant feature model is used to extract the second feature vector based on the first feature vector, wherein the semantically relevant feature model represents the relationship between semantics and time-frequency features, so the extracted second feature vector is related to semantics .
  • the semantically relevant feature model is constructed according to the first feature vector of the reference speech and the semantics of the reference speech.
  • a semantically relevant feature model can be constructed based on a neural network, and the neural network can be a fully connected neural network (Fully Connected Neural Network, FCNN), a recurrent neural network (Recurrent Neural Network, RNN), a convolutional neural network (Convolutional Neural Network, CNN), etc. Wherein the present embodiment adopts CNN network.
  • FCNN Fully Connected Neural Network
  • RNN recurrent neural network
  • CNN convolutional Neural Network
  • the semantic-related feature model can be trained in combination with the pre-processing module.
  • the reference speech has semantic annotations
  • the reference speech is passed through the pre-processing module to obtain the first feature vector, and then the first feature vector
  • the semantic correlation model is input, and the semantic correlation model is trained according to whether the semantics of the semantic correlation feature model is converged.
  • Gradient descent method, confrontation network method, etc. can be used for training.
  • the semantic-related feature model constructed based on the neural network includes a multi-layer network, and the output of the semantic-related feature model corresponds to each semantic classification.
  • the second feature vector may be a feature vector output by any layer of the multi-layer network, since the first feature vector is used as the input of the neural network, the second feature vector is a feature vector extracted based on the first feature vector .
  • each semantic corresponds to the output of the neural network (the neural network is equivalent to a classification network, and each category corresponds to each semantic), it can be understood that the second feature vector is a feature vector related to semantics.
  • the second feature vector when the output of a layer before the output layer of the neural network is used as the second feature vector, the second feature vector may be a feature vector with more dimensions than the first feature vector.
  • the operation results of the multiple convolution kernels constitute multiple dimensions of the second feature vector.
  • the second feature vector can also be a feature vector formed by the cascade of two or more layers of output vectors of the neural network (similar to the residual connection), so that the second feature vector can not only have low-level features, It is also possible to have high-level features at the same time.
  • the output of the second layer network of the neural network and the output of the fourth layer network are cascaded to form the second feature vector.
  • the second feature vector when the output of the output layer of the neural network is used as the second feature vector, is a vector composed of confidence levels corresponding to each semantic meaning. For example, when the output of the neural network is 20 nodes, and the 20 nodes correspond to 20 classifications (that is, for identifying one of the 20 semantics), then the second feature vector can be a one-dimensional vector with 20 parameters, each The value of the parameter corresponds to the confidence of each semantic.
  • the reference voice set includes several groups of reference voices, each group of reference voices corresponds to the same semantics, for example, the semantics can be "turn on the air conditioner", “turn off the air conditioner”, “turn up the volume”, “turn down the volume”, etc. Semantics of the command.
  • the reference voice can be understood as a voice collected in an environment with little noise, or as a standard voice.
  • S225 Evaluate the voice quality of the test voice according to the second feature vector of the test voice.
  • the evaluated voice quality is also related to semantics.
  • the voice quality of the test voice can be understood as the voice quality of the test voice set.
  • the voice quality evaluation result includes the following three evaluation indicators:
  • the first evaluation index represent the concentration degree index of the central position of each group of described test voices, wherein, the described test voices of different groups have different semantics, and the central position of each group of described test voices is the described test voice of this group
  • a certain group of test voices may include multiple test voices, and the center position of the group of test voices may be obtained through calculation based on the distribution of the second feature vectors of the multiple test voices.
  • Mahalanobis distance Euclidean distance, or other distance indicators that can calculate the similarity between samples can be used as the concentration index.
  • the evaluation method of the first evaluation index can be as follows: first, for each group of test voices, calculate the distance between the center position of this group of test voices and the center positions of other groups of test voices, and obtain the Then, the average value of each distance minimum value obtained for each group of test voices is taken as an index of the concentration degree of the center position of each group of test voices.
  • the calculation method of the first evaluation index D1 can be shown in the following formula (1):
  • C represents the grouping type in the test voice set, i.e. the corpus type of the test voice set
  • ⁇ tj represents the center position of the j group test voice in the test voice set
  • ⁇ ti represents the center position of the i group test voice in the test voice set
  • represents the test voice
  • the second evaluation index represents the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is The space in which the second eigenvector resides.
  • the evaluation method of the second evaluation index can be as follows: first, for each group of test speech, calculate the semi-axis length of the distribution of the second feature vector of the group of test speech in the space where the second feature vector is located, and then , take the mean value of the semi-principal lengths obtained for each group of test voices, and use it as the dispersion degree index of each group of test voices.
  • the calculation method of the second evaluation index D2 can be shown in the following formula (2):
  • C represents the grouping type in the test speech set
  • d represents the dimension of the second feature vector of each group of test speech in its space
  • fjk represents the k-th dimension semi-principal length of the second feature vector of the j group of test speech
  • represents the covariance matrix of the second eigenvector distribution of the jth test utterance set in the test utterance set.
  • the 3rd evaluation index the similar degree index of the center position of each group of reference speeches corresponding to the center position of each group of described test speeches of semantics; Wherein, the described test speeches of different groups have different semantics, each group described test speeches
  • the central position is the central position of the second feature vector of the group of test voices in the second feature space
  • the second feature space is the space where the second feature vector is located
  • the reference voices of different groups have different Semantics
  • the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
  • the evaluation method of the third evaluation index can be as follows: First, in the space where the second feature vector is located, for each group of test speeches, calculate the difference between the center positions of the group of test speeches and the center positions of a group of reference speech sounds The distance between this group of reference voices is the same as the group of test voice semantics, that is, corresponding to the same sentence content, and then, the average value of each distance obtained for each group of test voices is taken as the similarity index between the test voice set and the reference voice set .
  • the center position of a certain group of test voices refers to the distribution center of the second feature vectors of the group of test voices
  • the center position of a certain group of reference voices refers to the distribution center of the second feature vectors of the group of reference voices.
  • C represents the grouping type in the test speech set
  • ⁇ rj represents the central position of the jth group of reference speeches
  • ⁇ tj represents the center position of the jth group of test speeches
  • represents the second feature vector of the jth group of reference speeches in the reference speech set
  • the speech quality assessment result includes one of the above assessment indicators, and may also include a combination of any number of assessment indicators, and different assessment indicators may have different weights.
  • the embodiment of the present application also provides a speech recognition quality prediction method, which can predict the speech recognition quality of the test speech based on the speech quality evaluation result of the test speech.
  • FIG. 3A shows the flow of an embodiment of the speech recognition quality prediction method, which includes steps S310 to S340.
  • S320 Evaluate the speech quality of the test speech according to the semantic related information of the test speech.
  • the speech recognition quality prediction method and the aforementioned speech quality evaluation method can be integrated and executed in one process.
  • the content described in the above steps S310 and S320 can directly use the execution results of the aforementioned steps S210 and S220, There is no need to repeat the same content.
  • S330 Predict the speech recognition quality of the speech recognition model for the test speech by using the speech recognition quality function according to the evaluated test speech quality.
  • the speech recognition quality function is used to indicate the relationship between speech recognition quality and speech quality.
  • the recognition result of the speech recognition model is used in constructing the speech recognition quality function, so the predicted speech recognition quality can be used as the prediction result of the recognition quality of the speech recognition model.
  • the speech recognition model can be realized by the above-mentioned speech recognition module.
  • the outputted speech recognition quality prediction results include: the predicted speech recognition quality of the speech recognition module, factors affecting the speech recognition quality of the speech recognition module, and ways to adjust the speech recognition quality of the speech recognition module.
  • the factors affecting the voice recognition quality of the voice recognition module include: the windows are open, the vehicle speed is too high, the sound inside the vehicle is too loud, the performance and deployment position of the microphone, and the performance of the voice recognition model of the voice recognition module.
  • the manner of adjusting the voice recognition quality of the voice recognition module includes one or a combination of the following: closing the window, reducing the speed of the vehicle, reducing the sound in the vehicle, replacing the microphone in question or optimizing the deployment of the microphone, and adjusting the parameters of the pre-processing module. Tuning, tuning of the speech recognition model parameters of the speech recognition module.
  • the prediction result can be output to the man-machine interface of the vehicle, so as to show the evaluation result to the user.
  • the prediction result can be output to the man-machine interface of the vehicle, so as to show the evaluation result to the user.
  • the speech recognition quality function in the above step S330 can be constructed as shown in FIG. 3B , and the construction process of the function includes steps S321 to S327.
  • S321 Perform multiple degradations on each reference voice in the reference voice set, each degradation forms a set of degraded voice sets, thereby obtaining multiple sets of degraded reference voice sets.
  • the reference speech is degraded to different degrees to generate multiple sets of degraded speech, each degraded speech of each degree, or each degraded speech of different ways constitutes a set of degraded speech sets.
  • the degradation method may be voice scrambling.
  • each reference voice in the reference voice set may be scrambled according to the possible noise environment of the vehicle, such as adding background music interference, simulated tire noise, and wind resistance noise interference. , Simulated interference from other car horns outside the car, etc.
  • each set of degraded reference voices is used as the test voice, and the voice quality evaluated for each set of degraded reference voices is obtained according to the voice quality assessment method in the foregoing embodiments, and the voice quality of each set of degraded reference voices constitutes the first evaluation result.
  • S325 Perform recognition and statistics on multiple sets of degraded reference voices according to the above-mentioned voice recognition model, and use the statistical voice recognition results as a first statistical result.
  • each set of degraded reference speech is recognized by using the speech recognition model, and the speech recognition result of each set of degraded reference speech recognition is generated, and the speech recognition results of each set of degraded reference speech constitute the first statistical result.
  • S327 Obtain a speech recognition quality function according to a functional relationship between the first statistical result and the first evaluation result.
  • the speech recognition quality function can be constructed based on machine learning. For example, for the first statistical results and the first evaluation results of each set of degraded reference speech, the speech recognition quality function can be constructed by fitting a polynomial. It can also be based on Deep learning builds a speech recognition quality function, for example, by training a neural network model.
  • each evaluation indicator in the first evaluation result is used as a dependent variable to construct a speech recognition quality function.
  • one or more combined indicators in the first evaluation result are combined and used as the dependent variable to construct the speech recognition quality function.
  • the evaluation indicators here are, for example, the indicators shown in the above formula (1) to formula (3).
  • the embodiment of the present application also provides a method for improving speech recognition quality. It can be determined based on the voice quality assessment results of the test voice and the voice recognition quality prediction results that the pre-processing module or the voice recognition module needs to be tuned to improve the voice recognition quality, as shown in Figure 4. Improve the voice recognition quality
  • the process of the method embodiment includes steps S410 to S480.
  • step S220 reference may be made to the above step S220 or the descriptions in its various embodiments, and will not be described in detail here.
  • step S430 Determine whether the voice quality is lower than a preset first baseline. Wherein, when the voice quality is lower than the first baseline, perform step S440, otherwise, perform step S450.
  • the first baseline judges the evaluation index of voice quality, and is also referred to as the index baseline.
  • the first baselines are respectively set for each evaluation index for evaluating voice quality; in other embodiments, each evaluation index for evaluating voice quality is combined into one or more combined indexes, and the corresponding first baseline is set. baseline.
  • S440 Output an evaluation result of the speech quality of the test speech.
  • step S230 reference may be made to the above-mentioned step S230 or the descriptions in its various embodiments, which will not be described in detail here.
  • this step may return to step S410, or end this process.
  • step S450 may be continued, or step S480 may be executed to continue speech recognition.
  • S450 Predict speech recognition quality.
  • step S460 Determine whether the predicted speech recognition quality is lower than a preset second baseline. Wherein, when the voice recognition quality is lower than the preset second baseline, step S470 is performed; otherwise, step S480 is performed.
  • the second baseline judges the speech recognition quality, that is, judges the speech recognition accuracy of the semantic recognition model, and is also the accuracy baseline.
  • this step may return to step S410, or end this process.
  • step S480 may also be performed to continue speech recognition.
  • S480 Recognize the test voice by the voice recognition module.
  • step S430 when the speech quality evaluated in step S430 is higher than the first baseline, and the speech recognition quality predicted in step S460 is higher than the second baseline, it is considered that the accuracy of the speech recognition quality will be higher, and the recognition result It can be used as a follow-up, such as for controlling the vehicle, etc.
  • the speech recognition result when entering this step when the speech quality evaluated in step S430 is lower than the first baseline, or the predicted speech recognition quality in step S460 is lower than the second baseline, the speech recognition result can be further prompted Give user confirmation to determine whether to use the speech recognition result.
  • the user may adjust pre-processing parameters or speech recognition model parameters according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470, so as to Improve the quality of speech recognition.
  • the device under test such as a vehicle, may automatically perform the pre-processing according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470. Adjustment of processing parameters or speech recognition model parameters.
  • the specific implementation of the method for improving the quality of speech recognition involves speech quality assessment
  • the steps of the method, the steps of the speech recognition quality prediction method, the steps corresponding to these two parts can also be independent with reference to the foregoing embodiments, as the specific implementation of the speech quality evaluation method and the speech recognition quality prediction method, in order to simplify the description, the The separate specific implementation manners of these two parts will not be described in detail.
  • the flow process of the specific implementation of speech recognition quality prediction method comprises the following steps:
  • S510 The vehicle acquires a test voice set based on the reference voice set through the microphone arranged in the cockpit.
  • the test voice set includes several groups.
  • the tester sits in the driver's seat of the vehicle and broadcasts each group of test voices in turn.
  • the semantics of each group can correspond to a common command.
  • Each group of test voices includes 10 test voices broadcast by the tester.
  • the content of the test voice reported corresponds to the content of the reference voice set.
  • the vehicle guides the testers to play the test voices in a guided manner through the man-machine interface. For example, each voice content of the corresponding reference voice set and the number of times to be played can be displayed on the screen in turn, and the testers will broadcast accordingly.
  • S515 The collected test voice is pre-processed by the pre-processing module on the vehicle, including extracting the time-frequency feature vector graph of each test voice in the test voice set, that is, the first feature vector.
  • S520 Using the semantic correlation feature model, obtain a semantic correlation feature vector of the test voice in the test voice set, that is, a second feature vector, according to the time-frequency feature vector map.
  • S525 Evaluate the speech quality of the test speech set based on the semantically relevant feature vectors of the test speech.
  • the speech quality can be evaluated using one or more of the above formulas (1) to (3).
  • step S530 According to the evaluation result of the voice quality, judge whether the voice quality is lower than the set index baseline, if it is lower than the set index baseline, go to step S535, otherwise go to step S540.
  • S535 Output an evaluation result of the speech quality of the test speech.
  • the evaluation result can be output to the man-machine interface, and the evaluation result includes content related to the voice quality of the test voice.
  • the content related to the speech quality of the test speech displayed on the man-machine interface may include: prompting the user to tune the parameters and algorithms in the pre-processing module.
  • the adjustable interface and parameters can be displayed in a graphical form.
  • S540 Using the speech recognition quality function, based on the speech quality of the test speech set, predict the speech recognition quality of the speech recognition module for the test speech set.
  • step S545 Determine whether the predicted recognition quality is lower than the set accuracy baseline, if it is lower than the accuracy baseline, go to step S550, otherwise go to step S555.
  • the prediction result can be output to the man-machine interface, and the prediction result includes content related to speech recognition quality.
  • the content related to the speech recognition quality displayed on the man-machine interface includes: prompting the user to optimize the speech recognition model of the speech recognition module.
  • the adjustable interface and parameters can be displayed in a graphical form.
  • the speech recognition quality function can be further optimized.
  • the steps in the speech recognition quality function construction method can be used to retrain the speech recognition quality function to optimize the speech recognition quality function.
  • the speech recognition quality function can be retrained according to the above-mentioned steps S323-S327; Quality function retraining.
  • the embodiments of the present application also provide corresponding devices. Regarding the beneficial effects or technical problems solved by the devices, you can refer to the descriptions in the methods corresponding to the devices respectively, or refer to the descriptions in the summary of the invention, which are only briefly described here. Each device in this embodiment may be used to implement each optional embodiment in the foregoing method. Each device embodiment of the present application will be described below based on each figure.
  • the speech quality assessment device provided by the embodiment of the present application can be used to implement various embodiments of the speech quality assessment method. As shown in FIG. 6A , the device has an acquisition module 610 , an evaluation module 620 and an output module 630 .
  • the obtaining module 610 is used for obtaining test voice. It is specifically used to execute the foregoing step S210 and various embodiments thereof.
  • the evaluation module 620 is used for evaluating the speech quality of the test speech according to the semantic related information of the test speech. It is specifically used to execute the foregoing step S220 and various embodiments thereof.
  • the output module 630 is configured to output the evaluation result of the voice quality. It is specifically used to execute the foregoing step S230 and various embodiments thereof.
  • the evaluation result of the voice quality output by the output module 630 includes one or more of the following: the quality of the test voice; factors affecting the quality of the test voice; adjusting the quality of the test voice The way.
  • the quality of the test voice includes one or more of the following: the quality of the test voice; factors affecting the quality of the test voice; adjusting the quality of the test voice The way.
  • the evaluation module 620 is specifically configured to: obtain a first feature vector of the test voice, wherein the first feature vector includes a time-frequency feature vector of the test voice; Obtaining a second feature vector of the test speech from the first feature vector, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.
  • the evaluation module 620 when evaluating the speech quality of the test speech according to the second feature vector, includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation
  • the indicators include: the concentration index of the center position of each group of the test voices, wherein the test voices of different groups have different semantics, and the center position of each group of the test voices is the second feature of the group of test voices The center position of the vector in the second feature space, where the second feature space is the space where the second feature vector is located.
  • the evaluation module 620 when evaluating the speech quality of the test speech according to the second feature vector, includes: using a second evaluation index to evaluate the speech quality of the test speech, the second evaluation The index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is the second feature space. The space of the two eigenvectors.
  • the evaluation module 620 when evaluating the speech quality of the test speech according to the second feature vector, includes: using a third evaluation index to evaluate the speech quality of the test speech, the third evaluation Index comprises: the similar degree index of the central position of each group described test speech and the central position of each group of reference speech corresponding to semantics; Wherein, the described test speech of different groups has different semantics, the central position of each group described test speech The central position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference voices of different groups have different semantics, The center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
  • the evaluation module 620 when the evaluation module 620 obtains the first feature vector of the test speech, it includes: obtaining consecutive frames contained in the test speech, wherein adjacent frames contain overlapping information; according to the A plurality of feature vectors including frequency domain features are obtained for consecutive frames, and the plurality of feature vectors constitute the first feature vector.
  • the embodiment of the present application also provides a device for speech recognition quality prediction, which can be used to implement the method embodiment of speech recognition quality prediction, as shown in Figure 6B, the device has an acquisition module 612, a prediction module 622 and an output Module 632.
  • the obtaining module 612 is used for obtaining the test voice. It is specifically used to execute the above step S310 and various embodiments thereof.
  • the prediction module 622 is used to predict the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, the speech recognition quality function represents the relationship between the speech recognition quality and the speech quality, and the speech quality is based on the aforementioned Any one of the possible embodiments of the voice quality assessment method is evaluated. It is specifically used to execute the above steps S320-S330 and various embodiments thereof.
  • the output module 632 is used to output the prediction result of speech recognition quality. It is specifically used to execute the above step S340 and various embodiments thereof.
  • the prediction result of the speech recognition quality output by the output module 632 includes one or more of the following, including one or more of the following: the speech recognition quality of the speech recognition model; Factors affecting the speech recognition quality of the speech recognition model; ways of adjusting the speech recognition quality of the speech recognition model.
  • the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; obtaining speech recognition results of multiple sets of the degraded reference speeches according to the speech recognition model, and using them as the first A statistical result; respectively using multiple sets of the degraded reference voices as test voices, and obtaining multiple sets of voice quality assessment results of the degraded reference voices according to any possible embodiment of the voice quality assessment method, to obtain a first evaluation result;
  • the speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
  • the embodiment of the present application also provides a device for improving speech recognition quality, which can be used to implement the method embodiment for improving speech recognition quality, as shown in Figure 6C, the speech quality evaluation device has an acquisition module 614, an evaluation prediction module 624 and output module 634 .
  • the obtaining module 614 is used for obtaining the test voice. It is specifically used to execute the above step S410 and various embodiments thereof.
  • the evaluation prediction module 624 is used to evaluate the speech quality of the test speech. Any possible embodiment of the foregoing voice quality assessment method may be used for assessment. It is specifically used to execute the above steps S420-S430 and various embodiments thereof.
  • the output module 634 is configured to output the assessment result of the voice quality when the voice quality is lower than the preset first baseline. It is specifically used to execute the above step S440 and various embodiments thereof.
  • the evaluation and prediction module 624 is further configured to predict the speech recognition quality according to the method described in any possible embodiment of the speech recognition quality prediction method when the speech quality is greater than or equal to the first baseline . It is specifically used to execute the above steps S450-S460 and various embodiments thereof.
  • the output module 634 is further configured to output the prediction result of the speech recognition quality when the speech recognition quality is lower than a preset second baseline. It is specifically used to execute the above step S470 and various embodiments thereof.
  • the embodiment of the present application also provides a vehicle, as shown in Figure 7A and Figure 7B, the vehicle includes a sound collection device 710, a pre-processing device 720, a speech recognition device 730, and the aforementioned speech quality evaluation device, speech recognition quality Predictive means or means to improve the quality of speech recognition.
  • the sound collecting device 710 is used for collecting the semantic commands based on the reference speech spoken by the driver.
  • it may be a microphone in the cockpit.
  • the microphone is set at the central control panel 740 , and may also be set at one or more other positions such as the instrument panel above the steering wheel, the rearview mirror in the cockpit, and the steering wheel.
  • the pre-processing device 720 is used for pre-processing the speech broadcast by the driver.
  • the voice recognition device 730 is used to recognize the driver's command when the voice quality requirement of the driver's command and the predicted voice recognition quality meet the requirements.
  • the aforementioned speech quality evaluation device, speech recognition quality predicting device or speech recognition quality improving device is used for the aforementioned purpose, and the user can perform corresponding operations based on this to improve speech quality and speech recognition quality.
  • a central control panel 740 as a man-machine interface is also shown, through which the user receives the output of a speech quality evaluation device, a speech recognition quality prediction device, or a device for improving speech recognition quality.
  • the information can be displayed and displayed, and the parameter adjustment interface can be displayed with the help of the central control panel 740, which is convenient for the user to perform the above-mentioned parameter adjustment by manipulating the central control panel 740.
  • the pre-processing device 720, the speech recognition device 730, and the speech quality evaluation device, the speech recognition quality prediction device or the device for improving the speech recognition quality can be processed by one or more in the vehicle In this embodiment, it may be implemented by a processor of the vehicle infotainment system.
  • FIG. 8 is a schematic structural diagram of a computing device 800 provided by an embodiment of the present application.
  • the computing device 800 includes: a processor 810 , a memory 820 , and a communication interface 830 .
  • the communication interface 830 in the computing device 800 shown in FIG. 8 can be used to communicate with other devices.
  • the processor 810 may be connected to the memory 820 .
  • the memory 820 can be used to store the program codes and data. Therefore, the memory 820 may be a storage module inside the processor 810, or an external storage module independent of the processor 810, or may include a storage module inside the processor 810 and an external storage module independent of the processor 810. part.
  • the processor 810 executes the computer-implemented instructions in the memory 820 to perform the operation steps of the above method.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the program is used to execute one or more of the solutions described in the various embodiments of the present application when executed by a processor.
  • an embodiment means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Therefore, the phrase “in an embodiment” appearing in various places in this specification does not necessarily all refer to the same embodiment, but may refer to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

Abstract

The present application relates to the technical field of speech recognition. Embodiments of the present application provide a speech quality assessment method, a speech recognition quality prediction method, and a speech recognition quality improvement method. First, a test speech is acquired; then speech quality of the test speech is assessed according to semantic related information of the test speech to determine whether a parameter of speech preprocessing needs to be adjusted, and speech recognition quality is predicted according to the assessed speech quality to determine whether a parameter of a speech recognition model needs to be adjusted; and a corresponding assessment result or prediction result is outputted, such that the parameter of the preprocessing or the speech recognition model can be adjusted according to the assessment result or the prediction result. According to the present application, the parameter adjustment process of the preprocessing and the speech recognition model in the speech recognition process can be decoupled.

Description

语音质量评估、语音识别质量预测与提高的方法及装置Method and device for speech quality assessment, speech recognition quality prediction and improvement 技术领域technical field
本申请涉及语音识别技术领域,尤其涉及语音质量评估方法及装置、语音识别质量预测方法及装置、提高语音识别质量的方法及装置、车辆、计算机可读存储介质及计算机程序产品。The present application relates to the technical field of speech recognition, in particular to a method and device for evaluating speech quality, a method and device for predicting speech recognition quality, a method and device for improving speech recognition quality, a vehicle, a computer-readable storage medium, and a computer program product.
背景技术Background technique
语音识别系统中语音信号经传感器采集后会经过语音前处理模块增强,然后输送到语音识别模块进行语音唤醒与识别,所以语音识别系统的识别效果主要取决于语音识别模块的性能与语音信号质量两大因素。语音信号质量受制于环境、音频采集硬件及语音前处理模块的算法。In the voice recognition system, the voice signal will be enhanced by the voice pre-processing module after being collected by the sensor, and then sent to the voice recognition module for voice wake-up and recognition. Therefore, the recognition effect of the voice recognition system mainly depends on the performance of the voice recognition module and the quality of the voice signal. big factor. The quality of the voice signal is subject to the environment, audio collection hardware and the algorithm of the voice pre-processing module.
业界普遍采用语音识别模块与语音前处理模块整体联调的方式来测试语音识别质量,以进行性能调优,而缺乏能够将语音前处理模块与语音识别模块进行分离而独立调优的可用系统与标准,不能对语音前处理模块输出的语音信号质量进行评估而为采集硬件、语音前处理模块及语音识别模块提供调优基线与反馈。该方案不利于实际语音识别业务中语音信号质量及语音识别模块性能相关问题的独立定位与解决,对语音识别系统的故障易造成问题定位困难,因此将语音识别模块与语音前处理模块进行解耦,得到语音质量标定与反馈,对提升模块故障诊断效率意义重大。The industry generally adopts the overall joint debugging of the speech recognition module and the speech pre-processing module to test the quality of speech recognition for performance tuning. Standards, cannot evaluate the quality of the voice signal output by the voice pre-processing module and provide tuning baselines and feedback for the acquisition hardware, voice pre-processing module and voice recognition module. This solution is not conducive to the independent positioning and solution of voice signal quality and voice recognition module performance related issues in the actual voice recognition business, and it is easy to cause difficulty in problem location for voice recognition system failures. Therefore, the voice recognition module is decoupled from the voice pre-processing module. , to obtain voice quality calibration and feedback, which is of great significance to improve the efficiency of module fault diagnosis.
发明内容Contents of the invention
有鉴于此,本申请提供了一种语音质量评估方法及装置、语音识别质量预测方法及装置、提高语音识别质量的方法及装置、车辆、计算机可读存储介质及计算机程序产品。In view of this, the present application provides a speech quality assessment method and device, a speech recognition quality prediction method and device, a speech recognition quality improving method and device, a vehicle, a computer readable storage medium and a computer program product.
为达到上述目的,本申请的第一方面提供一种语音质量评估方法,包括:获取测试语音;根据测试语音的语义相关信息评估测试语音的语音质量;输出语音质量的评估结果。In order to achieve the above purpose, the first aspect of the present application provides a voice quality assessment method, including: acquiring a test voice; evaluating the voice quality of the test voice according to the semantic related information of the test voice; and outputting the voice quality assessment result.
由上,可以实现对测试语音进行语音质量评估,并且该质量评估是基于与测试语音的语义相关的信息进行评估,因此所评估的语音质量里可体现出语义信息,从而该评估的语音质量可用于预测语音识别模型对测试语音的语音识别质量。另外,输出的评估结果可包括与所评估的语音质量的相关内容,可以用于用户参考该评估结果改善语音质量。From the above, it is possible to evaluate the voice quality of the test voice, and the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech. In addition, the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality.
例如,应用于车辆时,可以实现对车内语音质量监测与反馈,从而用户可以采取相应措施维持或改善车内语音交互体验。另一种可能的实施方式中,车辆也可以根据输出的评估结果执行相应的操作来改善车内语音交互体验,该操作例如自动关闭车窗、自动降低车内声音(如车辆播放音乐的声音)、或自动进行参数调优等。For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle. In another possible implementation, the vehicle can also perform corresponding operations based on the output evaluation results to improve the in-car voice interaction experience, such as automatically closing the car windows, automatically reducing the sound in the car (such as the sound of the car playing music) , or automatically perform parameter tuning, etc.
作为第一方面的一种可能的实施方式,语音质量的评估结果包括以下一种或多种: 测试语音的质量;影响测试语音质量的因素;调整测试语音质量的方式。As a possible implementation manner of the first aspect, the evaluation result of the voice quality includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a manner of adjusting the quality of the test voice.
由上,所输出的内容可以灵活设置。例如,输出的测试语音的质量,可以以量化的具体参数输出,也可以以优、良、中等定性方式输出,也可以结合图像、文字等方式输出。例如,影响测试语音质量的因素,可以是车窗处于开启状态而引入的车外噪声;车速过高导致的胎噪或发动机的噪音过大;车内播放的音乐等其他声音过大等,便于用户参考而想要的调整。调整测试语音质量的方式,可以关闭车窗、降低车速、降低车内声音、更换有问题的麦克风、前处理模块参数的调优等。From the above, the output content can be flexibly set. For example, the quality of the output test speech can be output by quantified specific parameters, can also be output in qualitative ways such as excellent, good, and medium, and can also be output in combination with images and texts. For example, the factors that affect the test voice quality can be the noise outside the car caused by the window being opened; the tire noise or engine noise caused by the high speed of the car; Adjustments desired by the user for reference. Adjust the way to test the voice quality, you can close the car windows, reduce the speed of the car, reduce the sound in the car, replace the problematic microphone, adjust the parameters of the pre-processing module, etc.
作为第一方面的一种可能的实施方式,根据测试语音的语义相关信息评估测试语音的语音质量,包括:获得测试语音的第一特征向量,其中,第一特征向量包括测试语音的时频特征向量;根据测试语音的第一特征向量获得测试语音的第二特征向量,其中,第二特征向量与测试语音的语义相关;根据第二特征向量评估测试语音的语音质量。As a possible implementation manner of the first aspect, evaluating the speech quality of the test speech according to the semantic related information of the test speech includes: obtaining a first feature vector of the test speech, where the first feature vector includes time-frequency features of the test speech vector; obtaining a second feature vector of the test voice according to the first feature vector of the test voice, wherein the second feature vector is related to the semantics of the test voice; evaluating the voice quality of the test voice according to the second feature vector.
由上,可以先获得包括语音的时频特征向量的第一特征向量,在基于此获得与语音的语义相关第二特征向量,以根据第二特征向量评估测试语音的语音质量。第一特征向量与时频特征向量相关,因此前处理参数与第二特征向量具有联系,故可以将第二特征向量评估的测试语音的语音质量作为前处理参数进行调优的参考。From the above, the first feature vector including the time-frequency feature vector of the speech can be obtained first, and then the second feature vector related to the semantics of the speech can be obtained based on this, so as to evaluate the speech quality of the test speech according to the second feature vector. The first eigenvector is related to the time-frequency eigenvector, so the preprocessing parameters are related to the second eigenvector, so the speech quality of the test speech evaluated by the second eigenvector can be used as a reference for tuning the preprocessing parameters.
作为第一方面的一种可能的实施方式,根据第二特征向量评估测试语音的语音质量,包括:使用第一评估指标评估测试语音的语音质量,第一评估指标包括:各组测试语音的中心位置的集中程度指标,其中,不同组的测试语音具有不同语义,每组测试语音的中心位置为该组测试语音的第二特征向量在第二特征空间中的中心位置,第二特征空间为第二特征向量所在的空间。As a possible implementation manner of the first aspect, evaluating the speech quality of the test speech according to the second feature vector includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: the center of each group of test speech The concentration index of position, wherein, different groups of test voices have different semantics, the center position of each group of test voices is the center position of the second eigenvector of the group of test voices in the second feature space, and the second feature space is the center position of the second feature vector The space of the two eigenvectors.
由上,可以根据上述一种可选的评估指标评估测试语音的语音质量,该评估指标基于第二特征向量计算。From the above, the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
作为第一方面的一种可能的实施方式,根据第二特征向量评估测试语音的语音质量,包括:使用第二评估指标评估测试语音的语音质量,第二评估指标包括:各组测试语音的第二特征向量在第二特征空间中的分散程度指标,其中,不同组的测试语音具有不同语义,第二特征空间为第二特征向量所在的空间。As a possible implementation manner of the first aspect, evaluating the speech quality of the test speech according to the second feature vector includes: using a second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: the first An indicator of the degree of dispersion of the two feature vectors in the second feature space, wherein different groups of test speeches have different semantics, and the second feature space is the space where the second feature vectors are located.
由上,可以根据上述一种可选的评估指标评估测试语音的语音质量,该评估指标基于第二特征向量计算。From the above, the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
作为第一方面的一种可能的实施方式,根据第二特征向量评估测试语音的语音质量,包括:使用第三评估指标评估测试语音的语音质量,第三评估指标包括:各组测试语音的中心位置与语义对应的各组参考语音的中心位置的相近程度指标;其中,不同组的测试语音具有不同语义,每组测试语音的中心位置为该组测试语音的第二特征向量在第二特征空间中的中心位置,第二特征空间为第二特征向量所在的空间,不同组的参考语音具有不同语义,每组参考语音的中心位置为该组参考语音的第二特征向量在第二特征空间中的中心位置。As a possible implementation of the first aspect, evaluating the speech quality of the test speech according to the second feature vector includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: the center of each group of test speech The similarity index of the center position of each group of reference voices corresponding to the position and semantics; wherein, different groups of test voices have different semantics, and the center position of each group of test voices is the second eigenvector of the group of test voices in the second feature space The center position in the second feature space is the space where the second feature vector is located. Different groups of reference voices have different semantics. The center position of each group of reference voices is the second feature vector of the group of reference voices in the second feature space central location.
由上,可以根据上述一种可选的评估指标评估测试语音的语音质量,该评估指标基于第二特征向量计算。From the above, the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.
作为第一方面的一种可能的实施方式,获得测试语音的第一特征向量,包括:获 取测试语音包含的连续帧,其中,相邻帧包含有重叠的信息;根据连续帧获得包括频域特征的多个特征向量,多个特征向量构成第一特征向量。As a possible implementation of the first aspect, obtaining the first feature vector of the test speech includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors, the plurality of eigenvectors constitute the first eigenvector.
由上,在获取第一特征向量时,通过具有重叠的相邻帧,使得帧与帧直接的关系可以带入到第二特征向量的计算中。From the above, when obtaining the first feature vector, by having overlapping adjacent frames, the direct relationship between frames can be brought into the calculation of the second feature vector.
为达到上述目的,本申请的第二方面提供一种语音识别质量预测方法,包括:获取测试语音;根据语音识别质量函数预测语音识别模型对测试语音的语音识别质量,语音识别质量函数用于指示语音识别质量与语音质量之间的关系,语音质量根据第一方面或第一方面的任意一种可能的实施方式的方法评估;输出语音识别质量的预测结果。In order to achieve the above object, the second aspect of the present application provides a speech recognition quality prediction method, including: obtaining a test speech; predicting the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, and the speech recognition quality function is used to indicate The relationship between the speech recognition quality and the speech quality, the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; and a prediction result of the speech recognition quality is output.
由上,实现了通过语音质量来预测语音识别模型对语音识别的质量,可以用于用户参考该预测结果改善语音识别质量。例如,应用于车辆时,可以实现对车内语音识别的质量进行监测与反馈,从而用户可以采取相应措施维持车内语音交互体验。From the above, it is realized to predict the speech recognition quality of the speech recognition model through the speech quality, which can be used by the user to refer to the prediction result to improve the speech recognition quality. For example, when applied to vehicles, it is possible to monitor and give feedback on the quality of in-vehicle voice recognition, so that users can take corresponding measures to maintain the in-vehicle voice interaction experience.
作为第二方面的一种可能的实施方式,输出语音识别质量的预测结果,包括以下一种或多种:语音识别模型的语音识别的质量;影响语音识别模型的语音识别质量的因素;调整语音识别模型的语音识别质量的方式。As a possible implementation of the second aspect, the output prediction result of speech recognition quality includes one or more of the following: the speech recognition quality of the speech recognition model; factors affecting the speech recognition quality of the speech recognition model; adjusting the speech The way to identify the speech recognition quality of the model.
由上,所预测结果的内容可以灵活设置。例如,语音识别模型的质量,可以以量化的具体参数提供,也可以以优、良、中等定性方式提供,也可以结合图像、文字等方式提供。例如,输出的影响语音识别模型的语音识别质量的因素,可以是第一方面提供的方法中的因素,也可以是语音识别模块的语音识别模型的性能因素。输出的调整测试语音质量的方式,可以是第一方面提供的方法中提供的调整方式,也可以是对语音识别模型的参数的调优提示。From the above, the content of the predicted result can be flexibly set. For example, the quality of a speech recognition model can be provided in quantified specific parameters, or in qualitative ways such as excellent, good, and medium, or in combination with images and text. For example, the output factors affecting the speech recognition quality of the speech recognition model may be factors in the method provided in the first aspect, or may be performance factors of the speech recognition model of the speech recognition module. The output mode for adjusting the test voice quality may be the adjustment mode provided in the method provided in the first aspect, or it may be a tuning prompt for the parameters of the speech recognition model.
作为第二方面的一种可能的实施方式,语音识别质量函数的构建过程包括:获得参考语音的多套劣化参考语音;根据语音识别模型获得多套劣化参考语音的语音识别结果,并作为第一统计结果;分别以多套劣化参考语音作为测试语音,根据第一方面或第一方面的任意一种可能的实施方式的方法获得多套劣化参考语音的语音质量评估结果,得到第一评估结果;根据第一统计结果与第一评估结果的函数关系获得语音识别质量函数。As a possible implementation of the second aspect, the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.
由上,上述为通过参考语音来构建语音识别质量函数的一种方式,具体的,在不引入其他参考语音的情况下采用了参考进行劣化的方式,可以有效的减少了参考语音的数据量。From the above, the above is a way of constructing the speech recognition quality function through reference speech. Specifically, the method of degrading the reference speech is adopted without introducing other reference speech, which can effectively reduce the data volume of the reference speech.
为达到上述目的,本申请的第三方面提供一种提高语音识别质量的方法,包括:获取测试语音;根据第一方面或第一方面的任意一种可能的实施方式的方法获得测试语音的语音质量评估结果;当语音质量低于预设的第一基线时,执行输出语音质量的评估结果。In order to achieve the above object, the third aspect of the present application provides a method for improving the quality of speech recognition, including: obtaining a test speech; obtaining the speech of the test speech according to the method of the first aspect or any possible implementation manner of the first aspect A quality assessment result; when the voice quality is lower than the preset first baseline, output the voice quality assessment result.
由上,可以获得测试语音的语音质量评估结果,并根据该测试语音的语音质量评估结果包括与语音质量相关的内容,其中该内容可以包括是否要调整前处理参数。例如,应用于车辆时,可以实现对车内语音质量进行监测与反馈,从而用户可以采取相应措施维持车内语音交互体验。该过程可以独立于语音识别模型的语音识别过程,实现了前处理与语音识别的解耦。From the above, the speech quality evaluation result of the test speech can be obtained, and according to the speech quality evaluation result of the test speech, content related to the speech quality can be included, wherein the content can include whether to adjust the pre-processing parameters. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle. This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.
作为第三方面的一种可能的实施方式,还包括:当语音质量大于或等于第一基线时,根据第二方面或第二方面的任意一种可能的实施方式的方法预测语音识别质量;当语音识别质量低于预设的第二基线时,执行输出语音识别质量的预测结果。As a possible implementation manner of the third aspect, it also includes: when the voice quality is greater than or equal to the first baseline, predicting the speech recognition quality according to the second aspect or any one possible implementation manner of the second aspect; when When the voice recognition quality is lower than the preset second baseline, output a prediction result of the voice recognition quality.
由上,可以实现根据测试语音来确定是要调整前处理参数,还是要调整语音识别模型,实现了前处理与语音识别的解耦。利于各个模块问题的独立定位与性能调优。From the above, it can be determined whether to adjust the pre-processing parameters or the speech recognition model according to the test voice, and realize the decoupling of pre-processing and speech recognition. It is beneficial to the independent positioning and performance tuning of each module problem.
为达到上述目的,本申请的第四方面提供一种语音质量评估装置,包括:In order to achieve the above purpose, the fourth aspect of the present application provides a voice quality assessment device, including:
获取模块,用于获取测试语音;评估模块,用于根据测试语音的语义相关信息评估测试语音的语音质量;输出模块,用于输出语音质量的评估结果。The obtaining module is used to obtain the test speech; the evaluation module is used to evaluate the speech quality of the test speech according to the semantic related information of the test speech; the output module is used to output the speech quality evaluation result.
由上,可以实现对测试语音进行语音质量评估,并且该质量评估是基于与测试语音的语义相关的信息进行评估,因此所评估的语音质量里可体现出语义信息,从而该评估的语音质量可用于预测语音识别模型对测试语音的语音识别质量。另外,输出的评估结果可包括与所评估的语音质量的相关内容,可以用于用户参考该评估结果改善语音质量。例如,应用于车辆时,可以实现对车内语音质量监测与反馈,从而用户可以采取相应措施维持或改善车内语音交互体验。应用车辆时,一种可能的实施方式中,车辆也可以根据输出的评估结果执行相应的操作来改善车内语音交互体验,该操作例如自动关闭车窗、自动降低车内声音(如车辆播放音乐的声音)、或自动进行参数调优等。From the above, it is possible to evaluate the voice quality of the test voice, and the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech. In addition, the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle. When applied to vehicles, in a possible implementation, the vehicle can also perform corresponding operations based on the output evaluation results to improve the voice interaction experience in the vehicle, such as automatically closing the windows, automatically reducing the sound in the vehicle (such as the vehicle playing music sound), or automatically perform parameter tuning, etc.
作为第四方面的一种可能的实施方式,输出模块所输出的语音质量的评估结果,包括以下一种或多种:测试语音的质量;影响测试语音质量的因素;调整测试语音质量的方式。As a possible implementation manner of the fourth aspect, the evaluation result of the voice quality output by the output module includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a way of adjusting the quality of the test voice.
作为第四方面的一种可能的实施方式,评估模块具体用于:获得测试语音的第一特征向量,其中,第一特征向量包括测试语音的时频特征向量;根据测试语音的第一特征向量获得测试语音的第二特征向量,其中,第二特征向量与测试语音的语义相关;根据第二特征向量评估测试语音的语音质量。As a possible implementation of the fourth aspect, the evaluation module is specifically configured to: obtain a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech; Obtaining a second feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.
作为第四方面的一种可能的实施方式,评估模块在根据第二特征向量评估测试语音的语音质量时,包括:使用第一评估指标评估测试语音的语音质量,第一评估指标包括:各组测试语音的中心位置的集中程度指标,其中,不同组的测试语音具有不同语义,每组测试语音的中心位置为该组测试语音的第二特征向量在第二特征空间中的中心位置,第二特征空间为第二特征向量所在的空间。As a possible implementation of the fourth aspect, when the evaluation module evaluates the speech quality of the test speech according to the second feature vector, it includes: using the first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: each group The concentration index of the center position of the test speech, wherein, the test speech of different groups has different semantics, the center position of each group of test speech is the center position of the second feature vector of this group of test speech in the second feature space, the second The feature space is the space where the second feature vector resides.
作为第四方面的一种可能的实施方式,评估模块在根据第二特征向量评估测试语音的语音质量时,包括:使用第二评估指标评估测试语音的语音质量,第二评估指标包括:各组测试语音的第二特征向量在第二特征空间中的分散程度指标,其中,不同组的测试语音具有不同语义,第二特征空间为第二特征向量所在的空间。As a possible implementation of the fourth aspect, when the evaluation module evaluates the speech quality of the test speech according to the second feature vector, it includes: using the second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: each group An indicator of the degree of dispersion of the second feature vector of the test speech in the second feature space, wherein different groups of test speech have different semantics, and the second feature space is the space where the second feature vector is located.
作为第四方面的一种可能的实施方式,评估模块在根据第二特征向量评估测试语音的语音质量时,包括:使用第三评估指标评估测试语音的语音质量,第三评估指标包括:各组测试语音的中心位置与语义对应的各组参考语音的中心位置的相近程度指标;其中,不同组的测试语音具有不同语义,每组测试语音的中心位置为该组测试语音的第二特征向量在第二特征空间中的中心位置,第二特征空间为第二特征向量所在的空间,不同组的参考语音具有不同语义,每组参考语音的中心位置为该组参考语音 的第二特征向量在第二特征空间中的中心位置。As a possible implementation of the fourth aspect, when the evaluation module evaluates the speech quality of the test speech according to the second feature vector, it includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: each group The center position of the test voice and the similarity index of the center position of each group of reference voices corresponding to the semantics; Wherein, the test voices of different groups have different semantics, and the center position of each group of test voices is the second feature vector of this group of test voices in The center position in the second feature space, the second feature space is the space where the second feature vector is located, different groups of reference speeches have different semantics, the center position of each group of reference speech is the second feature vector of the group of reference speech at the The center position in the second feature space.
作为第四方面的一种可能的实施方式,评估模块在获得测试语音的第一特征向量时,包括:获取测试语音包含的连续帧,其中,相邻帧包含有重叠的信息;根据连续帧获得包括频域特征的多个特征向量,多个特征向量构成第一特征向量。As a possible implementation of the fourth aspect, when the evaluation module obtains the first feature vector of the test speech, it includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors including frequency domain features, the plurality of eigenvectors constitute a first eigenvector.
为达到上述目的,本申请的第五方面提供一种语音识别质量预测装置,包括:获取模块,用于获取测试语音;预测模块,用于根据语音识别质量函数预测语音识别模型对测试语音的语音识别质量,语音识别质量函数用于指示语音识别质量与语音质量之间的关系,语音质量根据第一方面或第一方面的任意一种可能的实施方式的方法评估;输出模块,用于输出语音识别质量的预测结果。In order to achieve the above object, the fifth aspect of the present application provides a speech recognition quality prediction device, including: an acquisition module, used to obtain a test speech; a prediction module, used to predict the speech of the speech recognition model to the test speech according to the speech recognition quality function Recognition quality, the speech recognition quality function is used to indicate the relationship between the speech recognition quality and the speech quality, and the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; the output module is used to output the speech Prediction results of recognition quality.
由上,实现了通过语音质量来预测语音识别模型对语音识别的质量,可以用于用户参考该预测结果改善语音识别质量。例如,应用于车辆时,可以实现对车内语音识别的质量进行监测与反馈,从而用户可以采取相应措施维持车内语音交互体验。From the above, it is realized to predict the speech recognition quality of the speech recognition model through the speech quality, which can be used by the user to refer to the prediction result to improve the speech recognition quality. For example, when applied to vehicles, it is possible to monitor and give feedback on the quality of in-vehicle voice recognition, so that users can take corresponding measures to maintain the in-vehicle voice interaction experience.
作为第五方面的一种可能的实施方式,输出模块所输出的语音识别质量的预测结果,包括以下一种或多种:语音识别模型的语音识别的质量;影响语音识别模型的语音识别质量的因素;调整语音识别模型的语音识别质量的方式。As a possible implementation of the fifth aspect, the prediction result of the speech recognition quality output by the output module includes one or more of the following: the speech recognition quality of the speech recognition model; Factor; a way to tune the speech recognition quality of a speech recognition model.
作为第五方面的一种可能的实施方式,语音识别质量函数的构建过程包括:获得参考语音的多套劣化参考语音;根据语音识别模型获得多套劣化参考语音的语音识别结果,并作为第一统计结果;分别以多套劣化参考语音作为测试语音,根据第一方面或第一方面的任意一种可能的实施方式的方法获得多套劣化参考语音的语音质量评估结果,得到第一评估结果;根据第一统计结果与第一评估结果的函数关系获得语音识别质量函数。As a possible implementation of the fifth aspect, the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.
为达到上述目的,本申请的第六方面提供一种提高语音识别质量的装置,包括:获取模块,用于获取测试语音;评估预测模块,用于根据第一方面或第一方面的任意一种可能的实施方式的方法获得测试语音的语音质量评估结果;输出模块,用于当语音质量低于预设的第一基线时,执行输出语音质量的评估结果。In order to achieve the above object, the sixth aspect of the present application provides a device for improving the quality of speech recognition, including: an acquisition module, used to acquire test speech; an evaluation and prediction module, used according to any one of the first aspect or the first aspect The method in a possible implementation manner obtains the speech quality evaluation result of the test speech; the output module is configured to output the speech quality evaluation result when the speech quality is lower than a preset first baseline.
由上,可以获得测试语音的语音质量评估结果,并根据该测试语音的语音质量评估结果输出与语音质量相关的内容,其中所输出的内容可以包括是否要调整前处理参数。例如,应用于车辆时,可以实现对车内语音质量进行监测与反馈,从而用户可以采取相应措施维持车内语音交互体验。该过程可以独立于语音识别模型的语音识别过程,实现了前处理与语音识别的解耦。From the above, the speech quality evaluation result of the test speech can be obtained, and the content related to the speech quality can be output according to the speech quality evaluation result of the test speech, wherein the output content can include whether to adjust the pre-processing parameters. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle. This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.
作为第六方面的一种可能的实施方式,评估预测模块还用于当语音质量大于或等于第一基线时,根据第二方面或第二方面的任意一种可能的实施方式的方法预测语音识别质量;输出模块还用于当语音识别质量低于预设的第二基线时,执行输出语音识别质量的预测结果。As a possible implementation of the sixth aspect, the evaluation and prediction module is also used to predict speech recognition according to the method of the second aspect or any possible implementation of the second aspect when the voice quality is greater than or equal to the first baseline The quality; output module is also used to output the prediction result of the speech recognition quality when the speech recognition quality is lower than the preset second baseline.
为达到上述目的,本申请的第七方面提供一种车辆,包括:声音采集装置,用于采集用户的语音命令;前处理装置,用于对语音命令的声音进行前处理;语音识别装置,用于前处理后的声音进行识别;上述第四、五、六方面及其任意一种可能的实施方式中的装置。In order to achieve the above object, the seventh aspect of the present application provides a vehicle, including: a voice collection device for collecting the user's voice command; a pre-processing device for pre-processing the sound of the voice command; a voice recognition device for Recognition of the pre-processed sound; the device in the fourth, fifth, and sixth aspects and any possible implementation manners thereof.
为达到上述目的,本申请的第八方面提供一种计算设备,包括一个或多个处理器 和一个或多个存储器,存储器存储有程序指令,程序指令当被一个或多个处理器执行时使得一个或多个处理器实现第一方面及其任意一种可能的实施方式的方法。To achieve the above object, the eighth aspect of the present application provides a computing device, including one or more processors and one or more memories, the memories store program instructions, and when executed by the one or more processors, the program instructions make One or more processors implement the method of the first aspect and any possible implementation manner thereof.
为达到上述目的,本申请的第九方面提供一种计算机可读存储介质,其上存储有程序指令,程序指令当被计算机执行时使得计算机实现第一方面及其任意一种可能的实施方式的方法。To achieve the above object, the ninth aspect of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a computer, the computer implements the first aspect and any possible implementation manner thereof. method.
为达到上述目的,本申请的第十方面提供一种计算机程序产品,其包括有程序指令,程序指令当被计算机执行时使得计算机实现第一方面及其任意一种可能的实施方式的方法。In order to achieve the above purpose, the tenth aspect of the present application provides a computer program product, which includes program instructions, and when the program instructions are executed by a computer, the computer implements the method of the first aspect and any possible implementation manner thereof.
综上,本申请实施例实现了语音的前处理过程的评估与语音识别模型对语音识别的预测进行解耦,实现了前处理问题、语音识别问题的分别定位,利于各个模块问题的独立定位与性能调优。并且,由于所述的解耦,因此也可以利用对测试语音质量的评估来提示用户相应的操作,来改善车内语音交互体验,也可以利用对语音识别的预测来提示用户相应的操作,来改善车内语音交互体验。To sum up, the embodiment of the present application realizes the decoupling of the evaluation of the speech pre-processing process and the prediction of speech recognition by the speech recognition model, and realizes the separate positioning of the pre-processing problem and the speech recognition problem, which is beneficial to the independent positioning and coordination of each module problem. performance tuning. Moreover, due to the above decoupling, the evaluation of the test voice quality can also be used to prompt the user for corresponding operations to improve the in-vehicle voice interaction experience, and the prediction of speech recognition can also be used to prompt the user for corresponding operations. Improve the voice interaction experience in the car.
本申请的这些和其它方面在以下(多个)实施例的描述中会更加简明易懂。These and other aspects of the present application will be made more apparent in the following description of the embodiment(s).
附图说明Description of drawings
以下参照附图来进一步说明本申请的各个特征和各个特征之间的联系。附图均为示例性的,一些特征并不以实际比例示出,并且一些附图中可能省略了本申请所涉及领域的惯常的且对于本申请非必要的特征,或是额外示出了对于本申请非必要的特征,附图所示的各个特征的组合并不用以限制本申请。另外,在本说明书全文中,相同的附图标记所指代的内容也是相同的。具体的附图说明如下:The various features of the present application and the connections between the various features are further described below with reference to the accompanying drawings. The drawings are exemplary, some features are not shown to scale, and in some drawings, features customary in the field to which the application pertains and are not necessary for the application may be omitted, or additionally shown for the The application is not an essential feature, and the combination of the various features shown in the drawings is not intended to limit the application. In addition, in the whole specification, the content indicated by the same reference numeral is also the same. The specific accompanying drawings are explained as follows:
图1为本申请实施例涉及的应用场景一的结构示意图;FIG. 1 is a schematic structural diagram of an application scenario 1 involved in an embodiment of the present application;
图2A为本申请实施例提供的语音质量评估方法的流程示意图;FIG. 2A is a schematic flow chart of a voice quality assessment method provided in an embodiment of the present application;
图2B为本申请实施例提供的测试语音的语音质量评估方法的流程示意图;FIG. 2B is a schematic flow chart of a voice quality assessment method for test voice provided in an embodiment of the present application;
图3A为本申请实施例提供的语音识别质量预测的方法的流程示意图;FIG. 3A is a schematic flowchart of a method for predicting speech recognition quality provided by an embodiment of the present application;
图3B为本申请实施例提供的语音识别质量函数构建方法的流程示意图;FIG. 3B is a schematic flow chart of a method for constructing a speech recognition quality function provided by an embodiment of the present application;
图4为本申请实施例提供的提高语音识别质量的方法的流程示意图;FIG. 4 is a schematic flowchart of a method for improving speech recognition quality provided by an embodiment of the present application;
图5为本申请实施例提供的提高语音识别质量方法的一具体实施方式的流程示意图;FIG. 5 is a schematic flow chart of a specific implementation of the method for improving the quality of speech recognition provided by the embodiment of the present application;
图6A为本申请实施例提供的语音质量评估装置的示意图;FIG. 6A is a schematic diagram of a speech quality assessment device provided in an embodiment of the present application;
图6B为本申请实施例提供的语音识别质量预测装置的示意图;FIG. 6B is a schematic diagram of a speech recognition quality prediction device provided in an embodiment of the present application;
图6C为本申请实施例提供的提高语音识别质量装置的示意图;FIG. 6C is a schematic diagram of a device for improving speech recognition quality provided by an embodiment of the present application;
图7A为本申请实施例提供的车辆的结构示意图;FIG. 7A is a schematic structural diagram of a vehicle provided in an embodiment of the present application;
图7B为本申请实施例提供的车辆的座舱内的示意图;FIG. 7B is a schematic diagram of the cockpit of the vehicle provided by the embodiment of the present application;
图8为本申请实施例提供的计算装置的示意图。FIG. 8 is a schematic diagram of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图并举实施例,对本申请提供的技术方案作进一步说明。应理解,本申请实施例中提供的系统结构和业务场景主要是为了说明本申请的技术方案的可能 的实施方式,不应被解读为对本申请的技术方案的唯一限定。本领域普通技术人员可知,随着系统结构的演进和新业务场景的出现,本申请提供的技术方案对类似技术问题同样适用。The technical solutions provided by the present application will be further described below in conjunction with the accompanying drawings and examples. It should be understood that the system structure and business scenarios provided in the embodiments of this application are mainly to illustrate the possible implementation of the technical solution of this application, and should not be interpreted as the only limitation on the technical solution of this application. Those skilled in the art know that, with the evolution of the system structure and the emergence of new business scenarios, the technical solutions provided in this application are also applicable to similar technical problems.
应理解,本申请实施例提供的语音质量评估方案,包括语音质量评估方法及装置、语音识别质量预测方法及装置、提高语音识别质量的方法及装置、计算机可读存储介质及计算机程序产品。由于这些技术方案解决问题的原理相同或相似,在如下具体实施例的介绍中,某些重复之处可能不再赘述,但应视为这些具体实施例之间已有相互引用,可以相互结合。It should be understood that the voice quality assessment solution provided by the embodiments of the present application includes a voice quality assessment method and device, a voice recognition quality prediction method and device, a method and device for improving voice recognition quality, a computer-readable storage medium, and a computer program product. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have been referred to each other and can be combined with each other.
对被测语音质量进行评估时,可以采用平均意见得分(Mean Opinion Score,MOS)的指标进行评估。该指标又称为主观语音质量指标,评估时可采取多个指标级别对被测语音的质量进行评估,被测语音的质量是在所有试听人员的评分上求平均得到的。When evaluating the voice quality under test, the index of Mean Opinion Score (MOS) can be used for evaluation. This index is also known as the subjective voice quality index. During the evaluation, multiple index levels can be used to evaluate the quality of the tested voice. The quality of the tested voice is obtained by averaging the scores of all test listeners.
采用平均意见得分的指标进行被测语音质量评估时,不同试听人员间的听觉能力以及主观听觉体验差异会造成评分差异;尤其当仅提供单个句子而不提供上下文时,试听人员评分结果差异显著,这将导致对被测语音质量评估结果的客观性低,该评估方法适应性较差。When the average opinion score is used to evaluate the speech quality of the tested voice, the differences in auditory ability and subjective auditory experience between different listeners will cause differences in scores; especially when only a single sentence is provided without providing context, the scores of the test listeners are significantly different. This will lead to low objectivity of the evaluation result of the tested voice quality, and the evaluation method is poor in adaptability.
对被测语音质量进行评估时,也可以采用客观评估指标进行评估。客观评估指标包括信噪比(Signal-to-Noise Ratio,SNR)、感知语音质量评估(Perceptual Evaluation of Speech Quality,PESQ)、感知客观语音质量评估(Perceptual Objective Listening Quality Analysis,POLQA)、短时客观可懂度(Short-Time Objective Intelligibility,STOI)等。PESQ算法需要带噪的衰减信号和一个原始的参考信号,将两个待比较的语音信号经过电平调整、输入滤波器滤波、时间对准和补偿、听觉变换之后,分别提取两路信号的参数,综合其时频特性,得到PESQ分数,最终将这个分数映射到主观平均意见分(MOS)。POLQA是PESQ的继承者,扩展到处理更高带宽的音频信号。STOI是衡量语音可懂度的重要指标之一,用来评估在时域上经过掩蔽或经过短时傅里叶变换且频域上加权的带噪语音的可懂性。STOI通过对纯净语音和待评估的语音进行比较得到评分。When evaluating the voice quality under test, objective evaluation indicators can also be used for evaluation. Objective evaluation indicators include Signal-to-Noise Ratio (SNR), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Analysis (POLQA), short-term objective Intelligibility (Short-Time Objective Intelligibility, STOI), etc. The PESQ algorithm requires a noisy attenuation signal and an original reference signal. After level adjustment, input filter filtering, time alignment and compensation, and auditory transformation of the two speech signals to be compared, the parameters of the two signals are extracted respectively. , synthesize its time-frequency characteristics, get the PESQ score, and finally map this score to the subjective mean opinion score (MOS). POLQA is the successor to PESQ, extended to handle higher bandwidth audio signals. STOI is one of the important indicators to measure speech intelligibility, which is used to evaluate the intelligibility of noisy speech that has been masked in the time domain or short-time Fourier transformed and weighted in the frequency domain. STOI is scored by comparing the clean speech with the speech to be evaluated.
采用上述客观评估指标进行被测语音质量评估的方法,从声学角度关注声音特性和主观听感的相关性,而与面向机器的语音识别(即语音识别为文字或语义)性能关系不明朗,在语音识别系统研发调优过程中难以作为语音识别模块的有效参考。The method of evaluating the voice quality under test by using the above-mentioned objective evaluation indicators pays attention to the correlation between sound characteristics and subjective hearing perception from an acoustic point of view, but the relationship with the performance of machine-oriented speech recognition (that is, speech recognition as text or semantics) is unclear. It is difficult to serve as an effective reference for the speech recognition module in the process of R&D and tuning of the speech recognition system.
本申请实施例提供了一种改进的语音质量评估方案,其中,可以基于测试语音的时频特征确定测试语音与语义相关的语音质量,还可以基于测试语音与语义相关的语音质量预测测试语音的识别质量,并可以通过把测试语音与语义相关的语音质量和预测的语音识别质量与基线对比,确定影响语音识别质量的问题,从而通过定位或解决该问题来提升语音识别质量。The embodiment of the present application provides an improved speech quality evaluation scheme, wherein the speech quality of the test speech related to semantics can be determined based on the time-frequency characteristics of the test speech, and the speech quality of the test speech can also be predicted based on the speech quality of the test speech related to semantics. Recognition quality, and by comparing the test speech and semantic-related speech quality and the predicted speech recognition quality with the baseline, determine the problem that affects the speech recognition quality, so as to improve the speech recognition quality by locating or solving the problem.
本申请实施例提供的语音质量评估方案,可以应用到语音识别过程中的质量检测、评估等应用领域。例如,应用于车辆的智能座舱内时,可以用来确定当前所接收的座舱内语音质量情况,或语音识别质量情况,进而可以给出相应的提示或执行相应的动作,如减小座舱内音乐音量、关闭车窗、或者进行参数调优等可以改善座舱内语音质 量的提示或动作。又如,应用于手机、智能音箱等智能终端时,可以对该终端当前所在环境下的语音质量或语音识别质量进行评估,进而可以给出相应的提示,如提示麦克风是否被遮挡,如提示启动摄像头以在语音识别时联合唇语识别以提高语音识别准确性等,或者执行可被允许的动作(例如相应应用的权限设置为可调用摄像头),例如执行启动摄像头以在语音识别时联合唇语识别。又如,应用于作为质量检测的检测终端时,可以使用该检测终端对某待测产品的语音识别的功能进行测试,如对车辆的语音识别质量的测试,以便于对车辆进行与语音识别相关参数的调优。The speech quality evaluation solution provided by the embodiment of the present application can be applied to application fields such as quality detection and evaluation in the speech recognition process. For example, when applied in the smart cockpit of a vehicle, it can be used to determine the quality of the currently received voice in the cockpit, or the quality of voice recognition, and then give corresponding prompts or perform corresponding actions, such as reducing the music in the cockpit Prompts or actions that can improve the voice quality in the cockpit, such as volume, closing windows, or parameter tuning. For another example, when applied to smart terminals such as mobile phones and smart speakers, the voice quality or voice recognition quality of the terminal's current environment can be evaluated, and then corresponding prompts can be given, such as prompting whether the microphone is blocked, such as prompting to activate The camera can be combined with lip recognition during speech recognition to improve the accuracy of speech recognition, etc., or perform actions that can be allowed (for example, the permission of the corresponding application is set to call the camera), for example, start the camera to combine lip recognition during speech recognition identify. As another example, when applied to a testing terminal used as a quality test, the testing terminal can be used to test the voice recognition function of a product to be tested, such as testing the voice recognition quality of a vehicle, so as to conduct voice recognition-related tests on the vehicle. Tuning of parameters.
在一些实施例中,上述车辆、智能终端、待测产品通常具有麦克风(microphone)、处理器。其中,麦克风用于采集用户语音;处理器可以用于对采集的语音进行前处理,以及对前处理后的语音进行语音识别,以识别为文字,还可以进一步基于识别的文字进行指令的识别。在一些实施例中,当应用于车辆、智能终端时,所述车辆、智能终端上还可具有人机界面,用于通过显示或声音的方式向用户进行上述相应的提示。在一些实施例中,处理器还可以通过人机界面(Human Machine Interface,HMI)根据用户的操作执行前处理或语音识别的参数调优。在一些实施例中,上述车辆的处理器可以是电子设备,具体可以为车机或车载电脑等车载处理装置的处理器,也可以为中央处理器(central processing unit,CPU)、微处理器(micro control unit,MCU)等常规的芯片处理器。在一些实施例中,当应用于作为质量检测的检测终端时,该检测终端上可具有人机界面,以通过显示或声音的方式向用户进行上述相应的提示。In some embodiments, the vehicle, smart terminal, and product under test usually have a microphone and a processor. Among them, the microphone is used to collect the user's voice; the processor can be used to pre-process the collected voice, and perform voice recognition on the pre-processed voice to recognize it as text, and further recognize instructions based on the recognized text. In some embodiments, when applied to a vehicle or an intelligent terminal, the vehicle or the intelligent terminal may also have a man-machine interface, which is used to provide the above corresponding prompts to the user through display or sound. In some embodiments, the processor may also perform parameter tuning of pre-processing or speech recognition according to user operations through a human-machine interface (Human Machine Interface, HMI). In some embodiments, the processor of the above-mentioned vehicle may be an electronic device, specifically, it may be a processor of a vehicle-mounted processing device such as a vehicle machine or a vehicle-mounted computer, or it may be a central processing unit (central processing unit, CPU), a microprocessor ( micro control unit, MCU) and other conventional chip processors. In some embodiments, when applied to a testing terminal for quality testing, the testing terminal may have a man-machine interface to provide the above corresponding prompts to the user through display or sound.
在一些实施例中,本申请实施例提供的语音质量评估方案可以内嵌于上述车辆、智能终端、待测产品,作为其一功能模块存在。在一些实施例中,当本申请实施例提供的语音质量评估方案应用于独立的质量检测的检测终端时,该检测终端可以与被测设备,如上述车辆、智能终端、待测产品进行有线或无线方式通讯,获得所需数据,例如该数据为前处理后的语音数据,可根据该数据进行语音质量检测或预测语音识别质量,以及给出测试结果的提示。在一些实施例中,该检测终端还可以将测试结果反馈到被测设备,由被测设备根据测试结果执行参数调优等操作。In some embodiments, the speech quality assessment solution provided by the embodiments of the present application may be embedded in the above-mentioned vehicles, smart terminals, and products under test, and exist as a functional module thereof. In some embodiments, when the voice quality evaluation solution provided by the embodiment of the present application is applied to an independent quality detection detection terminal, the detection terminal can be wired or Communicate wirelessly to obtain the required data, for example, the data is pre-processed voice data, based on which the voice quality test or the voice recognition quality can be predicted, and a prompt of the test result can be given. In some embodiments, the detection terminal can also feed back the test results to the device under test, and the device under test performs operations such as parameter tuning according to the test results.
下面,进一步结合图1,对本申请实施例提供的语音质量评估方案应用于车辆时的一种场景进行概要性的说明。In the following, further referring to FIG. 1 , a general description will be given of a scenario when the voice quality assessment solution provided by the embodiment of the present application is applied to a vehicle.
图1示出了本申请实施例提供的语音质量评估方案应用于车辆的一种场景,可用于语音质量评估,也可用于进行语音识别的质量评估。该场景中,该车辆的座舱内包括:声音采集模块110、前处理模块120、语音识别模块130、评估预测模块140、输出模块150。其中,前处理模块120、语音识别模块130和评估预测模块140可以由车辆的同一个处理器实现,也可以分别由三个或多个处理器实现。Fig. 1 shows a scenario in which the voice quality assessment solution provided by the embodiment of the present application is applied to a vehicle, which can be used for voice quality assessment, and can also be used for quality assessment of voice recognition. In this scenario, the cockpit of the vehicle includes: a sound collection module 110 , a pre-processing module 120 , a speech recognition module 130 , an evaluation and prediction module 140 , and an output module 150 . Wherein, the pre-processing module 120, the speech recognition module 130 and the evaluation and prediction module 140 may be implemented by the same processor of the vehicle, or may be implemented by three or more processors respectively.
声音采集模块110可以为麦克风,可以用于采集用户播报的测试语音,其中,一测试语音会对应一参考语音,且两者对应相同的语句内容,如对应相同的语音指令,具有相同的语义。在建立语音质量评估的各模型的阶段,声音采集模块110也用于采集参考语音。参考语音指的在训练所述语音识别模块130,或者训练后述的语义相关特征模型所使用的语音,语义相关特征模型用于语音质量评估,将在后文对此进一步进行说明。The sound collection module 110 can be a microphone, which can be used to collect test voices broadcast by users, wherein a test voice corresponds to a reference voice, and both correspond to the same sentence content, such as corresponding to the same voice command, and have the same semantics. In the stage of establishing each model for speech quality assessment, the sound collection module 110 is also used to collect reference speech. The reference speech refers to the speech used for training the speech recognition module 130 or training the semantic-related feature model described later. The semantic-related feature model is used for speech quality assessment, which will be further described later.
前处理模块120用于对采集的声音进行预加重、分帧或加窗等前处理,以使该声音中包含的用户的测试语音更容易被识别。其中,预加重包括对语音的高频部分进行加重,去除口唇辐射的影响,增加语音的高频分辨率;分帧则利用语音信号具有短时平稳性把语音信号分成各个语音帧来进行处理,相邻的语音帧之间存在重叠以使各语音帧连续;加窗则是对每个语音帧抽样附近的语音波形加强、而对波形的其余部分减弱,从而使语音平滑。其中,对前处理进行参数调整包括以下一种或多种:预加重处理所针对的高频的频段、加重的程度,分帧处理中的帧长度、重叠帧的长度,加窗处理中部分波形加强的程度和部分波形减弱的程度。The pre-processing module 120 is used to perform pre-processing such as pre-emphasis, framing or windowing on the collected sound, so that the user's test voice contained in the sound can be recognized more easily. Among them, pre-emphasis includes emphasizing the high-frequency part of the voice, removing the influence of lip radiation, and increasing the high-frequency resolution of the voice; framing uses the short-term stationarity of the voice signal to divide the voice signal into individual voice frames for processing. There is overlap between adjacent speech frames to make each speech frame continuous; windowing is to strengthen the speech waveform near each speech frame sampling and weaken the rest of the waveform, so as to make the speech smooth. Among them, the parameter adjustment of the pre-processing includes one or more of the following: the frequency band of the high frequency targeted by the pre-emphasis processing, the degree of emphasis, the frame length in the framing processing, the length of the overlapping frame, and the partial waveform in the windowing processing The degree of strengthening and the degree of weakening of part of the waveform.
语音识别(Automatic Speech Recognition,ASR)模块130,用于对经过前处理后的测试语音进行语句内容的识别。通过语音识别模块可以识别测试语音中的词汇,并转换为计算机可读的字符序列。获得语音识别的内容后,可以进一步基于该内容进行控制指令的识别,并可由车辆执行机构执行所述语音对应的控制指令。在一些实施例中,语音识别的内容到控制指令的识别过程,可以基于关键字匹配方式识别出控制指令,也可以基于神经网络的语义识别技术识别出对应的控制指令。其中,对语音识别模块进行参数调整是指对语音识别模块的语音识别模型的参数进行调整,例如实现语音识别模型的神经网络的参数、超参数的调整。A speech recognition (Automatic Speech Recognition, ASR) module 130 is used for recognizing the sentence content of the pre-processed test speech. Vocabulary in the test speech can be recognized through the speech recognition module and converted into computer-readable character sequences. After the content of the voice recognition is obtained, the control command can be further recognized based on the content, and the control command corresponding to the voice can be executed by the vehicle actuator. In some embodiments, the recognition process from speech recognition content to control instructions can identify control instructions based on keyword matching, and can also identify corresponding control instructions based on neural network semantic recognition technology. Wherein, adjusting the parameters of the speech recognition module refers to adjusting the parameters of the speech recognition model of the speech recognition module, for example, adjusting the parameters and hyperparameters of the neural network of the speech recognition model.
评估预测模块140用于实现语音质量评估,可生成对测试语音的与语义相关的语音质量评估结果。在一些实施例中,还可以根据所述语音质量评估结果预测语音识别模块对测试语音的语音识别质量。在一些实施例中,评估预测模块也描述为包括评估模块和预测模块,分别实现上述语音质量评估、语音识别质量预测。The assessment and prediction module 140 is used to implement voice quality assessment, and can generate a semantically related voice quality assessment result for the test voice. In some embodiments, the speech recognition quality of the speech recognition module for the test speech can also be predicted according to the speech quality evaluation result. In some embodiments, the evaluation and prediction module is also described as including an evaluation module and a prediction module, respectively implementing the above speech quality evaluation and speech recognition quality prediction.
输出模块150用于将语音质量评估结果,或语音识别质量的预测结果等信息输出。其中所输出内容可以提供给车辆控制器,以由车辆据此执行相应的操作。所输出的内容也可以通过车辆的人机界面提供给用户,在一些实施例中,所输出的信息包括测试语音的质量、语音识别模型的语音识别的质量、影响所述测试语音质量的因素、调整所述测试语音质量的方式等。在一些实施例中,人机界面可以包括车辆座舱内的显示屏(显示屏例如液晶显示屏、抬头显示(Head Up Display,HUD)等)、扬声器,以通过画面或声音的方式提示给用户。The output module 150 is used for outputting information such as speech quality assessment results or prediction results of speech recognition quality. The output content can be provided to the vehicle controller, so that the vehicle can perform corresponding operations accordingly. The outputted content can also be provided to the user through the man-machine interface of the vehicle. In some embodiments, the outputted information includes the quality of the test voice, the quality of the voice recognition of the voice recognition model, factors affecting the quality of the test voice, Adjusting the manner of testing voice quality, etc. In some embodiments, the human-machine interface may include a display screen in the vehicle cockpit (a display screen such as a liquid crystal display screen, a head-up display (Head Up Display, HUD) and the like), and a speaker to prompt the user through images or sounds.
在一些实施例中,人机界面可以是中控屏,用户通过中控屏收到输出模块150输出的上述信息后,可以借助人机界面进行前处理模块120或语音识别模块130的参数调整,或对车内相关执行机构进行控制,如对车窗的开闭、对车内音频播放装置的播放音量进行控制等。其中,人机界面提供的参数调整界面,可以使用普通用户易于理解和调整的方式显示(如图形化显示),也可以以面向专业维护人员的方式进行参数显示。In some embodiments, the man-machine interface may be a central control panel. After receiving the above-mentioned information output by the output module 150 through the central control panel, the user may use the man-machine interface to adjust the parameters of the pre-processing module 120 or the voice recognition module 130. Or control the relevant actuators in the car, such as controlling the opening and closing of the car windows, controlling the playback volume of the audio playback device in the car, etc. Among them, the parameter adjustment interface provided by the man-machine interface can be displayed in a way that is easy to understand and adjust for ordinary users (such as graphical display), or can be displayed in a way for professional maintenance personnel.
在另外一些实施例中,上述评估预测模块140也可以部署在独立的测试设备上或部署在云端,输出模块150也可以部署在独立的测试设备上。这里的测试设备可以是专门的测试设备,可以是安装有相应软件的智能终端,例如该智能终端可以为手机、电脑、平板电脑等。当上述模块部署在独立的测试设备或云端时,可以基于通讯技术实现车辆与测试设备或云端的通讯。In other embodiments, the evaluation and prediction module 140 may also be deployed on an independent test device or in the cloud, and the output module 150 may also be deployed on an independent test device. The test equipment here can be a special test equipment, and can be an intelligent terminal installed with corresponding software, for example, the intelligent terminal can be a mobile phone, a computer, a tablet computer, and the like. When the above modules are deployed on independent test equipment or cloud, the communication between the vehicle and the test equipment or cloud can be realized based on communication technology.
下面基于各个附图,介绍本申请的各方法实施例。Various method embodiments of the present application are introduced below based on various drawings.
本申请实施例提供了一种语音质量评估的方法,可以用于根据测试语音集评估语音质量。图2A示出了语音质量评估的方法的一实施例的流程,该实施例中,以应用于车辆为例进行说明,其包括步骤S210至S230。The embodiment of the present application provides a voice quality evaluation method, which can be used to evaluate the voice quality according to the test voice set. FIG. 2A shows the flow of an embodiment of a method for assessing voice quality. In this embodiment, the application to a vehicle is used as an example for illustration, which includes steps S210 to S230.
S210:获取测试语音。S210: Obtain a test voice.
本实施例中,通过设置在车辆座舱内的声音采集模块来获取测试语音,声音采集模块可以是麦克风,在一些实施例中,可以是布设在车内座舱不同位置的多个麦克风。In this embodiment, the test voice is obtained through a sound collection module arranged in the vehicle cabin. The sound collection module may be a microphone, and in some embodiments, may be a plurality of microphones arranged in different positions of the vehicle cabin.
在一些实施例中,可以是在对车辆的测试或检测过程中执行本步骤,该测试语音可以由测试人员播报。In some embodiments, this step may be performed during the test or inspection of the vehicle, and the test voice may be broadcast by the tester.
在一些实施例中,可以是在用户使用车辆的过程中执行本步骤,例如在车辆行驶过程中或停车状态中,此时,测试语音可以由驾驶员(即用户)播报。其中,测试语音与车辆的语音指令是匹配的,由于驾驶员已经知晓语音指令所使用的语音内容,因此可以将驾驶员所播报的语音指令作为测试语音。In some embodiments, this step may be performed while the user is using the vehicle, for example, when the vehicle is running or parked. At this time, the test voice may be broadcast by the driver (ie, the user). Wherein, the test voice matches the voice command of the vehicle. Since the driver already knows the voice content used in the voice command, the voice command broadcast by the driver can be used as the test voice.
在一些实施例中,可以是在车辆在无法准确识别用户(如驾驶员)的语音指令时,触发本步骤,并将用户已经播报的,或再次播报的语音指令作为测试语音。在一些实施例中,当触发本步骤后,也可以由车辆通过人机界面,以画面、文字、或语音方式提示用户已经进入本实施例的语音质量评估流程,也可以进一步以引导的方式引导用户播报相应的各条测试语音。In some embodiments, this step may be triggered when the vehicle cannot accurately recognize the voice command of the user (such as the driver), and use the voice command that the user has broadcast or re-broadcast as the test voice. In some embodiments, when this step is triggered, the vehicle can also prompt the user to enter the speech quality assessment process of this embodiment through the man-machine interface through the man-machine interface, or it can be further guided by guidance The user broadcasts the corresponding test voices.
在一些实施例中,用户可以针对某一条语音内容(如一条语音指令),进行多次重复播报,所获得的多条语音也称为对应该组测试语音的多条语音,或称为对应一种语料的多个样本。用户也可以分别针对几条语音指令分别进行多次重复播报,以分别获得对应这几组测试语音的多条语音。例如用户多次播报的“打开空调”的多条语音为一组测试语音,多次播报的“调高音量”的多条语音为另一组测试语音。其中,一组测试语音对应一条语音指令,或称对应一条语义,或称对应同一条语句内容。In some embodiments, the user can repeat the broadcast multiple times for a certain voice content (such as a voice command), and the multiple voices obtained are also called multiple voices corresponding to the group of test voices, or called corresponding to a multiple samples of the corpus. The user can also repeat the broadcast several times for several voice commands respectively, so as to obtain multiple voices corresponding to these groups of test voices respectively. For example, the multiple voices of "turn on the air conditioner" broadcast by the user for many times are a group of test voices, and the multiple voices of "increase the volume" broadcasted by the user for many times are another group of test voices. Among them, a group of test voices corresponds to a voice instruction, or corresponds to a semantic meaning, or corresponds to the same sentence content.
S220:根据测试语音的语义相关信息评估测试语音的语音质量。S220: Evaluate the speech quality of the test speech according to the semantic related information of the test speech.
本实施例中,通过车辆的评估模块进行所述语音质量的评估。本实施例中,该评估模块由车辆的处理器实现,该处理器与所述声音采集模块信号连接。In this embodiment, the evaluation of the voice quality is performed by an evaluation module of the vehicle. In this embodiment, the evaluation module is implemented by a processor of the vehicle, and the processor is connected to the sound collection module by signal.
本实施例中,测试语音的语音质量评估结果与预定的语义相关,从而使语音质量评估结果不仅可用于评估语音质量,还能用于预测语音识别模块的语音识别质量。In this embodiment, the speech quality evaluation result of the test speech is related to predetermined semantics, so that the speech quality evaluation result can not only be used to evaluate the speech quality, but also be used to predict the speech recognition quality of the speech recognition module.
在本实施例中,测试语音的语义相关信息,是针对测试语音利用神经网络生成的与语义相关的特征向量,且该特征向量是该神经网络的输出层之前任意一层的输出。在其他一些实施例中,该特征向量也可是该神经网络的输出层之前多层的输出的级联构成,该多层可以是任意两层或两层以上。In this embodiment, the semantic-related information of the test speech is a semantic-related feature vector generated by the neural network for the test speech, and the feature vector is the output of any layer before the output layer of the neural network. In some other embodiments, the feature vector may also be formed by cascading outputs of multiple layers before the output layer of the neural network, and the multiple layers may be any two or more layers.
在一些实施例中,测试语音的语义相关信息,可以是由上述神经网络的输出层的输出构成的一维向量。例如输出层的输出对应各条语音指令(即对应各语义)的各个置信度构成的一维向量。对于语义相关信息,将在后文步骤S223中,进一步进行详细介绍。In some embodiments, the semantic related information of the test speech may be a one-dimensional vector formed by the output of the output layer of the neural network. For example, the output of the output layer corresponds to a one-dimensional vector formed by each confidence level of each voice instruction (ie, each semantic meaning). For semantic related information, further detailed introduction will be made in step S223 later.
S230:输出测试语音的语音质量的评估结果。S230: Output an evaluation result of the speech quality of the test speech.
本实施例中,可以输出至车辆的人机界面,以向用户展示评估结果。人机界面可以包括显示屏,显示屏可以是车辆中控屏,也可以是抬头显示(Head Up Display,HUD),也可以是增强现实抬头显示(Augmented Reality-HUD,AR-HUD)等,人机界面还可以包括扬声器,人机界面还可以包括输入部件,输入部件可以是集成于显示屏的触摸屏,也可以是独立的按键等。通过人机界面,可以采用图像、文字、或语音的方式执行所述提示。In this embodiment, it can be output to the man-machine interface of the vehicle to show the evaluation result to the user. The human-machine interface can include a display screen, which can be a vehicle central control screen, a head-up display (Head Up Display, HUD), or an augmented reality head-up display (Augmented Reality-HUD, AR-HUD), etc. The machine interface can also include a speaker, and the human-machine interface can also include an input component, which can be a touch screen integrated in the display screen, or an independent button, etc. Through the man-machine interface, the prompt can be executed in the form of image, text, or voice.
在一些实施例中,评估结果包括与测试语音的语音质量相关的内容,与测试语音质量相关的内容可以包括下列之一或组合:测试语音的质量,影响测试语音质量的因素,调整测试语音质量的方式。In some embodiments, the evaluation result includes content related to the voice quality of the test voice, and the content related to the test voice quality may include one or a combination of the following: the quality of the test voice, factors affecting the quality of the test voice, and adjusting the quality of the test voice The way.
在一些实施例中,影响测试语音质量的因素可以包括下列之一或组合:车窗处于开启状态,该因素引入了车外噪声;车速过高,该因素使得胎噪或发动机的噪音过大;车内其他声音过大,例如车辆播放的音乐等;车辆内麦克风性能或部署的位置。In some embodiments, the factors that affect the quality of the test voice may include one or a combination of the following: the window is open, and this factor introduces noise from outside the vehicle; the vehicle speed is too high, and this factor makes tire noise or engine noise too loud; Other sounds in the car are too loud, such as music played by the car, etc.; the performance or location of the microphone in the car.
在一些实施例中,调整测试语音质量的方式可以包括下列之一或组合:关闭车窗、降低车速、降低车内声音、更换有问题的麦克风、前处理模块参数的调优。In some embodiments, the way of adjusting the voice quality of the test may include one or a combination of the following: closing the car window, reducing the speed of the car, reducing the sound in the car, replacing a problematic microphone, and tuning the parameters of the pre-processing module.
在一些实施例中,如图2B示出的流程图,上述步骤S220包括步骤S221至S225。In some embodiments, as shown in the flow chart of FIG. 2B , the above step S220 includes steps S221 to S225.
S221:获得测试语音的第一特征向量。其中,第一特征向量包括测试语音的时频特征。S221: Obtain a first feature vector of the test speech. Wherein, the first feature vector includes time-frequency features of the test speech.
本实施例中,可以是获得测试语音的各连续帧的包括频域特征的多个特征向量,该多个特征向量构成所述第一特征向量。在本实施例中,相邻帧可具有重叠的信息。在本实施例中,可以由前处理模块的前处理过程对获取测试语音进行分帧处理形成所述各连续帧。在一些实施例中,前处理过程还包括预加重和加窗等处理。通过使相邻帧包含有重叠的信息,一方面可以使帧数据完整,另一方面可以使得特征向量的参数变化平滑。在其他一些实施例中,相邻帧也可以不包含有重叠的信息,这样可以减少所要处理的数据量,从而可以提高处理速度。In this embodiment, multiple feature vectors including frequency domain features of each consecutive frame of the test speech may be obtained, and the multiple feature vectors constitute the first feature vector. In this embodiment, adjacent frames may have overlapping information. In this embodiment, the pre-processing process of the pre-processing module may perform frame division processing on the acquired test voice to form the continuous frames. In some embodiments, the pre-processing process also includes processing such as pre-emphasis and windowing. By making adjacent frames contain overlapping information, on the one hand, the frame data can be complete, and on the other hand, the parameters of the feature vector can be changed smoothly. In some other embodiments, adjacent frames may not contain overlapping information, which can reduce the amount of data to be processed, thereby increasing the processing speed.
其中,由于第一特征向量包括测试语音的时频特征,故也称之为时频特征向量图,该图为二维图形,一个维度为时间坐标,另一个维度为频率坐标,各个像素点的强度为各个连续帧的对应频率的强度。Wherein, since the first feature vector includes the time-frequency feature of the test speech, it is also called the time-frequency feature vector diagram, which is a two-dimensional figure, one dimension is the time coordinate, and the other dimension is the frequency coordinate. Intensity is the intensity of the corresponding frequency for each successive frame.
在一些实施例中,频域特征包括下列之一或其组合:梅尔倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)、线性预测倒谱系数(Linear Predictive Cepstral Coefficient,LPCC)、频谱。In some embodiments, the frequency domain features include one or a combination of the following: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), spectrum.
S223:根据测试语音的第一特征向量获得测试语音的第二特征向量。其中,第二特征向量与语音的语义相关。S223: Obtain a second feature vector of the test voice according to the first feature vector of the test voice. Wherein, the second feature vector is related to the semantics of speech.
在一些实施例中,利用语义相关特征模型基于第一特征向量提取第二特征向量,其中,语义相关特征模型表示了语义与时频特征的关系,因此,所提取的第二特征向量与语义相关。In some embodiments, the semantically relevant feature model is used to extract the second feature vector based on the first feature vector, wherein the semantically relevant feature model represents the relationship between semantics and time-frequency features, so the extracted second feature vector is related to semantics .
在一些实施例中,语义相关特征模型是根据参考语音的第一特征向量和该参考语音的语义构建的。在一些实施例中,可以基于神经网络构建语义相关特征模型,神经网络可以是全连接神经网络(Fully Connected Neural Network,FCNN)、循环神经网 络(Recurrent Neural Network,RNN)、卷积神经网络(Convolutional Neural Network,CNN)等。其中本实施例中采用CNN网络。在构建语义相关特征模型时,可以结合前处理模块训练该语义相关特征模型,例如,训练时,参考语音具有语义标注,将参考语音经过前处理模块得到第一特征向量,再将第一特征向量输入语义相关特征模型,根据语义相关特征模型的输出的语义是否收敛来训练语义相关模型。训练时可以采用梯度下降法、对抗网络法等等。In some embodiments, the semantically relevant feature model is constructed according to the first feature vector of the reference speech and the semantics of the reference speech. In some embodiments, a semantically relevant feature model can be constructed based on a neural network, and the neural network can be a fully connected neural network (Fully Connected Neural Network, FCNN), a recurrent neural network (Recurrent Neural Network, RNN), a convolutional neural network (Convolutional Neural Network, CNN), etc. Wherein the present embodiment adopts CNN network. When constructing a semantic-related feature model, the semantic-related feature model can be trained in combination with the pre-processing module. For example, during training, the reference speech has semantic annotations, and the reference speech is passed through the pre-processing module to obtain the first feature vector, and then the first feature vector The semantic correlation model is input, and the semantic correlation model is trained according to whether the semantics of the semantic correlation feature model is converged. Gradient descent method, confrontation network method, etc. can be used for training.
在一些实施例中,基于神经网络构建的语义相关特征模型包括多层网络,该语义相关特征模型的输出对应各语义的分类。其中,第二特征向量可以是该多层网络的任意一层网络输出的特征向量,由于第一特征向量作为该神经网络的输入,因此第二特征向量是基于第一特征向量所提取的特征向量。另一方面,由于各个语义对应该神经网络的输出(该神经网络相当于分类网络,各个类别对应各个语义),因此可以理解为第二特征向量是与语义相关的特征向量。In some embodiments, the semantic-related feature model constructed based on the neural network includes a multi-layer network, and the output of the semantic-related feature model corresponds to each semantic classification. Wherein, the second feature vector may be a feature vector output by any layer of the multi-layer network, since the first feature vector is used as the input of the neural network, the second feature vector is a feature vector extracted based on the first feature vector . On the other hand, since each semantic corresponds to the output of the neural network (the neural network is equivalent to a classification network, and each category corresponds to each semantic), it can be understood that the second feature vector is a feature vector related to semantics.
在一些实施例中,当采用该神经网络的输出层之前的某层输出作为第二特征向量时,第二特征向量可以是相比第一特征向量具有更多维度的特征向量。例如采用CNN网络时,使用多个卷积核获得第二特征向量时,多个卷积核的运算结果构成了第二特征向量的多个维度。In some embodiments, when the output of a layer before the output layer of the neural network is used as the second feature vector, the second feature vector may be a feature vector with more dimensions than the first feature vector. For example, when a CNN network is used, when multiple convolution kernels are used to obtain the second feature vector, the operation results of the multiple convolution kernels constitute multiple dimensions of the second feature vector.
在一些实施例中,第二特征向量也可以为神经网络的两层或多层输出向量的级联所构成的特征向量(类似于残差连接),这样第二特征向量可以不仅有低层特征,也可以同时具有高层特征。例如,该神经网络的第二层网络的输出、第四层网络的输出级联构成所述第二特征向量。In some embodiments, the second feature vector can also be a feature vector formed by the cascade of two or more layers of output vectors of the neural network (similar to the residual connection), so that the second feature vector can not only have low-level features, It is also possible to have high-level features at the same time. For example, the output of the second layer network of the neural network and the output of the fourth layer network are cascaded to form the second feature vector.
在一些实施例中,当采用神经网络的输出层的输出作为第二特征向量时,则第二特征向量为对应各个语义的各个置信度构成的向量。例如,当神经网络的输出为20个节点,该20个节点对应20分类(即用于识别20个语义之一)时,则第二特征向量可以为具有20个参数的一维向量,每个参数的值对应每个语义的置信度。In some embodiments, when the output of the output layer of the neural network is used as the second feature vector, the second feature vector is a vector composed of confidence levels corresponding to each semantic meaning. For example, when the output of the neural network is 20 nodes, and the 20 nodes correspond to 20 classifications (that is, for identifying one of the 20 semantics), then the second feature vector can be a one-dimensional vector with 20 parameters, each The value of the parameter corresponds to the confidence of each semantic.
其中,参考语音集包括若干组参考语音,每组参考语音对应于同一条语义,例如语义可以是“打开空调”、“关闭空调”、“调高音量”、“调低音量”等车辆的常用命令的语义。其中,参考语音可以理解为在噪声很小的环境下采集的语音,或理解为标准语音。Among them, the reference voice set includes several groups of reference voices, each group of reference voices corresponds to the same semantics, for example, the semantics can be "turn on the air conditioner", "turn off the air conditioner", "turn up the volume", "turn down the volume", etc. Semantics of the command. Wherein, the reference voice can be understood as a voice collected in an environment with little noise, or as a standard voice.
S225:根据测试语音的第二特征向量评估测试语音的语音质量。S225: Evaluate the voice quality of the test voice according to the second feature vector of the test voice.
其中,因为第二特征向量与语义相关,因此评估的语音质量也和语义相关。当测试语音为多条测试语音构成的测试语音集时,测试语音的语音质量可以理解为测试语音集的语音质量。Wherein, because the second feature vector is related to semantics, the evaluated voice quality is also related to semantics. When the test voice is a test voice set composed of multiple test voices, the voice quality of the test voice can be understood as the voice quality of the test voice set.
在一些实施例中,语音质量评估结果包括下列三种评估指标:In some embodiments, the voice quality evaluation result includes the following three evaluation indicators:
第一评估指标:表示各组所述测试语音的中心位置的集中程度指标,其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,其中,第二特征空间为所述第二特征向量所在的空间。其中,某组测试语音可包括多条测试语音,可以基于该多条测试语音的第二特征向量的分布计算获得该组测试语音的所述中心位置。The first evaluation index: represent the concentration degree index of the central position of each group of described test voices, wherein, the described test voices of different groups have different semantics, and the central position of each group of described test voices is the described test voice of this group The center position of the second feature vector in the second feature space, wherein the second feature space is the space where the second feature vector is located. Wherein, a certain group of test voices may include multiple test voices, and the center position of the group of test voices may be obtained through calculation based on the distribution of the second feature vectors of the multiple test voices.
在一些实施例中,集中程度指标可以采用马氏距离,也可以采用欧氏距离,或其 他可计算样本之间相似度的距离指标。In some embodiments, Mahalanobis distance, Euclidean distance, or other distance indicators that can calculate the similarity between samples can be used as the concentration index.
在一些实施例中,第一评估指标的评估方式可如下:首先,针对每组测试语音,计算该组测试语音的中心位置与其他各组测试语音的中心位置之间距离,并获取该距离中的最小值,然后,针对各组测试语音所得到的各距离最小值取均值,作为各组测试语音的中心位置的集中程度的指标。第一评估指标D1的计算方法可如下式(1)所示:In some embodiments, the evaluation method of the first evaluation index can be as follows: first, for each group of test voices, calculate the distance between the center position of this group of test voices and the center positions of other groups of test voices, and obtain the Then, the average value of each distance minimum value obtained for each group of test voices is taken as an index of the concentration degree of the center position of each group of test voices. The calculation method of the first evaluation index D1 can be shown in the following formula (1):
Figure PCTCN2021122149-appb-000001
Figure PCTCN2021122149-appb-000001
其中C代表测试语音集中分组种类,即测试语音集的语料种类,μ tj代表测试语音集中第j组测试语音的中心位置,μ ti代表测试语音集中第i组测试语音的中心位置,∑代表测试语音集中第j组测试语音的第二特征向量分布的协方差矩阵。 Wherein C represents the grouping type in the test voice set, i.e. the corpus type of the test voice set, μ tj represents the center position of the j group test voice in the test voice set, μ ti represents the center position of the i group test voice in the test voice set, and ∑ represents the test voice The covariance matrix of the distribution of the second eigenvectors of the jth group of test utterances in the utterance set.
第二评估指标:表示各组所述测试语音的所述第二特征向量在第二特征空间中的分散程度指标,其中,不同组的所述测试语音具有不同语义,所述第二特征空间为所述第二特征向量所在的空间。The second evaluation index: represents the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is The space in which the second eigenvector resides.
在一些实施例中,第二评估指标的评估方式可如下:首先,针对每组测试语音,计算在第二特征向量所在空间中,该组测试语音的第二特征向量分布的半主轴长度,然后,针对各组测试语音所得到的各半主轴长度取均值,作为各组测试语音的所述分散程度指标。第二评估指标D2的计算方法可如下式(2)所示:In some embodiments, the evaluation method of the second evaluation index can be as follows: first, for each group of test speech, calculate the semi-axis length of the distribution of the second feature vector of the group of test speech in the space where the second feature vector is located, and then , take the mean value of the semi-principal lengths obtained for each group of test voices, and use it as the dispersion degree index of each group of test voices. The calculation method of the second evaluation index D2 can be shown in the following formula (2):
Figure PCTCN2021122149-appb-000002
Figure PCTCN2021122149-appb-000002
其中C代表测试语音集中的分组种类,d代表每组测试语音的第二特征向量在其空间中的维度,f jk代表第j组测试语音的第二特征向量的第k维半主轴长,∑代表测试语音集中的第j组测试语音的第二特征向量分布的协方差矩阵。 Wherein C represents the grouping type in the test speech set, d represents the dimension of the second feature vector of each group of test speech in its space, fjk represents the k-th dimension semi-principal length of the second feature vector of the j group of test speech, ∑ represents the covariance matrix of the second eigenvector distribution of the jth test utterance set in the test utterance set.
第三评估指标:各组所述测试语音的中心位置与语义对应的各组参考语音的中心位置的相近程度指标;其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,所述第二特征空间为所述第二特征向量所在的空间,不同组的所述参考语音具有不同语义,每组所述参考语音的中心位置为该组参考语音的所述第二特征向量在所述第二特征空间中的中心位置。The 3rd evaluation index: the similar degree index of the center position of each group of reference speeches corresponding to the center position of each group of described test speeches of semantics; Wherein, the described test speeches of different groups have different semantics, each group described test speeches The central position is the central position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference voices of different groups have different Semantics, the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
在一些实施例中,第三评估指标的评估方式可如下:首先,在第二特征向量所在空间中,针对每组测试语音,计算该组测试语音的中心位置与一组参考语音的中心位置之间的距离,该组参考语音与所述该组测试语音语义相同,即对应同一语句内容,然后,针对各组测试语音得到的各个距离取均值,作为测试语音集与参考语音集的相近程度指标。其中,某组测试语音的中心位置指该组测试语音的第二特征向量的分布中心,某组参考语音的中心位置指该组参考语音的第二特征向量的分布中心。第三评估指标D3的计算方法可如下式(3)所示:In some embodiments, the evaluation method of the third evaluation index can be as follows: First, in the space where the second feature vector is located, for each group of test speeches, calculate the difference between the center positions of the group of test speeches and the center positions of a group of reference speech sounds The distance between this group of reference voices is the same as the group of test voice semantics, that is, corresponding to the same sentence content, and then, the average value of each distance obtained for each group of test voices is taken as the similarity index between the test voice set and the reference voice set . Wherein, the center position of a certain group of test voices refers to the distribution center of the second feature vectors of the group of test voices, and the center position of a certain group of reference voices refers to the distribution center of the second feature vectors of the group of reference voices. The calculation method of the third evaluation index D3 can be shown in the following formula (3):
Figure PCTCN2021122149-appb-000003
Figure PCTCN2021122149-appb-000003
其中C代表测试语音集中的分组种类,μ rj代表第j组参考语音的中心位置,μ tj代 表第j组测试语音的中心位置,∑代表参考语音集中的第j组参考语音的第二特征向量分布的协方差矩阵。 Wherein C represents the grouping type in the test speech set, μ rj represents the central position of the jth group of reference speeches, μ tj represents the center position of the jth group of test speeches, and ∑ represents the second feature vector of the jth group of reference speeches in the reference speech set The covariance matrix of the distribution.
在一些实施例中,语音质量评估结果包括上述一种评估指标,也可以包括任意多种评估指标的组合,并且,不同的评估指标可以具有不同的权重。In some embodiments, the speech quality assessment result includes one of the above assessment indicators, and may also include a combination of any number of assessment indicators, and different assessment indicators may have different weights.
本申请实施例还提供了一种语音识别质量预测的方法,可以基于上述测试语音的语音质量评估结果预测测试语音的语音识别质量。如图3A示出了语音识别质量预测方法的实施例的流程,其包括步骤S310至S340。The embodiment of the present application also provides a speech recognition quality prediction method, which can predict the speech recognition quality of the test speech based on the speech quality evaluation result of the test speech. FIG. 3A shows the flow of an embodiment of the speech recognition quality prediction method, which includes steps S310 to S340.
S310:获取测试语音。S310: Obtain a test voice.
其中,本步骤可参考上述S210或其各实施例中的描述,不再赘述。Wherein, for this step, reference may be made to the descriptions in the above S210 or its various embodiments, and details are not repeated here.
S320:根据测试语音的语义相关信息评估测试语音的语音质量。S320: Evaluate the speech quality of the test speech according to the semantic related information of the test speech.
其中,本步骤可参考上述S220或其各实施例中的描述,不再赘述。Wherein, for this step, reference may be made to the descriptions in the above S220 or its various embodiments, and details are not repeated here.
在一些实施例中,语音识别质量预测方法与前述的语音质量评估方法可以整合在一个流程中执行,这种情况下,上述步骤S310、S320描述的内容可以直接采用前述步骤S210、S220执行结果,不需重复执行相同的内容。In some embodiments, the speech recognition quality prediction method and the aforementioned speech quality evaluation method can be integrated and executed in one process. In this case, the content described in the above steps S310 and S320 can directly use the execution results of the aforementioned steps S210 and S220, There is no need to repeat the same content.
S330:根据所评估的测试语音质量、利用语音识别质量函数预测语音识别模型对测试语音的语音识别质量。其中,语音识别质量函数用于指示语音识别质量与语音质量之间的关系。S330: Predict the speech recognition quality of the speech recognition model for the test speech by using the speech recognition quality function according to the evaluated test speech quality. Wherein, the speech recognition quality function is used to indicate the relationship between speech recognition quality and speech quality.
在一些实施例中,在构建语音识别质量函数中使用了语音识别模型的识别结果,故,所预测的语音识别质量可以作为对该语音识别模型识别质量的预测结果。其中语音识别模型可由上述语音识别模块实现。In some embodiments, the recognition result of the speech recognition model is used in constructing the speech recognition quality function, so the predicted speech recognition quality can be used as the prediction result of the recognition quality of the speech recognition model. Wherein the speech recognition model can be realized by the above-mentioned speech recognition module.
S340:输出语音识别质量的预测结果。S340: Output a prediction result of speech recognition quality.
其中,所输出的语音识别质量的预测结果包括:预测的语音识别模块的语音识别质量,影响语音识别模块的语音识别质量的因素,调整语音识别模块的语音识别质量的方式。Wherein, the outputted speech recognition quality prediction results include: the predicted speech recognition quality of the speech recognition module, factors affecting the speech recognition quality of the speech recognition module, and ways to adjust the speech recognition quality of the speech recognition module.
在一些实施例中,影响语音识别模块的语音识别质量的因素包括:窗口处于开启状态、车速过高、车内声音过大、麦克风性能与部署位置、语音识别模块的语音识别模型的性能。In some embodiments, the factors affecting the voice recognition quality of the voice recognition module include: the windows are open, the vehicle speed is too high, the sound inside the vehicle is too loud, the performance and deployment position of the microphone, and the performance of the voice recognition model of the voice recognition module.
在一些实施例中,调整语音识别模块的语音识别质量的方式包括下列之一或组合:关闭窗口、降低车速、降低车内声音、更换有问题的麦克风或麦克风部署优化、前处理模块的参数的调优、语音识别模块的语音识别模型参数的调优。In some embodiments, the manner of adjusting the voice recognition quality of the voice recognition module includes one or a combination of the following: closing the window, reducing the speed of the vehicle, reducing the sound in the vehicle, replacing the microphone in question or optimizing the deployment of the microphone, and adjusting the parameters of the pre-processing module. Tuning, tuning of the speech recognition model parameters of the speech recognition module.
其中,本实施例中,预测结果可以输出至车辆的人机界面,以向用户展示评估结果。具体可以参考上述步骤S230的描述。Wherein, in this embodiment, the prediction result can be output to the man-machine interface of the vehicle, so as to show the evaluation result to the user. For details, reference may be made to the description of the above step S230.
在一些实施例中,上述步骤S330中的语音识别质量函数,可如图3B示出的流程进行构建,该函数的构建过程包括步骤S321至S327。In some embodiments, the speech recognition quality function in the above step S330 can be constructed as shown in FIG. 3B , and the construction process of the function includes steps S321 to S327.
S321:对参考语音集中的各参考语音进行多次劣化,每次劣化形成一套劣化语音集,从而获得多套劣化参考语音集。S321: Perform multiple degradations on each reference voice in the reference voice set, each degradation forms a set of degraded voice sets, thereby obtaining multiple sets of degraded reference voice sets.
其中,对参考语音进行不同程度的劣化,生成多套劣化语音,每种程度的各劣化 语音,或不同方式的各劣化语音构成一套劣化语音集。其中,劣化的方法可以是语音加扰,在一些实施例中,可以根据车辆可能的噪声环境对参考语音集中的各参考语音进行加扰,例如增加背景音乐干扰、模拟的胎噪、风阻噪声干扰、模拟的车外其他车喇叭干扰等。Among them, the reference speech is degraded to different degrees to generate multiple sets of degraded speech, each degraded speech of each degree, or each degraded speech of different ways constitutes a set of degraded speech sets. Wherein, the degradation method may be voice scrambling. In some embodiments, each reference voice in the reference voice set may be scrambled according to the possible noise environment of the vehicle, such as adding background music interference, simulated tire noise, and wind resistance noise interference. , Simulated interference from other car horns outside the car, etc.
S323:根据前述的语音质量评估方法实施例的方法,获得对多套劣化参考语音评估的语音质量,得到第一评估结果。S323: According to the method of the aforementioned voice quality assessment method embodiment, obtain voice quality assessments for multiple sets of degraded reference voices, and obtain a first assessment result.
其中,把每套劣化参考语音作为测试语音,根据前述实施例中的语音质量评估的方法,获得对每套劣化参考语音评估的语音质量,各套劣化参考语音的语音质量构成第一评估结果。Wherein, each set of degraded reference voices is used as the test voice, and the voice quality evaluated for each set of degraded reference voices is obtained according to the voice quality assessment method in the foregoing embodiments, and the voice quality of each set of degraded reference voices constitutes the first evaluation result.
S325:根据上述语音识别模型,对多套劣化参考语音进行识别和统计,所统计的语音识别结果,作为第一统计结果。S325: Perform recognition and statistics on multiple sets of degraded reference voices according to the above-mentioned voice recognition model, and use the statistical voice recognition results as a first statistical result.
其中,利用语音识别模型对每套劣化参考语音进行识别,生成该每套劣化参考语音识别的语音识别结果,各套劣化参考语音的语音识别结果构成第一统计结果。Wherein, each set of degraded reference speech is recognized by using the speech recognition model, and the speech recognition result of each set of degraded reference speech recognition is generated, and the speech recognition results of each set of degraded reference speech constitute the first statistical result.
S327:根据第一统计结果与第一评估结果的函数关系获得语音识别质量函数。S327: Obtain a speech recognition quality function according to a functional relationship between the first statistical result and the first evaluation result.
在一些实施例中,可以基于机械学习构建语音识别质量函数,例如,针对各套劣化参考语音的第一统计结果及第一评估结果,采用拟合多项式的方式构建语音识别质量函数,也可以基于深度学习构建语音识别质量函数,例如采用神经网络模型的训练的方式构建。In some embodiments, the speech recognition quality function can be constructed based on machine learning. For example, for the first statistical results and the first evaluation results of each set of degraded reference speech, the speech recognition quality function can be constructed by fitting a polynomial. It can also be based on Deep learning builds a speech recognition quality function, for example, by training a neural network model.
在一些实施例中,把第一评估结果中各评估指标均作为因变量,构建语音识别质量函数。在另一些实施例中,把第一评估结果中各评估指标组合后一个或多个组合指标后再作为因变量,构建语音识别质量函数。这里的评估指标例如上述公式(1)至公式(3)所示的指标。In some embodiments, each evaluation indicator in the first evaluation result is used as a dependent variable to construct a speech recognition quality function. In some other embodiments, one or more combined indicators in the first evaluation result are combined and used as the dependent variable to construct the speech recognition quality function. The evaluation indicators here are, for example, the indicators shown in the above formula (1) to formula (3).
本申请实施例还提供了一种提高语音识别质量的方法。可以基于上述测试语音的语音质量评估结果和语音识别质量预测的结果,来确定需要对前处理模块或语音识别模块进行参数调优,从而提高语音识别质量,如图4示出了提高语音识别质量的方法实施例的流程,其包括步骤S410至S480。The embodiment of the present application also provides a method for improving speech recognition quality. It can be determined based on the voice quality assessment results of the test voice and the voice recognition quality prediction results that the pre-processing module or the voice recognition module needs to be tuned to improve the voice recognition quality, as shown in Figure 4. Improve the voice recognition quality The process of the method embodiment includes steps S410 to S480.
S410:获取测试语音。S410: Obtain a test voice.
其中,本步骤可参考上述步骤S210或其各实施例中的描述,这里不再详述。Wherein, for this step, reference may be made to the above step S210 or the descriptions in the various embodiments thereof, which will not be described in detail here.
S420:获得评估测试语音的语音质量。S420: Obtain the speech quality of the evaluation test speech.
其中,本步骤可参考上述步骤S220或其各实施例中的描述,这里不再详述。Wherein, for this step, reference may be made to the above step S220 or the descriptions in its various embodiments, and will not be described in detail here.
S430:判断语音质量是否低于预设的第一基线。其中,当该语音质量低于第一基线时,执行步骤S440,否则,执行步骤S450。S430: Determine whether the voice quality is lower than a preset first baseline. Wherein, when the voice quality is lower than the first baseline, perform step S440, otherwise, perform step S450.
其中,第一基线对语音质量的评估指标进行判断,又称为指标基线。在一些实施例中,对评估语音质量的各评估指标分别设置第一基线;在另一些实施例中,把评估语音质量的各评估指标组合成一个或多个组合指标,再设置对应的第一基线。Wherein, the first baseline judges the evaluation index of voice quality, and is also referred to as the index baseline. In some embodiments, the first baselines are respectively set for each evaluation index for evaluating voice quality; in other embodiments, each evaluation index for evaluating voice quality is combined into one or more combined indexes, and the corresponding first baseline is set. baseline.
S440:输出测试语音的语音质量的评估结果。S440: Output an evaluation result of the speech quality of the test speech.
其中,本步骤可参考上述步骤S230或其各实施例中的描述,这里不再详述。Wherein, for this step, reference may be made to the above-mentioned step S230 or the descriptions in its various embodiments, which will not be described in detail here.
在一些实施例中执行完本步骤后,可以返回步骤S410,或结束本次流程。In some embodiments, after this step is performed, it may return to step S410, or end this process.
在一些实施例中,考虑到语音识别模型具有的容错性,也可以继续执行步骤S450,或者执行步骤S480,以继续进行语音识别。In some embodiments, considering the fault tolerance of the speech recognition model, step S450 may be continued, or step S480 may be executed to continue speech recognition.
S450:预测语音识别质量。S450: Predict speech recognition quality.
其中,本步骤可参考上述步骤S330或其各实施例中的描述,这里不再详述。Wherein, for this step, reference may be made to the above step S330 or the descriptions in its various embodiments, and will not be described in detail here.
S460:判断预测的语音识别质量是否低于预设的第二基线。其中,当语音识别质量低于预设的第二基线时,执行步骤S470,否则执行步骤S480。S460: Determine whether the predicted speech recognition quality is lower than a preset second baseline. Wherein, when the voice recognition quality is lower than the preset second baseline, step S470 is performed; otherwise, step S480 is performed.
其中,第二基线对语音识别质量进行判断,即对语义识别模型的语音识别精度进行判断,又为精度基线。Wherein, the second baseline judges the speech recognition quality, that is, judges the speech recognition accuracy of the semantic recognition model, and is also the accuracy baseline.
S470:输出所述语音识别质量的预测结果。S470: Output the prediction result of the speech recognition quality.
其中,本步骤可参考上述步骤S340或其各实施例中的描述,这里不再详述。Wherein, for this step, reference may be made to the above-mentioned step S340 or descriptions in various embodiments thereof, and details are not described here again.
在一些实施例中执行完本步骤后,可以返回步骤S410,或结束本次流程。In some embodiments, after this step is performed, it may return to step S410, or end this process.
在一些实施例中,考虑到语音识别模型具有的容错性,也可以继续执行步骤S480,以继续进行语音识别。In some embodiments, considering the fault tolerance of the speech recognition model, step S480 may also be performed to continue speech recognition.
S480:由语音识别模块对所述测试语音进行识别。S480: Recognize the test voice by the voice recognition module.
在一些实施例中,当在步骤S430评估的语音质量高于第一基线、且在步骤S460预测的语音识别质量高于第二基线时,认为语音识别质量的准确度会较高,该识别结果可以作为后续使用,例如用于对车辆的控制等。In some embodiments, when the speech quality evaluated in step S430 is higher than the first baseline, and the speech recognition quality predicted in step S460 is higher than the second baseline, it is considered that the accuracy of the speech recognition quality will be higher, and the recognition result It can be used as a follow-up, such as for controlling the vehicle, etc.
在一些实施例中,当在步骤S430评估的语音质量低于第一基线、或在步骤S460预测的语音识别质量低于第二基线的情况下进入到本步骤时,可以进一步将语音识别结果提示给用户确认,以确定是否使用该语音识别结果。In some embodiments, when entering this step when the speech quality evaluated in step S430 is lower than the first baseline, or the predicted speech recognition quality in step S460 is lower than the second baseline, the speech recognition result can be further prompted Give user confirmation to determine whether to use the speech recognition result.
在一些实施例中,用户可以根据步骤S440所输出的所述语音质量的评估结果、或步骤S470所输出的所述语音识别质量的预测结果,进行前处理参数或语音识别模型参数的调整,以提高语音识别的质量。在其他一些实施例中,也可以是由被测设备,例如车辆,根据步骤S440所输出的所述语音质量的评估结果、或步骤S470所输出的所述语音识别质量的预测结果,自动进行前处理参数或语音识别模型参数的调整。In some embodiments, the user may adjust pre-processing parameters or speech recognition model parameters according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470, so as to Improve the quality of speech recognition. In some other embodiments, the device under test, such as a vehicle, may automatically perform the pre-processing according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470. Adjustment of processing parameters or speech recognition model parameters.
下面,为了便于对上述各实施例的技术方案进行进一步的了解,将对提高语音识别质量方法的具体实施方式进行描述,其中,该提高语音识别质量方法的具体实施方式中,涉及到了语音质量评估方法的步骤、语音识别质量预测方法的步骤,这两部分内容对应的步骤也可参照前述的实施例独立出来,作为语音质量评估方法、语音识别质量预测方法的具体实施方式,为简化描述,对这两部分单独的具体实施方式不再赘述。In the following, in order to further understand the technical solutions of the above-mentioned embodiments, the specific implementation of the method for improving the quality of speech recognition will be described, wherein the specific implementation of the method for improving the quality of speech recognition involves speech quality assessment The steps of the method, the steps of the speech recognition quality prediction method, the steps corresponding to these two parts can also be independent with reference to the foregoing embodiments, as the specific implementation of the speech quality evaluation method and the speech recognition quality prediction method, in order to simplify the description, the The separate specific implementation manners of these two parts will not be described in detail.
如图5示出了语音识别质量预测方法的具体实施方式的流程,包括下述步骤:As shown in Fig. 5 the flow process of the specific implementation of speech recognition quality prediction method, comprises the following steps:
S510:车辆通过座舱内布设的麦克风,获取基于参考语音集的测试语音集。S510: The vehicle acquires a test voice set based on the reference voice set through the microphone arranged in the cockpit.
其中,测试语音集包括若干各分组,示例地,测试人员坐在车辆的主驾位置依次播报各组测试语音,每个分组的语义可对应一常用命令,每组测试语音包括测试人员播报的10条相同内容的测试语音。Wherein, the test voice set includes several groups. For example, the tester sits in the driver's seat of the vehicle and broadcasts each group of test voices in turn. The semantics of each group can correspond to a common command. Each group of test voices includes 10 test voices broadcast by the tester. A test voice with the same content.
其中播报的测试语音的内容与参考语音集内容对应。本例中,由车辆通过人机界面以引导的方式引导测试人员播报测试语音,例如可以在屏幕上依次显示对应参考语 音集的各条语音内容和需要播报的次数,由测试人员据此播报。The content of the test voice reported corresponds to the content of the reference voice set. In this example, the vehicle guides the testers to play the test voices in a guided manner through the man-machine interface. For example, each voice content of the corresponding reference voice set and the number of times to be played can be displayed on the screen in turn, and the testers will broadcast accordingly.
S515:所采集的测试语音,经车辆上的前处理模块进行前处理,包括提取测试语音集中各测试语音的时频特征向量图,即第一特征向量。S515: The collected test voice is pre-processed by the pre-processing module on the vehicle, including extracting the time-frequency feature vector graph of each test voice in the test voice set, that is, the first feature vector.
S520:利用语义相关特征模型,根据所述时频特征向量图获得测试语音集中测试语音的语义相关特征向量,即第二特征向量。S520: Using the semantic correlation feature model, obtain a semantic correlation feature vector of the test voice in the test voice set, that is, a second feature vector, according to the time-frequency feature vector map.
S525:基于测试语音的语义相关特征向量评估测试语音集的语音质量,该语音质量可以使用上述公式(1)到公式(3)中的一个或多个进行评估。S525: Evaluate the speech quality of the test speech set based on the semantically relevant feature vectors of the test speech. The speech quality can be evaluated using one or more of the above formulas (1) to (3).
S530:根据语音质量的评估结果,判断该语音质量是否低于设定的指标基线,如果低于设定的指标基线,则运行步骤S535,否则运行步骤S540。S530: According to the evaluation result of the voice quality, judge whether the voice quality is lower than the set index baseline, if it is lower than the set index baseline, go to step S535, otherwise go to step S540.
S535:输出测试语音的语音质量的评估结果。S535: Output an evaluation result of the speech quality of the test speech.
本例中,评估结果可以输出至人机界面,评估结果包括与测试语音的语音质量相关的内容。本例中,在人机界面显示的与测试语音的语音质量相关的内容可以包括:提示用户可对前处理模块中参数和算法进行调优。其中,可以图形化的形式显示可调优的界面和参数。In this example, the evaluation result can be output to the man-machine interface, and the evaluation result includes content related to the voice quality of the test voice. In this example, the content related to the speech quality of the test speech displayed on the man-machine interface may include: prompting the user to tune the parameters and algorithms in the pre-processing module. Among them, the adjustable interface and parameters can be displayed in a graphical form.
S540:利用语音识别质量函数,基于测试语音集的语音质量,预测语音识别模块对测试语音集的语音识别质量。S540: Using the speech recognition quality function, based on the speech quality of the test speech set, predict the speech recognition quality of the speech recognition module for the test speech set.
S545:判断预测的识别质量是否低于设定的精度基线,如果低于精度基线,则运行步骤S550,否则运行步骤S555。S545: Determine whether the predicted recognition quality is lower than the set accuracy baseline, if it is lower than the accuracy baseline, go to step S550, otherwise go to step S555.
S550:输出所述语音识别质量的预测结果。S550: Output the prediction result of the speech recognition quality.
本例中,预测结果可以输出至人机界面,预测结果包括与语音识别质量相关的内容。In this example, the prediction result can be output to the man-machine interface, and the prediction result includes content related to speech recognition quality.
本例中,在人机界面显示的与语音识别质量相关的内容包括:提示用户可对语音识别模块的语音识别模型进行优化。其中,可以图形化的形式显示可调优的界面和参数。In this example, the content related to the speech recognition quality displayed on the man-machine interface includes: prompting the user to optimize the speech recognition model of the speech recognition module. Among them, the adjustable interface and parameters can be displayed in a graphical form.
S555:提示本次语音评估质量与预测的语音识别质量达标。S555: Prompt that the quality of this speech evaluation and the predicted speech recognition quality meet the standard.
另外,该具体实施方式中,当用户根据步骤S535或步骤S550的输出内容进行了相应参数调优后,还可进一步优化语音识别质量函数。具体的,可以使用语音识别质量函数构建方法中的步骤,对语音识别质量函数重新训练,以优化该语音识别质量函数。例如当进行了前处理的参数或算法优化后,可再次根据上述步骤S323-S327对语音识别质量函数重新训练;当进行了语音识别模型参数优化后,可再次根据上述步骤S325-S327对语音识别质量函数重新训练。In addition, in this specific implementation manner, after the user performs corresponding parameter tuning according to the output content of step S535 or step S550, the speech recognition quality function can be further optimized. Specifically, the steps in the speech recognition quality function construction method can be used to retrain the speech recognition quality function to optimize the speech recognition quality function. For example, after the pre-processing parameters or algorithms are optimized, the speech recognition quality function can be retrained according to the above-mentioned steps S323-S327; Quality function retraining.
本申请实施例还提供了相应的装置,关于装置的有益效果或解决的技术问题,可以参见与各装置分别对应的方法中的描述,或者参见发明内容中的描述,此处仅进行简述。该实施例中的各个装置,可以用于实现上述的方法中的各可选实施例。下面基于各图介绍本申请的各装置实施例。The embodiments of the present application also provide corresponding devices. Regarding the beneficial effects or technical problems solved by the devices, you can refer to the descriptions in the methods corresponding to the devices respectively, or refer to the descriptions in the summary of the invention, which are only briefly described here. Each device in this embodiment may be used to implement each optional embodiment in the foregoing method. Each device embodiment of the present application will be described below based on each figure.
本申请的实施例提供的一种语音质量评估的装置,可以用于实现语音质量评估方法的各实施例,如图6A所示,该装置具有获取模块610、评估模块620和输出模块630。The speech quality assessment device provided by the embodiment of the present application can be used to implement various embodiments of the speech quality assessment method. As shown in FIG. 6A , the device has an acquisition module 610 , an evaluation module 620 and an output module 630 .
获取模块610用于获取测试语音。具体用于执行前述步骤S210及其各实施例。The obtaining module 610 is used for obtaining test voice. It is specifically used to execute the foregoing step S210 and various embodiments thereof.
评估模块620用于根据测试语音的语义相关信息评估测试语音的语音质量。具体用于执行前述步骤S220及其各实施例。The evaluation module 620 is used for evaluating the speech quality of the test speech according to the semantic related information of the test speech. It is specifically used to execute the foregoing step S220 and various embodiments thereof.
输出模块630,用于输出语音质量的评估结果。具体用于执行前述步骤S230及其各实施例。The output module 630 is configured to output the evaluation result of the voice quality. It is specifically used to execute the foregoing step S230 and various embodiments thereof.
在一些实施例中,所述输出模块630所输出的语音质量的评估结果,包括以下一种或多种:所述测试语音的质量;影响所述测试语音质量的因素;调整所述测试语音质量的方式。具体的内容可参见前述步骤S230中的描述。In some embodiments, the evaluation result of the voice quality output by the output module 630 includes one or more of the following: the quality of the test voice; factors affecting the quality of the test voice; adjusting the quality of the test voice The way. For specific content, refer to the description in the aforementioned step S230.
在一些实施例中,所述评估模块620具体用于:获得所述测试语音的第一特征向量,其中,所述第一特征向量包括所述测试语音的时频特征向量;根据所述测试语音的第一特征向量获得所述测试语音的第二特征向量,其中,所述第二特征向量与所述测试语音的语义相关;根据所述第二特征向量评估所述测试语音的语音质量。In some embodiments, the evaluation module 620 is specifically configured to: obtain a first feature vector of the test voice, wherein the first feature vector includes a time-frequency feature vector of the test voice; Obtaining a second feature vector of the test speech from the first feature vector, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.
在一些实施例中,所述评估模块620在根据所述第二特征向量评估所述测试语音的语音质量时,包括:使用第一评估指标评估所述测试语音的语音质量,所述第一评估指标包括:各组所述测试语音的中心位置的集中程度指标,其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,所述第二特征空间为所述第二特征向量所在的空间。In some embodiments, when evaluating the speech quality of the test speech according to the second feature vector, the evaluation module 620 includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation The indicators include: the concentration index of the center position of each group of the test voices, wherein the test voices of different groups have different semantics, and the center position of each group of the test voices is the second feature of the group of test voices The center position of the vector in the second feature space, where the second feature space is the space where the second feature vector is located.
在一些实施例中,所述评估模块620在根据所述第二特征向量评估所述测试语音的语音质量时,包括:使用第二评估指标评估所述测试语音的语音质量,所述第二评估指标包括:各组所述测试语音的所述第二特征向量在第二特征空间中的分散程度指标,其中,不同组的所述测试语音具有不同语义,所述第二特征空间为所述第二特征向量所在的空间。In some embodiments, when evaluating the speech quality of the test speech according to the second feature vector, the evaluation module 620 includes: using a second evaluation index to evaluate the speech quality of the test speech, the second evaluation The index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is the second feature space. The space of the two eigenvectors.
在一些实施例中,所述评估模块620在根据所述第二特征向量评估所述测试语音的语音质量时,包括:使用第三评估指标评估所述测试语音的语音质量,所述第三评估指标包括:各组所述测试语音的中心位置与语义对应的各组参考语音的中心位置的相近程度指标;其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,所述第二特征空间为所述第二特征向量所在的空间,不同组的所述参考语音具有不同语义,每组所述参考语音的中心位置为该组参考语音的所述第二特征向量在所述第二特征空间中的中心位置。In some embodiments, when evaluating the speech quality of the test speech according to the second feature vector, the evaluation module 620 includes: using a third evaluation index to evaluate the speech quality of the test speech, the third evaluation Index comprises: the similar degree index of the central position of each group described test speech and the central position of each group of reference speech corresponding to semantics; Wherein, the described test speech of different groups has different semantics, the central position of each group described test speech The central position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference voices of different groups have different semantics, The center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
在一些实施例中,所述评估模块620在获得所述测试语音的第一特征向量时,包括:获取所述测试语音包含的连续帧,其中,相邻帧包含有重叠的信息;根据所述连续帧获得包括频域特征的多个特征向量,所述多个特征向量构成所述第一特征向量。In some embodiments, when the evaluation module 620 obtains the first feature vector of the test speech, it includes: obtaining consecutive frames contained in the test speech, wherein adjacent frames contain overlapping information; according to the A plurality of feature vectors including frequency domain features are obtained for consecutive frames, and the plurality of feature vectors constitute the first feature vector.
本申请的实施例还提供了一种语音识别质量预测的装置,该装置可以用于实现语音识别质量预测的方法实施例,如图6B所示,该装置具有获取模块612、预测模块622和输出模块632。The embodiment of the present application also provides a device for speech recognition quality prediction, which can be used to implement the method embodiment of speech recognition quality prediction, as shown in Figure 6B, the device has an acquisition module 612, a prediction module 622 and an output Module 632.
获取模块612用于获取测试语音。具体用于执行上述步骤S310及其各实施例。The obtaining module 612 is used for obtaining the test voice. It is specifically used to execute the above step S310 and various embodiments thereof.
预测模块622用于根据语音识别质量函数预测语音识别模型对所述测试语音的语音识别质量,所述语音识别质量函数表示了语音识别质量与语音质量之间的关系,所述语音质量根据前述的语音质量评估方法的任意一种可能的实施例评估。具体用于执行上述步骤S320-S330及其各实施例。The prediction module 622 is used to predict the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, the speech recognition quality function represents the relationship between the speech recognition quality and the speech quality, and the speech quality is based on the aforementioned Any one of the possible embodiments of the voice quality assessment method is evaluated. It is specifically used to execute the above steps S320-S330 and various embodiments thereof.
输出模块632用于输出语音识别质量的预测结果。具体用于执行上述步骤S340及其各实施例。The output module 632 is used to output the prediction result of speech recognition quality. It is specifically used to execute the above step S340 and various embodiments thereof.
在一些实施例中,所述输出模块632所输出的所述语音识别质量的预测结果,包括以下一种或多种,包括以下一种或多种:所述语音识别模型的语音识别的质量;影响所述语音识别模型的语音识别质量的因素;调整所述语音识别模型的语音识别质量的方式。In some embodiments, the prediction result of the speech recognition quality output by the output module 632 includes one or more of the following, including one or more of the following: the speech recognition quality of the speech recognition model; Factors affecting the speech recognition quality of the speech recognition model; ways of adjusting the speech recognition quality of the speech recognition model.
在一些实施例中,所述语音识别质量函数,其构建过程包括:获得参考语音的多套劣化参考语音;根据所述语音识别模型获得多套所述劣化参考语音的语音识别结果,并作为第一统计结果;分别以多套所述劣化参考语音作为测试语音,根据语音质量评估方法的任意一种可能的实施例获得多套所述劣化参考语音的语音质量评估结果,得到第一评估结果;根据所述第一统计结果与所述第一评估结果的函数关系获得所述语音识别质量函数。In some embodiments, the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; obtaining speech recognition results of multiple sets of the degraded reference speeches according to the speech recognition model, and using them as the first A statistical result; respectively using multiple sets of the degraded reference voices as test voices, and obtaining multiple sets of voice quality assessment results of the degraded reference voices according to any possible embodiment of the voice quality assessment method, to obtain a first evaluation result; The speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
本申请的实施例还提供了一种提高语音识别质量的装置,该装置可以用于实现提高语音识别质量的方法实施例,如图6C所示,该语音质量评估装置具有获取模块614、评估预测模块624和输出模块634。The embodiment of the present application also provides a device for improving speech recognition quality, which can be used to implement the method embodiment for improving speech recognition quality, as shown in Figure 6C, the speech quality evaluation device has an acquisition module 614, an evaluation prediction module 624 and output module 634 .
获取模块614用于获取测试语音。具体用于执行上述步骤S410及其各实施例。The obtaining module 614 is used for obtaining the test voice. It is specifically used to execute the above step S410 and various embodiments thereof.
评估预测模块624用于评估测试语音的语音质量。可以采用前述的语音质量评估方法的任意一种可能的实施例来评估。具体用于执行上述步骤S420-S430及其各实施例。The evaluation prediction module 624 is used to evaluate the speech quality of the test speech. Any possible embodiment of the foregoing voice quality assessment method may be used for assessment. It is specifically used to execute the above steps S420-S430 and various embodiments thereof.
输出模块634用于当语音质量低于预设的第一基线时,执行所述输出语音质量的评估结果。具体用于执行上述步骤S440及其各实施例。The output module 634 is configured to output the assessment result of the voice quality when the voice quality is lower than the preset first baseline. It is specifically used to execute the above step S440 and various embodiments thereof.
在一些实施例中,评估预测模块624还用于当所述语音质量大于或等于所述第一基线时,根据语音识别质量预测方法中任意一种可能的实施例所述的方法预测语音识别质量。具体用于执行上述步骤S450-S460及其各实施例。In some embodiments, the evaluation and prediction module 624 is further configured to predict the speech recognition quality according to the method described in any possible embodiment of the speech recognition quality prediction method when the speech quality is greater than or equal to the first baseline . It is specifically used to execute the above steps S450-S460 and various embodiments thereof.
所述输出模块634还用于当所述语音识别质量低于预设的第二基线时,执行所述输出所述语音识别质量的预测结果。具体用于执行上述步骤S470及其各实施例。The output module 634 is further configured to output the prediction result of the speech recognition quality when the speech recognition quality is lower than a preset second baseline. It is specifically used to execute the above step S470 and various embodiments thereof.
本申请的实施例还提供了一种车辆,如图7A和图7B所示,该车辆包括声音采集装置710、前处理装置720、语音识别装置730,以及前述的语音质量评估装置、语音识别质量预测装置或提高语音识别质量的装置。The embodiment of the present application also provides a vehicle, as shown in Figure 7A and Figure 7B, the vehicle includes a sound collection device 710, a pre-processing device 720, a speech recognition device 730, and the aforementioned speech quality evaluation device, speech recognition quality Predictive means or means to improve the quality of speech recognition.
声音采集装置710用于采集驾驶人员播报基于参考语音的语义的命令。在图7B中可以为座舱内的麦克风。图7B中示出了例子中,麦克风设置于中控屏740处,也可以设置在方向盘上方的仪表盘处、座舱内的后视镜处、方向盘处等其他一个或多个位置。The sound collecting device 710 is used for collecting the semantic commands based on the reference speech spoken by the driver. In FIG. 7B it may be a microphone in the cockpit. In the example shown in FIG. 7B , the microphone is set at the central control panel 740 , and may also be set at one or more other positions such as the instrument panel above the steering wheel, the rearview mirror in the cockpit, and the steering wheel.
前处理装置720用于对集驾驶人员播报的语音进行前处理。The pre-processing device 720 is used for pre-processing the speech broadcast by the driver.
语音识别装置730用于当驾驶人员命令的语音质量要求且预测的语音识别质量满足要求时,对驾驶人员命令进行识别。The voice recognition device 730 is used to recognize the driver's command when the voice quality requirement of the driver's command and the predicted voice recognition quality meet the requirements.
前述的语音质量评估装置、语音识别质量预测装置或提高语音识别质量的装置,用于前述的所述用途,用户可以基于此执行相应的操作,以改善语音质量、语音识别质量。The aforementioned speech quality evaluation device, speech recognition quality predicting device or speech recognition quality improving device is used for the aforementioned purpose, and the user can perform corresponding operations based on this to improve speech quality and speech recognition quality.
在图7B示出的车辆座舱中,还示出了作为人机界面的中控屏740,用户通过中控屏740收到语音质量评估装置、语音识别质量预测装置或提高语音识别质量的装置输出的信息,并进行显示,以及可以借助中控屏740进行参数调整界面的显示,便于用户通过操控中控屏740进行上述的参数调整。In the vehicle cockpit shown in FIG. 7B , a central control panel 740 as a man-machine interface is also shown, through which the user receives the output of a speech quality evaluation device, a speech recognition quality prediction device, or a device for improving speech recognition quality. The information can be displayed and displayed, and the parameter adjustment interface can be displayed with the help of the central control panel 740, which is convenient for the user to perform the above-mentioned parameter adjustment by manipulating the central control panel 740.
在图7B示出的车辆座舱中,所述前处理装置720、语音识别装置730,以及语音质量评估装置、语音识别质量预测装置或提高语音识别质量的装置可以由车辆内的一个或多个处理器实现,本实施例中,可以由车载信息娱乐系统的处理器实现。In the vehicle cockpit shown in Figure 7B, the pre-processing device 720, the speech recognition device 730, and the speech quality evaluation device, the speech recognition quality prediction device or the device for improving the speech recognition quality can be processed by one or more in the vehicle In this embodiment, it may be implemented by a processor of the vehicle infotainment system.
图8是本申请实施例提供的一种计算设备800的结构性示意性图。该计算设备800包括:处理器810、存储器820、通信接口830。FIG. 8 is a schematic structural diagram of a computing device 800 provided by an embodiment of the present application. The computing device 800 includes: a processor 810 , a memory 820 , and a communication interface 830 .
应理解,图8所示的计算设备800中的通信接口830可以用于与其他设备之间进行通信。It should be understood that the communication interface 830 in the computing device 800 shown in FIG. 8 can be used to communicate with other devices.
其中,该处理器810可以与存储器820连接。该存储器820可以用于存储该程序代码和数据。因此,该存储器820可以是处理器810内部的存储模块,也可以是与处理器810独立的外部存储模块,还可以是包括处理器810内部的存储模块和与处理器810独立的外部存储模块的部件。Wherein, the processor 810 may be connected to the memory 820 . The memory 820 can be used to store the program codes and data. Therefore, the memory 820 may be a storage module inside the processor 810, or an external storage module independent of the processor 810, or may include a storage module inside the processor 810 and an external storage module independent of the processor 810. part.
在计算设备800运行时,所述处理器810执行所述存储器820中的计算机执行指令执行上述方法的操作步骤。When the computing device 800 is running, the processor 810 executes the computer-implemented instructions in the memory 820 to perform the operation steps of the above method.
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时用于执行本申请各个实施例所描述的方案中的一个或多个。The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the program is used to execute one or more of the solutions described in the various embodiments of the present application when executed by a processor.
在以上的描述中,所涉及的表示步骤的标号,如S110、S120……等,并不表示一定会按此步骤执行,在允许的情况下可以互换前后步骤的顺序,或同时执行。In the above description, the referenced numbers representing the steps, such as S110, S120, etc., do not necessarily mean that the steps will be executed, and the order of the preceding and following steps can be interchanged or executed simultaneously if allowed.
本申请的说明书和权利要求书中的词语“第一、第二、第三”等类似用语,仅用于区别类似的对象,不代表针对对象的特定排序,可以理解地,在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。The words "first, second, third" and other similar terms in the description and claims of this application are only used to distinguish similar objects, and do not represent a specific ordering of objects. Understandably, when permitted The specific order or sequencing may be interchanged such that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein.
本申请的说明书和权利要求书中使用的术语“包括”不应解释为限制于其后列出的内容;它不排除其它的元件或步骤。因此,其应当诠释为指定所提到的所述特征、整体、步骤或部件的存在,但并不排除存在或添加一个或更多其它特征、整体、步骤或部件及其组群。因此,表述“包括装置A和B的设备”不应局限为仅由部件A和B组成的设备。The term "comprising" used in the description and claims of the present application should not be interpreted as being limited to what is listed thereafter; it does not exclude other elements or steps. Therefore, it should be interpreted as specifying the presence of said features, integers, steps or components, but not excluding the presence or addition of one or more other features, integers, steps or components and groups thereof. Therefore, the expression "apparatus comprising means A and B" should not be limited to an apparatus consisting of parts A and B only.
本说明书中提到的“实施例”意味着与该实施例结合描述的特定特征、结构或特性 包括在本申请的至少一个实施例中。因此,在本说明书各处出现的用语“在实施例中”并不一定都指同一实施例,但可以指同一实施例。此外,在一个或多个实施例中,能够以任何适当的方式组合各特定特征、结构或特性,如从本公开对本领域的普通技术人员显而易见的那样。Reference in this specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Therefore, the phrase "in an embodiment" appearing in various places in this specification does not necessarily all refer to the same embodiment, but may refer to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.
注意,上述仅为本申请的实施例及所运用的技术原理。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请的构思的情况下,还可以包括更多其他等效实施例,均属于本申请的保护范畴。Note that the above are only the embodiments of the present application and the applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present application, all of which belong to protection scope of this application.

Claims (28)

  1. 一种语音质量评估的方法,其特征在于,包括:A method for voice quality assessment, characterized in that it comprises:
    获取测试语音;Get the test voice;
    根据所述测试语音的语义相关信息评估所述测试语音的语音质量;Evaluating the speech quality of the test speech according to semantically relevant information of the test speech;
    输出所述语音质量的评估结果。Outputting the evaluation result of the voice quality.
  2. 根据权利要求1所述的方法,其特征在于,所述语音质量的评估结果包括以下一种或多种:The method according to claim 1, wherein the assessment results of the voice quality include one or more of the following:
    所述测试语音的质量;the quality of the test voice;
    影响所述测试语音质量的因素;Factors affecting the quality of the test voice;
    调整所述测试语音质量的方式。Adjust the method of testing voice quality.
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述测试语音的语义相关信息评估所述测试语音的语音质量,包括:The method according to claim 1, wherein the evaluating the voice quality of the test voice according to the semantic related information of the test voice comprises:
    获得所述测试语音的第一特征向量,其中,所述第一特征向量包括所述测试语音的时频特征向量;Obtaining a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech;
    根据所述测试语音的第一特征向量获得所述测试语音的第二特征向量,其中,所述第二特征向量与所述测试语音的语义相关;Obtaining a second feature vector of the test speech according to the first feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech;
    根据所述第二特征向量评估所述测试语音的语音质量。Evaluating the speech quality of the test speech according to the second feature vector.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述第二特征向量评估所述测试语音的语音质量,包括:使用第一评估指标评估所述测试语音的语音质量,The method according to claim 3, wherein the evaluating the voice quality of the test voice according to the second feature vector comprises: evaluating the voice quality of the test voice using a first evaluation indicator,
    所述第一评估指标包括:各组所述测试语音的中心位置的集中程度指标,其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,所述第二特征空间为所述第二特征向量所在的空间。The first evaluation index includes: the concentration index of the central position of each group of the test voices, wherein the test voices of different groups have different semantics, and the central position of each group of the test voices is the center position of the test voices of the group. The center position of the second eigenvector in a second eigenspace, where the second eigenspace is the space where the second eigenvector is located.
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述第二特征向量评估所述测试语音的语音质量,包括:使用第二评估指标评估所述测试语音的语音质量,The method according to claim 3, wherein the evaluating the voice quality of the test voice according to the second feature vector comprises: evaluating the voice quality of the test voice using a second evaluation index,
    所述第二评估指标包括:各组所述测试语音的所述第二特征向量在第二特征空间中的分散程度指标,其中,不同组的所述测试语音具有不同语义,所述第二特征空间为所述第二特征向量所在的空间。The second evaluation index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature The space is the space where the second eigenvector is located.
  6. 根据权利要求3所述的方法,其特征在于,所述根据所述第二特征向量评估所述测试语音的语音质量,包括:使用第三评估指标评估所述测试语音的语音质量,The method according to claim 3, wherein the evaluating the voice quality of the test voice according to the second feature vector comprises: evaluating the voice quality of the test voice using a third evaluation indicator,
    所述第三评估指标包括:各组所述测试语音的中心位置与语义对应的各组参考语音的中心位置的相近程度指标;其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中 心位置,所述第二特征空间为所述第二特征向量所在的空间,不同组的所述参考语音具有不同语义,每组所述参考语音的中心位置为该组参考语音的所述第二特征向量在所述第二特征空间中的中心位置。The third evaluation index includes: the similarity index of the center position of each group of the test voices and the center position of each group of reference voices corresponding to the semantics; wherein, the test voices of different groups have different semantics, and the test voices of each group have different semantics. The center position of the test voice is the center position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference The voices have different semantics, and the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
  7. 根据权利要求3至6任一所述的方法,其特征在于,所述获得所述测试语音的第一特征向量,包括:The method according to any one of claims 3 to 6, wherein said obtaining the first feature vector of said test voice comprises:
    获取所述测试语音包含的连续帧,其中,相邻帧包含有重叠的信息;Obtaining the continuous frames contained in the test speech, wherein adjacent frames contain overlapping information;
    根据所述连续帧获得包括频域特征的多个特征向量,所述多个特征向量构成所述第一特征向量。A plurality of feature vectors including frequency domain features are obtained according to the consecutive frames, and the plurality of feature vectors constitute the first feature vector.
  8. 一种语音识别质量预测的方法,其特征在于,包括:A method for speech recognition quality prediction, characterized in that it comprises:
    获取测试语音;Get the test voice;
    根据语音识别质量函数预测语音识别模型对所述测试语音的语音识别质量,所述语音识别质量函数用于指示语音识别质量与语音质量之间的关系,所述语音质量根据权利要求1-7任一项所述的方法评估;Predict the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, the speech recognition quality function is used to indicate the relationship between speech recognition quality and speech quality, and the speech quality is according to any one of claims 1-7. one of the method evaluations;
    输出所述语音识别质量的预测结果。and outputting the prediction result of the speech recognition quality.
  9. 根据权利要求8所述的方法,其特征在于,所述输出所述语音识别质量的预测结果,包括以下一种或多种:The method according to claim 8, wherein the outputting the prediction result of the speech recognition quality comprises one or more of the following:
    所述语音识别模型的语音识别的质量;the quality of the speech recognition of the speech recognition model;
    影响所述语音识别模型的语音识别质量的因素;Factors affecting the speech recognition quality of the speech recognition model;
    调整所述语音识别模型的语音识别质量的方式。The way to adjust the speech recognition quality of the speech recognition model.
  10. 根据权利要求9所述的方法,其特征在于,所述语音识别质量函数的构建过程包括:The method according to claim 9, wherein the construction process of the speech recognition quality function comprises:
    获得参考语音的多套劣化参考语音;Obtain multiple sets of degraded reference voices of the reference voice;
    根据所述语音识别模型获得多套所述劣化参考语音的语音识别结果,并作为第一统计结果;Obtain multiple sets of speech recognition results of the degraded reference speech according to the speech recognition model, and use it as a first statistical result;
    分别以多套所述劣化参考语音作为测试语音,根据权利要求1-7任一项所述的方法获得多套所述劣化参考语音的语音质量评估结果,得到第一评估结果;Respectively using multiple sets of the degraded reference voices as test voices, obtaining multiple sets of voice quality evaluation results of the degraded reference voices according to any one of claims 1-7, to obtain the first evaluation result;
    根据所述第一统计结果与所述第一评估结果的函数关系获得所述语音识别质量函数。The speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
  11. 一种提高语音识别质量的方法,其特征在于,包括:A method for improving speech recognition quality, comprising:
    获取测试语音;Get the test voice;
    根据权利要求1-7任一项所述的方法获得所述测试语音的语音质量评估结果;Obtain the speech quality assessment result of described test speech according to the method described in any one of claim 1-7;
    当所述语音质量低于预设的第一基线时,执行所述输出所述语音质量的评估结果。When the voice quality is lower than a preset first baseline, the outputting the voice quality evaluation result is performed.
  12. 根据权利要求11所述的方法,其特征在于,还包括:The method according to claim 11, further comprising:
    当所述语音质量大于或等于所述第一基线时,根据权利要求8或9所述的方法预测语音识别质量;When the speech quality is greater than or equal to the first baseline, predicting speech recognition quality according to the method of claim 8 or 9;
    当所述语音识别质量低于预设的第二基线时,执行所述输出所述语音识别质量的预测结果。When the speech recognition quality is lower than a preset second baseline, outputting a prediction result of the speech recognition quality is performed.
  13. 一种用于语音质量评估的装置,其特征在于,包括:A device for voice quality assessment, characterized in that it comprises:
    获取模块,用于获取测试语音;Obtain module, be used for obtaining test voice;
    评估模块,用于根据所述测试语音的语义相关信息评估所述测试语音的语音质量;An evaluation module, configured to evaluate the speech quality of the test speech according to the semantic related information of the test speech;
    输出模块,用于输出所述语音质量的评估结果。An output module, configured to output the evaluation result of the voice quality.
  14. 根据权利要求13所述的装置,其特征在于,所述输出模块所输出的语音质量的评估结果,包括以下一种或多种:The device according to claim 13, wherein the evaluation results of the voice quality output by the output module include one or more of the following:
    所述测试语音的质量;the quality of the test voice;
    影响所述测试语音质量的因素;Factors affecting the quality of the test voice;
    调整所述测试语音质量的方式。Adjust the method of testing voice quality.
  15. 根据权利要求13所述的装置,其特征在于,所述评估模块具体用于:The device according to claim 13, wherein the evaluation module is specifically used for:
    获得所述测试语音的第一特征向量,其中,所述第一特征向量包括所述测试语音的时频特征向量;Obtaining a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech;
    根据所述测试语音的第一特征向量获得所述测试语音的第二特征向量,其中,所述第二特征向量与所述测试语音的语义相关;Obtaining a second feature vector of the test speech according to the first feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech;
    根据所述第二特征向量评估所述测试语音的语音质量。Evaluating the speech quality of the test speech according to the second feature vector.
  16. 根据权利要求15所述的装置,其特征在于,所述评估模块在根据所述第二特征向量评估所述测试语音的语音质量时,包括:使用第一评估指标评估所述测试语音的语音质量,The device according to claim 15, wherein the evaluation module, when evaluating the speech quality of the test speech according to the second feature vector, comprises: using a first evaluation index to evaluate the speech quality of the test speech ,
    所述第一评估指标包括:各组所述测试语音的中心位置的集中程度指标,其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,所述第二特征空间为所述第二特征向量所在的空间。The first evaluation index includes: the concentration index of the central position of each group of the test voices, wherein the test voices of different groups have different semantics, and the central position of each group of the test voices is the center position of the test voices of the group. The center position of the second eigenvector in a second eigenspace, where the second eigenspace is the space where the second eigenvector is located.
  17. 根据权利要求15所述的装置,其特征在于,所述评估模块在根据所述第二特征向量评估所述测试语音的语音质量时,包括:使用第二评估指标评估所述测试语音的语音质量,The device according to claim 15, wherein the evaluation module, when evaluating the speech quality of the test speech according to the second feature vector, comprises: using a second evaluation index to evaluate the speech quality of the test speech ,
    所述第二评估指标包括:各组所述测试语音的所述第二特征向量在第二特征空间中的分散程度指标,其中,不同组的所述测试语音具有不同语义,所述第二特征空间为所述第二特征向量所在的空间。The second evaluation index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature The space is the space where the second eigenvector is located.
  18. 根据权利要求15所述的装置,其特征在于,所述评估模块在根据所述第二 特征向量评估所述测试语音的语音质量时,包括:使用第三评估指标评估所述测试语音的语音质量,The device according to claim 15, wherein the evaluation module, when evaluating the speech quality of the test speech according to the second feature vector, comprises: using a third evaluation index to evaluate the speech quality of the test speech ,
    所述第三评估指标包括:各组所述测试语音的中心位置与语义对应的各组参考语音的中心位置的相近程度指标;其中,不同组的所述测试语音具有不同语义,每组所述测试语音的中心位置为该组测试语音的所述第二特征向量在第二特征空间中的中心位置,所述第二特征空间为所述第二特征向量所在的空间,不同组的所述参考语音具有不同语义,每组所述参考语音的中心位置为该组参考语音的所述第二特征向量在所述第二特征空间中的中心位置。The third evaluation index includes: the similarity index of the center position of each group of the test speech and the center position of each group of reference speech corresponding to the semantics; wherein, the test speeches of different groups have different semantics, and each group of the The center position of the test voice is the center position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference The voices have different semantics, and the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
  19. 根据权利要求15至18任一项所述的装置,其特征在于,所述评估模块在获得所述测试语音的第一特征向量时,包括:The device according to any one of claims 15 to 18, wherein when the evaluation module obtains the first feature vector of the test speech, it includes:
    获取所述测试语音包含的连续帧,其中,相邻帧包含有重叠的信息;Obtaining the continuous frames contained in the test speech, wherein adjacent frames contain overlapping information;
    根据所述连续帧获得包括频域特征的多个特征向量,所述多个特征向量构成所述第一特征向量。A plurality of feature vectors including frequency domain features are obtained according to the consecutive frames, and the plurality of feature vectors constitute the first feature vector.
  20. 一种语音识别质量预测的装置,其特征在于,包括:A device for speech recognition quality prediction, characterized in that it comprises:
    获取模块,用于获取测试语音;Obtain module, be used for obtaining test voice;
    预测模块,用于根据语音识别质量函数预测语音识别模型对所述测试语音的语音识别质量,所述语音识别质量函数用于指示语音识别质量与语音质量之间的关系,所述语音质量根据权利要求1-7任一项所述的方法评估;A prediction module, configured to predict the speech recognition quality of the speech recognition model for the test speech according to a speech recognition quality function, the speech recognition quality function being used to indicate the relationship between the speech recognition quality and the speech quality, and the speech quality according to the right The method evaluation described in any one of requirements 1-7;
    输出模块,用于输出所述语音识别质量的预测结果。An output module, configured to output the prediction result of the speech recognition quality.
  21. 根据权利要求20所述的装置,其特征在于,所述输出模块所输出的所述语音识别质量的预测结果,包括以下一种或多种:The device according to claim 20, wherein the prediction result of the speech recognition quality output by the output module includes one or more of the following:
    所述语音识别模型的语音识别的质量;the quality of the speech recognition of the speech recognition model;
    影响所述语音识别模型的语音识别质量的因素;Factors affecting the speech recognition quality of the speech recognition model;
    调整所述语音识别模型的语音识别质量的方式。The way to adjust the speech recognition quality of the speech recognition model.
  22. 根据权利要求21所述的装置,其特征在于,所述语音识别质量函数的构建过程包括:The device according to claim 21, wherein the construction process of the speech recognition quality function comprises:
    获得参考语音的多套劣化参考语音;Obtain multiple sets of degraded reference voices of the reference voice;
    根据所述语音识别模型获得多套所述劣化参考语音的语音识别结果,并作为第一统计结果;Obtain multiple sets of speech recognition results of the degraded reference speech according to the speech recognition model, and use it as a first statistical result;
    分别以多套所述劣化参考语音作为测试语音,根据权利要求1-7任一项所述的方法获得多套所述劣化参考语音的语音质量评估结果,得到第一评估结果;Respectively using multiple sets of the degraded reference voices as test voices, obtaining multiple sets of voice quality evaluation results of the degraded reference voices according to any one of claims 1-7, to obtain the first evaluation result;
    根据所述第一统计结果与所述第一评估结果的函数关系获得所述语音识别质量函数。The speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
  23. 一种提高语音识别质量的装置,其特征在于,包括:A device for improving speech recognition quality, characterized in that it comprises:
    获取模块,用于获取测试语音;Obtain module, be used for obtaining test voice;
    评估预测模块,用于根据权利要求1-7任一项所述的方法获得所述测试语音的语音质量评估结果;Evaluation prediction module, for obtaining the speech quality evaluation result of described test speech according to the method described in any one of claim 1-7;
    输出模块,用于当所述语音质量低于预设的第一基线时,执行所述输出语音质量的评估结果。An output module, configured to output the assessment result of the voice quality when the voice quality is lower than a preset first baseline.
  24. 根据权利要求23所述的装置,其特征在于:The device according to claim 23, characterized in that:
    评估预测模块还用于当所述语音质量大于或等于所述第一基线时,根据权利要求8-9任一所述的方法预测语音识别质量;The evaluation prediction module is also used to predict the speech recognition quality according to any one of claims 8-9 when the speech quality is greater than or equal to the first baseline;
    所述输出模块还用于当所述语音识别质量低于预设的第二基线时,执行所述输出所述语音识别质量的预测结果。The output module is further configured to output the prediction result of the speech recognition quality when the speech recognition quality is lower than a preset second baseline.
  25. 一种车辆,其特征在于,包括:A vehicle, characterized in that it comprises:
    声音采集装置,用于采集用户的语音命令;A sound collection device, configured to collect the user's voice command;
    前处理装置,用于对所述语音命令的声音进行前处理;A pre-processing device for pre-processing the sound of the voice command;
    语音识别装置,用于所述前处理后的声音进行识别;A speech recognition device, used for recognizing the pre-processed sound;
    权利要求13至24任一项所述的装置。A device as claimed in any one of claims 13 to 24.
  26. 一种计算设备,其特征在于,包括一个或多个处理器和一个或多个存储器,所述存储器存储有程序指令,所述程序指令当被所述一个或多个处理器执行时使得所述一个或多个处理器实现权利要求1-12任一项所述的方法。A computing device comprising one or more processors and one or more memories storing program instructions which, when executed by the one or more processors, cause the One or more processors implement the method of any one of claims 1-12.
  27. 一种计算机可读存储介质,其上存储有程序指令,其特征在于,所述程序指令当被计算机执行时使得所述计算机实现权利要求1-12任一项所述的方法。A computer-readable storage medium, on which program instructions are stored, wherein when the program instructions are executed by a computer, the computer implements the method according to any one of claims 1-12.
  28. 一种计算机程序产品,其特征在于,其包括有程序指令,所述程序指令当被计算机执行时使得所述计算机实现权利要求1-12任一项所述的方法。A computer program product, characterized in that it includes program instructions, and when the program instructions are executed by a computer, the computer implements the method described in any one of claims 1-12.
PCT/CN2021/122149 2021-09-30 2021-09-30 Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus WO2023050301A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/122149 WO2023050301A1 (en) 2021-09-30 2021-09-30 Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus
CN202180008040.9A CN116210050A (en) 2021-09-30 2021-09-30 Method and device for evaluating voice quality and predicting and improving voice recognition quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/122149 WO2023050301A1 (en) 2021-09-30 2021-09-30 Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus

Publications (1)

Publication Number Publication Date
WO2023050301A1 true WO2023050301A1 (en) 2023-04-06

Family

ID=85781128

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122149 WO2023050301A1 (en) 2021-09-30 2021-09-30 Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus

Country Status (2)

Country Link
CN (1) CN116210050A (en)
WO (1) WO2023050301A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389059A (en) * 2000-06-29 2003-01-01 皇家菲利浦电子有限公司 Speech qality estimation for off-line speech recognition
CN1802694A (en) * 2003-05-08 2006-07-12 语音信号科技公司 Signal-to-noise mediated speech recognition algorithm
CN1965218A (en) * 2004-06-04 2007-05-16 皇家飞利浦电子股份有限公司 Performance prediction for an interactive speech recognition system
US20150073785A1 (en) * 2013-09-06 2015-03-12 Nuance Communications, Inc. Method for voicemail quality detection
CN106297795A (en) * 2015-05-25 2017-01-04 展讯通信(上海)有限公司 Audio recognition method and device
CN107093427A (en) * 2016-02-17 2017-08-25 通用汽车环球科技运作有限责任公司 The automatic speech recognition of not smooth language
CN107221319A (en) * 2017-05-16 2017-09-29 厦门盈趣科技股份有限公司 A kind of speech recognition test system and method
WO2020166322A1 (en) * 2019-02-12 2020-08-20 日本電信電話株式会社 Learning-data acquisition device, model learning device, methods for same, and program
CN112951270A (en) * 2019-11-26 2021-06-11 新东方教育科技集团有限公司 Voice fluency detection method and device and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389059A (en) * 2000-06-29 2003-01-01 皇家菲利浦电子有限公司 Speech qality estimation for off-line speech recognition
CN1802694A (en) * 2003-05-08 2006-07-12 语音信号科技公司 Signal-to-noise mediated speech recognition algorithm
CN1965218A (en) * 2004-06-04 2007-05-16 皇家飞利浦电子股份有限公司 Performance prediction for an interactive speech recognition system
US20150073785A1 (en) * 2013-09-06 2015-03-12 Nuance Communications, Inc. Method for voicemail quality detection
CN106297795A (en) * 2015-05-25 2017-01-04 展讯通信(上海)有限公司 Audio recognition method and device
CN107093427A (en) * 2016-02-17 2017-08-25 通用汽车环球科技运作有限责任公司 The automatic speech recognition of not smooth language
CN107221319A (en) * 2017-05-16 2017-09-29 厦门盈趣科技股份有限公司 A kind of speech recognition test system and method
WO2020166322A1 (en) * 2019-02-12 2020-08-20 日本電信電話株式会社 Learning-data acquisition device, model learning device, methods for same, and program
CN112951270A (en) * 2019-11-26 2021-06-11 新东方教育科技集团有限公司 Voice fluency detection method and device and electronic equipment

Also Published As

Publication number Publication date
CN116210050A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
DE112017003563B4 (en) METHOD AND SYSTEM OF AUTOMATIC LANGUAGE RECOGNITION USING POSTERIORI TRUST POINT NUMBERS
Schädler et al. Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
Tawari et al. Speech based emotion classification framework for driver assistance system
CN105593936B (en) System and method for text-to-speech performance evaluation
Liu et al. Bone-conducted speech enhancement using deep denoising autoencoder
CN110178178A (en) Microphone selection and multiple talkers segmentation with environment automatic speech recognition (ASR)
CN104078039A (en) Voice recognition system of domestic service robot on basis of hidden Markov model
CN109313892A (en) Steady language identification method and system
CN109448726A (en) A kind of method of adjustment and system of voice control accuracy rate
CN101114449A (en) Model training method for unspecified person alone word, recognition system and recognition method
Shrawankar et al. Adverse conditions and ASR techniques for robust speech user interface
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
Venkatesan et al. Binaural classification-based speech segregation and robust speaker recognition system
Chen et al. InQSS: a speech intelligibility and quality assessment model using a multi-task learning network
Lavechin et al. Statistical learning models of early phonetic acquisition struggle with child-centered audio data
CN110176243A (en) Sound enhancement method, model training method, device and computer equipment
WO2023050301A1 (en) Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus
Hepsiba et al. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN
CN116230018A (en) Synthetic voice quality evaluation method for voice synthesis system
Chen et al. InQSS: a speech intelligibility assessment model using a multi-task learning network
Kumawat et al. SSQA: Speech signal quality assessment method using spectrogram and 2-D convolutional neural networks for improving efficiency of ASR devices
Nandyala et al. Real time isolated word speech recognition system for human computer interaction
CN115171878A (en) Depression detection method based on BiGRU and BiLSTM
CN115168563A (en) Airport service guiding method, system and device based on intention recognition
Boril et al. Data-driven design of front-end filter bank for Lombard speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958875

Country of ref document: EP

Kind code of ref document: A1