WO2023050301A1

WO2023050301A1 - Speech quality assessment method and apparatus, speech recognition quality prediction method and apparatus, and speech recognition quality improvement method and apparatus

Info

Publication number: WO2023050301A1
Application number: PCT/CN2021/122149
Authority: WO
Inventors: 周宇; 聂为然; 向腾
Original assignee: 华为技术有限公司
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2023-04-06
Also published as: CN116210050A

Abstract

The present application relates to the technical field of speech recognition. Embodiments of the present application provide a speech quality assessment method, a speech recognition quality prediction method, and a speech recognition quality improvement method. First, a test speech is acquired; then speech quality of the test speech is assessed according to semantic related information of the test speech to determine whether a parameter of speech preprocessing needs to be adjusted, and speech recognition quality is predicted according to the assessed speech quality to determine whether a parameter of a speech recognition model needs to be adjusted; and a corresponding assessment result or prediction result is outputted, such that the parameter of the preprocessing or the speech recognition model can be adjusted according to the assessment result or the prediction result. According to the present application, the parameter adjustment process of the preprocessing and the speech recognition model in the speech recognition process can be decoupled.

Description

Method and device for speech quality assessment, speech recognition quality prediction and improvement

technical field

The present application relates to the technical field of speech recognition, in particular to a method and device for evaluating speech quality, a method and device for predicting speech recognition quality, a method and device for improving speech recognition quality, a vehicle, a computer-readable storage medium, and a computer program product.

Background technique

In the voice recognition system, the voice signal will be enhanced by the voice pre-processing module after being collected by the sensor, and then sent to the voice recognition module for voice wake-up and recognition. Therefore, the recognition effect of the voice recognition system mainly depends on the performance of the voice recognition module and the quality of the voice signal. big factor. The quality of the voice signal is subject to the environment, audio collection hardware and the algorithm of the voice pre-processing module.

The industry generally adopts the overall joint debugging of the speech recognition module and the speech pre-processing module to test the quality of speech recognition for performance tuning. Standards, cannot evaluate the quality of the voice signal output by the voice pre-processing module and provide tuning baselines and feedback for the acquisition hardware, voice pre-processing module and voice recognition module. This solution is not conducive to the independent positioning and solution of voice signal quality and voice recognition module performance related issues in the actual voice recognition business, and it is easy to cause difficulty in problem location for voice recognition system failures. Therefore, the voice recognition module is decoupled from the voice pre-processing module. , to obtain voice quality calibration and feedback, which is of great significance to improve the efficiency of module fault diagnosis.

Contents of the invention

In view of this, the present application provides a speech quality assessment method and device, a speech recognition quality prediction method and device, a speech recognition quality improving method and device, a vehicle, a computer readable storage medium and a computer program product.

In order to achieve the above purpose, the first aspect of the present application provides a voice quality assessment method, including: acquiring a test voice; evaluating the voice quality of the test voice according to the semantic related information of the test voice; and outputting the voice quality assessment result.

From the above, it is possible to evaluate the voice quality of the test voice, and the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech. In addition, the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality.

For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle. In another possible implementation, the vehicle can also perform corresponding operations based on the output evaluation results to improve the in-car voice interaction experience, such as automatically closing the car windows, automatically reducing the sound in the car (such as the sound of the car playing music) , or automatically perform parameter tuning, etc.

As a possible implementation manner of the first aspect, the evaluation result of the voice quality includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a manner of adjusting the quality of the test voice.

From the above, the output content can be flexibly set. For example, the quality of the output test speech can be output by quantified specific parameters, can also be output in qualitative ways such as excellent, good, and medium, and can also be output in combination with images and texts. For example, the factors that affect the test voice quality can be the noise outside the car caused by the window being opened; the tire noise or engine noise caused by the high speed of the car; Adjustments desired by the user for reference. Adjust the way to test the voice quality, you can close the car windows, reduce the speed of the car, reduce the sound in the car, replace the problematic microphone, adjust the parameters of the pre-processing module, etc.

As a possible implementation manner of the first aspect, evaluating the speech quality of the test speech according to the semantic related information of the test speech includes: obtaining a first feature vector of the test speech, where the first feature vector includes time-frequency features of the test speech vector; obtaining a second feature vector of the test voice according to the first feature vector of the test voice, wherein the second feature vector is related to the semantics of the test voice; evaluating the voice quality of the test voice according to the second feature vector.

From the above, the first feature vector including the time-frequency feature vector of the speech can be obtained first, and then the second feature vector related to the semantics of the speech can be obtained based on this, so as to evaluate the speech quality of the test speech according to the second feature vector. The first eigenvector is related to the time-frequency eigenvector, so the preprocessing parameters are related to the second eigenvector, so the speech quality of the test speech evaluated by the second eigenvector can be used as a reference for tuning the preprocessing parameters.

As a possible implementation manner of the first aspect, evaluating the speech quality of the test speech according to the second feature vector includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: the center of each group of test speech The concentration index of position, wherein, different groups of test voices have different semantics, the center position of each group of test voices is the center position of the second eigenvector of the group of test voices in the second feature space, and the second feature space is the center position of the second feature vector The space of the two eigenvectors.

From the above, the speech quality of the test speech can be evaluated according to the above optional evaluation index, and the evaluation index is calculated based on the second feature vector.

As a possible implementation manner of the first aspect, evaluating the speech quality of the test speech according to the second feature vector includes: using a second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: the first An indicator of the degree of dispersion of the two feature vectors in the second feature space, wherein different groups of test speeches have different semantics, and the second feature space is the space where the second feature vectors are located.

As a possible implementation of the first aspect, evaluating the speech quality of the test speech according to the second feature vector includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: the center of each group of test speech The similarity index of the center position of each group of reference voices corresponding to the position and semantics; wherein, different groups of test voices have different semantics, and the center position of each group of test voices is the second eigenvector of the group of test voices in the second feature space The center position in the second feature space is the space where the second feature vector is located. Different groups of reference voices have different semantics. The center position of each group of reference voices is the second feature vector of the group of reference voices in the second feature space central location.

As a possible implementation of the first aspect, obtaining the first feature vector of the test speech includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors, the plurality of eigenvectors constitute the first eigenvector.

From the above, when obtaining the first feature vector, by having overlapping adjacent frames, the direct relationship between frames can be brought into the calculation of the second feature vector.

In order to achieve the above object, the second aspect of the present application provides a speech recognition quality prediction method, including: obtaining a test speech; predicting the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, and the speech recognition quality function is used to indicate The relationship between the speech recognition quality and the speech quality, the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; and a prediction result of the speech recognition quality is output.

From the above, it is realized to predict the speech recognition quality of the speech recognition model through the speech quality, which can be used by the user to refer to the prediction result to improve the speech recognition quality. For example, when applied to vehicles, it is possible to monitor and give feedback on the quality of in-vehicle voice recognition, so that users can take corresponding measures to maintain the in-vehicle voice interaction experience.

As a possible implementation of the second aspect, the output prediction result of speech recognition quality includes one or more of the following: the speech recognition quality of the speech recognition model; factors affecting the speech recognition quality of the speech recognition model; adjusting the speech The way to identify the speech recognition quality of the model.

From the above, the content of the predicted result can be flexibly set. For example, the quality of a speech recognition model can be provided in quantified specific parameters, or in qualitative ways such as excellent, good, and medium, or in combination with images and text. For example, the output factors affecting the speech recognition quality of the speech recognition model may be factors in the method provided in the first aspect, or may be performance factors of the speech recognition model of the speech recognition module. The output mode for adjusting the test voice quality may be the adjustment mode provided in the method provided in the first aspect, or it may be a tuning prompt for the parameters of the speech recognition model.

As a possible implementation of the second aspect, the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.

From the above, the above is a way of constructing the speech recognition quality function through reference speech. Specifically, the method of degrading the reference speech is adopted without introducing other reference speech, which can effectively reduce the data volume of the reference speech.

In order to achieve the above object, the third aspect of the present application provides a method for improving the quality of speech recognition, including: obtaining a test speech; obtaining the speech of the test speech according to the method of the first aspect or any possible implementation manner of the first aspect A quality assessment result; when the voice quality is lower than the preset first baseline, output the voice quality assessment result.

From the above, the speech quality evaluation result of the test speech can be obtained, and according to the speech quality evaluation result of the test speech, content related to the speech quality can be included, wherein the content can include whether to adjust the pre-processing parameters. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle. This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.

As a possible implementation manner of the third aspect, it also includes: when the voice quality is greater than or equal to the first baseline, predicting the speech recognition quality according to the second aspect or any one possible implementation manner of the second aspect; when When the voice recognition quality is lower than the preset second baseline, output a prediction result of the voice recognition quality.

From the above, it can be determined whether to adjust the pre-processing parameters or the speech recognition model according to the test voice, and realize the decoupling of pre-processing and speech recognition. It is beneficial to the independent positioning and performance tuning of each module problem.

In order to achieve the above purpose, the fourth aspect of the present application provides a voice quality assessment device, including:

The obtaining module is used to obtain the test speech; the evaluation module is used to evaluate the speech quality of the test speech according to the semantic related information of the test speech; the output module is used to output the speech quality evaluation result.

From the above, it is possible to evaluate the voice quality of the test voice, and the quality assessment is based on the information related to the semantics of the test voice, so the semantic information can be reflected in the voice quality of the assessment, so that the voice quality of the assessment can be used It is used to predict the speech recognition quality of the speech recognition model on the test speech. In addition, the output evaluation result may include content related to the evaluated voice quality, which may be used by the user to refer to the evaluation result to improve the voice quality. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain or improve the voice interaction experience in the vehicle. When applied to vehicles, in a possible implementation, the vehicle can also perform corresponding operations based on the output evaluation results to improve the voice interaction experience in the vehicle, such as automatically closing the windows, automatically reducing the sound in the vehicle (such as the vehicle playing music sound), or automatically perform parameter tuning, etc.

As a possible implementation manner of the fourth aspect, the evaluation result of the voice quality output by the output module includes one or more of the following: quality of the test voice; factors affecting the quality of the test voice; and a way of adjusting the quality of the test voice.

As a possible implementation of the fourth aspect, the evaluation module is specifically configured to: obtain a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech; Obtaining a second feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.

As a possible implementation of the fourth aspect, when the evaluation module evaluates the speech quality of the test speech according to the second feature vector, it includes: using the first evaluation index to evaluate the speech quality of the test speech, and the first evaluation index includes: each group The concentration index of the center position of the test speech, wherein, the test speech of different groups has different semantics, the center position of each group of test speech is the center position of the second feature vector of this group of test speech in the second feature space, the second The feature space is the space where the second feature vector resides.

As a possible implementation of the fourth aspect, when the evaluation module evaluates the speech quality of the test speech according to the second feature vector, it includes: using the second evaluation index to evaluate the speech quality of the test speech, and the second evaluation index includes: each group An indicator of the degree of dispersion of the second feature vector of the test speech in the second feature space, wherein different groups of test speech have different semantics, and the second feature space is the space where the second feature vector is located.

As a possible implementation of the fourth aspect, when the evaluation module evaluates the speech quality of the test speech according to the second feature vector, it includes: using a third evaluation index to evaluate the speech quality of the test speech, and the third evaluation index includes: each group The center position of the test voice and the similarity index of the center position of each group of reference voices corresponding to the semantics; Wherein, the test voices of different groups have different semantics, and the center position of each group of test voices is the second feature vector of this group of test voices in The center position in the second feature space, the second feature space is the space where the second feature vector is located, different groups of reference speeches have different semantics, the center position of each group of reference speech is the second feature vector of the group of reference speech at the The center position in the second feature space.

As a possible implementation of the fourth aspect, when the evaluation module obtains the first feature vector of the test speech, it includes: obtaining continuous frames contained in the test speech, wherein adjacent frames contain overlapping information; A plurality of eigenvectors including frequency domain features, the plurality of eigenvectors constitute a first eigenvector.

In order to achieve the above object, the fifth aspect of the present application provides a speech recognition quality prediction device, including: an acquisition module, used to obtain a test speech; a prediction module, used to predict the speech of the speech recognition model to the test speech according to the speech recognition quality function Recognition quality, the speech recognition quality function is used to indicate the relationship between the speech recognition quality and the speech quality, and the speech quality is evaluated according to the method of the first aspect or any possible implementation manner of the first aspect; the output module is used to output the speech Prediction results of recognition quality.

As a possible implementation of the fifth aspect, the prediction result of the speech recognition quality output by the output module includes one or more of the following: the speech recognition quality of the speech recognition model; Factor; a way to tune the speech recognition quality of a speech recognition model.

As a possible implementation of the fifth aspect, the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; Statistical results; using multiple sets of degraded reference voices as test voices respectively, according to the method of the first aspect or any possible implementation mode of the first aspect, the voice quality evaluation results of multiple sets of degraded reference voices are obtained, and the first evaluation result is obtained; A speech recognition quality function is obtained according to the functional relationship between the first statistical result and the first evaluation result.

In order to achieve the above object, the sixth aspect of the present application provides a device for improving the quality of speech recognition, including: an acquisition module, used to acquire test speech; an evaluation and prediction module, used according to any one of the first aspect or the first aspect The method in a possible implementation manner obtains the speech quality evaluation result of the test speech; the output module is configured to output the speech quality evaluation result when the speech quality is lower than a preset first baseline.

From the above, the speech quality evaluation result of the test speech can be obtained, and the content related to the speech quality can be output according to the speech quality evaluation result of the test speech, wherein the output content can include whether to adjust the pre-processing parameters. For example, when applied to vehicles, it can monitor and give feedback on the voice quality in the vehicle, so that users can take corresponding measures to maintain the voice interaction experience in the vehicle. This process can be independent of the speech recognition process of the speech recognition model, realizing the decoupling of pre-processing and speech recognition.

As a possible implementation of the sixth aspect, the evaluation and prediction module is also used to predict speech recognition according to the method of the second aspect or any possible implementation of the second aspect when the voice quality is greater than or equal to the first baseline The quality; output module is also used to output the prediction result of the speech recognition quality when the speech recognition quality is lower than the preset second baseline.

In order to achieve the above object, the seventh aspect of the present application provides a vehicle, including: a voice collection device for collecting the user's voice command; a pre-processing device for pre-processing the sound of the voice command; a voice recognition device for Recognition of the pre-processed sound; the device in the fourth, fifth, and sixth aspects and any possible implementation manners thereof.

To achieve the above object, the eighth aspect of the present application provides a computing device, including one or more processors and one or more memories, the memories store program instructions, and when executed by the one or more processors, the program instructions make One or more processors implement the method of the first aspect and any possible implementation manner thereof.

To achieve the above object, the ninth aspect of the present application provides a computer-readable storage medium, on which program instructions are stored, and when the program instructions are executed by a computer, the computer implements the first aspect and any possible implementation manner thereof. method.

In order to achieve the above purpose, the tenth aspect of the present application provides a computer program product, which includes program instructions, and when the program instructions are executed by a computer, the computer implements the method of the first aspect and any possible implementation manner thereof.

To sum up, the embodiment of the present application realizes the decoupling of the evaluation of the speech pre-processing process and the prediction of speech recognition by the speech recognition model, and realizes the separate positioning of the pre-processing problem and the speech recognition problem, which is beneficial to the independent positioning and coordination of each module problem. performance tuning. Moreover, due to the above decoupling, the evaluation of the test voice quality can also be used to prompt the user for corresponding operations to improve the in-vehicle voice interaction experience, and the prediction of speech recognition can also be used to prompt the user for corresponding operations. Improve the voice interaction experience in the car.

These and other aspects of the present application will be made more apparent in the following description of the embodiment(s).

Description of drawings

The various features of the present application and the connections between the various features are further described below with reference to the accompanying drawings. The drawings are exemplary, some features are not shown to scale, and in some drawings, features customary in the field to which the application pertains and are not necessary for the application may be omitted, or additionally shown for the The application is not an essential feature, and the combination of the various features shown in the drawings is not intended to limit the application. In addition, in the whole specification, the content indicated by the same reference numeral is also the same. The specific accompanying drawings are explained as follows:

FIG. 1 is a schematic structural diagram of an application scenario 1 involved in an embodiment of the present application;

FIG. 2A is a schematic flow chart of a voice quality assessment method provided in an embodiment of the present application;

FIG. 2B is a schematic flow chart of a voice quality assessment method for test voice provided in an embodiment of the present application;

FIG. 3A is a schematic flowchart of a method for predicting speech recognition quality provided by an embodiment of the present application;

FIG. 3B is a schematic flow chart of a method for constructing a speech recognition quality function provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a method for improving speech recognition quality provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart of a specific implementation of the method for improving the quality of speech recognition provided by the embodiment of the present application;

FIG. 6A is a schematic diagram of a speech quality assessment device provided in an embodiment of the present application;

FIG. 6B is a schematic diagram of a speech recognition quality prediction device provided in an embodiment of the present application;

FIG. 6C is a schematic diagram of a device for improving speech recognition quality provided by an embodiment of the present application;

FIG. 7A is a schematic structural diagram of a vehicle provided in an embodiment of the present application;

FIG. 7B is a schematic diagram of the cockpit of the vehicle provided by the embodiment of the present application;

FIG. 8 is a schematic diagram of a computing device provided by an embodiment of the present application.

Detailed ways

The technical solutions provided by the present application will be further described below in conjunction with the accompanying drawings and examples. It should be understood that the system structure and business scenarios provided in the embodiments of this application are mainly to illustrate the possible implementation of the technical solution of this application, and should not be interpreted as the only limitation on the technical solution of this application. Those skilled in the art know that, with the evolution of the system structure and the emergence of new business scenarios, the technical solutions provided in this application are also applicable to similar technical problems.

It should be understood that the voice quality assessment solution provided by the embodiments of the present application includes a voice quality assessment method and device, a voice recognition quality prediction method and device, a method and device for improving voice recognition quality, a computer-readable storage medium, and a computer program product. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have been referred to each other and can be combined with each other.

When evaluating the voice quality under test, the index of Mean Opinion Score (MOS) can be used for evaluation. This index is also known as the subjective voice quality index. During the evaluation, multiple index levels can be used to evaluate the quality of the tested voice. The quality of the tested voice is obtained by averaging the scores of all test listeners.

When the average opinion score is used to evaluate the speech quality of the tested voice, the differences in auditory ability and subjective auditory experience between different listeners will cause differences in scores; especially when only a single sentence is provided without providing context, the scores of the test listeners are significantly different. This will lead to low objectivity of the evaluation result of the tested voice quality, and the evaluation method is poor in adaptability.

When evaluating the voice quality under test, objective evaluation indicators can also be used for evaluation. Objective evaluation indicators include Signal-to-Noise Ratio (SNR), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Analysis (POLQA), short-term objective Intelligibility (Short-Time Objective Intelligibility, STOI), etc. The PESQ algorithm requires a noisy attenuation signal and an original reference signal. After level adjustment, input filter filtering, time alignment and compensation, and auditory transformation of the two speech signals to be compared, the parameters of the two signals are extracted respectively. , synthesize its time-frequency characteristics, get the PESQ score, and finally map this score to the subjective mean opinion score (MOS). POLQA is the successor to PESQ, extended to handle higher bandwidth audio signals. STOI is one of the important indicators to measure speech intelligibility, which is used to evaluate the intelligibility of noisy speech that has been masked in the time domain or short-time Fourier transformed and weighted in the frequency domain. STOI is scored by comparing the clean speech with the speech to be evaluated.

The method of evaluating the voice quality under test by using the above-mentioned objective evaluation indicators pays attention to the correlation between sound characteristics and subjective hearing perception from an acoustic point of view, but the relationship with the performance of machine-oriented speech recognition (that is, speech recognition as text or semantics) is unclear. It is difficult to serve as an effective reference for the speech recognition module in the process of R&D and tuning of the speech recognition system.

The embodiment of the present application provides an improved speech quality evaluation scheme, wherein the speech quality of the test speech related to semantics can be determined based on the time-frequency characteristics of the test speech, and the speech quality of the test speech can also be predicted based on the speech quality of the test speech related to semantics. Recognition quality, and by comparing the test speech and semantic-related speech quality and the predicted speech recognition quality with the baseline, determine the problem that affects the speech recognition quality, so as to improve the speech recognition quality by locating or solving the problem.

The speech quality evaluation solution provided by the embodiment of the present application can be applied to application fields such as quality detection and evaluation in the speech recognition process. For example, when applied in the smart cockpit of a vehicle, it can be used to determine the quality of the currently received voice in the cockpit, or the quality of voice recognition, and then give corresponding prompts or perform corresponding actions, such as reducing the music in the cockpit Prompts or actions that can improve the voice quality in the cockpit, such as volume, closing windows, or parameter tuning. For another example, when applied to smart terminals such as mobile phones and smart speakers, the voice quality or voice recognition quality of the terminal's current environment can be evaluated, and then corresponding prompts can be given, such as prompting whether the microphone is blocked, such as prompting to activate The camera can be combined with lip recognition during speech recognition to improve the accuracy of speech recognition, etc., or perform actions that can be allowed (for example, the permission of the corresponding application is set to call the camera), for example, start the camera to combine lip recognition during speech recognition identify. As another example, when applied to a testing terminal used as a quality test, the testing terminal can be used to test the voice recognition function of a product to be tested, such as testing the voice recognition quality of a vehicle, so as to conduct voice recognition-related tests on the vehicle. Tuning of parameters.

In some embodiments, the vehicle, smart terminal, and product under test usually have a microphone and a processor. Among them, the microphone is used to collect the user's voice; the processor can be used to pre-process the collected voice, and perform voice recognition on the pre-processed voice to recognize it as text, and further recognize instructions based on the recognized text. In some embodiments, when applied to a vehicle or an intelligent terminal, the vehicle or the intelligent terminal may also have a man-machine interface, which is used to provide the above corresponding prompts to the user through display or sound. In some embodiments, the processor may also perform parameter tuning of pre-processing or speech recognition according to user operations through a human-machine interface (Human Machine Interface, HMI). In some embodiments, the processor of the above-mentioned vehicle may be an electronic device, specifically, it may be a processor of a vehicle-mounted processing device such as a vehicle machine or a vehicle-mounted computer, or it may be a central processing unit (central processing unit, CPU), a microprocessor ( micro control unit, MCU) and other conventional chip processors. In some embodiments, when applied to a testing terminal for quality testing, the testing terminal may have a man-machine interface to provide the above corresponding prompts to the user through display or sound.

In some embodiments, the speech quality assessment solution provided by the embodiments of the present application may be embedded in the above-mentioned vehicles, smart terminals, and products under test, and exist as a functional module thereof. In some embodiments, when the voice quality evaluation solution provided by the embodiment of the present application is applied to an independent quality detection detection terminal, the detection terminal can be wired or Communicate wirelessly to obtain the required data, for example, the data is pre-processed voice data, based on which the voice quality test or the voice recognition quality can be predicted, and a prompt of the test result can be given. In some embodiments, the detection terminal can also feed back the test results to the device under test, and the device under test performs operations such as parameter tuning according to the test results.

In the following, further referring to FIG. 1 , a general description will be given of a scenario when the voice quality assessment solution provided by the embodiment of the present application is applied to a vehicle.

Fig. 1 shows a scenario in which the voice quality assessment solution provided by the embodiment of the present application is applied to a vehicle, which can be used for voice quality assessment, and can also be used for quality assessment of voice recognition. In this scenario, the cockpit of the vehicle includes: a sound collection module 110 , a pre-processing module 120 , a speech recognition module 130 , an evaluation and prediction module 140 , and an output module 150 . Wherein, the pre-processing module 120, the speech recognition module 130 and the evaluation and prediction module 140 may be implemented by the same processor of the vehicle, or may be implemented by three or more processors respectively.

The sound collection module 110 can be a microphone, which can be used to collect test voices broadcast by users, wherein a test voice corresponds to a reference voice, and both correspond to the same sentence content, such as corresponding to the same voice command, and have the same semantics. In the stage of establishing each model for speech quality assessment, the sound collection module 110 is also used to collect reference speech. The reference speech refers to the speech used for training the speech recognition module 130 or training the semantic-related feature model described later. The semantic-related feature model is used for speech quality assessment, which will be further described later.

The pre-processing module 120 is used to perform pre-processing such as pre-emphasis, framing or windowing on the collected sound, so that the user's test voice contained in the sound can be recognized more easily. Among them, pre-emphasis includes emphasizing the high-frequency part of the voice, removing the influence of lip radiation, and increasing the high-frequency resolution of the voice; framing uses the short-term stationarity of the voice signal to divide the voice signal into individual voice frames for processing. There is overlap between adjacent speech frames to make each speech frame continuous; windowing is to strengthen the speech waveform near each speech frame sampling and weaken the rest of the waveform, so as to make the speech smooth. Among them, the parameter adjustment of the pre-processing includes one or more of the following: the frequency band of the high frequency targeted by the pre-emphasis processing, the degree of emphasis, the frame length in the framing processing, the length of the overlapping frame, and the partial waveform in the windowing processing The degree of strengthening and the degree of weakening of part of the waveform.

A speech recognition (Automatic Speech Recognition, ASR) module 130 is used for recognizing the sentence content of the pre-processed test speech. Vocabulary in the test speech can be recognized through the speech recognition module and converted into computer-readable character sequences. After the content of the voice recognition is obtained, the control command can be further recognized based on the content, and the control command corresponding to the voice can be executed by the vehicle actuator. In some embodiments, the recognition process from speech recognition content to control instructions can identify control instructions based on keyword matching, and can also identify corresponding control instructions based on neural network semantic recognition technology. Wherein, adjusting the parameters of the speech recognition module refers to adjusting the parameters of the speech recognition model of the speech recognition module, for example, adjusting the parameters and hyperparameters of the neural network of the speech recognition model.

The assessment and prediction module 140 is used to implement voice quality assessment, and can generate a semantically related voice quality assessment result for the test voice. In some embodiments, the speech recognition quality of the speech recognition module for the test speech can also be predicted according to the speech quality evaluation result. In some embodiments, the evaluation and prediction module is also described as including an evaluation module and a prediction module, respectively implementing the above speech quality evaluation and speech recognition quality prediction.

The output module 150 is used for outputting information such as speech quality assessment results or prediction results of speech recognition quality. The output content can be provided to the vehicle controller, so that the vehicle can perform corresponding operations accordingly. The outputted content can also be provided to the user through the man-machine interface of the vehicle. In some embodiments, the outputted information includes the quality of the test voice, the quality of the voice recognition of the voice recognition model, factors affecting the quality of the test voice, Adjusting the manner of testing voice quality, etc. In some embodiments, the human-machine interface may include a display screen in the vehicle cockpit (a display screen such as a liquid crystal display screen, a head-up display (Head Up Display, HUD) and the like), and a speaker to prompt the user through images or sounds.

In some embodiments, the man-machine interface may be a central control panel. After receiving the above-mentioned information output by the output module 150 through the central control panel, the user may use the man-machine interface to adjust the parameters of the pre-processing module 120 or the voice recognition module 130. Or control the relevant actuators in the car, such as controlling the opening and closing of the car windows, controlling the playback volume of the audio playback device in the car, etc. Among them, the parameter adjustment interface provided by the man-machine interface can be displayed in a way that is easy to understand and adjust for ordinary users (such as graphical display), or can be displayed in a way for professional maintenance personnel.

In other embodiments, the evaluation and prediction module 140 may also be deployed on an independent test device or in the cloud, and the output module 150 may also be deployed on an independent test device. The test equipment here can be a special test equipment, and can be an intelligent terminal installed with corresponding software, for example, the intelligent terminal can be a mobile phone, a computer, a tablet computer, and the like. When the above modules are deployed on independent test equipment or cloud, the communication between the vehicle and the test equipment or cloud can be realized based on communication technology.

Various method embodiments of the present application are introduced below based on various drawings.

The embodiment of the present application provides a voice quality evaluation method, which can be used to evaluate the voice quality according to the test voice set. FIG. 2A shows the flow of an embodiment of a method for assessing voice quality. In this embodiment, the application to a vehicle is used as an example for illustration, which includes steps S210 to S230.

S210: Obtain a test voice.

In this embodiment, the test voice is obtained through a sound collection module arranged in the vehicle cabin. The sound collection module may be a microphone, and in some embodiments, may be a plurality of microphones arranged in different positions of the vehicle cabin.

In some embodiments, this step may be performed during the test or inspection of the vehicle, and the test voice may be broadcast by the tester.

In some embodiments, this step may be performed while the user is using the vehicle, for example, when the vehicle is running or parked. At this time, the test voice may be broadcast by the driver (ie, the user). Wherein, the test voice matches the voice command of the vehicle. Since the driver already knows the voice content used in the voice command, the voice command broadcast by the driver can be used as the test voice.

In some embodiments, this step may be triggered when the vehicle cannot accurately recognize the voice command of the user (such as the driver), and use the voice command that the user has broadcast or re-broadcast as the test voice. In some embodiments, when this step is triggered, the vehicle can also prompt the user to enter the speech quality assessment process of this embodiment through the man-machine interface through the man-machine interface, or it can be further guided by guidance The user broadcasts the corresponding test voices.

In some embodiments, the user can repeat the broadcast multiple times for a certain voice content (such as a voice command), and the multiple voices obtained are also called multiple voices corresponding to the group of test voices, or called corresponding to a multiple samples of the corpus. The user can also repeat the broadcast several times for several voice commands respectively, so as to obtain multiple voices corresponding to these groups of test voices respectively. For example, the multiple voices of "turn on the air conditioner" broadcast by the user for many times are a group of test voices, and the multiple voices of "increase the volume" broadcasted by the user for many times are another group of test voices. Among them, a group of test voices corresponds to a voice instruction, or corresponds to a semantic meaning, or corresponds to the same sentence content.

S220: Evaluate the speech quality of the test speech according to the semantic related information of the test speech.

In this embodiment, the evaluation of the voice quality is performed by an evaluation module of the vehicle. In this embodiment, the evaluation module is implemented by a processor of the vehicle, and the processor is connected to the sound collection module by signal.

In this embodiment, the speech quality evaluation result of the test speech is related to predetermined semantics, so that the speech quality evaluation result can not only be used to evaluate the speech quality, but also be used to predict the speech recognition quality of the speech recognition module.

In this embodiment, the semantic-related information of the test speech is a semantic-related feature vector generated by the neural network for the test speech, and the feature vector is the output of any layer before the output layer of the neural network. In some other embodiments, the feature vector may also be formed by cascading outputs of multiple layers before the output layer of the neural network, and the multiple layers may be any two or more layers.

In some embodiments, the semantic related information of the test speech may be a one-dimensional vector formed by the output of the output layer of the neural network. For example, the output of the output layer corresponds to a one-dimensional vector formed by each confidence level of each voice instruction (ie, each semantic meaning). For semantic related information, further detailed introduction will be made in step S223 later.

S230: Output an evaluation result of the speech quality of the test speech.

In this embodiment, it can be output to the man-machine interface of the vehicle to show the evaluation result to the user. The human-machine interface can include a display screen, which can be a vehicle central control screen, a head-up display (Head Up Display, HUD), or an augmented reality head-up display (Augmented Reality-HUD, AR-HUD), etc. The machine interface can also include a speaker, and the human-machine interface can also include an input component, which can be a touch screen integrated in the display screen, or an independent button, etc. Through the man-machine interface, the prompt can be executed in the form of image, text, or voice.

In some embodiments, the evaluation result includes content related to the voice quality of the test voice, and the content related to the test voice quality may include one or a combination of the following: the quality of the test voice, factors affecting the quality of the test voice, and adjusting the quality of the test voice The way.

In some embodiments, the factors that affect the quality of the test voice may include one or a combination of the following: the window is open, and this factor introduces noise from outside the vehicle; the vehicle speed is too high, and this factor makes tire noise or engine noise too loud; Other sounds in the car are too loud, such as music played by the car, etc.; the performance or location of the microphone in the car.

In some embodiments, the way of adjusting the voice quality of the test may include one or a combination of the following: closing the car window, reducing the speed of the car, reducing the sound in the car, replacing a problematic microphone, and tuning the parameters of the pre-processing module.

In some embodiments, as shown in the flow chart of FIG. 2B , the above step S220 includes steps S221 to S225.

S221: Obtain a first feature vector of the test speech. Wherein, the first feature vector includes time-frequency features of the test speech.

In this embodiment, multiple feature vectors including frequency domain features of each consecutive frame of the test speech may be obtained, and the multiple feature vectors constitute the first feature vector. In this embodiment, adjacent frames may have overlapping information. In this embodiment, the pre-processing process of the pre-processing module may perform frame division processing on the acquired test voice to form the continuous frames. In some embodiments, the pre-processing process also includes processing such as pre-emphasis and windowing. By making adjacent frames contain overlapping information, on the one hand, the frame data can be complete, and on the other hand, the parameters of the feature vector can be changed smoothly. In some other embodiments, adjacent frames may not contain overlapping information, which can reduce the amount of data to be processed, thereby increasing the processing speed.

Wherein, since the first feature vector includes the time-frequency feature of the test speech, it is also called the time-frequency feature vector diagram, which is a two-dimensional figure, one dimension is the time coordinate, and the other dimension is the frequency coordinate. Intensity is the intensity of the corresponding frequency for each successive frame.

In some embodiments, the frequency domain features include one or a combination of the following: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), spectrum.

S223: Obtain a second feature vector of the test voice according to the first feature vector of the test voice. Wherein, the second feature vector is related to the semantics of speech.

In some embodiments, the semantically relevant feature model is used to extract the second feature vector based on the first feature vector, wherein the semantically relevant feature model represents the relationship between semantics and time-frequency features, so the extracted second feature vector is related to semantics .

In some embodiments, the semantically relevant feature model is constructed according to the first feature vector of the reference speech and the semantics of the reference speech. In some embodiments, a semantically relevant feature model can be constructed based on a neural network, and the neural network can be a fully connected neural network (Fully Connected Neural Network, FCNN), a recurrent neural network (Recurrent Neural Network, RNN), a convolutional neural network (Convolutional Neural Network, CNN), etc. Wherein the present embodiment adopts CNN network. When constructing a semantic-related feature model, the semantic-related feature model can be trained in combination with the pre-processing module. For example, during training, the reference speech has semantic annotations, and the reference speech is passed through the pre-processing module to obtain the first feature vector, and then the first feature vector The semantic correlation model is input, and the semantic correlation model is trained according to whether the semantics of the semantic correlation feature model is converged. Gradient descent method, confrontation network method, etc. can be used for training.

In some embodiments, the semantic-related feature model constructed based on the neural network includes a multi-layer network, and the output of the semantic-related feature model corresponds to each semantic classification. Wherein, the second feature vector may be a feature vector output by any layer of the multi-layer network, since the first feature vector is used as the input of the neural network, the second feature vector is a feature vector extracted based on the first feature vector . On the other hand, since each semantic corresponds to the output of the neural network (the neural network is equivalent to a classification network, and each category corresponds to each semantic), it can be understood that the second feature vector is a feature vector related to semantics.

In some embodiments, when the output of a layer before the output layer of the neural network is used as the second feature vector, the second feature vector may be a feature vector with more dimensions than the first feature vector. For example, when a CNN network is used, when multiple convolution kernels are used to obtain the second feature vector, the operation results of the multiple convolution kernels constitute multiple dimensions of the second feature vector.

In some embodiments, the second feature vector can also be a feature vector formed by the cascade of two or more layers of output vectors of the neural network (similar to the residual connection), so that the second feature vector can not only have low-level features, It is also possible to have high-level features at the same time. For example, the output of the second layer network of the neural network and the output of the fourth layer network are cascaded to form the second feature vector.

In some embodiments, when the output of the output layer of the neural network is used as the second feature vector, the second feature vector is a vector composed of confidence levels corresponding to each semantic meaning. For example, when the output of the neural network is 20 nodes, and the 20 nodes correspond to 20 classifications (that is, for identifying one of the 20 semantics), then the second feature vector can be a one-dimensional vector with 20 parameters, each The value of the parameter corresponds to the confidence of each semantic.

Among them, the reference voice set includes several groups of reference voices, each group of reference voices corresponds to the same semantics, for example, the semantics can be "turn on the air conditioner", "turn off the air conditioner", "turn up the volume", "turn down the volume", etc. Semantics of the command. Wherein, the reference voice can be understood as a voice collected in an environment with little noise, or as a standard voice.

S225: Evaluate the voice quality of the test voice according to the second feature vector of the test voice.

Wherein, because the second feature vector is related to semantics, the evaluated voice quality is also related to semantics. When the test voice is a test voice set composed of multiple test voices, the voice quality of the test voice can be understood as the voice quality of the test voice set.

In some embodiments, the voice quality evaluation result includes the following three evaluation indicators:

The first evaluation index: represent the concentration degree index of the central position of each group of described test voices, wherein, the described test voices of different groups have different semantics, and the central position of each group of described test voices is the described test voice of this group The center position of the second feature vector in the second feature space, wherein the second feature space is the space where the second feature vector is located. Wherein, a certain group of test voices may include multiple test voices, and the center position of the group of test voices may be obtained through calculation based on the distribution of the second feature vectors of the multiple test voices.

In some embodiments, Mahalanobis distance, Euclidean distance, or other distance indicators that can calculate the similarity between samples can be used as the concentration index.

In some embodiments, the evaluation method of the first evaluation index can be as follows: first, for each group of test voices, calculate the distance between the center position of this group of test voices and the center positions of other groups of test voices, and obtain the Then, the average value of each distance minimum value obtained for each group of test voices is taken as an index of the concentration degree of the center position of each group of test voices. The calculation method of the first evaluation index D1 can be shown in the following formula (1):

Wherein C represents the grouping type in the test voice set, i.e. the corpus type of the test voice set, μ _tj represents the center position of the j group test voice in the test voice set, μ _ti represents the center position of the i group test voice in the test voice set, and ∑ represents the test voice The covariance matrix of the distribution of the second eigenvectors of the jth group of test utterances in the utterance set.

The second evaluation index: represents the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is The space in which the second eigenvector resides.

In some embodiments, the evaluation method of the second evaluation index can be as follows: first, for each group of test speech, calculate the semi-axis length of the distribution of the second feature vector of the group of test speech in the space where the second feature vector is located, and then , take the mean value of the semi-principal lengths obtained for each group of test voices, and use it as the dispersion degree index of each group of test voices. The calculation method of the second evaluation index D2 can be shown in the following formula (2):

Wherein C represents the grouping type in the test speech set, d represents the dimension of the second feature vector of each group of test speech in its space, _fjk represents the k-th dimension semi-principal length of the second feature vector of the j group of test speech, ∑ represents the covariance matrix of the second eigenvector distribution of the jth test utterance set in the test utterance set.

The 3rd evaluation index: the similar degree index of the center position of each group of reference speeches corresponding to the center position of each group of described test speeches of semantics; Wherein, the described test speeches of different groups have different semantics, each group described test speeches The central position is the central position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference voices of different groups have different Semantics, the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.

In some embodiments, the evaluation method of the third evaluation index can be as follows: First, in the space where the second feature vector is located, for each group of test speeches, calculate the difference between the center positions of the group of test speeches and the center positions of a group of reference speech sounds The distance between this group of reference voices is the same as the group of test voice semantics, that is, corresponding to the same sentence content, and then, the average value of each distance obtained for each group of test voices is taken as the similarity index between the test voice set and the reference voice set . Wherein, the center position of a certain group of test voices refers to the distribution center of the second feature vectors of the group of test voices, and the center position of a certain group of reference voices refers to the distribution center of the second feature vectors of the group of reference voices. The calculation method of the third evaluation index D3 can be shown in the following formula (3):

Wherein C represents the grouping type in the test speech set, μ _rj represents the central position of the jth group of reference speeches, μ _tj represents the center position of the jth group of test speeches, and ∑ represents the second feature vector of the jth group of reference speeches in the reference speech set The covariance matrix of the distribution.

In some embodiments, the speech quality assessment result includes one of the above assessment indicators, and may also include a combination of any number of assessment indicators, and different assessment indicators may have different weights.

The embodiment of the present application also provides a speech recognition quality prediction method, which can predict the speech recognition quality of the test speech based on the speech quality evaluation result of the test speech. FIG. 3A shows the flow of an embodiment of the speech recognition quality prediction method, which includes steps S310 to S340.

S310: Obtain a test voice.

Wherein, for this step, reference may be made to the descriptions in the above S210 or its various embodiments, and details are not repeated here.

S320: Evaluate the speech quality of the test speech according to the semantic related information of the test speech.

Wherein, for this step, reference may be made to the descriptions in the above S220 or its various embodiments, and details are not repeated here.

In some embodiments, the speech recognition quality prediction method and the aforementioned speech quality evaluation method can be integrated and executed in one process. In this case, the content described in the above steps S310 and S320 can directly use the execution results of the aforementioned steps S210 and S220, There is no need to repeat the same content.

S330: Predict the speech recognition quality of the speech recognition model for the test speech by using the speech recognition quality function according to the evaluated test speech quality. Wherein, the speech recognition quality function is used to indicate the relationship between speech recognition quality and speech quality.

In some embodiments, the recognition result of the speech recognition model is used in constructing the speech recognition quality function, so the predicted speech recognition quality can be used as the prediction result of the recognition quality of the speech recognition model. Wherein the speech recognition model can be realized by the above-mentioned speech recognition module.

S340: Output a prediction result of speech recognition quality.

Wherein, the outputted speech recognition quality prediction results include: the predicted speech recognition quality of the speech recognition module, factors affecting the speech recognition quality of the speech recognition module, and ways to adjust the speech recognition quality of the speech recognition module.

In some embodiments, the factors affecting the voice recognition quality of the voice recognition module include: the windows are open, the vehicle speed is too high, the sound inside the vehicle is too loud, the performance and deployment position of the microphone, and the performance of the voice recognition model of the voice recognition module.

In some embodiments, the manner of adjusting the voice recognition quality of the voice recognition module includes one or a combination of the following: closing the window, reducing the speed of the vehicle, reducing the sound in the vehicle, replacing the microphone in question or optimizing the deployment of the microphone, and adjusting the parameters of the pre-processing module. Tuning, tuning of the speech recognition model parameters of the speech recognition module.

Wherein, in this embodiment, the prediction result can be output to the man-machine interface of the vehicle, so as to show the evaluation result to the user. For details, reference may be made to the description of the above step S230.

In some embodiments, the speech recognition quality function in the above step S330 can be constructed as shown in FIG. 3B , and the construction process of the function includes steps S321 to S327.

S321: Perform multiple degradations on each reference voice in the reference voice set, each degradation forms a set of degraded voice sets, thereby obtaining multiple sets of degraded reference voice sets.

Among them, the reference speech is degraded to different degrees to generate multiple sets of degraded speech, each degraded speech of each degree, or each degraded speech of different ways constitutes a set of degraded speech sets. Wherein, the degradation method may be voice scrambling. In some embodiments, each reference voice in the reference voice set may be scrambled according to the possible noise environment of the vehicle, such as adding background music interference, simulated tire noise, and wind resistance noise interference. , Simulated interference from other car horns outside the car, etc.

S323: According to the method of the aforementioned voice quality assessment method embodiment, obtain voice quality assessments for multiple sets of degraded reference voices, and obtain a first assessment result.

Wherein, each set of degraded reference voices is used as the test voice, and the voice quality evaluated for each set of degraded reference voices is obtained according to the voice quality assessment method in the foregoing embodiments, and the voice quality of each set of degraded reference voices constitutes the first evaluation result.

S325: Perform recognition and statistics on multiple sets of degraded reference voices according to the above-mentioned voice recognition model, and use the statistical voice recognition results as a first statistical result.

Wherein, each set of degraded reference speech is recognized by using the speech recognition model, and the speech recognition result of each set of degraded reference speech recognition is generated, and the speech recognition results of each set of degraded reference speech constitute the first statistical result.

S327: Obtain a speech recognition quality function according to a functional relationship between the first statistical result and the first evaluation result.

In some embodiments, the speech recognition quality function can be constructed based on machine learning. For example, for the first statistical results and the first evaluation results of each set of degraded reference speech, the speech recognition quality function can be constructed by fitting a polynomial. It can also be based on Deep learning builds a speech recognition quality function, for example, by training a neural network model.

In some embodiments, each evaluation indicator in the first evaluation result is used as a dependent variable to construct a speech recognition quality function. In some other embodiments, one or more combined indicators in the first evaluation result are combined and used as the dependent variable to construct the speech recognition quality function. The evaluation indicators here are, for example, the indicators shown in the above formula (1) to formula (3).

The embodiment of the present application also provides a method for improving speech recognition quality. It can be determined based on the voice quality assessment results of the test voice and the voice recognition quality prediction results that the pre-processing module or the voice recognition module needs to be tuned to improve the voice recognition quality, as shown in Figure 4. Improve the voice recognition quality The process of the method embodiment includes steps S410 to S480.

S410: Obtain a test voice.

Wherein, for this step, reference may be made to the above step S210 or the descriptions in the various embodiments thereof, which will not be described in detail here.

S420: Obtain the speech quality of the evaluation test speech.

Wherein, for this step, reference may be made to the above step S220 or the descriptions in its various embodiments, and will not be described in detail here.

S430: Determine whether the voice quality is lower than a preset first baseline. Wherein, when the voice quality is lower than the first baseline, perform step S440, otherwise, perform step S450.

Wherein, the first baseline judges the evaluation index of voice quality, and is also referred to as the index baseline. In some embodiments, the first baselines are respectively set for each evaluation index for evaluating voice quality; in other embodiments, each evaluation index for evaluating voice quality is combined into one or more combined indexes, and the corresponding first baseline is set. baseline.

S440: Output an evaluation result of the speech quality of the test speech.

Wherein, for this step, reference may be made to the above-mentioned step S230 or the descriptions in its various embodiments, which will not be described in detail here.

In some embodiments, after this step is performed, it may return to step S410, or end this process.

In some embodiments, considering the fault tolerance of the speech recognition model, step S450 may be continued, or step S480 may be executed to continue speech recognition.

S450: Predict speech recognition quality.

Wherein, for this step, reference may be made to the above step S330 or the descriptions in its various embodiments, and will not be described in detail here.

S460: Determine whether the predicted speech recognition quality is lower than a preset second baseline. Wherein, when the voice recognition quality is lower than the preset second baseline, step S470 is performed; otherwise, step S480 is performed.

Wherein, the second baseline judges the speech recognition quality, that is, judges the speech recognition accuracy of the semantic recognition model, and is also the accuracy baseline.

S470: Output the prediction result of the speech recognition quality.

Wherein, for this step, reference may be made to the above-mentioned step S340 or descriptions in various embodiments thereof, and details are not described here again.

In some embodiments, considering the fault tolerance of the speech recognition model, step S480 may also be performed to continue speech recognition.

S480: Recognize the test voice by the voice recognition module.

In some embodiments, when the speech quality evaluated in step S430 is higher than the first baseline, and the speech recognition quality predicted in step S460 is higher than the second baseline, it is considered that the accuracy of the speech recognition quality will be higher, and the recognition result It can be used as a follow-up, such as for controlling the vehicle, etc.

In some embodiments, when entering this step when the speech quality evaluated in step S430 is lower than the first baseline, or the predicted speech recognition quality in step S460 is lower than the second baseline, the speech recognition result can be further prompted Give user confirmation to determine whether to use the speech recognition result.

In some embodiments, the user may adjust pre-processing parameters or speech recognition model parameters according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470, so as to Improve the quality of speech recognition. In some other embodiments, the device under test, such as a vehicle, may automatically perform the pre-processing according to the evaluation result of the speech quality output in step S440 or the prediction result of the speech recognition quality output in step S470. Adjustment of processing parameters or speech recognition model parameters.

In the following, in order to further understand the technical solutions of the above-mentioned embodiments, the specific implementation of the method for improving the quality of speech recognition will be described, wherein the specific implementation of the method for improving the quality of speech recognition involves speech quality assessment The steps of the method, the steps of the speech recognition quality prediction method, the steps corresponding to these two parts can also be independent with reference to the foregoing embodiments, as the specific implementation of the speech quality evaluation method and the speech recognition quality prediction method, in order to simplify the description, the The separate specific implementation manners of these two parts will not be described in detail.

As shown in Fig. 5 the flow process of the specific implementation of speech recognition quality prediction method, comprises the following steps:

S510: The vehicle acquires a test voice set based on the reference voice set through the microphone arranged in the cockpit.

Wherein, the test voice set includes several groups. For example, the tester sits in the driver's seat of the vehicle and broadcasts each group of test voices in turn. The semantics of each group can correspond to a common command. Each group of test voices includes 10 test voices broadcast by the tester. A test voice with the same content.

The content of the test voice reported corresponds to the content of the reference voice set. In this example, the vehicle guides the testers to play the test voices in a guided manner through the man-machine interface. For example, each voice content of the corresponding reference voice set and the number of times to be played can be displayed on the screen in turn, and the testers will broadcast accordingly.

S515: The collected test voice is pre-processed by the pre-processing module on the vehicle, including extracting the time-frequency feature vector graph of each test voice in the test voice set, that is, the first feature vector.

S520: Using the semantic correlation feature model, obtain a semantic correlation feature vector of the test voice in the test voice set, that is, a second feature vector, according to the time-frequency feature vector map.

S525: Evaluate the speech quality of the test speech set based on the semantically relevant feature vectors of the test speech. The speech quality can be evaluated using one or more of the above formulas (1) to (3).

S530: According to the evaluation result of the voice quality, judge whether the voice quality is lower than the set index baseline, if it is lower than the set index baseline, go to step S535, otherwise go to step S540.

S535: Output an evaluation result of the speech quality of the test speech.

In this example, the evaluation result can be output to the man-machine interface, and the evaluation result includes content related to the voice quality of the test voice. In this example, the content related to the speech quality of the test speech displayed on the man-machine interface may include: prompting the user to tune the parameters and algorithms in the pre-processing module. Among them, the adjustable interface and parameters can be displayed in a graphical form.

S540: Using the speech recognition quality function, based on the speech quality of the test speech set, predict the speech recognition quality of the speech recognition module for the test speech set.

S545: Determine whether the predicted recognition quality is lower than the set accuracy baseline, if it is lower than the accuracy baseline, go to step S550, otherwise go to step S555.

S550: Output the prediction result of the speech recognition quality.

In this example, the prediction result can be output to the man-machine interface, and the prediction result includes content related to speech recognition quality.

In this example, the content related to the speech recognition quality displayed on the man-machine interface includes: prompting the user to optimize the speech recognition model of the speech recognition module. Among them, the adjustable interface and parameters can be displayed in a graphical form.

S555: Prompt that the quality of this speech evaluation and the predicted speech recognition quality meet the standard.

In addition, in this specific implementation manner, after the user performs corresponding parameter tuning according to the output content of step S535 or step S550, the speech recognition quality function can be further optimized. Specifically, the steps in the speech recognition quality function construction method can be used to retrain the speech recognition quality function to optimize the speech recognition quality function. For example, after the pre-processing parameters or algorithms are optimized, the speech recognition quality function can be retrained according to the above-mentioned steps S323-S327; Quality function retraining.

The embodiments of the present application also provide corresponding devices. Regarding the beneficial effects or technical problems solved by the devices, you can refer to the descriptions in the methods corresponding to the devices respectively, or refer to the descriptions in the summary of the invention, which are only briefly described here. Each device in this embodiment may be used to implement each optional embodiment in the foregoing method. Each device embodiment of the present application will be described below based on each figure.

The speech quality assessment device provided by the embodiment of the present application can be used to implement various embodiments of the speech quality assessment method. As shown in FIG. 6A , the device has an acquisition module 610 , an evaluation module 620 and an output module 630 .

The obtaining module 610 is used for obtaining test voice. It is specifically used to execute the foregoing step S210 and various embodiments thereof.

The evaluation module 620 is used for evaluating the speech quality of the test speech according to the semantic related information of the test speech. It is specifically used to execute the foregoing step S220 and various embodiments thereof.

The output module 630 is configured to output the evaluation result of the voice quality. It is specifically used to execute the foregoing step S230 and various embodiments thereof.

In some embodiments, the evaluation result of the voice quality output by the output module 630 includes one or more of the following: the quality of the test voice; factors affecting the quality of the test voice; adjusting the quality of the test voice The way. For specific content, refer to the description in the aforementioned step S230.

In some embodiments, the evaluation module 620 is specifically configured to: obtain a first feature vector of the test voice, wherein the first feature vector includes a time-frequency feature vector of the test voice; Obtaining a second feature vector of the test speech from the first feature vector, wherein the second feature vector is related to the semantics of the test speech; evaluating the speech quality of the test speech according to the second feature vector.

In some embodiments, when evaluating the speech quality of the test speech according to the second feature vector, the evaluation module 620 includes: using a first evaluation index to evaluate the speech quality of the test speech, and the first evaluation The indicators include: the concentration index of the center position of each group of the test voices, wherein the test voices of different groups have different semantics, and the center position of each group of the test voices is the second feature of the group of test voices The center position of the vector in the second feature space, where the second feature space is the space where the second feature vector is located.

In some embodiments, when evaluating the speech quality of the test speech according to the second feature vector, the evaluation module 620 includes: using a second evaluation index to evaluate the speech quality of the test speech, the second evaluation The index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature space is the second feature space. The space of the two eigenvectors.

In some embodiments, when evaluating the speech quality of the test speech according to the second feature vector, the evaluation module 620 includes: using a third evaluation index to evaluate the speech quality of the test speech, the third evaluation Index comprises: the similar degree index of the central position of each group described test speech and the central position of each group of reference speech corresponding to semantics; Wherein, the described test speech of different groups has different semantics, the central position of each group described test speech The central position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference voices of different groups have different semantics, The center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.

In some embodiments, when the evaluation module 620 obtains the first feature vector of the test speech, it includes: obtaining consecutive frames contained in the test speech, wherein adjacent frames contain overlapping information; according to the A plurality of feature vectors including frequency domain features are obtained for consecutive frames, and the plurality of feature vectors constitute the first feature vector.

The embodiment of the present application also provides a device for speech recognition quality prediction, which can be used to implement the method embodiment of speech recognition quality prediction, as shown in Figure 6B, the device has an acquisition module 612, a prediction module 622 and an output Module 632.

The obtaining module 612 is used for obtaining the test voice. It is specifically used to execute the above step S310 and various embodiments thereof.

The prediction module 622 is used to predict the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, the speech recognition quality function represents the relationship between the speech recognition quality and the speech quality, and the speech quality is based on the aforementioned Any one of the possible embodiments of the voice quality assessment method is evaluated. It is specifically used to execute the above steps S320-S330 and various embodiments thereof.

The output module 632 is used to output the prediction result of speech recognition quality. It is specifically used to execute the above step S340 and various embodiments thereof.

In some embodiments, the prediction result of the speech recognition quality output by the output module 632 includes one or more of the following, including one or more of the following: the speech recognition quality of the speech recognition model; Factors affecting the speech recognition quality of the speech recognition model; ways of adjusting the speech recognition quality of the speech recognition model.

In some embodiments, the construction process of the speech recognition quality function includes: obtaining multiple sets of degraded reference speeches of the reference speech; obtaining speech recognition results of multiple sets of the degraded reference speeches according to the speech recognition model, and using them as the first A statistical result; respectively using multiple sets of the degraded reference voices as test voices, and obtaining multiple sets of voice quality assessment results of the degraded reference voices according to any possible embodiment of the voice quality assessment method, to obtain a first evaluation result; The speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.

The embodiment of the present application also provides a device for improving speech recognition quality, which can be used to implement the method embodiment for improving speech recognition quality, as shown in Figure 6C, the speech quality evaluation device has an acquisition module 614, an evaluation prediction module 624 and output module 634 .

The obtaining module 614 is used for obtaining the test voice. It is specifically used to execute the above step S410 and various embodiments thereof.

The evaluation prediction module 624 is used to evaluate the speech quality of the test speech. Any possible embodiment of the foregoing voice quality assessment method may be used for assessment. It is specifically used to execute the above steps S420-S430 and various embodiments thereof.

The output module 634 is configured to output the assessment result of the voice quality when the voice quality is lower than the preset first baseline. It is specifically used to execute the above step S440 and various embodiments thereof.

In some embodiments, the evaluation and prediction module 624 is further configured to predict the speech recognition quality according to the method described in any possible embodiment of the speech recognition quality prediction method when the speech quality is greater than or equal to the first baseline . It is specifically used to execute the above steps S450-S460 and various embodiments thereof.

The output module 634 is further configured to output the prediction result of the speech recognition quality when the speech recognition quality is lower than a preset second baseline. It is specifically used to execute the above step S470 and various embodiments thereof.

The embodiment of the present application also provides a vehicle, as shown in Figure 7A and Figure 7B, the vehicle includes a sound collection device 710, a pre-processing device 720, a speech recognition device 730, and the aforementioned speech quality evaluation device, speech recognition quality Predictive means or means to improve the quality of speech recognition.

The sound collecting device 710 is used for collecting the semantic commands based on the reference speech spoken by the driver. In FIG. 7B it may be a microphone in the cockpit. In the example shown in FIG. 7B , the microphone is set at the central control panel 740 , and may also be set at one or more other positions such as the instrument panel above the steering wheel, the rearview mirror in the cockpit, and the steering wheel.

The pre-processing device 720 is used for pre-processing the speech broadcast by the driver.

The voice recognition device 730 is used to recognize the driver's command when the voice quality requirement of the driver's command and the predicted voice recognition quality meet the requirements.

The aforementioned speech quality evaluation device, speech recognition quality predicting device or speech recognition quality improving device is used for the aforementioned purpose, and the user can perform corresponding operations based on this to improve speech quality and speech recognition quality.

In the vehicle cockpit shown in FIG. 7B , a central control panel 740 as a man-machine interface is also shown, through which the user receives the output of a speech quality evaluation device, a speech recognition quality prediction device, or a device for improving speech recognition quality. The information can be displayed and displayed, and the parameter adjustment interface can be displayed with the help of the central control panel 740, which is convenient for the user to perform the above-mentioned parameter adjustment by manipulating the central control panel 740.

In the vehicle cockpit shown in Figure 7B, the pre-processing device 720, the speech recognition device 730, and the speech quality evaluation device, the speech recognition quality prediction device or the device for improving the speech recognition quality can be processed by one or more in the vehicle In this embodiment, it may be implemented by a processor of the vehicle infotainment system.

FIG. 8 is a schematic structural diagram of a computing device 800 provided by an embodiment of the present application. The computing device 800 includes: a processor 810 , a memory 820 , and a communication interface 830 .

It should be understood that the communication interface 830 in the computing device 800 shown in FIG. 8 can be used to communicate with other devices.

Wherein, the processor 810 may be connected to the memory 820 . The memory 820 can be used to store the program codes and data. Therefore, the memory 820 may be a storage module inside the processor 810, or an external storage module independent of the processor 810, or may include a storage module inside the processor 810 and an external storage module independent of the processor 810. part.

When the computing device 800 is running, the processor 810 executes the computer-implemented instructions in the memory 820 to perform the operation steps of the above method.

The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the program is used to execute one or more of the solutions described in the various embodiments of the present application when executed by a processor.

In the above description, the referenced numbers representing the steps, such as S110, S120, etc., do not necessarily mean that the steps will be executed, and the order of the preceding and following steps can be interchanged or executed simultaneously if allowed.

The words "first, second, third" and other similar terms in the description and claims of this application are only used to distinguish similar objects, and do not represent a specific ordering of objects. Understandably, when permitted The specific order or sequencing may be interchanged such that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein.

The term "comprising" used in the description and claims of the present application should not be interpreted as being limited to what is listed thereafter; it does not exclude other elements or steps. Therefore, it should be interpreted as specifying the presence of said features, integers, steps or components, but not excluding the presence or addition of one or more other features, integers, steps or components and groups thereof. Therefore, the expression "apparatus comprising means A and B" should not be limited to an apparatus consisting of parts A and B only.

Reference in this specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Therefore, the phrase "in an embodiment" appearing in various places in this specification does not necessarily all refer to the same embodiment, but may refer to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

Note that the above are only the embodiments of the present application and the applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present application, all of which belong to protection scope of this application.

Claims

A method for voice quality assessment, characterized in that it comprises:

Get the test voice;

Evaluating the speech quality of the test speech according to semantically relevant information of the test speech;

Outputting the evaluation result of the voice quality.
The method according to claim 1, wherein the assessment results of the voice quality include one or more of the following:

the quality of the test voice;

Factors affecting the quality of the test voice;

Adjust the method of testing voice quality.
The method according to claim 1, wherein the evaluating the voice quality of the test voice according to the semantic related information of the test voice comprises:

Obtaining a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech;

Obtaining a second feature vector of the test speech according to the first feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech;

Evaluating the speech quality of the test speech according to the second feature vector.
The method according to claim 3, wherein the evaluating the voice quality of the test voice according to the second feature vector comprises: evaluating the voice quality of the test voice using a first evaluation indicator,

The first evaluation index includes: the concentration index of the central position of each group of the test voices, wherein the test voices of different groups have different semantics, and the central position of each group of the test voices is the center position of the test voices of the group. The center position of the second eigenvector in a second eigenspace, where the second eigenspace is the space where the second eigenvector is located.
The method according to claim 3, wherein the evaluating the voice quality of the test voice according to the second feature vector comprises: evaluating the voice quality of the test voice using a second evaluation index,

The second evaluation index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature The space is the space where the second eigenvector is located.
The method according to claim 3, wherein the evaluating the voice quality of the test voice according to the second feature vector comprises: evaluating the voice quality of the test voice using a third evaluation indicator,

The third evaluation index includes: the similarity index of the center position of each group of the test voices and the center position of each group of reference voices corresponding to the semantics; wherein, the test voices of different groups have different semantics, and the test voices of each group have different semantics. The center position of the test voice is the center position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference The voices have different semantics, and the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
The method according to any one of claims 3 to 6, wherein said obtaining the first feature vector of said test voice comprises:

Obtaining the continuous frames contained in the test speech, wherein adjacent frames contain overlapping information;

A plurality of feature vectors including frequency domain features are obtained according to the consecutive frames, and the plurality of feature vectors constitute the first feature vector.
A method for speech recognition quality prediction, characterized in that it comprises:

Get the test voice;

Predict the speech recognition quality of the speech recognition model for the test speech according to the speech recognition quality function, the speech recognition quality function is used to indicate the relationship between speech recognition quality and speech quality, and the speech quality is according to any one of claims 1-7. one of the method evaluations;

and outputting the prediction result of the speech recognition quality.
The method according to claim 8, wherein the outputting the prediction result of the speech recognition quality comprises one or more of the following:

the quality of the speech recognition of the speech recognition model;

Factors affecting the speech recognition quality of the speech recognition model;

The way to adjust the speech recognition quality of the speech recognition model.
The method according to claim 9, wherein the construction process of the speech recognition quality function comprises:

Obtain multiple sets of degraded reference voices of the reference voice;

Obtain multiple sets of speech recognition results of the degraded reference speech according to the speech recognition model, and use it as a first statistical result;

Respectively using multiple sets of the degraded reference voices as test voices, obtaining multiple sets of voice quality evaluation results of the degraded reference voices according to any one of claims 1-7, to obtain the first evaluation result;

The speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
A method for improving speech recognition quality, comprising:

Get the test voice;

Obtain the speech quality assessment result of described test speech according to the method described in any one of claim 1-7;

When the voice quality is lower than a preset first baseline, the outputting the voice quality evaluation result is performed.
The method according to claim 11, further comprising:

When the speech quality is greater than or equal to the first baseline, predicting speech recognition quality according to the method of claim 8 or 9;

When the speech recognition quality is lower than a preset second baseline, outputting a prediction result of the speech recognition quality is performed.
A device for voice quality assessment, characterized in that it comprises:

Obtain module, be used for obtaining test voice;

An evaluation module, configured to evaluate the speech quality of the test speech according to the semantic related information of the test speech;

An output module, configured to output the evaluation result of the voice quality.
The device according to claim 13, wherein the evaluation results of the voice quality output by the output module include one or more of the following:

the quality of the test voice;

Factors affecting the quality of the test voice;

Adjust the method of testing voice quality.
The device according to claim 13, wherein the evaluation module is specifically used for:

Obtaining a first feature vector of the test speech, wherein the first feature vector includes a time-frequency feature vector of the test speech;

Obtaining a second feature vector of the test speech according to the first feature vector of the test speech, wherein the second feature vector is related to the semantics of the test speech;

Evaluating the speech quality of the test speech according to the second feature vector.
The device according to claim 15, wherein the evaluation module, when evaluating the speech quality of the test speech according to the second feature vector, comprises: using a first evaluation index to evaluate the speech quality of the test speech ,

The first evaluation index includes: the concentration index of the central position of each group of the test voices, wherein the test voices of different groups have different semantics, and the central position of each group of the test voices is the center position of the test voices of the group. The center position of the second eigenvector in a second eigenspace, where the second eigenspace is the space where the second eigenvector is located.
The device according to claim 15, wherein the evaluation module, when evaluating the speech quality of the test speech according to the second feature vector, comprises: using a second evaluation index to evaluate the speech quality of the test speech ,

The second evaluation index includes: the degree of dispersion index of the second feature vector of each group of the test speech in the second feature space, wherein, the test speech of different groups has different semantics, and the second feature The space is the space where the second eigenvector is located.
The device according to claim 15, wherein the evaluation module, when evaluating the speech quality of the test speech according to the second feature vector, comprises: using a third evaluation index to evaluate the speech quality of the test speech ,

The third evaluation index includes: the similarity index of the center position of each group of the test speech and the center position of each group of reference speech corresponding to the semantics; wherein, the test speeches of different groups have different semantics, and each group of the The center position of the test voice is the center position of the second feature vector of the group of test voices in the second feature space, the second feature space is the space where the second feature vector is located, and the reference The voices have different semantics, and the center position of each group of reference voices is the center position of the second feature vector of the group of reference voices in the second feature space.
The device according to any one of claims 15 to 18, wherein when the evaluation module obtains the first feature vector of the test speech, it includes:

Obtaining the continuous frames contained in the test speech, wherein adjacent frames contain overlapping information;

A plurality of feature vectors including frequency domain features are obtained according to the consecutive frames, and the plurality of feature vectors constitute the first feature vector.
A device for speech recognition quality prediction, characterized in that it comprises:

Obtain module, be used for obtaining test voice;

A prediction module, configured to predict the speech recognition quality of the speech recognition model for the test speech according to a speech recognition quality function, the speech recognition quality function being used to indicate the relationship between the speech recognition quality and the speech quality, and the speech quality according to the right The method evaluation described in any one of requirements 1-7;

An output module, configured to output the prediction result of the speech recognition quality.
The device according to claim 20, wherein the prediction result of the speech recognition quality output by the output module includes one or more of the following:

the quality of the speech recognition of the speech recognition model;

Factors affecting the speech recognition quality of the speech recognition model;

The way to adjust the speech recognition quality of the speech recognition model.
The device according to claim 21, wherein the construction process of the speech recognition quality function comprises:

Obtain multiple sets of degraded reference voices of the reference voice;

Obtain multiple sets of speech recognition results of the degraded reference speech according to the speech recognition model, and use it as a first statistical result;

Respectively using multiple sets of the degraded reference voices as test voices, obtaining multiple sets of voice quality evaluation results of the degraded reference voices according to any one of claims 1-7, to obtain the first evaluation result;

The speech recognition quality function is obtained according to a functional relationship between the first statistical result and the first evaluation result.
A device for improving speech recognition quality, characterized in that it comprises:

Obtain module, be used for obtaining test voice;

Evaluation prediction module, for obtaining the speech quality evaluation result of described test speech according to the method described in any one of claim 1-7;

An output module, configured to output the assessment result of the voice quality when the voice quality is lower than a preset first baseline.
The device according to claim 23, characterized in that:

The evaluation prediction module is also used to predict the speech recognition quality according to any one of claims 8-9 when the speech quality is greater than or equal to the first baseline;

The output module is further configured to output the prediction result of the speech recognition quality when the speech recognition quality is lower than a preset second baseline.
A vehicle, characterized in that it comprises:

A sound collection device, configured to collect the user's voice command;

A pre-processing device for pre-processing the sound of the voice command;

A speech recognition device, used for recognizing the pre-processed sound;

A device as claimed in any one of claims 13 to 24.
A computing device comprising one or more processors and one or more memories storing program instructions which, when executed by the one or more processors, cause the One or more processors implement the method of any one of claims 1-12.
A computer-readable storage medium, on which program instructions are stored, wherein when the program instructions are executed by a computer, the computer implements the method according to any one of claims 1-12.
A computer program product, characterized in that it includes program instructions, and when the program instructions are executed by a computer, the computer implements the method described in any one of claims 1-12.