CN110580917B

CN110580917B - Voice data quality detection method, device, server and storage medium

Info

Publication number: CN110580917B
Application number: CN201910870667.XA
Authority: CN
Inventors: 丰强泽; 齐红威; 何鸿凌; 肖永红
Original assignee: Datatang Beijing Technology Co ltd
Current assignee: Datatang Beijing Technology Co ltd
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2022-02-15
Anticipated expiration: 2039-09-16
Also published as: CN110580917A

Abstract

The invention provides a voice data quality detection method, a device, a server and a storage medium, which are used for dividing voice data to be subjected to quality detection by taking a frame as a unit to obtain at least one voice frame, calculating the spectral energy value of each frequency band of the voice frame in at least one preset frequency band, calculating the target spectral energy value of the voice data in the frequency band by using the spectral energy value of each voice frame in the frequency band of at least one voice frame, and analyzing the target spectral energy value of the voice data in each frequency band to obtain the quality detection result of the voice data. The technical scheme provided by the invention can analyze the voice quality detection result of the voice data by calculating the target spectrum energy values of the voice data in different frequency bands, thereby realizing the detection of the voice data quality.

Description

Voice data quality detection method, device, server and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technology, and more particularly, to a method, an apparatus, a server, and a storage medium for detecting voice data quality.

Background

With the rapid development of speech recognition technology, people have higher and higher requirements on the accuracy of speech recognition technology. A large amount of speech data is required behind the speech recognition technology, and the higher the quality of the speech data, the higher the accuracy of the speech recognition technology. However, the quality of voice data varies greatly during the production process due to the variety of recording apparatuses and recording environments. Therefore, it is an urgent need to solve the problem of providing a method, an apparatus, a server and a storage medium for detecting voice data quality.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, a server and a storage medium for detecting quality of voice data.

In order to achieve the above object, the following solutions are proposed:

the first aspect of the invention discloses a voice data quality detection method, which comprises the following steps:

dividing voice data by taking a frame as a unit to obtain at least one voice frame;

calculating the spectral energy value of each frequency band of the voice frame in at least one preset frequency band;

calculating a target spectral energy value of the voice data in the frequency band by using the spectral energy value of each voice frame in the at least one voice frame in the frequency band;

and analyzing the target spectrum energy value of the voice data in each frequency band to obtain a quality detection result of the voice data.

Optionally, the analyzing the target spectral energy value of the voice data in each frequency band to obtain a quality detection result of the voice data includes:

inputting the target frequency spectrum energy value of the voice data in each frequency band into a preset voice quality detection model to obtain a quality detection result of the voice data;

the voice quality detection model is generated by training the voice quality detection model to be trained by using a calibration result of a prediction result of a voice data sample approaching to the voice data sample by using a voice quality detection model to be trained and a target spectral energy value of each frequency band of the voice data sample in the at least one frequency band as a training target, wherein the calibration result represents a voice quality category to which the voice data sample belongs.

selecting a target condition which is satisfied by the target spectrum energy value of the voice data in each frequency band from at least one preset condition;

and taking the preset voice quality category matched with the target condition as a quality detection result of the voice data.

Optionally, the segmenting the voice data by using a frame as a unit to obtain at least one voice frame includes:

taking a preset number of sampling points as a frame;

and segmenting the voice data according to the acquisition sequence of the sampling points in the voice data to obtain at least one voice frame.

Optionally, the calculating a spectral energy value of each frequency band of the speech frame in at least one preset frequency band includes:

determining a target frequency value of each frequency band in at least one preset frequency band;

and calculating the spectral energy value of the voice frame in the frequency band by using the target frequency value of the frequency band and the sampling value of each sampling point in the voice frame.

Optionally, the calculating a target spectral energy value of the speech data in the frequency band by using the spectral energy value of each of the at least one speech frame in the frequency band includes:

acquiring the spectral energy value of each voice frame in the at least one voice frame in the frequency band;

and determining the average value of the acquired spectrum energy values as a target spectrum energy value of the frequency band.

Optionally, a speech quality detection model training process is further included, and the process includes:

obtaining a voice data sample belonging to each of at least one voice quality category;

dividing the voice data sample by taking a frame as a unit to obtain at least one voice frame;

calculating the spectral energy value of each frequency band of the voice frame in the at least one frequency band;

calculating a target spectral energy value of the voice data sample in the frequency band by using the spectral energy value of each voice frame in the voice data sample in the frequency band;

and training the voice quality detection model to be trained to obtain the voice quality detection model by taking the prediction result of the voice data sample approaching the voice quality class to which the voice data sample belongs as a training target according to the target frequency spectrum energy value of the voice data sample in each frequency band.

The second aspect of the present invention discloses a voice data quality detection apparatus, comprising;

the first segmentation unit is used for segmenting the voice data by taking a frame as a unit to obtain at least one voice frame;

the first calculating unit is used for calculating the spectral energy value of each frequency band of the voice frame in at least one preset frequency band;

a second calculating unit, configured to calculate a target spectral energy value of the frequency band of the voice data by using a spectral energy value of each of the at least one voice frame in the frequency band;

and the quality detection result determining unit is used for analyzing the target frequency spectrum energy value of the voice data in each frequency band to obtain the quality detection result of the voice data.

A third aspect of the present invention discloses a server, comprising: at least one memory and at least one processor; the memory stores a program, and the processor calls the program stored in the memory, wherein the program is used for realizing the voice data quality detection method disclosed in any one of the first aspect of the invention.

A fourth aspect of the present invention discloses a storage medium, in which computer-executable instructions are stored, the computer-executable instructions being configured to execute the voice data quality detection method disclosed in any one of the first aspects of the present invention.

The invention provides a voice data quality detection method, a device, a server and a storage medium, which are used for dividing voice data by taking a frame as a unit to obtain at least one voice frame, calculating the spectral energy value of each frequency band of the voice frame in at least one preset frequency band, calculating the target spectral energy value of the voice data in the frequency band by using the spectral energy value of each voice frame in the at least one voice frame in the frequency band, and analyzing the target spectral energy value of the voice data in each frequency band to obtain the quality detection result of the voice data. According to the technical scheme provided by the invention, the voice quality detection result of the voice data can be analyzed by calculating the target spectrum energy values of the voice data in different frequency bands, so that the detection of the voice data quality is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a voice data quality detection method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the phenomenon of voice data loss caused by high frequency energy loss according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the phenomenon of the voice data in the quality problem of DC offset according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a method for calculating a spectral energy value of a speech frame in each preset at least one frequency band according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of calculating a target spectral energy value of voice data in a frequency band according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart illustrating a method for generating a speech data quality detection model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a voice data quality detection apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As can be seen from the above background art, in the process of generating voice data, due to the diversity of voice devices and recording environments, some unqualified voice data may be generated. It is therefore necessary to detect the quality of the speech data.

The inventor finds that the speech speed value and/or the volume value of the speaker corresponding to each speech section in the speech data can be obtained through research, analyzes the speech speed value and/or the volume value of the speaker corresponding to each speech section in the speech data, and determines the audio quality of the speech data. This method can only detect the length of the silence segment in the voice data and the quality problem that can be clearly distinguished by human hearing in the voice data. For example, the speech data may have a quality problem that is not clearly discernible by human hearing, but is too low, too high or too low in volume. For example, the loss of high-frequency energy of voice data caused by noise reduction of the recording device, the dc offset of voice data caused by current interference of the recording hardware of the recording device on the motherboard or other electronic devices, and abnormal continuous noise, etc.

Accordingly, the inventor of the present application further provides a voice data quality detection method as shown in fig. 1 through research, and based on the voice data quality detection method as shown in fig. 1 provided in the embodiment of the present application, not only the quality problem that can be clearly distinguished by the auditory sense of human ears in the voice data can be detected, but also the quality problem that cannot be clearly distinguished by the auditory sense of human ears in the voice data can be detected.

Referring to fig. 1, an embodiment of the present invention provides a flowchart of a voice data quality detection method, where the voice data quality detection method includes the following steps:

s101: and dividing the voice data by taking the frame as a unit to obtain at least one voice frame.

In the specific process of executing step S101, a preset number of sampling points are used as a frame, and the voice data is segmented according to the collection sequence of the sampling points in the voice data to obtain at least one voice frame.

In the embodiment of the present application, the sampling point in one voice data may be determined at preset time intervals from the starting point of the voice data. As a preferred implementation manner of the embodiment of the present application, information such as a volume value of voice data at a sampling point may be collected. The above is only the preferable content of the information collected at the sampling point provided by the embodiment of the present application, and the inventor can set the specific information collected at the sampling point according to his own needs, which is not limited herein.

It should be noted that the number of sampling points in one frame is preset. For example, 256 samples may be taken as one frame, and when the number of samples in the speech data is 25600, the speech data is divided according to the collection order of the samples in the speech data to obtain 100 frames. The number of sampling points included in a frame may be set according to practical applications, and the embodiment of the present invention is not limited thereto.

S102: and calculating the spectral energy value of each frequency band of the voice frame in at least one preset frequency band.

It should be noted that, at least one frequency band may be preset according to the frequency rule, for example, the preset at least one frequency band is: the frequency spectrum of the voice data is set into three frequency bands, namely a low frequency band, a medium frequency band and a high frequency band. Wherein, the frequency value of the low frequency band is below 10HZ, the frequency value of the medium and high frequency band is above 4000HZ, and the frequency value of the high frequency band is above 6000 HZ. In this embodiment of the present application, specific contents of the at least one preset frequency band may be set according to practical applications, and the embodiment of the present invention is not limited.

In the specific process of step S102, a spectral energy value of each frequency band of the speech frame in the preset at least one frequency band is calculated by using a window function filtering method. For example, when at least one preset frequency band is a low frequency band, a medium frequency band and a high frequency band, a method of window function filtering is used to calculate the spectral energy value of the speech frame in the low frequency band, the spectral energy value of the speech frame in the medium frequency band and the spectral energy value of the speech frame in the high frequency band respectively.

It should be noted that the window function may be a keze window, a hanning window, or a hamming window, and the like, and may be set according to practical applications, and the embodiment of the present invention is not limited thereto.

S103: and calculating a target spectral energy value of the voice data in the frequency band by using the spectral energy value of each voice frame in the at least one voice frame in the frequency band.

In the specific process of executing step S103, the spectral energy value of each speech frame in the at least one speech frame in the frequency band is used to calculate the average value of the spectral energy values of the speech frames in the at least one speech frame in the frequency band, and the calculated average value of the spectral energy values is determined as the target spectral energy value of the speech data in the frequency band. Wherein calculating the average value of the spectral energy of the speech frame in the target frequency band in at least one speech frame of the speech data comprises: calculating the sum of the spectral energy values of each speech frame in the at least one speech frame in the target frequency band to obtain a first result; calculating the number of the voice frames included in at least one voice frame to obtain a second result; the first result is divided by the second result to obtain a third result, and the third result can be regarded as an average value of the spectral energy of the speech frame in the target frequency band in at least one speech frame of the speech data. The target frequency band is a low frequency band, or the target frequency band is a medium-high frequency band, or the target frequency band is a high frequency band.

For example, when the voice data is segmented by taking a frame as a unit to obtain at least one voice frame, and at least one preset frequency band is a low frequency band, a medium frequency band and a high frequency band, for each voice frame, the spectral energy value of the voice frame in the low frequency band, the spectral energy value of the voice frame in the medium frequency band and the spectral energy value of the voice frame in the high frequency band are calculated respectively; calculating the average value of the spectral energy of the voice frame in the low frequency band in at least one voice frame and taking the average value as the target spectral energy value of the voice data in the low frequency band; calculating the average value of the spectrum energy of the voice frame in the middle and high frequency bands in at least one voice frame and taking the average value as the target spectrum energy value of the voice data in the middle and high frequency bands; and calculating the average value of the spectral energy of the speech frame in the high frequency band in at least one speech frame and taking the average value as the target spectral energy value of the speech data in the high frequency band.

S104: and analyzing the target frequency spectrum energy value of the voice data in each frequency band to obtain a quality detection result of the voice data.

As a preferred implementation manner of the embodiment of the present application, analyzing target spectral energy values of voice data in each frequency band to obtain a quality detection result of the voice data includes: and inputting the target frequency spectrum energy values of the voice data in each frequency band into a preset voice quality detection model to obtain a quality detection result of the voice data.

The voice quality detection model is generated by training the voice quality detection model to be trained by using a calibration result of the voice data sample, which is a prediction result of the voice data sample approaching to the voice data sample, as a training target by using the voice quality detection model to be trained by using a target spectrum energy value of the voice data sample in each frequency band of at least one frequency band, and the calibration result represents the voice quality category to which the voice data sample belongs.

It should be noted that the voice quality category may be a high-frequency energy loss category, a dc offset category, an abnormal continuous noise category, a qualified category, and the like, and may be set according to practical applications, which is not limited in the embodiments of the present invention.

It should be noted that the high frequency energy loss refers to the phenomenon that the voice data has energy loss at the frequency of 4KHZ or higher, as shown in fig. 2, and the dc offset refers to the phenomenon that the voice data has energy accumulation continuously at the frequency of 1KHZ or lower, as shown in fig. 3.

In the embodiment of the present invention, the quality detection result of the voice data is the voice quality category to which the voice data belongs. For example, the quality detection result of the voice data may be that the voice data belongs to the high frequency energy missing class. The present invention can be set according to practical applications, and the embodiments of the present invention are not limited.

When the voice quality type to which the detected voice data belongs is a good type, it is described that the voice data is good, and when the voice quality type to which the detected voice data belongs is not a good type, such as a high frequency energy loss type, a dc offset type, an abnormal continuous noise type, or the like, the voice data can be considered to be bad.

In an embodiment of the present application, a plurality of speech data samples may be obtained to train a speech quality detection model, where the plurality of speech data samples may be at least one speech data sample respectively belonging to each speech quality class.

As another preferred implementation manner of the embodiment of the present application, analyzing target spectral energy values of voice data in each frequency band to obtain a quality detection result of the voice data includes: selecting target conditions which are met by target spectrum energy values of voice data in each frequency band from at least one preset condition; and taking the preset voice quality category matched with the target condition as a quality detection result of the voice data.

In the embodiment of the present application, voice data samples respectively belonging to each voice quality category may be obtained, and a target spectrum energy value of each voice data sample in each frequency band in at least one frequency band is fitted to obtain a condition respectively matched with each voice quality category, where each obtained condition may be regarded as at least one preset condition.

It should be noted that the fitting method for fitting the target frequency spectrum of each voice data sample in each frequency band of at least one frequency band may be a classification algorithm disclosed by bayesian learning, regression learning, neural network, SVM, etc., and may be set according to practical applications, which is not limited in the embodiments of the present invention.

For example, the at least one condition includes 3 conditions, which are condition 1, condition 2, and condition 3, where condition 1 may be that the target spectral energy value in the medium-high frequency band is less than 1000, and the speech quality category matched with this condition 1 may be a high-frequency energy missing category; condition 2 may be that the target spectral energy value in the low frequency band is >1000000, and the voice quality class matching the condition 2 may be a dc offset class; the condition 3 may be that the target spectral energy value at the high frequency band is >20000000, and the speech quality class matching the condition 3 may be an abnormal continuous noise class.

Based on the method, when the quality of the voice data is detected, a target spectral energy value of the voice data in a low frequency band, a target spectral energy value of the voice data in a middle and high frequency band and a target spectral energy value of the voice data in a high frequency band are obtained firstly; selecting conditions met by the voice data from at least one condition according to the target spectral energy value of the voice data in the low frequency band, the target spectral energy value of the medium and high frequency bands and the target spectral energy value of the voice data in the high frequency band, taking the selected conditions as target conditions, and taking preset voice quality types matched with the target conditions as quality detection results of the voice data.

In the above embodiment of the present invention, step S102 disclosed in fig. 1 calculates a spectral energy value of each frequency band of a speech frame in at least one preset frequency band, as shown in fig. 4, includes the following steps:

s401: and determining a target frequency value of each frequency band in at least one preset frequency band.

In the process of specifically executing step S401, if the frequency band is a low frequency band (the low frequency band is less than or equal to 10 HZ), 10HZ may be used as the target frequency value of the low frequency band; if the frequency band is a medium-high frequency band (the medium-high frequency band is more than 4000 HZ), the 4000HZ can be used as a target frequency value of the medium-high frequency band; if the frequency band is a high frequency band (the high frequency band is more than 6000 HZ), 6000HZ may be used as the target frequency value of the high frequency band.

The above is only a preferred way for determining the target frequency value of the frequency band provided in the embodiment of the present application, and specifically, the inventor can select any frequency value in the frequency band as the target frequency value of the frequency band according to his own requirement, which is not limited herein.

S402: and calculating the spectral energy value of the voice frame in the frequency band by using the target frequency value of the frequency band and the sampling value of each sampling point in the voice frame.

In the process of specifically executing step S402, the spectral energy value of the speech frame in the frequency band is calculated by using the target frequency value of the determined frequency band and the sampling value of each sampling point in the speech frame as the input information of the window function.

For example, when a speech frame includes 256 sampling points, the window function is a keiser window, and one of the preset at least one frequency band is a low frequency band, the target frequency value of the low frequency band may be determined to be 10HZ, and the determined target frequency value of the low frequency band 10HZ and the sampling values of the 256 sampling points in the speech frame are used as input information of the keiser window to calculate the spectral energy value of the speech frame in the low frequency band.

As a preferred embodiment of the present application, the sampling value of the sampling point may be a volume value, which is only the preferred content of the sampling value of the sampling point provided in the embodiment of the present application, and the inventor may set the sampling value according to his own requirement, which is not limited herein.

In the implementation of the invention, firstly, a target frequency value of each frequency band in at least one preset frequency band is determined, and secondly, the frequency spectrum energy value of the voice frame in the frequency band is calculated by using the target frequency value of the frequency band and the sampling value of each sampling point in the voice frame. The frequency spectrum energy value of the voice frame in the frequency band can be calculated more accurately by using the target frequency value of the frequency band and the sampling value of each sampling point in the voice frame.

In the above embodiment of the present invention, step S103 disclosed in fig. 1 calculates the target spectral energy value of the voice data in the frequency band by using the spectral energy value of each speech frame in the frequency band of each speech frame in at least one speech frame, as shown in fig. 5, including the following steps:

s501: and acquiring the spectral energy value of each speech frame in at least one speech frame in the frequency band.

In the process of specifically executing step S501, a spectral energy value of each speech frame in at least one speech frame of the speech data in the frequency band is obtained. For example, when the speech data is segmented by taking a frame as a unit to obtain 100 speech frames, and one of the preset at least one frequency band is a low frequency band, the spectral energy value of each speech frame in the 100 speech frames in the low frequency band is obtained.

S502: and determining the average value of the acquired spectral energy values as a target spectral energy value of the frequency band.

In the process of specifically executing step S502, an average value of the spectral energy values of the obtained speech frames in the frequency band is calculated, and the calculated average value is determined as the target spectral energy value of the frequency band. For example, after the spectral energy value of each of the 100 speech frames in the low frequency band is obtained, the sum of the spectral energy values of the 100 speech frames in the low frequency band is calculated to obtain a first result, the number 100 of the speech frames in the speech data is used as a second result, the first result is divided by the second result to obtain a third result, and the third result is used as an average value of the spectral energy values of the speech frames in the 100 speech frames in the low frequency band, that is, the third result can be regarded as a target spectral energy value of the speech data in the low frequency band.

Based on the embodiment of the present invention, a calculation formula of an average value of spectral energy values of a speech frame in at least one speech frame of speech data in a frequency band is as follows:

wherein, meanEnergy is the frequency spectrum energy of the frequency bandAverage value of the magnitudes, n is the total number of speech frames obtained by dividing the speech data in units of frames, b_kIs the sum of the spectral energy values of n speech frames in the frequency band.

In the embodiment of the invention, the spectral energy value of each speech frame in at least one speech frame of the speech data in the frequency band is obtained, the average value of the spectral energy values of the speech frames in the frequency band in the at least one speech frame is calculated, the calculated average value is determined as the target spectral energy value of the speech data in the frequency band, and then the quality detection result of the speech data can be obtained by analyzing the target spectral energy values of the speech data in each frequency band.

In the foregoing embodiment of the present invention, as shown in fig. 6, a flow chart of a method for generating a speech quality detection model according to an embodiment of the present invention is shown in a process of training a speech quality detection model based on speech data samples, which is involved in step S104 disclosed in fig. 1, and includes the following steps:

s601: a speech data sample belonging to each of at least one speech quality class is obtained.

In the embodiment of the present application, at least one voice quality category may be preset, and the at least one voice quality category may include a qualified category (which indicates that no quality problem occurs in voice data), a high-frequency energy loss category, a direct current offset category, and/or an abnormal continuous noise category. The above is only the preferred content of the at least one voice quality category provided in the embodiment of the present application, and the inventor may set the content of the at least one voice quality category according to his own needs, which is not limited herein.

It should be noted that the voice data sample is obtained in advance, and the voice data sample carries the voice quality class to which it belongs. For example, if the voice data sample does not have the voice quality problem, the voice quality category carried by the voice data sample is a qualified category; if the voice data sample has a voice quality problem and the voice quality problem is a direct current offset, the voice quality category carried by the voice data sample is a direct current offset category.

S602: and cutting the voice data sample by a frame pair unit to obtain at least one voice frame.

In the specific process of executing step S602, a preset number of sampling points are used as a frame, and the voice data sample is segmented according to the collection sequence of the sampling points in the voice data sample to obtain at least one voice frame.

The specific implementation principle and the execution process of step S602 in fig. 6 disclosed in the embodiment of the present invention are the same as the specific implementation principle and the execution process of step S101 disclosed in fig. 1 in the embodiment of the present invention, and reference may be made to the corresponding parts in fig. 1 disclosed in the embodiment of the present invention, which are not described again here.

S603: and calculating the spectral energy value of each frequency band of the at least one frequency band of the voice frame.

In the process of specifically executing step S603, a preset target frequency value of each frequency band in at least one frequency band is first determined, and then the spectral energy value of the speech frame in the frequency band is calculated by using the target frequency value of the frequency band and the sampling value of each sampling point in the speech frame.

The specific implementation principle and the execution process of step S603 in fig. 6 disclosed in the embodiment of the present invention are the same as those of the embodiment disclosed in fig. 4 in the embodiment of the present invention, and reference may be made to the corresponding parts in fig. 4 disclosed in the embodiment of the present invention, which are not described herein again.

S604: and calculating the target spectral energy value of the voice data sample in the frequency band by using the spectral energy value of each voice frame in the voice data sample in the frequency band.

In the specific process of executing step S604, the spectral energy values of each voice frame in the voice data sample in the frequency band are obtained, and the average value of the obtained spectral energy values is determined as the target spectral energy value of the frequency band.

The specific implementation principle and the execution process of step S604 in fig. 6 disclosed in the embodiment of the present invention are the same as those of the embodiment disclosed in fig. 5 in the embodiment of the present invention, and reference may be made to corresponding parts in fig. 5 disclosed in the embodiment of the present invention, which are not described herein again.

S605: and training the voice quality detection model to be trained to obtain the voice quality detection model by taking the voice quality detection model to be trained as a training target, wherein the prediction result of the voice data sample approaches to the voice quality category to which the voice data sample belongs according to the target frequency spectrum energy value of the voice data sample in each frequency band.

It should be noted that the speech quality detection model to be trained may be a bayesian model, a regression model, or the like, and may be set according to practical applications, which is not limited in the embodiments of the present invention.

In the specific process of executing step S605, the target spectral energy values of the voice data samples in each frequency band are input to the voice quality detection model to be trained, so as to obtain the prediction result of the voice quality detection model to be trained on the voice data samples, the voice quality detection model to be trained approaches the voice quality class to which the voice data samples belong according to the prediction result of the voice data samples in each frequency band as a training target, the parameters in the voice quality detection model to be trained are updated, and the voice quality detection model to be trained is trained until the voice quality detection model to be trained converges, so as to obtain the voice quality detection model.

For a better understanding of the above, the following examples are given.

For example, if the at least one preset voice quality category includes a qualified category, a high frequency energy loss category, a dc offset category, and an abnormal continuous noise category. At least one speech data sample belonging to each speech quality class respectively is obtained.

For example, a voice data sample 1, a voice data sample 2, a voice data sample 3, and a voice data sample 4 are obtained, where the voice quality class to which the voice data sample 1 belongs is a high-frequency energy missing class, the voice quality class to which the voice data sample 2 belongs is a dc offset class, the voice quality class to which the voice data sample 3 belongs is an abnormal continuous noise class, and the voice quality class to which the voice data sample 4 belongs is a qualified class. In the above, for convenience of understanding, one voice data sample is obtained for each voice quality class for example, but in the present application, when the voice quality detection model is trained, a plurality of voice data samples may be obtained for each voice quality class for training the voice quality detection model, so as to improve the accuracy of the training result of the voice quality detection model. The inventor can set the number of the voice data samples belonging to each voice quality category according to his own needs, which is not limited herein.

Firstly, each voice data sample is divided to obtain at least one voice frame of a voice data sample 1, at least one voice frame of a voice data sample 2, at least one voice frame of a voice data sample 3 and at least one voice frame of a voice data sample 4.

Secondly, for each voice frame corresponding to the voice data sample 1, respectively calculating the spectral energy value of the voice frame in the low frequency band, the spectral energy value in the medium and high frequency bands, and the spectral energy value in the high frequency band, and further calculating the target spectral energy value of the voice data sample 1 in the low frequency band by using the spectral energy value of each voice frame in the low frequency band in the voice data sample 1, calculating the target spectral energy value of the voice sample 1 in the medium and high frequency band by using the spectral energy value of each voice frame in the voice data sample 1 in the medium and high frequency bands, and calculating the target spectral energy value of the voice data sample 1 in the high frequency band by using the spectral energy value of each voice frame in the voice data sample 1 in the high frequency bands; for each voice frame corresponding to the voice data sample 2, respectively calculating the spectral energy value of the voice frame in the low frequency band, the spectral energy value in the medium and high frequency band, and the spectral energy value in the high frequency band, further calculating the target spectral energy value of the voice data sample 2 in the low frequency band by using the spectral energy value of each voice frame in the voice data sample 2 in the low frequency band, and calculating the target spectral energy value of the voice data sample 2 in the medium and high frequency band by using the spectral energy value of each voice frame in the voice data sample 2 in the medium and high frequency band; for each voice frame in the voice data sample 3, respectively calculating the spectral energy value of the voice frame in the low frequency band, the spectral energy value in the medium and high frequency band, and the spectral energy value in the high frequency band, and further calculating the target spectral energy value of the voice data sample 3 in the low frequency band by using the spectral energy value of each voice frame in the voice data sample 3 in the low frequency band, calculating the target spectral energy value of the voice data sample 3 in the medium and high frequency band by using the spectral energy value of each voice frame in the voice data sample 3 in the medium and high frequency band, and calculating the target spectral energy value of the voice data sample 3 in the high frequency band by using the spectral energy value of each voice frame in the voice data sample 3 in the high frequency band; for each speech frame in the speech data sample 4, the spectral energy value of the speech frame in the low frequency band, the spectral energy value in the medium and high frequency bands, and the spectral energy value in the high frequency band are respectively calculated, and then the spectral energy value of each speech frame in the speech data sample 4 in the low frequency band is used to calculate the target spectral energy value of the speech data sample 4 in the low frequency band, the spectral energy value of each speech frame in the speech data sample 4 in the medium and high frequency bands is used to calculate the target spectral energy value of the speech data sample 4 in the medium and high frequency bands, and the spectral energy value of each speech frame in the speech data sample 4 in the high frequency band is used to calculate the target spectral energy value of the speech data sample 4 in the high frequency band.

And finally, inputting the voice data sample 1, the voice data sample 2, the voice data sample 3 and the voice data sample 4 as training samples into a voice quality detection model to be trained to obtain prediction results of the voice quality detection model to be trained on the voice data sample 1, the voice data sample 2, the voice data sample 3 and the voice data sample 4 respectively. For each voice data sample, updating parameters in the voice quality detection model to be trained by using the voice quality detection model to be trained as a training target according to the target spectrum energy value of the voice data sample in each frequency band, wherein the prediction result of the voice data sample approaches to the voice quality class to which the voice data sample belongs, training the voice quality detection model to be trained, and training the voice quality detection model to be trained through a plurality of voice data samples until the voice quality detection model to be trained converges to obtain the voice quality detection model.

In the embodiment of the invention, the voice data samples respectively belonging to each voice quality category in at least one voice quality category are obtained, the target spectrum energy values of the voice data samples in each frequency band are calculated, the prediction result of the voice data samples approaches to the voice quality category to which the voice data samples belong according to the target spectrum energy values of the voice data samples in each frequency band by using the voice quality detection model to be trained as a training target, the voice quality detection model to be trained is trained to obtain the voice quality detection model, and the voice quality category to which the voice data to be subjected to voice quality detection belongs is analyzed based on the voice quality detection model.

Based on the voice data quality detection method disclosed in the embodiment of the present invention, the embodiment of the present invention also correspondingly discloses a voice data quality detection apparatus, as shown in fig. 7, the voice data quality detection apparatus 700 includes:

a first dividing unit 701, configured to divide voice data by taking a frame as a unit to obtain at least one voice frame.

A first calculating unit 702, configured to calculate a spectral energy value of each frequency band of the speech frame in at least one preset frequency band.

A second calculating unit 703, configured to calculate a target spectral energy value of the voice data in the band by using the spectral energy value of each voice frame in the at least one voice frame in the band.

The quality detection result determining unit 704 is configured to analyze the target spectral energy values of the voice data in each frequency band to obtain a quality detection result of the voice data.

The specific principle and the implementation process of each unit in the voice data quality detection apparatus disclosed in the above embodiment of the present invention are the same as those of the voice data quality detection method disclosed in the above embodiment of the present invention, and reference may be made to corresponding parts in the voice data quality detection method disclosed in the above embodiment of the present invention, which are not described herein again.

The invention provides a voice data quality detection device, which is characterized in that voice data is segmented by taking a frame as a unit to obtain at least one voice frame, the spectral energy value of each voice frame in at least one preset frequency band is calculated, the spectral energy value of each voice frame in at least one voice frame in the frequency band is utilized to calculate the target spectral energy value of the voice data in the frequency band, and the target spectral energy value of the voice data in each frequency band is analyzed to obtain the quality detection result of the voice data. According to the technical scheme provided by the invention, the voice quality detection result of the voice data can be analyzed by calculating the target spectrum energy values of the voice data in different frequency bands, so that the detection of the voice data quality is realized.

Preferably, the quality detection result determining unit 704 includes: the first quality detection result determining subunit.

And the first quality detection result determining subunit is used for inputting the target frequency spectrum energy values of the voice data in each frequency band into a preset voice quality detection model to obtain a quality detection result of the voice data.

The voice quality detection model is generated by training the voice quality detection model to be trained by using a calibration result of the voice data sample, wherein the calibration result is generated by the voice quality detection model to be trained, the calibration result represents the voice quality category of the voice data sample, and the calibration result represents the prediction result of the voice data sample approaches to the voice data sample by using the target spectral energy value of the voice data sample in each frequency band of at least one frequency band.

Preferably, the quality detection result determining unit 704 includes: the selecting unit and the second quality detection result determining subunit.

And the selecting unit is used for selecting target conditions which are met by the target spectrum energy values of the voice data in each frequency band from at least one preset condition.

And the second quality detection result determining subunit is used for setting the voice quality category matched with the target condition in advance as the quality detection result of the voice data.

Preferably, the first dividing unit 701 includes: a frame determination unit and a first segmentation subunit.

And the frame determining unit is used for taking a preset number of sampling points as one frame.

And the first segmentation subunit is used for segmenting the voice data according to the acquisition sequence of the sampling points in the voice data to obtain at least one voice frame.

Preferably, the first calculating unit 702 includes: a target frequency value determining unit and a third calculating unit.

The target frequency value determining unit is used for determining a target frequency value of each frequency band in at least one preset frequency band;

and the third calculating unit is used for calculating the spectral energy value of the voice frame in the frequency band by using the target frequency value of the frequency band and the sampling value of each sampling point in the voice frame.

Preferably, the second calculation unit 703 includes: the device comprises an acquisition unit and a target spectrum energy value determination unit.

The first obtaining unit is used for obtaining the spectral energy value of each speech frame in at least one speech frame in the frequency band.

And the target spectrum energy value determining unit is used for determining the average value of the acquired spectrum energy values as the target spectrum energy value of the frequency band.

In the embodiment of the invention, the spectral energy value of each voice frame in at least one voice frame in the frequency band is obtained, the average value of the spectral energy values of each voice frame in the obtained at least one voice frame in the frequency band is calculated, and the calculated average value is determined as the target spectral energy value of the frequency band. And further, the quality detection result of the voice data can be obtained by analyzing the target spectrum energy value of the voice data in each frequency band.

Preferably, the quality detection result determining unit 704 includes: the device comprises a second acquisition unit, a detection unit, a fourth calculation unit, a fifth calculation unit and a training unit.

A second obtaining unit, configured to obtain a voice data sample that belongs to each of the at least one voice quality class respectively.

And the second segmentation unit is used for segmenting the voice sample by taking the frame as a unit to obtain at least one voice frame.

And the fourth calculating unit is used for calculating the spectral energy value of each frequency band of the at least one frequency band of the voice frame.

And the fifth calculating unit is used for calculating the target spectral energy value of the voice data sample in the frequency band by using the spectral energy value of each voice frame in the voice data sample in the frequency band.

And the training unit is used for training the voice quality detection model to be trained to obtain the voice quality detection model by taking the prediction result of the voice data sample approaching to the voice quality category to which the voice data sample belongs according to the target spectral energy value of the voice data sample in each frequency band as a training target.

In the embodiment of the invention, a voice data sample respectively belonging to each voice quality category in at least one voice quality category is obtained, the voice data sample is cut by a frame pair unit to obtain at least one voice frame, the spectral energy value of each frequency band of the voice frame in at least one frequency band is calculated, the prediction result of the voice data sample approaches to the voice quality category to which the voice data sample belongs according to the target spectral energy value of the voice data sample in each frequency band by using a voice quality detection model to be trained as a training target, and the voice quality detection model to be trained is trained to obtain the voice quality detection model. And inputting the target frequency spectrum energy values of the voice data in each frequency band into a preset voice quality detection model to obtain a quality detection result of the voice data.

An embodiment of the present invention provides a server, referring to fig. 8, including a memory 801 and a processor 802, where:

the memory 801 stores programs; the processor 802 is configured to execute the program stored in the memory, and in particular, to perform the voice data quality detection method according to any embodiment of the present invention.

Embodiments of the present invention provide a storage medium, where computer-executable instructions are stored to implement a voice data quality detection method according to any embodiment of the present invention.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A voice data quality detection method is characterized by comprising the following steps:

analyzing target spectrum energy values of the voice data in each frequency band to obtain a quality detection result of the voice data;

analyzing the target spectrum energy value of the voice data in each frequency band to obtain a quality detection result of the voice data, wherein the quality detection result comprises the following steps:

2. The method of claim 1, wherein the analyzing the target spectral energy value of the speech data in each of the frequency bands to obtain the quality detection result of the speech data comprises:

3. The method of claim 1, wherein the segmenting the voice data into at least one voice frame in units of frames comprises:

taking a preset number of sampling points as a frame;

4. The method of claim 1, wherein the calculating the spectral energy value of each of the preset at least one frequency band of the speech frame comprises:

5. The method according to claim 1, wherein said calculating the target spectral energy value of the speech data in the band by using the spectral energy value of each of the at least one speech frame in the band comprises:

6. The method of claim 1, further comprising a speech quality detection model training process comprising:

7. A voice data quality detection apparatus, comprising:

the quality detection result determining unit is used for analyzing the target frequency spectrum energy value of the voice data in each frequency band to obtain the quality detection result of the voice data;

8. A server, comprising: at least one memory and at least one processor; the memory stores a program, and the processor calls the program stored in the memory, the program being used for implementing the voice data quality detection method according to any one of claims 1 to 6.

9. A storage medium having stored thereon computer-executable instructions for performing the method of voice data quality detection according to any one of claims 1-6.