WO2022068304A1

WO2022068304A1 - Sound quality detection method and device

Info

Publication number: WO2022068304A1
Application number: PCT/CN2021/105044
Authority: WO
Inventors: 郑羲光; 陈翔宇; 张晨
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2020-09-29
Filing date: 2021-07-07
Publication date: 2022-04-07
Also published as: CN112185421B; CN112185421A

Abstract

A sound quality detection method and device. Said method comprises: acquiring a target audio signal (S100); detecting an audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal (S200); detecting an audio acquisition signal corresponding to the target audio signal to obtain second evaluation information related to the audio acquisition signal (S300); fusing the first evaluation information and the second evaluation information according to a first preset weight vector to obtain target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal (S400); and determining the category of the sound quality of the target audio signal according to the target evaluation information (S500).

Description

Sound quality detection method and device

This application claims the priority of the Chinese Patent Application No. 202011054305.2 filed with the Chinese Patent Office on September 29, 2020, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to the technical field of audio processing, and in particular, to a sound quality detection method, device, electronic device and storage medium.

Background technique

With the progress of society and the development of computer technology and network technology, there are more and more channels for people to receive external information. In recent years, due to the development of audio processing technology, the communication with the outside world and the perception of changes in the outside world through audio information have achieved unprecedented development, and people are paying more and more attention to the quality of the audio information sent and obtained. The traditional sound quality detection method is generally a complete reference sound quality detection method. First, the original lossless audio signal and various lossy audio signals with reduced sound quality corresponding to the original lossless audio signal are obtained. The gap is determined, the sound quality evaluation information of the lossy audio signal is determined, and the sound quality of the lossy audio signal is determined through the evaluation information.

SUMMARY OF THE INVENTION

The present disclosure provides a sound quality detection method, device, electronic device and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a sound quality detection method, including:

Get the target audio signal;

Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;

Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;

The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;

According to the target evaluation information, the sound quality category of the target audio signal is determined.

In some embodiments, the detecting an audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal, including:

classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;

According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.

In some embodiments, the classification of the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal includes:

The target audio signal is divided according to the first time length to obtain a first number of audio clips;

For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;

A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.

In some embodiments, according to the audio classification result, the target audio signal is detected to obtain the first evaluation information related to the audio content signal, including:

For each of the audio clips, the audio clips are detected according to the second number of target categories, to obtain a second number of clip content evaluation information related to the second number of target categories;

Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;

According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.

In some embodiments, the detecting an audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal, including:

Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;

Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;

According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.

The target audio signal is divided according to the second time length to obtain a third number of audio clips;

For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;

According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.

According to a second aspect of the embodiments of the present disclosure, there is provided a sound quality detection device, comprising:

an audio signal acquisition unit, configured to perform acquisition of the target audio signal;

a first detection unit, configured to perform detection of an audio content signal corresponding to the target audio signal, to obtain first evaluation information related to the audio content signal;

a second detection unit, configured to perform detection of an audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;

a target evaluation information determination unit, configured to fuse the first evaluation information and the second evaluation information according to a first preset weight vector, to obtain target evaluation information corresponding to the target audio signal; wherein the The target evaluation information is related to the sound quality of the target audio signal;

The sound quality detection unit is configured to perform determining the sound quality category of the target audio signal according to the target evaluation information.

In some embodiments, the first detection unit is further configured to perform:

In some embodiments, the second detection unit is further configured to perform:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising:

processor;

a memory for storing the processor-executable instructions;

Wherein, the processor is configured to execute the instructions to implement the sound quality detection method described in any one of the embodiments of the first aspect.

According to a fourth aspect of the embodiments of the present disclosure, a storage medium is provided, when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can execute any one of the embodiments of the first aspect above The sound quality detection method described in .

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product, the program product comprising a computer program, the computer program being stored in a readable storage medium, and at least one processor of a device from the readable storage medium The computer program is read and executed, so that the device executes the sound quality detection method described in any one of the embodiments of the first aspect.

The embodiments of the present disclosure evaluate the quality of the target audio signal at the audio content signal level by detecting the audio content signal corresponding to the target audio signal, obtain the first evaluation information related to the audio content signal, and detect the audio corresponding to the target audio signal. Collect the signal, evaluate the quality of the target audio signal at the audio collection signal level, obtain the second evaluation information related to the audio collection signal, and obtain the first evaluation information and the second evaluation information for evaluating the target audio signal from different dimensions Then, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal, and is evaluated according to the target audio signal. information to determine the sound quality category of the target audio signal. Therefore, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal. At the same time, the target evaluation information used for evaluating the sound quality of the target audio signal is based on the multi-dimensional characteristics of the target audio signal itself. If the attributes are obtained, the target audio signal can be detected in all directions, and finally the purpose of accurately defining the sound quality of the corresponding audio signal can be achieved.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

Fig. 1 is a flow chart of a method for detecting sound quality according to an exemplary embodiment.

FIG. 2 is a flow chart of a possible implementation manner of step S200 according to an exemplary embodiment.

FIG. 3 is a flowchart showing an implementation manner of step S210 according to an exemplary embodiment.

FIG. 4 is a flowchart showing an implementation manner of step S220 according to an exemplary embodiment.

Fig. 5 is a flowchart showing an implementation manner of step S300 according to an exemplary embodiment.

FIG. 6 is a flowchart showing an implementation manner of step S310 according to an exemplary embodiment.

FIG. 7 is a structural diagram of a sound quality detection system according to a specific exemplary embodiment.

Fig. 8 is a block diagram of an apparatus for detecting sound quality according to an exemplary embodiment.

Fig. 9 is a block diagram of an electronic device for sound quality detection according to an exemplary embodiment.

Detailed ways

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

FIG. 1 is a flow chart of a method for detecting sound quality according to an exemplary embodiment, as shown in FIG. 1 , which specifically includes the following steps:

In step S100, the target audio signal is acquired.

In step S200, the audio content signal corresponding to the target audio signal is detected to obtain first evaluation information related to the audio content signal.

In step S300, the audio collection signal corresponding to the target audio signal is detected, and the second evaluation information related to the audio collection signal is obtained.

In step S400, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein the target evaluation information is related to the sound quality of the target audio signal.

In step S500, the sound quality category of the target audio signal is determined according to the target evaluation information.

The target audio signal refers to the audio signal whose sound quality is to be evaluated. The audio content signal refers to a specific signal related to the target audio signal, for example, the audio content signal may be music, speech or other (noise and other audio). The audio acquisition signal refers to the signal related to the acquisition process of the target audio signal, which aims to evaluate the audio quality unrelated to the audio content signal, mainly refers to the part of the signal that has damage to the sound quality during the audio signal acquisition process. In some embodiments, the audio acquisition signal The evaluation mainly includes broken sound evaluation and external recording evaluation. The first preset weight vector is a vector formed by combining coefficients when the first evaluation information and the second evaluation information are combined. The audio content signal and the audio capture signal are used to set the sound quality of the image.

Specifically, after obtaining the target audio signal whose sound quality is to be evaluated, the quality of the target audio signal is evaluated at the specific audio content level, and the first evaluation information related to the audio content signal is obtained. The quality of the target audio signal is evaluated, and the second evaluation information related to the audio acquisition signal is obtained. After obtaining the first evaluation information and the second evaluation information for evaluating the target audio signal from different dimensions, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the corresponding target audio signal. Target evaluation information. The target evaluation information is related to the sound quality of the target audio signal. Exemplarily, when the measurement standard corresponding to the first evaluation information is 90, the measurement standard corresponding to the second evaluation information is 85, and the first preset weight vector is (0.6, 0.4), the measurement standard corresponding to the target evaluation information is ( 90*0.6+85*0.4). The target evaluation information can accurately evaluate the sound quality of the corresponding target audio signal, so as to determine the sound quality category of the target audio signal according to the target evaluation information. For example, when the full score is 100 points and the target evaluation information is 95, the target audio signal can be determined as a high-quality audio signal; when the target evaluation information is 70, the target audio signal can be determined as a medium-quality audio signal, and the target audio signal can be determined as a medium-quality audio signal. When the evaluation information is 55, the target audio signal may be determined as a low-quality audio signal. It should be noted that the above scores and classifications are only exemplary descriptions, and in the specific implementation process, sound quality categories may be further divided according to actual needs.

In the above-mentioned sound quality detection method, by detecting the audio content signal corresponding to the target audio signal, the quality of the target audio signal is evaluated at the audio content signal level, so as to obtain the first evaluation information related to the audio content signal, and detect the corresponding audio content signal of the target audio signal. Audio acquisition signal, evaluate the quality of the target audio signal at the audio acquisition signal level, obtain the second evaluation information related to the audio acquisition signal, and obtain the first evaluation information and the second evaluation to evaluate the target audio signal from different dimensions After the information is obtained, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal, and is Evaluate the information to determine the sound quality category of the target audio signal. Therefore, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal. At the same time, the target evaluation information for evaluating the sound quality of the target audio signal is based on the multi-dimensional characteristics of the target audio signal itself. If the attributes are obtained, the target audio signal can be detected in all directions, and finally the purpose of accurately defining the sound quality of the corresponding audio signal can be achieved.

In an exemplary embodiment, as shown in FIG. 2, it is a flowchart of an implementable implementation manner of step S200, including the following steps:

In step S210, the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal.

In step S220, the target audio signal is detected according to the audio classification result to obtain first evaluation information related to the audio content signal.

The audio category corresponding to the audio content signal may be music, speech or other (noise and other audio).

Specifically, the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal, where the audio classification result includes a category corresponding to the target audio signal and a category corresponding to the target audio signal determined to be this category. probability. According to the category corresponding to the target audio signal, the target audio signal is detected, the evaluation information of each target audio signal in each category is obtained, and the evaluation information of each category is combined according to the probability corresponding to each category to obtain the corresponding audio signal. The first evaluation information related to the content signal.

For example, a pre-trained audio classification network model capable of detecting the category of the audio signal can be obtained, and the probability that the target audio signal is music, speech or other can be obtained. For example, the probability that the target audio signal is music is 0.7, which is The probability of speech is 0.2 and the probability of being other audio signals is 0.1. After obtaining the audio classification result, obtain a pre-trained evaluation network capable of evaluating the corresponding category of audio signals, for example, a music evaluation network model that can be used for music evaluation, a voice network model that can be used for speech evaluation, and other networks Model, input the target audio signal into the corresponding network evaluation model, and obtain the evaluation information of the target audio signal at the music level, the voice level and other levels. For example, the score corresponding to the evaluation information at the music level of the target audio signal is 90, the score corresponding to the evaluation information at the speech level is 80, and the score corresponding to the evaluation information at other levels is 85. Finally, each category of evaluation information is combined according to the probability corresponding to each category to obtain a score corresponding to the first evaluation information (0.7*90+0.2*80+0.1*85).

In the above-mentioned exemplary embodiment, the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal, and the target audio signal is detected according to the audio classification result, and the target audio signal can be detected in the corresponding audio signal. The target audio signal is detected in a targeted manner in terms of specific categories, and the obtained first evaluation information can detect the quality of the audio signal more comprehensively and pertinently, and provide a basis for the sound quality evaluation of the subsequent audio signal.

In an exemplary embodiment, a flowchart of a possible implementation manner of step S210 is shown in FIG. 3 , including the following steps:

In step S211, the target audio signal is divided according to the first time length to obtain a first number of audio segments.

In step S212, for each audio segment, classify the content of the audio segment corresponding to the audio segment to obtain a second number of target categories corresponding to the audio segment and a target probability that the audio segment is each target category.

In step S213, the second number of target categories and the second number of target probabilities corresponding to each audio segment are determined as the audio classification result.

The first time length refers to a reference metric value for dividing the audio signal. In some embodiments, it can be 1 second, 10 seconds, 20 seconds, or 1 minute. Time lengths such as 1 minute are only exemplary, and do not specifically limit the first time length.

Specifically, the target audio signal is segmented according to the first time length to obtain a first number of audio clips. For example, the length of the target audio signal is 3 minutes, and the target audio signal is segmented with 10 seconds as the first time length. An audio clip with a first number of 18 and a length of 10 seconds is obtained. For each 10-second audio clip, the content of the audio clip corresponding to the audio clip is classified, when the probability of each 10-second audio clip is music is 0.7, the probability of speech is 0.2, and the probability of other audio signals is 0.1 , the second number is 3, and the target categories are music, speech, and others. Finally, the second number of target categories and the second number of target probabilities corresponding to each of the 18 10-second audio clips are determined as the audio classification result.

In the above exemplary embodiment, the target audio signal is divided according to the first time length to obtain a first number of audio clips, and the content of the audio clips corresponding to each audio clip is classified to obtain a second number corresponding to the audio clips. The target categories and the audio clips are the target probability of each target category, and the second number of target categories and the second number of target probabilities corresponding to each audio clip are determined as the audio classification result. In this way, the target audio signal is divided in a smaller time dimension, so that each audio segment can be detected in a smaller time dimension in the follow-up, and finally the sound quality of the corresponding audio signal can be accurately defined. Purpose.

In an exemplary embodiment, a flowchart of a possible implementation manner of step S220 is shown in FIG. 4 , including the following steps:

In step S221, for each audio segment, the audio segment is detected according to the second number of target categories, and a second number of segment content evaluation information related to the second number of target categories is obtained.

In step S222, the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities is determined as the segment content evaluation information corresponding to each audio segment; or, using the second number of target probabilities as the weight coefficient, The second quantity of segment content evaluation information related to the audio segment is weighted to obtain segment content evaluation information corresponding to each audio segment; wherein the first number of audio segments corresponds to the first number of segment content evaluation information.

In step S223, according to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.

The first quantity of audio clips corresponds to the first quantity of clip content evaluation information. The second preset weight vector is a vector formed by combining coefficients of the sound quality detection results of multiple audio clips. In some embodiments, the second preset weight vector may be set in a weighted average manner, or may be set according to a specific target audio signal The second preset weight vector, for example, can set a relatively small weight coefficient for the audio clips at the beginning and the end of the target audio signal, and set a relatively large weight coefficient for the middle audio clip, so as to reduce the excessive noise at the beginning of the audio recording. big impact.

Specifically, the audio segments are detected to obtain a second number of segment content evaluation information related to the second number of target categories, and the segment content evaluation information corresponding to the largest probability value among the second number of target probabilities is determined as each The segment content evaluation information corresponding to the audio segment. Or, using the second number of target probabilities as weight coefficients, weighting the second number of segment content evaluation information related to the audio segment to obtain segment content evaluation information corresponding to each audio segment. For example, the probability of an audio clip being music is 0.7, the probability of being speech is 0.2, and the probability of being other audio signals is 0.1. The corresponding score is 80, and the score corresponding to the evaluation information at other levels is 85. The segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities is determined as the segment content evaluation information corresponding to each audio segment, and the segment content evaluation information is a score of 90 corresponding to the maximum probability of 0.7. The evaluation information of each category is combined according to the probability corresponding to each category to obtain a score (0.7*90+0.2*80+0.1*85) corresponding to the segment content evaluation information. The first quantity of audio clips corresponds to the first quantity of clip content evaluation information, and finally the first quantity of clip content evaluation information is weighted and summed according to the second preset weight vector to obtain the first evaluation information.

In the above exemplary embodiment, for each audio clip, the audio clips are detected according to the second number of target categories, and a second number of clip content evaluation information related to the second number of target categories is obtained; The segment content evaluation information corresponding to the maximum probability value among the target probabilities is determined as the segment content evaluation information corresponding to each audio segment; The segment content evaluation information is weighted to obtain segment content evaluation information corresponding to each audio segment, and the first number of segment content evaluation information is fused according to the second preset weight vector to obtain the first evaluation information. Therefore, the target audio signal is detected in a smaller time dimension in more detail, and the purpose of accurately defining the sound quality of the corresponding audio signal is finally achieved.

In an exemplary embodiment, as shown in FIG. 5 , it is a flowchart of an implementable implementation manner of step S300, including the following steps:

In step S310, a broken sound phenomenon corresponding to the target audio signal is detected to obtain corresponding broken sound evaluation information.

In step S320, the external recording device corresponding to the target audio signal is detected to obtain the corresponding external recording evaluation information.

In step S330, according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.

Among them, the phenomenon of broken sound refers to the phenomenon that when the sound signal level exceeds the upper limit of the load of the electronic components, a part of the sound signal is cut off, resulting in the phenomenon of noise in the emitted sound. The external recording device refers to the device that transmits the sound signal to the recording system through the microphone or the pickup device that comes with the tape recorder, and then records the sound signal in the storage medium. This recording method can conveniently record a variety of sound signals such as human voice, but the obtained audio signal is susceptible to external interference and the sound signal is easily distorted. The third preset weight vector is a vector formed by combining coefficients when the broken sound evaluation information and the external recording evaluation information are combined. In some embodiments, the third preset weight vector may be set in a weighted average manner, or may be set according to specific The broken sound evaluation information and the external recording evaluation information are set for the sound quality of the video.

Specifically, the sound breaking phenomenon corresponding to the target audio signal and the external recording device are respectively detected, and the sound breaking evaluation information corresponding to the sound breaking phenomenon detection and the external recording evaluation information corresponding to the external recording device detection are obtained. And according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information. For example, obtain a pre-trained evaluation network that can detect and evaluate the broken sound phenomenon and external recording equipment, for example, a broken sound evaluation network model that can be used for For the external recording network model, input the target audio signal into the corresponding network evaluation model, and obtain the audio-breaking evaluation information corresponding to the sound-breaking phenomenon detection and the external-recording evaluation information corresponding to the external recording device detection. For example, when the target audio signal has a score corresponding to the evaluation information at the breaking sound level of 90, the score corresponding to the evaluation information at the external recording level is 80, and the third preset weight vector is (0.6, 0.4), the target evaluation information The corresponding score is (90*0.6+80*0.4).

In the above exemplary embodiment, the sound breaking phenomenon corresponding to the target audio signal is detected to obtain the corresponding broken sound evaluation information, and the external recording device corresponding to the target audio signal is detected to obtain the corresponding external recording evaluation information. The target audio signal is detected in a targeted manner according to different sound quality categories corresponding to the specific acquisition device corresponding to the target audio signal. And according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused, so that the obtained second evaluation information can be more comprehensive and targeted to the sound quality effect produced by the audio signal acquisition device. Detection provides a basis for the sound quality evaluation of subsequent audio signals.

In an exemplary embodiment, as shown in FIG. 6 , it is a flowchart of an implementable implementation manner of step S310, including the following steps:

In step S311, the target audio signal is divided according to the second time length to obtain a third number of audio segments.

In step S312, for each audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the sound breaking evaluation information corresponding to the audio clip is obtained; wherein, the third quantity of audio clips corresponds to the third quantity of fragment sound breaking evaluation information .

In step S313, according to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.

The second time length refers to a reference metric value for dividing the audio signal, and in some embodiments, it may be 1 second, 10 seconds, 20 seconds, or 1 minute. Time lengths such as 1 minute are only exemplary, and do not specifically limit the second time length. The fourth preset weight vector is a vector composed of merging coefficients when the third number of pieces of sound breaking evaluation information are merged. A fourth preset weight vector is set for the target audio signal of The effect of recording excessive noise at the beginning.

Specifically, the target audio signal is segmented according to the second time length to obtain a third number of audio clips. For example, the length of the target audio signal is 3 minutes, and the target audio signal is segmented with 10 seconds as the first time length. An audio clip with a third number of 18 and a length of 10 seconds is obtained. For each 10-second audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the sound breaking evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of clips. One broken sound evaluation information corresponds to a weighting system, and the third quantity of fragmentary broken sound evaluation information corresponds to a fourth preset weight vector of a third quantity dimension. According to the fourth preset weight vector, the third quantity of clips broken sound is evaluated. The information is fused to obtain the broken sound evaluation information.

In the above exemplary embodiment, the target audio signal is divided according to the second time length to obtain a third number of audio clips, and for each audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the broken sound corresponding to the audio clip is obtained. Audio evaluation information; wherein, the third quantity of audio clips corresponds to the third quantity of audio fragmentation evaluation information, and according to the fourth preset weight vector, the third quantity of audio fragmentation evaluation information is fused to obtain the audio fragmentation evaluation information. In this way, the target audio signal is divided in a smaller time dimension, so that each audio segment can be detected in a smaller time dimension in the follow-up, and finally the sound quality of the corresponding audio signal can be accurately defined. Purpose.

FIG. 7 is a structural diagram of a sound quality detection system according to a specific exemplary embodiment, as shown in FIG. 7 , which specifically includes:

The sound quality detection system divides the evaluation of the quality of the target audio signal into two parts: the first part is the sound quality evaluation related to the content; the second part is the sound quality evaluation related to the acquisition device. The first part mainly judges the different contents of the audio signal, and then conducts targeted scoring according to specific categories. The second part is mainly aimed at the acquisition equipment of audio signals, and detects whether the acquisition equipment will introduce related distortion.

For the first part, the input target audio signal is classified into music, speech or other (noise and other audio) types through the deep learning network that classifies the audio signal, and the output is a fixed length (such as 1) in the target audio signal. The category corresponding to the audio segment that outputs the result once per second, for example, multiple target probabilities of music, speech or other types, and the sum of the multiple target probabilities is 1. After obtaining the above target probability, the classification with the highest probability can be selected for the subsequent scoring process. If it is classified as music, the signal is scored without reference to the music quality (it can be regarded as the probability of other categories being 0). The final score is the no-reference scoring result corresponding to the category with the highest probability. After obtaining these probabilities, the audio signal can also be directly sent to the three scoring networks in Figure 7 for detection. After obtaining the results, the first detection score (first evaluation information) of the final score fusion result is shown in the formula (1 ) as shown:

Content-related fusion scoring result = music probability * no reference music score + voice probability * no reference voice score

Result + other probability * audio event classification network score (1)

Among them, the audio event classification network aims to score non-speech and music audio signals for noise that degrades audio quality. Sounds such as babble noise, engine noise, and low-frequency noise in the aircraft cabin are harmful noises and correspond to low scores; sounds such as bird calls and running water are non-harmful noises and correspond to high scores.

For the second part of the network, that is, the sound quality evaluation related to the acquisition device, it aims to evaluate the sound quality independent of the content, mainly refers to the part that damages the sound quality during the audio signal acquisition process. These mainly include broken sound detection network and external recording detection network.

The broken sound detection network divides the input audio signal into signals in units of 1 second, and judges whether the sound is broken for each segment of the audio signal. Evaluate the sound level. The purpose of the design of the external recording detection network is to determine whether the audio signal to be tested is obviously collected by a low-quality mobile phone microphone. Signals collected by low-quality mobile phone microphones usually have a narrow frequency response and a low signal-to-noise ratio due to the acquisition equipment, which affects the sound quality. The external recording detection network determines whether the input signal is an audio signal collected by a low-quality microphone. The fusion score (the score corresponding to the second evaluation information) jointly generated by the broken sound detection and the external recording detection is shown in formula (2):

Collection equipment-related fusion scoring results = broken sound detection result * broken sound detection weight + external recording detection result * external recording detection weight (2)

The final fusion result (the score corresponding to the target evaluation information) is shown in formula (3):

Fusion scoring result = content-related fusion scoring result * content-related fusion scoring weight + acquisition device-related fusion scoring result * acquisition-device-related fusion scoring weight (3)

The final fusion result (the score corresponding to the target evaluation information) can accurately evaluate the sound quality of the corresponding target audio signal, so as to determine the sound quality category of the target audio signal according to the target evaluation information.

In the above sound quality detection system, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal. At the same time, the target evaluation information for evaluating the sound quality of the target audio signal is based on the target audio signal. It can detect the target audio signal in all directions, and finally achieve the purpose of accurately defining the sound quality of the corresponding audio signal.

It should be understood that although the steps in the flowcharts of FIGS. 1-7 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 1-7 may include multiple steps or multiple stages. These steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or phases within the other steps.

Fig. 8 is a block diagram of an apparatus for detecting sound quality according to an exemplary embodiment. 8, the device includes an audio signal acquisition unit 801, a first detection unit 802, a second detection unit 803, a target evaluation information determination unit 804, and a sound quality detection unit 805, specifically including:

An audio signal acquisition unit 801, configured to perform acquisition of a target audio signal;

The first detection unit 802 is configured to perform detection of the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;

The second detection unit 803 is configured to perform detection of the audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;

The target evaluation information determination unit 804 is configured to fuse the first evaluation information and the second evaluation information according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;

The sound quality detection unit 805 is configured to determine the sound quality category of the target audio signal according to the target evaluation information.

In an exemplary embodiment, the first detection unit 802 is further configured to perform: classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal; The signal is detected to obtain first evaluation information related to the audio content signal.

In an exemplary embodiment, the first detection unit 802 is further configured to perform: dividing the target audio signal according to the first time length to obtain a first number of audio clips; The audio clip content is classified to obtain the second number of target categories corresponding to the audio clip and the target probability that the audio clip is each target category; the second number of target categories and the second number of target probabilities corresponding to each audio clip are , which is determined as the audio classification result.

In an exemplary embodiment, the first detection unit 802 is further configured to perform: for each audio segment, detect the audio segment according to the second number of target categories, and obtain the first number related to the second number of target categories. Two pieces of segment content evaluation information; determining the segment content evaluation information corresponding to the largest probability value in the second number of target probabilities as the segment content evaluation information corresponding to each audio segment; or, taking the second number of target probabilities as the weight coefficient, weights the second quantity of segment content evaluation information related to the audio segment to obtain segment content evaluation information corresponding to each audio segment; wherein, the first number of audio segments corresponds to the first number of segment content evaluation information; according to The second preset weight vector fuses the content evaluation information of the first number of segments to obtain the first evaluation information.

In an exemplary embodiment, the second detection unit 803 is further configured to perform: detect the sound-breaking phenomenon corresponding to the target audio signal to obtain corresponding broken-sound evaluation information; detect the external recording device corresponding to the target audio signal , to obtain the corresponding external recording evaluation information; according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.

In an exemplary embodiment, the second detection unit 803 is further configured to perform: dividing the target audio signal according to the second time length to obtain a third number of audio clips; The degree of broken sound is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound evaluation information; according to the fourth preset weight vector, the third number of clips are broken according to the fourth preset weight vector. The sound evaluation information is fused to obtain the broken sound evaluation information.

Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

FIG. 9 is a block diagram of an electronic device 900 for sound quality detection according to an exemplary embodiment. For example, device 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, or the like.

9, device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and Communication component 916.

The processing component 902 generally controls the overall operation of the device 900, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 902 may include one or more modules to facilitate interaction between processing component 902 and other components. For example, processing component 902 may include a multimedia module to facilitate interaction between multimedia component 908 and processing component 902.

Memory 904 is configured to store various types of data to support operation at device 900 . Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and the like. Memory 904 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

Power supply assembly 906 provides power to various components of device 900 . Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 900 .

Multimedia component 908 includes a screen that provides an output interface between the device 900 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

Audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a microphone (MIC) that is configured to receive external audio signals when device 900 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 904 or transmitted via communication component 916 . In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

The I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

Sensor assembly 914 includes one or more sensors for providing status assessments of various aspects of device 900 . For example, the sensor assembly 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor assembly 914 can also detect a change in the position of the device 900 or a component of the device 900 , the presence or absence of user contact with the device 900 , the orientation or acceleration/deceleration of the device 900 and the temperature change of the device 900 . Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 916 is configured to facilitate wired or wireless communication between device 900 and other devices. Device 900 may access wireless networks based on communication standards, such as WiFi, carrier networks (eg, 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication.

In an exemplary embodiment, device 900 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as memory 904 including instructions, executable by processor 920 of device 900 to perform the method described above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising a computer program stored in a readable storage medium from which at least one processor of the device reads The computer program is retrieved and executed to cause the apparatus to perform the above-described method.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the present disclosure. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A method for detecting sound quality, comprising:

Get the target audio signal;

Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;

Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;

The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;

According to the target evaluation information, the sound quality category of the target audio signal is determined.
The sound quality detection method according to claim 1, wherein the detecting the audio content signal corresponding to the target audio signal to obtain the first evaluation information related to the audio content signal, comprising:

classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;

According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
The sound quality detection method according to claim 2, wherein the classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal, comprising:

The target audio signal is divided according to the first time length to obtain a first number of audio clips;

For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;

A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
The sound quality detection method according to claim 3, wherein the detecting the target audio signal according to the audio classification result to obtain the first evaluation information related to the audio content signal, comprising:

For each of the audio clips, the audio clips are detected according to the second number of target categories to obtain a second number of clip content evaluation information related to the second number of target categories;

Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;

According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
The sound quality detection method according to claim 1, wherein the detecting the audio collection signal corresponding to the target audio signal to obtain the second evaluation information related to the audio collection signal, comprising:

Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;

Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;

According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
The sound quality detection method according to claim 5, wherein the detecting the audio collection signal corresponding to the target audio signal to obtain the second evaluation information related to the audio collection signal, comprising:

The target audio signal is divided according to the second time length to obtain a third number of audio clips;

For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;

According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
A sound quality detection device, comprising:

an audio signal acquisition unit, configured to perform acquisition of the target audio signal;

a first detection unit, configured to perform detection of an audio content signal corresponding to the target audio signal, to obtain first evaluation information related to the audio content signal;

a second detection unit, configured to perform detection of an audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;

A target evaluation information determining unit, configured to fuse the first evaluation information and the second evaluation information according to a second preset weight vector, to obtain target evaluation information corresponding to the target audio signal; wherein the The target evaluation information is related to the sound quality of the target audio signal;

The sound quality detection unit is configured to perform determining the sound quality category of the target audio signal according to the target evaluation information.
The sound quality detection device according to claim 7, wherein the first detection unit is further configured to perform:

classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;

According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
The sound quality detection device according to claim 8, wherein the first detection unit is further configured to perform:

The target audio signal is divided according to the first time length to obtain a first number of audio clips;

For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;

A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
The sound quality detection device according to claim 9, wherein the first detection unit is further configured to perform:

For each of the audio clips, the audio clips are detected according to the second number of target categories to obtain a second number of clip content evaluation information related to the second number of target categories;

Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;

According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
The sound quality detection device according to claim 7, wherein the second detection unit is further configured to perform:

Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;

Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;

According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
The sound quality detection device according to claim 11, wherein the second detection unit is further configured to perform:

The target audio signal is divided according to the second time length to obtain a third number of audio clips;

For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;

According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
An electronic device, comprising:

processor;

memory for storing instructions executable by the processor;

wherein the processor is configured to execute the instructions to implement the following steps:

Get the target audio signal;

Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;

Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;

The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;

According to the target evaluation information, the sound quality category of the target audio signal is determined.
14. The electronic device of claim 13, wherein the processor is configured to execute the instructions to implement the following steps:

classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;

According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
15. The electronic device of claim 14, wherein the processor is configured to execute the instructions to implement the following steps:

The target audio signal is divided according to the first time length to obtain a first number of audio clips;

For each of the audio clips, classifying the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;

A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
16. The electronic device of claim 15, wherein the processor is configured to execute the instructions to implement the following steps:

For each of the audio clips, detecting the audio clips according to the second number of target categories to obtain a second number of clip content evaluation information related to the second number of target categories;

Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;

According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
The electronic device of claim 13, wherein the processor is configured to execute the instructions to implement the following steps:

Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;

Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;

According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
18. The electronic device of claim 17, wherein the processor is configured to execute the instructions to implement the following steps:

The target audio signal is divided according to the second time length to obtain a third number of audio clips;

For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;

According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
A non-transitory computer-readable storage medium storing computer instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the following steps:

Get the target audio signal;

Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;

Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;

The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;

According to the target evaluation information, the sound quality category of the target audio signal is determined.