WO2022068304A1 - Sound quality detection method and device - Google Patents

Sound quality detection method and device Download PDF

Info

Publication number
WO2022068304A1
WO2022068304A1 PCT/CN2021/105044 CN2021105044W WO2022068304A1 WO 2022068304 A1 WO2022068304 A1 WO 2022068304A1 CN 2021105044 W CN2021105044 W CN 2021105044W WO 2022068304 A1 WO2022068304 A1 WO 2022068304A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
evaluation information
target
signal
audio signal
Prior art date
Application number
PCT/CN2021/105044
Other languages
French (fr)
Chinese (zh)
Inventor
郑羲光
陈翔宇
张晨
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2022068304A1 publication Critical patent/WO2022068304A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • the present disclosure relates to the technical field of audio processing, and in particular, to a sound quality detection method, device, electronic device and storage medium.
  • the traditional sound quality detection method is generally a complete reference sound quality detection method.
  • the original lossless audio signal and various lossy audio signals with reduced sound quality corresponding to the original lossless audio signal are obtained.
  • the gap is determined, the sound quality evaluation information of the lossy audio signal is determined, and the sound quality of the lossy audio signal is determined through the evaluation information.
  • the present disclosure provides a sound quality detection method, device, electronic device and storage medium.
  • a sound quality detection method including:
  • the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
  • the sound quality category of the target audio signal is determined.
  • the detecting an audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal including:
  • the target audio signal is detected to obtain first evaluation information related to the audio content signal.
  • the classification of the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal includes:
  • the target audio signal is divided according to the first time length to obtain a first number of audio clips
  • a second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
  • the target audio signal is detected to obtain the first evaluation information related to the audio content signal, including:
  • the audio clips are detected according to the second number of target categories, to obtain a second number of clip content evaluation information related to the second number of target categories;
  • the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
  • the detecting an audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal including:
  • the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  • the detecting an audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal including:
  • the target audio signal is divided according to the second time length to obtain a third number of audio clips
  • the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
  • the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
  • a sound quality detection device comprising:
  • an audio signal acquisition unit configured to perform acquisition of the target audio signal
  • a first detection unit configured to perform detection of an audio content signal corresponding to the target audio signal, to obtain first evaluation information related to the audio content signal
  • a second detection unit configured to perform detection of an audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal
  • a target evaluation information determination unit configured to fuse the first evaluation information and the second evaluation information according to a first preset weight vector, to obtain target evaluation information corresponding to the target audio signal; wherein the The target evaluation information is related to the sound quality of the target audio signal;
  • the sound quality detection unit is configured to perform determining the sound quality category of the target audio signal according to the target evaluation information.
  • the first detection unit is further configured to perform:
  • the target audio signal is detected to obtain first evaluation information related to the audio content signal.
  • the first detection unit is further configured to perform:
  • the target audio signal is divided according to the first time length to obtain a first number of audio clips
  • a second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
  • the first detection unit is further configured to perform:
  • the audio clips are detected according to the second number of target categories, to obtain a second number of clip content evaluation information related to the second number of target categories;
  • the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
  • the second detection unit is further configured to perform:
  • the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  • the second detection unit is further configured to perform:
  • the target audio signal is divided according to the second time length to obtain a third number of audio clips
  • the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
  • the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
  • an electronic device comprising:
  • a memory for storing the processor-executable instructions
  • the processor is configured to execute the instructions to implement the sound quality detection method described in any one of the embodiments of the first aspect.
  • a storage medium when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can execute any one of the embodiments of the first aspect above The sound quality detection method described in .
  • a computer program product comprising a computer program, the computer program being stored in a readable storage medium, and at least one processor of a device from the readable storage medium The computer program is read and executed, so that the device executes the sound quality detection method described in any one of the embodiments of the first aspect.
  • the embodiments of the present disclosure evaluate the quality of the target audio signal at the audio content signal level by detecting the audio content signal corresponding to the target audio signal, obtain the first evaluation information related to the audio content signal, and detect the audio corresponding to the target audio signal. Collect the signal, evaluate the quality of the target audio signal at the audio collection signal level, obtain the second evaluation information related to the audio collection signal, and obtain the first evaluation information and the second evaluation information for evaluating the target audio signal from different dimensions Then, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal, and is evaluated according to the target audio signal. information to determine the sound quality category of the target audio signal.
  • the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal.
  • the target evaluation information used for evaluating the sound quality of the target audio signal is based on the multi-dimensional characteristics of the target audio signal itself. If the attributes are obtained, the target audio signal can be detected in all directions, and finally the purpose of accurately defining the sound quality of the corresponding audio signal can be achieved.
  • Fig. 1 is a flow chart of a method for detecting sound quality according to an exemplary embodiment.
  • FIG. 2 is a flow chart of a possible implementation manner of step S200 according to an exemplary embodiment.
  • FIG. 3 is a flowchart showing an implementation manner of step S210 according to an exemplary embodiment.
  • FIG. 4 is a flowchart showing an implementation manner of step S220 according to an exemplary embodiment.
  • Fig. 5 is a flowchart showing an implementation manner of step S300 according to an exemplary embodiment.
  • FIG. 6 is a flowchart showing an implementation manner of step S310 according to an exemplary embodiment.
  • FIG. 7 is a structural diagram of a sound quality detection system according to a specific exemplary embodiment.
  • Fig. 8 is a block diagram of an apparatus for detecting sound quality according to an exemplary embodiment.
  • Fig. 9 is a block diagram of an electronic device for sound quality detection according to an exemplary embodiment.
  • FIG. 1 is a flow chart of a method for detecting sound quality according to an exemplary embodiment, as shown in FIG. 1 , which specifically includes the following steps:
  • step S100 the target audio signal is acquired.
  • step S200 the audio content signal corresponding to the target audio signal is detected to obtain first evaluation information related to the audio content signal.
  • step S300 the audio collection signal corresponding to the target audio signal is detected, and the second evaluation information related to the audio collection signal is obtained.
  • step S400 the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein the target evaluation information is related to the sound quality of the target audio signal.
  • step S500 the sound quality category of the target audio signal is determined according to the target evaluation information.
  • the target audio signal refers to the audio signal whose sound quality is to be evaluated.
  • the audio content signal refers to a specific signal related to the target audio signal, for example, the audio content signal may be music, speech or other (noise and other audio).
  • the audio acquisition signal refers to the signal related to the acquisition process of the target audio signal, which aims to evaluate the audio quality unrelated to the audio content signal, mainly refers to the part of the signal that has damage to the sound quality during the audio signal acquisition process.
  • the audio acquisition signal mainly includes broken sound evaluation and external recording evaluation.
  • the first preset weight vector is a vector formed by combining coefficients when the first evaluation information and the second evaluation information are combined.
  • the audio content signal and the audio capture signal are used to set the sound quality of the image.
  • the quality of the target audio signal is evaluated at the specific audio content level, and the first evaluation information related to the audio content signal is obtained.
  • the quality of the target audio signal is evaluated, and the second evaluation information related to the audio acquisition signal is obtained.
  • the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the corresponding target audio signal.
  • Target evaluation information is related to the sound quality of the target audio signal.
  • the measurement standard corresponding to the first evaluation information is 90
  • the measurement standard corresponding to the second evaluation information is 85
  • the first preset weight vector is (0.6, 0.4)
  • the measurement standard corresponding to the target evaluation information is ( 90*0.6+85*0.4).
  • the target evaluation information can accurately evaluate the sound quality of the corresponding target audio signal, so as to determine the sound quality category of the target audio signal according to the target evaluation information. For example, when the full score is 100 points and the target evaluation information is 95, the target audio signal can be determined as a high-quality audio signal; when the target evaluation information is 70, the target audio signal can be determined as a medium-quality audio signal, and the target audio signal can be determined as a medium-quality audio signal. When the evaluation information is 55, the target audio signal may be determined as a low-quality audio signal. It should be noted that the above scores and classifications are only exemplary descriptions, and in the specific implementation process, sound quality categories may be further divided according to actual needs.
  • the quality of the target audio signal is evaluated at the audio content signal level, so as to obtain the first evaluation information related to the audio content signal, and detect the corresponding audio content signal of the target audio signal.
  • Audio acquisition signal evaluate the quality of the target audio signal at the audio acquisition signal level, obtain the second evaluation information related to the audio acquisition signal, and obtain the first evaluation information and the second evaluation to evaluate the target audio signal from different dimensions
  • the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal, and is Evaluate the information to determine the sound quality category of the target audio signal. Therefore, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal.
  • the target evaluation information for evaluating the sound quality of the target audio signal is based on the multi-dimensional characteristics of the target audio signal itself. If the attributes are obtained, the target audio signal can be detected in all directions, and finally the purpose of accurately defining the sound quality of the corresponding audio signal can be achieved.
  • step S200 it is a flowchart of an implementable implementation manner of step S200, including the following steps:
  • step S210 the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal.
  • step S220 the target audio signal is detected according to the audio classification result to obtain first evaluation information related to the audio content signal.
  • the audio category corresponding to the audio content signal may be music, speech or other (noise and other audio).
  • the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal, where the audio classification result includes a category corresponding to the target audio signal and a category corresponding to the target audio signal determined to be this category. probability.
  • the target audio signal is detected, the evaluation information of each target audio signal in each category is obtained, and the evaluation information of each category is combined according to the probability corresponding to each category to obtain the corresponding audio signal.
  • the first evaluation information related to the content signal is performed by the target audio signal.
  • a pre-trained audio classification network model capable of detecting the category of the audio signal can be obtained, and the probability that the target audio signal is music, speech or other can be obtained.
  • the probability that the target audio signal is music is 0.7, which is The probability of speech is 0.2 and the probability of being other audio signals is 0.1.
  • obtain a pre-trained evaluation network capable of evaluating the corresponding category of audio signals for example, a music evaluation network model that can be used for music evaluation, a voice network model that can be used for speech evaluation, and other networks Model, input the target audio signal into the corresponding network evaluation model, and obtain the evaluation information of the target audio signal at the music level, the voice level and other levels.
  • the score corresponding to the evaluation information at the music level of the target audio signal is 90
  • the score corresponding to the evaluation information at the speech level is 80
  • the score corresponding to the evaluation information at other levels is 85.
  • each category of evaluation information is combined according to the probability corresponding to each category to obtain a score corresponding to the first evaluation information (0.7*90+0.2*80+0.1*85).
  • the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal, and the target audio signal is detected according to the audio classification result, and the target audio signal can be detected in the corresponding audio signal.
  • the target audio signal is detected in a targeted manner in terms of specific categories, and the obtained first evaluation information can detect the quality of the audio signal more comprehensively and pertinently, and provide a basis for the sound quality evaluation of the subsequent audio signal.
  • step S210 a flowchart of a possible implementation manner of step S210 is shown in FIG. 3 , including the following steps:
  • step S211 the target audio signal is divided according to the first time length to obtain a first number of audio segments.
  • step S212 for each audio segment, classify the content of the audio segment corresponding to the audio segment to obtain a second number of target categories corresponding to the audio segment and a target probability that the audio segment is each target category.
  • step S213 the second number of target categories and the second number of target probabilities corresponding to each audio segment are determined as the audio classification result.
  • the first time length refers to a reference metric value for dividing the audio signal. In some embodiments, it can be 1 second, 10 seconds, 20 seconds, or 1 minute. Time lengths such as 1 minute are only exemplary, and do not specifically limit the first time length.
  • the target audio signal is segmented according to the first time length to obtain a first number of audio clips.
  • the length of the target audio signal is 3 minutes, and the target audio signal is segmented with 10 seconds as the first time length.
  • An audio clip with a first number of 18 and a length of 10 seconds is obtained.
  • the content of the audio clip corresponding to the audio clip is classified, when the probability of each 10-second audio clip is music is 0.7, the probability of speech is 0.2, and the probability of other audio signals is 0.1 , the second number is 3, and the target categories are music, speech, and others.
  • the second number of target categories and the second number of target probabilities corresponding to each of the 18 10-second audio clips are determined as the audio classification result.
  • the target audio signal is divided according to the first time length to obtain a first number of audio clips, and the content of the audio clips corresponding to each audio clip is classified to obtain a second number corresponding to the audio clips.
  • the target categories and the audio clips are the target probability of each target category, and the second number of target categories and the second number of target probabilities corresponding to each audio clip are determined as the audio classification result.
  • the target audio signal is divided in a smaller time dimension, so that each audio segment can be detected in a smaller time dimension in the follow-up, and finally the sound quality of the corresponding audio signal can be accurately defined. Purpose.
  • step S220 a flowchart of a possible implementation manner of step S220 is shown in FIG. 4 , including the following steps:
  • step S221 for each audio segment, the audio segment is detected according to the second number of target categories, and a second number of segment content evaluation information related to the second number of target categories is obtained.
  • step S222 the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities is determined as the segment content evaluation information corresponding to each audio segment; or, using the second number of target probabilities as the weight coefficient, The second quantity of segment content evaluation information related to the audio segment is weighted to obtain segment content evaluation information corresponding to each audio segment; wherein the first number of audio segments corresponds to the first number of segment content evaluation information.
  • step S223 according to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
  • the first quantity of audio clips corresponds to the first quantity of clip content evaluation information.
  • the second preset weight vector is a vector formed by combining coefficients of the sound quality detection results of multiple audio clips.
  • the second preset weight vector may be set in a weighted average manner, or may be set according to a specific target audio signal
  • the second preset weight vector for example, can set a relatively small weight coefficient for the audio clips at the beginning and the end of the target audio signal, and set a relatively large weight coefficient for the middle audio clip, so as to reduce the excessive noise at the beginning of the audio recording. big impact.
  • the audio segments are detected to obtain a second number of segment content evaluation information related to the second number of target categories, and the segment content evaluation information corresponding to the largest probability value among the second number of target probabilities is determined as each The segment content evaluation information corresponding to the audio segment.
  • the second number of target probabilities as weight coefficients, weighting the second number of segment content evaluation information related to the audio segment to obtain segment content evaluation information corresponding to each audio segment. For example, the probability of an audio clip being music is 0.7, the probability of being speech is 0.2, and the probability of being other audio signals is 0.1.
  • the corresponding score is 80, and the score corresponding to the evaluation information at other levels is 85.
  • the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities is determined as the segment content evaluation information corresponding to each audio segment, and the segment content evaluation information is a score of 90 corresponding to the maximum probability of 0.7.
  • the evaluation information of each category is combined according to the probability corresponding to each category to obtain a score (0.7*90+0.2*80+0.1*85) corresponding to the segment content evaluation information.
  • the first quantity of audio clips corresponds to the first quantity of clip content evaluation information, and finally the first quantity of clip content evaluation information is weighted and summed according to the second preset weight vector to obtain the first evaluation information.
  • the audio clips are detected according to the second number of target categories, and a second number of clip content evaluation information related to the second number of target categories is obtained;
  • the segment content evaluation information corresponding to the maximum probability value among the target probabilities is determined as the segment content evaluation information corresponding to each audio segment;
  • the segment content evaluation information is weighted to obtain segment content evaluation information corresponding to each audio segment, and the first number of segment content evaluation information is fused according to the second preset weight vector to obtain the first evaluation information. Therefore, the target audio signal is detected in a smaller time dimension in more detail, and the purpose of accurately defining the sound quality of the corresponding audio signal is finally achieved.
  • step S300 it is a flowchart of an implementable implementation manner of step S300, including the following steps:
  • step S310 a broken sound phenomenon corresponding to the target audio signal is detected to obtain corresponding broken sound evaluation information.
  • step S320 the external recording device corresponding to the target audio signal is detected to obtain the corresponding external recording evaluation information.
  • step S330 according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  • the phenomenon of broken sound refers to the phenomenon that when the sound signal level exceeds the upper limit of the load of the electronic components, a part of the sound signal is cut off, resulting in the phenomenon of noise in the emitted sound.
  • the external recording device refers to the device that transmits the sound signal to the recording system through the microphone or the pickup device that comes with the tape recorder, and then records the sound signal in the storage medium. This recording method can conveniently record a variety of sound signals such as human voice, but the obtained audio signal is susceptible to external interference and the sound signal is easily distorted.
  • the third preset weight vector is a vector formed by combining coefficients when the broken sound evaluation information and the external recording evaluation information are combined. In some embodiments, the third preset weight vector may be set in a weighted average manner, or may be set according to specific The broken sound evaluation information and the external recording evaluation information are set for the sound quality of the video.
  • the sound breaking phenomenon corresponding to the target audio signal and the external recording device are respectively detected, and the sound breaking evaluation information corresponding to the sound breaking phenomenon detection and the external recording evaluation information corresponding to the external recording device detection are obtained.
  • the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information. For example, obtain a pre-trained evaluation network that can detect and evaluate the broken sound phenomenon and external recording equipment, for example, a broken sound evaluation network model that can be used for For the external recording network model, input the target audio signal into the corresponding network evaluation model, and obtain the audio-breaking evaluation information corresponding to the sound-breaking phenomenon detection and the external-recording evaluation information corresponding to the external recording device detection.
  • the target audio signal has a score corresponding to the evaluation information at the breaking sound level of 90
  • the score corresponding to the evaluation information at the external recording level is 80
  • the third preset weight vector is (0.6, 0.4)
  • the target evaluation information The corresponding score is (90*0.6+80*0.4).
  • the sound breaking phenomenon corresponding to the target audio signal is detected to obtain the corresponding broken sound evaluation information
  • the external recording device corresponding to the target audio signal is detected to obtain the corresponding external recording evaluation information.
  • the target audio signal is detected in a targeted manner according to different sound quality categories corresponding to the specific acquisition device corresponding to the target audio signal.
  • the broken sound evaluation information and the external recording evaluation information are fused, so that the obtained second evaluation information can be more comprehensive and targeted to the sound quality effect produced by the audio signal acquisition device. Detection provides a basis for the sound quality evaluation of subsequent audio signals.
  • step S310 it is a flowchart of an implementable implementation manner of step S310, including the following steps:
  • step S311 the target audio signal is divided according to the second time length to obtain a third number of audio segments.
  • step S312 for each audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the sound breaking evaluation information corresponding to the audio clip is obtained; wherein, the third quantity of audio clips corresponds to the third quantity of fragment sound breaking evaluation information .
  • step S313 according to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
  • the second time length refers to a reference metric value for dividing the audio signal, and in some embodiments, it may be 1 second, 10 seconds, 20 seconds, or 1 minute. Time lengths such as 1 minute are only exemplary, and do not specifically limit the second time length.
  • the fourth preset weight vector is a vector composed of merging coefficients when the third number of pieces of sound breaking evaluation information are merged. A fourth preset weight vector is set for the target audio signal of The effect of recording excessive noise at the beginning.
  • the target audio signal is segmented according to the second time length to obtain a third number of audio clips.
  • the length of the target audio signal is 3 minutes, and the target audio signal is segmented with 10 seconds as the first time length.
  • An audio clip with a third number of 18 and a length of 10 seconds is obtained.
  • the degree of sound breaking corresponding to the audio clip is detected, and the sound breaking evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of clips.
  • One broken sound evaluation information corresponds to a weighting system, and the third quantity of fragmentary broken sound evaluation information corresponds to a fourth preset weight vector of a third quantity dimension. According to the fourth preset weight vector, the third quantity of clips broken sound is evaluated. The information is fused to obtain the broken sound evaluation information.
  • the target audio signal is divided according to the second time length to obtain a third number of audio clips, and for each audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the broken sound corresponding to the audio clip is obtained.
  • Audio evaluation information wherein, the third quantity of audio clips corresponds to the third quantity of audio fragmentation evaluation information, and according to the fourth preset weight vector, the third quantity of audio fragmentation evaluation information is fused to obtain the audio fragmentation evaluation information.
  • the target audio signal is divided in a smaller time dimension, so that each audio segment can be detected in a smaller time dimension in the follow-up, and finally the sound quality of the corresponding audio signal can be accurately defined. Purpose.
  • FIG. 7 is a structural diagram of a sound quality detection system according to a specific exemplary embodiment, as shown in FIG. 7 , which specifically includes:
  • the sound quality detection system divides the evaluation of the quality of the target audio signal into two parts: the first part is the sound quality evaluation related to the content; the second part is the sound quality evaluation related to the acquisition device.
  • the first part mainly judges the different contents of the audio signal, and then conducts targeted scoring according to specific categories.
  • the second part is mainly aimed at the acquisition equipment of audio signals, and detects whether the acquisition equipment will introduce related distortion.
  • the input target audio signal is classified into music, speech or other (noise and other audio) types through the deep learning network that classifies the audio signal, and the output is a fixed length (such as 1) in the target audio signal.
  • the category corresponding to the audio segment that outputs the result once per second for example, multiple target probabilities of music, speech or other types, and the sum of the multiple target probabilities is 1.
  • the classification with the highest probability can be selected for the subsequent scoring process. If it is classified as music, the signal is scored without reference to the music quality (it can be regarded as the probability of other categories being 0).
  • the final score is the no-reference scoring result corresponding to the category with the highest probability. After obtaining these probabilities, the audio signal can also be directly sent to the three scoring networks in Figure 7 for detection.
  • the first detection score (first evaluation information) of the final score fusion result is shown in the formula (1 ) as shown:
  • Content-related fusion scoring result music probability * no reference music score + voice probability * no reference voice score
  • the audio event classification network aims to score non-speech and music audio signals for noise that degrades audio quality.
  • Sounds such as babble noise, engine noise, and low-frequency noise in the aircraft cabin are harmful noises and correspond to low scores; sounds such as bird calls and running water are non-harmful noises and correspond to high scores.
  • the sound quality evaluation related to the acquisition device it aims to evaluate the sound quality independent of the content, mainly refers to the part that damages the sound quality during the audio signal acquisition process.
  • the broken sound detection network divides the input audio signal into signals in units of 1 second, and judges whether the sound is broken for each segment of the audio signal. Evaluate the sound level.
  • the purpose of the design of the external recording detection network is to determine whether the audio signal to be tested is obviously collected by a low-quality mobile phone microphone. Signals collected by low-quality mobile phone microphones usually have a narrow frequency response and a low signal-to-noise ratio due to the acquisition equipment, which affects the sound quality.
  • the external recording detection network determines whether the input signal is an audio signal collected by a low-quality microphone.
  • the fusion score (the score corresponding to the second evaluation information) jointly generated by the broken sound detection and the external recording detection is shown in formula (2):
  • Collection equipment-related fusion scoring results broken sound detection result * broken sound detection weight + external recording detection result * external recording detection weight (2)
  • the final fusion result (the score corresponding to the target evaluation information) is shown in formula (3):
  • Fusion scoring result content-related fusion scoring result * content-related fusion scoring weight + acquisition device-related fusion scoring result * acquisition-device-related fusion scoring weight (3)
  • the final fusion result (the score corresponding to the target evaluation information) can accurately evaluate the sound quality of the corresponding target audio signal, so as to determine the sound quality category of the target audio signal according to the target evaluation information.
  • the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal.
  • the target evaluation information for evaluating the sound quality of the target audio signal is based on the target audio signal. It can detect the target audio signal in all directions, and finally achieve the purpose of accurately defining the sound quality of the corresponding audio signal.
  • FIGS. 1-7 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 1-7 may include multiple steps or multiple stages. These steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or phases within the other steps.
  • Fig. 8 is a block diagram of an apparatus for detecting sound quality according to an exemplary embodiment.
  • the device includes an audio signal acquisition unit 801, a first detection unit 802, a second detection unit 803, a target evaluation information determination unit 804, and a sound quality detection unit 805, specifically including:
  • An audio signal acquisition unit 801 configured to perform acquisition of a target audio signal
  • the first detection unit 802 is configured to perform detection of the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;
  • the second detection unit 803 is configured to perform detection of the audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;
  • the target evaluation information determination unit 804 is configured to fuse the first evaluation information and the second evaluation information according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
  • the sound quality detection unit 805 is configured to determine the sound quality category of the target audio signal according to the target evaluation information.
  • the first detection unit 802 is further configured to perform: classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal; The signal is detected to obtain first evaluation information related to the audio content signal.
  • the first detection unit 802 is further configured to perform: dividing the target audio signal according to the first time length to obtain a first number of audio clips; The audio clip content is classified to obtain the second number of target categories corresponding to the audio clip and the target probability that the audio clip is each target category; the second number of target categories and the second number of target probabilities corresponding to each audio clip are , which is determined as the audio classification result.
  • the first detection unit 802 is further configured to perform: for each audio segment, detect the audio segment according to the second number of target categories, and obtain the first number related to the second number of target categories. Two pieces of segment content evaluation information; determining the segment content evaluation information corresponding to the largest probability value in the second number of target probabilities as the segment content evaluation information corresponding to each audio segment; or, taking the second number of target probabilities as the weight coefficient, weights the second quantity of segment content evaluation information related to the audio segment to obtain segment content evaluation information corresponding to each audio segment; wherein, the first number of audio segments corresponds to the first number of segment content evaluation information; according to The second preset weight vector fuses the content evaluation information of the first number of segments to obtain the first evaluation information.
  • the second detection unit 803 is further configured to perform: detect the sound-breaking phenomenon corresponding to the target audio signal to obtain corresponding broken-sound evaluation information; detect the external recording device corresponding to the target audio signal , to obtain the corresponding external recording evaluation information; according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  • the second detection unit 803 is further configured to perform: dividing the target audio signal according to the second time length to obtain a third number of audio clips; The degree of broken sound is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound evaluation information; according to the fourth preset weight vector, the third number of clips are broken according to the fourth preset weight vector.
  • the sound evaluation information is fused to obtain the broken sound evaluation information.
  • FIG. 9 is a block diagram of an electronic device 900 for sound quality detection according to an exemplary embodiment.
  • device 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, or the like.
  • device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and Communication component 916.
  • the processing component 902 generally controls the overall operation of the device 900, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 902 may include one or more processors 920 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 902 may include one or more modules to facilitate interaction between processing component 902 and other components. For example, processing component 902 may include a multimedia module to facilitate interaction between multimedia component 908 and processing component 902.
  • Memory 904 is configured to store various types of data to support operation at device 900 . Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and the like. Memory 904 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • Power supply assembly 906 provides power to various components of device 900 .
  • Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 900 .
  • Multimedia component 908 includes a screen that provides an output interface between the device 900 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action.
  • the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
  • Audio component 910 is configured to output and/or input audio signals.
  • audio component 910 includes a microphone (MIC) that is configured to receive external audio signals when device 900 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 904 or transmitted via communication component 916 .
  • audio component 910 also includes a speaker for outputting audio signals.
  • the I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
  • Sensor assembly 914 includes one or more sensors for providing status assessments of various aspects of device 900 .
  • the sensor assembly 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor assembly 914 can also detect a change in the position of the device 900 or a component of the device 900 , the presence or absence of user contact with the device 900 , the orientation or acceleration/deceleration of the device 900 and the temperature change of the device 900 .
  • Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 916 is configured to facilitate wired or wireless communication between device 900 and other devices.
  • Device 900 may access wireless networks based on communication standards, such as WiFi, carrier networks (eg, 2G, 3G, 4G, or 5G), or a combination thereof.
  • the communication component 916 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • device 900 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • non-transitory computer-readable storage medium including instructions, such as memory 904 including instructions, executable by processor 920 of device 900 to perform the method described above.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • a computer program product comprising a computer program stored in a readable storage medium from which at least one processor of the device reads The computer program is retrieved and executed to cause the apparatus to perform the above-described method.

Abstract

A sound quality detection method and device. Said method comprises: acquiring a target audio signal (S100); detecting an audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal (S200); detecting an audio acquisition signal corresponding to the target audio signal to obtain second evaluation information related to the audio acquisition signal (S300); fusing the first evaluation information and the second evaluation information according to a first preset weight vector to obtain target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal (S400); and determining the category of the sound quality of the target audio signal according to the target evaluation information (S500).

Description

音质检测方法及装置Sound quality detection method and device
本申请要求于2020年9月29日提交至中国专利局、申请号为202011054305.2的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202011054305.2 filed with the Chinese Patent Office on September 29, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及音频处理技术领域,尤其涉及一种音质检测方法、装置、电子设备及存储介质。The present disclosure relates to the technical field of audio processing, and in particular, to a sound quality detection method, device, electronic device and storage medium.
背景技术Background technique
随着社会的进步以及计算机技术、网络技术的发展,人们接收外界信息的渠道越来越多。近年来,由于音频处理技术的发展,通过音频信息与外界交流和感知外界变化得到的了空前发展,人们也越来越重视发出和获得的音频信息的质量。传统音质检测方法,一般是为完全参考音质检测方法,首先获取原始无损音频信号和原始无损音频信号对应音质被降低的各种有损音频信号,通过比较原始无损音频信号和有无损音频信号之间的差距,确定有损音频信号的音质评价信息,并通过该评价信息来确定该有损音频信号的音质。With the progress of society and the development of computer technology and network technology, there are more and more channels for people to receive external information. In recent years, due to the development of audio processing technology, the communication with the outside world and the perception of changes in the outside world through audio information have achieved unprecedented development, and people are paying more and more attention to the quality of the audio information sent and obtained. The traditional sound quality detection method is generally a complete reference sound quality detection method. First, the original lossless audio signal and various lossy audio signals with reduced sound quality corresponding to the original lossless audio signal are obtained. The gap is determined, the sound quality evaluation information of the lossy audio signal is determined, and the sound quality of the lossy audio signal is determined through the evaluation information.
发明内容SUMMARY OF THE INVENTION
本公开提供一种音质检测方法、装置、电子设备及存储介质。The present disclosure provides a sound quality detection method, device, electronic device and storage medium.
根据本公开实施例的第一方面,提供一种音质检测方法,包括:According to a first aspect of the embodiments of the present disclosure, there is provided a sound quality detection method, including:
获取目标音频信号;Get the target audio signal;
检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息;Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;
检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息;Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;
按照第一预设权重向量对所述第一评价信息与所述第二评价信息进行融合,得到与所述目标音频信号对应的目标评价信息;其中,所述目标评价信息与所述目标音频信号的音质相关;The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
根据所述目标评价信息,确定所述目标音频信号的音质类别。According to the target evaluation information, the sound quality category of the target audio signal is determined.
在一些实施例中,所述检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息,包括:In some embodiments, the detecting an audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal, including:
对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果;classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;
按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息。According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
在一些实施例中,所述对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果,包括:In some embodiments, the classification of the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal includes:
按照第一时间长度对所述目标音频信号进行分割,得到第一数量个音频片段;The target audio signal is divided according to the first time length to obtain a first number of audio clips;
对于每一所述音频片段,对所述音频片段对应的音频片段内容进行分类,得到与所述音频片段对应的第二数量个目标类别和所述音频片段为所述目标类别的目标概率;For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;
将每一所述音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为所述音频分类结果。A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
在一些实施例中,所述按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息,包括:In some embodiments, according to the audio classification result, the target audio signal is detected to obtain the first evaluation information related to the audio content signal, including:
对于每一所述音频片段,按照所述第二数量个目标类别,对所述音频片段进行检测, 得到与所述第二数量个目标类别相关的第二数量个片段内容评价信息;For each of the audio clips, the audio clips are detected according to the second number of target categories, to obtain a second number of clip content evaluation information related to the second number of target categories;
将所述第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一所述音频片段对应的片段内容评价信息;或,以所述第二数量个目标概率为权重系数,对与所述音频片段相关的第二数量个片段内容评价信息进行加权,得到每一所述音频片段对应的片段内容评价信息;其中,所述第一数量个音频片段对应第一数量个片段内容评价信息;Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;
按照第二预设权重向量,对所述第一数量个片段内容评价信息进行融合,得到所述第一评价信息。According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
在一些实施例中,所述检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息,包括:In some embodiments, the detecting an audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal, including:
对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息;Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;
对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息;Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;
按照第三预设权重向量,对所述破音评价信息和所述外录评价信息进行融合,得到所述第二评价信息。According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
在一些实施例中,所述检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息,包括:In some embodiments, the detecting an audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal, including:
按照第二时间长度对所述目标音频信号进行分割,得到第三数量个音频片段;The target audio signal is divided according to the second time length to obtain a third number of audio clips;
对于每一所述音频片段,对所述音频片段对应的破音程度进行检测,得到所述音频片段对应的破音评价信息;其中,所述第三数量个音频片段对应第三数量个片段破音评价信息;For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
按照第四预设权重向量,对所述第三数量个片段破音评价信息进行融合,得到所述破音评价信息。According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
根据本公开实施例的第二方面,提供一种音质检测装置,包括:According to a second aspect of the embodiments of the present disclosure, there is provided a sound quality detection device, comprising:
音频信号获取单元,被配置为执行获取目标音频信号;an audio signal acquisition unit, configured to perform acquisition of the target audio signal;
第一检测单元,被配置为执行检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息;a first detection unit, configured to perform detection of an audio content signal corresponding to the target audio signal, to obtain first evaluation information related to the audio content signal;
第二检测单元,被配置为执行检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息;a second detection unit, configured to perform detection of an audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;
目标评价信息确定单元,被配置为按照第一预设权重向量对所述第一评价信息与所述第二评价信息进行融合,得到与所述目标音频信号对应的目标评价信息;其中,所述目标评价信息与所述目标音频信号的音质相关;a target evaluation information determination unit, configured to fuse the first evaluation information and the second evaluation information according to a first preset weight vector, to obtain target evaluation information corresponding to the target audio signal; wherein the The target evaluation information is related to the sound quality of the target audio signal;
音质检测单元,被配置为执行根据所述目标评价信息,确定所述目标音频信号的音质类别。The sound quality detection unit is configured to perform determining the sound quality category of the target audio signal according to the target evaluation information.
在一些实施例中,所述第一检测单元还被配置为执行:In some embodiments, the first detection unit is further configured to perform:
对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果;classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;
按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息。According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
在一些实施例中,所述第一检测单元还被配置为执行:In some embodiments, the first detection unit is further configured to perform:
按照第一时间长度对所述目标音频信号进行分割,得到第一数量个音频片段;The target audio signal is divided according to the first time length to obtain a first number of audio clips;
对于每一所述音频片段,对所述音频片段对应的音频片段内容进行分类,得到与所述音频片段对应的第二数量个目标类别和所述音频片段为所述目标类别的目标概率;For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;
将每一所述音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为所述音频分类结果。A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
在一些实施例中,所述第一检测单元还被配置为执行:In some embodiments, the first detection unit is further configured to perform:
对于每一所述音频片段,按照所述第二数量个目标类别,对所述音频片段进行检测, 得到与所述第二数量个目标类别相关的第二数量个片段内容评价信息;For each of the audio clips, the audio clips are detected according to the second number of target categories, to obtain a second number of clip content evaluation information related to the second number of target categories;
将所述第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一所述音频片段对应的片段内容评价信息;或,以所述第二数量个目标概率为权重系数,对与所述音频片段相关的第二数量个片段内容评价信息进行加权,得到每一所述音频片段对应的片段内容评价信息;其中,所述第一数量个音频片段对应第一数量个片段内容评价信息;Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;
按照第二预设权重向量,对所述第一数量个片段内容评价信息进行融合,得到所述第一评价信息。According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
在一些实施例中,所述第二检测单元还被配置为执行:In some embodiments, the second detection unit is further configured to perform:
对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息;Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;
对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息;Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;
按照第三预设权重向量,对所述破音评价信息和所述外录评价信息进行融合,得到所述第二评价信息。According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
在一些实施例中,所述第二检测单元还被配置为执行:In some embodiments, the second detection unit is further configured to perform:
按照第二时间长度对所述目标音频信号进行分割,得到第三数量个音频片段;The target audio signal is divided according to the second time length to obtain a third number of audio clips;
对于每一所述音频片段,对所述音频片段对应的破音程度进行检测,得到所述音频片段对应的破音评价信息;其中,所述第三数量个音频片段对应第三数量个片段破音评价信息;For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
按照第四预设权重向量,对所述第三数量个片段破音评价信息进行融合,得到所述破音评价信息。According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
根据本公开实施例的第三方面,提供一种电子设备,包括:According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising:
处理器;processor;
用于存储所述处理器可执行指令的存储器;a memory for storing the processor-executable instructions;
其中,所述处理器被配置为执行所述指令,以实现上述第一方面的任一项实施例中所述的音质检测方法。Wherein, the processor is configured to execute the instructions to implement the sound quality detection method described in any one of the embodiments of the first aspect.
根据本公开实施例的第四方面,提供一种存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行上述第一方面的任一项实施例中所述的音质检测方法。According to a fourth aspect of the embodiments of the present disclosure, a storage medium is provided, when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can execute any one of the embodiments of the first aspect above The sound quality detection method described in .
根据本公开实施例的第五方面,提供一种计算机程序产品,所述程序产品包括计算机程序,所述计算机程序存储在可读存储介质中,设备的至少一个处理器从所述可读存储介质读取并执行所述计算机程序,使得设备执行第一方面的任一项实施例中所述的音质检测方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product, the program product comprising a computer program, the computer program being stored in a readable storage medium, and at least one processor of a device from the readable storage medium The computer program is read and executed, so that the device executes the sound quality detection method described in any one of the embodiments of the first aspect.
本公开的实施例通过检测目标音频信号对应的音频内容信号,在音频内容信号层面对目标音频信号的质量进行评价,得到与音频内容信号相关的第一评价信息,以及检测目标音频信号对应的音频采集信号,在音频采集信号层面对目标音频信号的质量进行评价,得到与音频采集信号相关的第二评价信息,在得到从不同维度对目标音频信号进行评价的第一评价信息和第二评价信息后,按照第一预设权重向量对第一评价信息与第二评价信息进行融合,得到与目标音频信号对应的目标评价信息,其中,目标评价信息与目标音频信号的音质相关,并根据目标评价信息,确定目标音频信号的音质类别。从而,无需获取与目标音频信号对应的原始无损音频信号,便可实现对目标音频信号质量进行检测的目的,同时,用于评价目标音频信号音质的目标评价信息是基于目标音频信号本身的多维度属性得到的,能够全方位对目标音频信号进行检测,最终实现精确地对相应音频信号的音质进行界定的目的。The embodiments of the present disclosure evaluate the quality of the target audio signal at the audio content signal level by detecting the audio content signal corresponding to the target audio signal, obtain the first evaluation information related to the audio content signal, and detect the audio corresponding to the target audio signal. Collect the signal, evaluate the quality of the target audio signal at the audio collection signal level, obtain the second evaluation information related to the audio collection signal, and obtain the first evaluation information and the second evaluation information for evaluating the target audio signal from different dimensions Then, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal, and is evaluated according to the target audio signal. information to determine the sound quality category of the target audio signal. Therefore, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal. At the same time, the target evaluation information used for evaluating the sound quality of the target audio signal is based on the multi-dimensional characteristics of the target audio signal itself. If the attributes are obtained, the target audio signal can be detected in all directions, and finally the purpose of accurately defining the sound quality of the corresponding audio signal can be achieved.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
图1是根据一示例性实施例示出的一种音质检测方法的流程图。Fig. 1 is a flow chart of a method for detecting sound quality according to an exemplary embodiment.
图2是根据一示例性实施例示出的步骤S200的一种可实施方式的流程图。FIG. 2 is a flow chart of a possible implementation manner of step S200 according to an exemplary embodiment.
图3是根据一示例性实施例示出的步骤S210的一种可实施方式的流程图。FIG. 3 is a flowchart showing an implementation manner of step S210 according to an exemplary embodiment.
图4是根据一示例性实施例示出的步骤S220的一种可实施方式的流程图。FIG. 4 is a flowchart showing an implementation manner of step S220 according to an exemplary embodiment.
图5是根据一示例性实施例示出的步骤S300的一种可实施方式的流程图。Fig. 5 is a flowchart showing an implementation manner of step S300 according to an exemplary embodiment.
图6是根据一示例性实施例示出的步骤S310的一种可实施方式的流程图。FIG. 6 is a flowchart showing an implementation manner of step S310 according to an exemplary embodiment.
图7是根据一具体示例性实施例示出的一种音质检测系统的结构图。FIG. 7 is a structural diagram of a sound quality detection system according to a specific exemplary embodiment.
图8是根据一示例性实施例示出的一种音质检测装置的框图。Fig. 8 is a block diagram of an apparatus for detecting sound quality according to an exemplary embodiment.
图9是根据一示例性实施例示出的一种用于音质检测的电子设备的框图。Fig. 9 is a block diagram of an electronic device for sound quality detection according to an exemplary embodiment.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
图1是根据一示例性实施例示出的一种音质检测方法的流程图,如图1所示,具体包括以下步骤:FIG. 1 is a flow chart of a method for detecting sound quality according to an exemplary embodiment, as shown in FIG. 1 , which specifically includes the following steps:
在步骤S100中,获取目标音频信号。In step S100, the target audio signal is acquired.
在步骤S200中,检测目标音频信号对应的音频内容信号,得到与音频内容信号相关的第一评价信息。In step S200, the audio content signal corresponding to the target audio signal is detected to obtain first evaluation information related to the audio content signal.
在步骤S300中,检测目标音频信号对应的音频采集信号,得到与音频采集信号相关的第二评价信息。In step S300, the audio collection signal corresponding to the target audio signal is detected, and the second evaluation information related to the audio collection signal is obtained.
在步骤S400中,按照第一预设权重向量对第一评价信息与第二评价信息进行融合,得到与目标音频信号对应的目标评价信息;其中,目标评价信息与目标音频信号的音质相关。In step S400, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein the target evaluation information is related to the sound quality of the target audio signal.
在步骤S500中,根据目标评价信息,确定目标音频信号的音质类别。In step S500, the sound quality category of the target audio signal is determined according to the target evaluation information.
其中,目标音频信号是指音质待评价的音频信号。音频内容信号是指与目标音频信号的相关的具体信号,例如,音频内容信号可以为音乐、语音或其他(噪音及其他音频)。音频采集信号是指目标音频信号采集过程相关的信号,旨在评价与音频内容信号无关的音频质量,主要指音频信号采集过程中对音质有损伤的部分信号,在一些实施例中,音频采集信号的评价主要包括破音评价和外录评价。第一预设权重向量是第一评价信息与第二评价信息进行合并时的合并系数构成的向量,在一些实施例中,可以以加权平均的方式设置第一预设权重向量,也可以根据具体的音频内容信号与音频采集信号对音质的影像进行设置。The target audio signal refers to the audio signal whose sound quality is to be evaluated. The audio content signal refers to a specific signal related to the target audio signal, for example, the audio content signal may be music, speech or other (noise and other audio). The audio acquisition signal refers to the signal related to the acquisition process of the target audio signal, which aims to evaluate the audio quality unrelated to the audio content signal, mainly refers to the part of the signal that has damage to the sound quality during the audio signal acquisition process. In some embodiments, the audio acquisition signal The evaluation mainly includes broken sound evaluation and external recording evaluation. The first preset weight vector is a vector formed by combining coefficients when the first evaluation information and the second evaluation information are combined. The audio content signal and the audio capture signal are used to set the sound quality of the image.
具体地,在获取到音质待评价的目标音频信号后,在具体的音频内容层面对目标音频信号的质量进行评价,得到与音频内容信号相关的第一评价信息,同时,在音频的采集层面对目标音频信号的质量进行评价,得到与音频采集信号相关的第二评价信息。在得到从不同维度对目标音频信号进行评价的第一评价信息和第二评价信息后,按照第一预设权重向量对第一评价信息与第二评价信息进行融合,得到与目标音频信号对应的目标评价信息,目标评价信息与目标音频信号的音质相关。示例性地,当第一评价信息对应的衡量标 准为90、第二评价信息对应的衡量标准为85、第一预设权重向量为(0.6,0.4)时,目标评价信息对应的衡量标准为(90*0.6+85*0.4)。该目标评价信息能够精确地对相应的目标音频信号的音质进行评价,从而根据目标评价信息,确定目标音频信号的音质类别。例如,当满分为100分时,目标评价信息为95时,可以将该目标音频信号确定为高品质音频信号,目标评价信息为70时,可以将该目标音频信号确定为中品质音频信号,目标评价信息为55时,可以将该目标音频信号确定为低品质音频信号。需要说明的是,上述分值和分类仅为示例性说明,具体实施过程中可以根据实际需求另行划分音质类别。Specifically, after obtaining the target audio signal whose sound quality is to be evaluated, the quality of the target audio signal is evaluated at the specific audio content level, and the first evaluation information related to the audio content signal is obtained. The quality of the target audio signal is evaluated, and the second evaluation information related to the audio acquisition signal is obtained. After obtaining the first evaluation information and the second evaluation information for evaluating the target audio signal from different dimensions, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the corresponding target audio signal. Target evaluation information. The target evaluation information is related to the sound quality of the target audio signal. Exemplarily, when the measurement standard corresponding to the first evaluation information is 90, the measurement standard corresponding to the second evaluation information is 85, and the first preset weight vector is (0.6, 0.4), the measurement standard corresponding to the target evaluation information is ( 90*0.6+85*0.4). The target evaluation information can accurately evaluate the sound quality of the corresponding target audio signal, so as to determine the sound quality category of the target audio signal according to the target evaluation information. For example, when the full score is 100 points and the target evaluation information is 95, the target audio signal can be determined as a high-quality audio signal; when the target evaluation information is 70, the target audio signal can be determined as a medium-quality audio signal, and the target audio signal can be determined as a medium-quality audio signal. When the evaluation information is 55, the target audio signal may be determined as a low-quality audio signal. It should be noted that the above scores and classifications are only exemplary descriptions, and in the specific implementation process, sound quality categories may be further divided according to actual needs.
上述音质检测方法中,通过检测目标音频信号对应的音频内容信号,在音频内容信号层面对目标音频信号的质量进行评价,得到与音频内容信号相关的第一评价信息,以及检测目标音频信号对应的音频采集信号,在音频采集信号层面对目标音频信号的质量进行评价,得到与音频采集信号相关的第二评价信息,在得到从不同维度对目标音频信号进行评价的第一评价信息和第二评价信息后,按照第一预设权重向量对第一评价信息与第二评价信息进行融合,得到与目标音频信号对应的目标评价信息,其中,目标评价信息与目标音频信号的音质相关,并根据目标评价信息,确定目标音频信号的音质类别。从而,无需获取与目标音频信号对应的原始无损音频信号,便可实现对目标音频信号质量进行检测的目的,同时,用于评价目标音频信号音质的目标评价信息是基于目标音频信号本身的多维度属性得到的,能够全方位对目标音频信号进行检测,最终实现精确地对相应音频信号的音质进行界定的目的。In the above-mentioned sound quality detection method, by detecting the audio content signal corresponding to the target audio signal, the quality of the target audio signal is evaluated at the audio content signal level, so as to obtain the first evaluation information related to the audio content signal, and detect the corresponding audio content signal of the target audio signal. Audio acquisition signal, evaluate the quality of the target audio signal at the audio acquisition signal level, obtain the second evaluation information related to the audio acquisition signal, and obtain the first evaluation information and the second evaluation to evaluate the target audio signal from different dimensions After the information is obtained, the first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal, wherein the target evaluation information is related to the sound quality of the target audio signal, and is Evaluate the information to determine the sound quality category of the target audio signal. Therefore, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal. At the same time, the target evaluation information for evaluating the sound quality of the target audio signal is based on the multi-dimensional characteristics of the target audio signal itself. If the attributes are obtained, the target audio signal can be detected in all directions, and finally the purpose of accurately defining the sound quality of the corresponding audio signal can be achieved.
在一示例性实施例中,如图2所示,为步骤S200的一种可实施方式的流程图,包括以下步骤:In an exemplary embodiment, as shown in FIG. 2, it is a flowchart of an implementable implementation manner of step S200, including the following steps:
在步骤S210中,对目标音频信号对应的音频内容信号进行分类,得到与音频内容信号对应的音频分类结果。In step S210, the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal.
在步骤S220中,按照音频分类结果,对目标音频信号进行检测,得到与音频内容信号相关的第一评价信息。In step S220, the target audio signal is detected according to the audio classification result to obtain first evaluation information related to the audio content signal.
其中,音频内容信号对应的音频类别可以是音乐、语音或其他(噪音及其他音频)。The audio category corresponding to the audio content signal may be music, speech or other (noise and other audio).
具体地,对目标音频信号对应的音频内容信号进行分类,得到与音频内容信号对应的音频分类结果,该音频分类结果包括目标音频信号对应的类别和将该目标音频信号判定为这一类别对应的概率。按照目标音频信号对应的类别,对目标音频信号进行检测,得到每一目标音频信号在每一类别上的评价信息,并按照每一类别对应的概率将各类别的评价信息进行组合,得到与音频内容信号相关的第一评价信息。Specifically, the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal, where the audio classification result includes a category corresponding to the target audio signal and a category corresponding to the target audio signal determined to be this category. probability. According to the category corresponding to the target audio signal, the target audio signal is detected, the evaluation information of each target audio signal in each category is obtained, and the evaluation information of each category is combined according to the probability corresponding to each category to obtain the corresponding audio signal. The first evaluation information related to the content signal.
示例地,可以获取预先训练好的能够对音频信号的类别进行检测的音频分类网络模型,得到目标音频信号为音乐、语音或其他的概率,例如,该目标音频信号是音乐的概率为0.7,是语音的概率为0.2,是其他音频信号的概率为0.1。在得到音频分类结果后,获取预先训练好的能够对相应类别的音频信号进行评价的评测网络,例如,可以用于音乐评测的音乐评测网络模型、可以用于语音评测的语音网络模型以及其他网络模型,将目标音频信号输入相应的网络评测模型,得到目标音频信号在音乐层面、语音层面以及其他层面的评价信息。例如,该目标音频信号在音乐层面的评价信息对应的分值为90,在语音层面的评价信息对应的分值为80,在其他层面的评价信息对应的分值为85。最终,按照每一类别对应的概率将各类别的评价信息进行组合,得到第一评价信息对应的分值(0.7*90+0.2*80+0.1*85)。For example, a pre-trained audio classification network model capable of detecting the category of the audio signal can be obtained, and the probability that the target audio signal is music, speech or other can be obtained. For example, the probability that the target audio signal is music is 0.7, which is The probability of speech is 0.2 and the probability of being other audio signals is 0.1. After obtaining the audio classification result, obtain a pre-trained evaluation network capable of evaluating the corresponding category of audio signals, for example, a music evaluation network model that can be used for music evaluation, a voice network model that can be used for speech evaluation, and other networks Model, input the target audio signal into the corresponding network evaluation model, and obtain the evaluation information of the target audio signal at the music level, the voice level and other levels. For example, the score corresponding to the evaluation information at the music level of the target audio signal is 90, the score corresponding to the evaluation information at the speech level is 80, and the score corresponding to the evaluation information at other levels is 85. Finally, each category of evaluation information is combined according to the probability corresponding to each category to obtain a score corresponding to the first evaluation information (0.7*90+0.2*80+0.1*85).
上述示例性实施例中,对目标音频信号对应的音频内容信号进行分类,得到与音频内容信号对应的音频分类结果,并按照音频分类结果,对目标音频信号进行检测,可以在目标音频信号对应的具体类别上有针对性地对目标音频信号进行检测,得到的第一评价信息,能够更加全面和有针对性地对音频信号的质量进行检测,为后续音频信号的音质评价提供基础。In the above-mentioned exemplary embodiment, the audio content signal corresponding to the target audio signal is classified to obtain an audio classification result corresponding to the audio content signal, and the target audio signal is detected according to the audio classification result, and the target audio signal can be detected in the corresponding audio signal. The target audio signal is detected in a targeted manner in terms of specific categories, and the obtained first evaluation information can detect the quality of the audio signal more comprehensively and pertinently, and provide a basis for the sound quality evaluation of the subsequent audio signal.
在一示例性实施例中,步骤S210的一种可实施方式的流程图如图3所示,包括以下步骤:In an exemplary embodiment, a flowchart of a possible implementation manner of step S210 is shown in FIG. 3 , including the following steps:
在步骤S211中,按照第一时间长度对目标音频信号进行分割,得到第一数量个音频片段。In step S211, the target audio signal is divided according to the first time length to obtain a first number of audio segments.
在步骤S212中,对于每一音频片段,对音频片段对应的音频片段内容进行分类,得到与音频片段对应的第二数量个目标类别和音频片段为每个目标类别的目标概率。In step S212, for each audio segment, classify the content of the audio segment corresponding to the audio segment to obtain a second number of target categories corresponding to the audio segment and a target probability that the audio segment is each target category.
在步骤S213中,将每一音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为音频分类结果。In step S213, the second number of target categories and the second number of target probabilities corresponding to each audio segment are determined as the audio classification result.
其中,第一时间长度是指对音频信号进行划分的参考度量值,在一些实施例中,可以为1秒、10秒、20秒、1分钟不等,上述1秒、10秒、20秒、1分钟等时间长度仅为示例性说明,并不对第一时间长度进行具体限定。The first time length refers to a reference metric value for dividing the audio signal. In some embodiments, it can be 1 second, 10 seconds, 20 seconds, or 1 minute. Time lengths such as 1 minute are only exemplary, and do not specifically limit the first time length.
具体地,按照第一时间长度对目标音频信号进行分割,得到第一数量个音频片段,例如,目标音频信号的长度为3分钟,以10秒为第一时间长度对目标音频信号进行分割,可以得到第一数量为18,长度为10秒的音频片段。对于每一10秒的音频片段,对音频片段对应的音频片段内容进行分类,当每一10秒的音频片段是音乐的概率为0.7、是语音的概率为0.2、是其他音频信号的概率为0.1时,第二数量为3,目标类别分别为音乐、语音和其他。最后,将18个10秒的音频片段中的每一音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为音频分类结果。Specifically, the target audio signal is segmented according to the first time length to obtain a first number of audio clips. For example, the length of the target audio signal is 3 minutes, and the target audio signal is segmented with 10 seconds as the first time length. An audio clip with a first number of 18 and a length of 10 seconds is obtained. For each 10-second audio clip, the content of the audio clip corresponding to the audio clip is classified, when the probability of each 10-second audio clip is music is 0.7, the probability of speech is 0.2, and the probability of other audio signals is 0.1 , the second number is 3, and the target categories are music, speech, and others. Finally, the second number of target categories and the second number of target probabilities corresponding to each of the 18 10-second audio clips are determined as the audio classification result.
上述示例性实施例中,按照第一时间长度对目标音频信号进行分割,得到第一数量个音频片段,并对每一音频片段对应的音频片段内容进行分类,得到与音频片段对应的第二数量个目标类别和音频片段为每个目标类别的目标概率,将每一音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为音频分类结果。从而在更小的时间维度上对目标音频信号进行划分,使得后续能在更小的时间维度上对每一音频片段进行更为细致的检测,最终实现精确地对相应音频信号的音质进行界定的目的。In the above exemplary embodiment, the target audio signal is divided according to the first time length to obtain a first number of audio clips, and the content of the audio clips corresponding to each audio clip is classified to obtain a second number corresponding to the audio clips. The target categories and the audio clips are the target probability of each target category, and the second number of target categories and the second number of target probabilities corresponding to each audio clip are determined as the audio classification result. In this way, the target audio signal is divided in a smaller time dimension, so that each audio segment can be detected in a smaller time dimension in the follow-up, and finally the sound quality of the corresponding audio signal can be accurately defined. Purpose.
在一示例性实施例中,步骤S220的一种可实施方式的流程图如图4所示,包括以下步骤:In an exemplary embodiment, a flowchart of a possible implementation manner of step S220 is shown in FIG. 4 , including the following steps:
在步骤S221中,对于每一音频片段,按照第二数量个目标类别,对音频片段进行检测,得到与第二数量个目标类别相关的第二数量个片段内容评价信息。In step S221, for each audio segment, the audio segment is detected according to the second number of target categories, and a second number of segment content evaluation information related to the second number of target categories is obtained.
在步骤S222中,将第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一音频片段对应的片段内容评价信息;或,以第二数量个目标概率为权重系数,对与音频片段相关的第二数量个片段内容评价信息进行加权,得到每一音频片段对应的片段内容评价信息;其中,第一数量个音频片段对应第一数量个片段内容评价信息。In step S222, the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities is determined as the segment content evaluation information corresponding to each audio segment; or, using the second number of target probabilities as the weight coefficient, The second quantity of segment content evaluation information related to the audio segment is weighted to obtain segment content evaluation information corresponding to each audio segment; wherein the first number of audio segments corresponds to the first number of segment content evaluation information.
在步骤S223中,按照第二预设权重向量,对第一数量个片段内容评价信息进行融合,得到第一评价信息。In step S223, according to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
其中,第一数量个音频片段对应第一数量个片段内容评价信息。第二预设权重向量是多个音频片段音质检测结果的合并系数构成的向量,在一些实施例中,可以以加权平均的方式设置第二预设权重向量,也可以根据具体的目标音频信号设置第二预设权重向量,例如,可以将目标音频信号的首尾处的音频片段设置相对较小的权重系数,为中间的音频片段设置相对较大的权重系数,以减小音频录制起始噪音过大的影响。The first quantity of audio clips corresponds to the first quantity of clip content evaluation information. The second preset weight vector is a vector formed by combining coefficients of the sound quality detection results of multiple audio clips. In some embodiments, the second preset weight vector may be set in a weighted average manner, or may be set according to a specific target audio signal The second preset weight vector, for example, can set a relatively small weight coefficient for the audio clips at the beginning and the end of the target audio signal, and set a relatively large weight coefficient for the middle audio clip, so as to reduce the excessive noise at the beginning of the audio recording. big impact.
具体地,对音频片段进行检测,得到与第二数量个目标类别相关的第二数量个片段内容评价信息,将第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一音频片段对应的片段内容评价信息。或者,以第二数量个目标概率为权重系数,对与音频片段相关的第二数量个片段内容评价信息进行加权,得到每一音频片段对应的片段内容评价信息。例如,一个音频片段是音乐的概率为0.7,是语音的概率为0.2,是其他音频信号的概率为0.1,该音频片段在音乐层面的评价信息对应的分值为90,在语音层面的评价信 息对应的分值为80,在其他层面的评价信息对应的分值为85。将第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一音频片段对应的片段内容评价信息,则片段内容评价信息为最大概率0.7对应的分值90。而按照每一类别对应的概率将各类别的评价信息进行组合,得到片段内容评价信息对应的分值(0.7*90+0.2*80+0.1*85)。第一数量个音频片段对应第一数量个片段内容评价信息,最后按照第二预设权重向量,对第一数量个片段内容评价信息进行加权求和,得到第一评价信息。Specifically, the audio segments are detected to obtain a second number of segment content evaluation information related to the second number of target categories, and the segment content evaluation information corresponding to the largest probability value among the second number of target probabilities is determined as each The segment content evaluation information corresponding to the audio segment. Or, using the second number of target probabilities as weight coefficients, weighting the second number of segment content evaluation information related to the audio segment to obtain segment content evaluation information corresponding to each audio segment. For example, the probability of an audio clip being music is 0.7, the probability of being speech is 0.2, and the probability of being other audio signals is 0.1. The corresponding score is 80, and the score corresponding to the evaluation information at other levels is 85. The segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities is determined as the segment content evaluation information corresponding to each audio segment, and the segment content evaluation information is a score of 90 corresponding to the maximum probability of 0.7. The evaluation information of each category is combined according to the probability corresponding to each category to obtain a score (0.7*90+0.2*80+0.1*85) corresponding to the segment content evaluation information. The first quantity of audio clips corresponds to the first quantity of clip content evaluation information, and finally the first quantity of clip content evaluation information is weighted and summed according to the second preset weight vector to obtain the first evaluation information.
上述示例性实施例中,对于每一音频片段,按照第二数量个目标类别,对音频片段进行检测,得到与第二数量个目标类别相关的第二数量个片段内容评价信息;将第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一音频片段对应的片段内容评价信息;或,以第二数量个目标概率为权重系数,对与音频片段相关的第二数量个片段内容评价信息进行加权,得到每一音频片段对应的片段内容评价信息,按照第二预设权重向量,对第一数量个片段内容评价信息进行融合,得到第一评价信息。从而在更小的时间维度上对目标音频信号进行更为细致的检测,最终实现精确地对相应音频信号的音质进行界定的目的。In the above exemplary embodiment, for each audio clip, the audio clips are detected according to the second number of target categories, and a second number of clip content evaluation information related to the second number of target categories is obtained; The segment content evaluation information corresponding to the maximum probability value among the target probabilities is determined as the segment content evaluation information corresponding to each audio segment; The segment content evaluation information is weighted to obtain segment content evaluation information corresponding to each audio segment, and the first number of segment content evaluation information is fused according to the second preset weight vector to obtain the first evaluation information. Therefore, the target audio signal is detected in a smaller time dimension in more detail, and the purpose of accurately defining the sound quality of the corresponding audio signal is finally achieved.
在一示例性实施例中,如图5所示,为步骤S300的一种可实施方式的流程图,包括以下步骤:In an exemplary embodiment, as shown in FIG. 5 , it is a flowchart of an implementable implementation manner of step S300, including the following steps:
在步骤S310中,对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息。In step S310, a broken sound phenomenon corresponding to the target audio signal is detected to obtain corresponding broken sound evaluation information.
在步骤S320中,对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息。In step S320, the external recording device corresponding to the target audio signal is detected to obtain the corresponding external recording evaluation information.
在步骤S330中,按照第三预设权重向量,对破音评价信息和外录评价信息进行融合,得到第二评价信息。In step S330, according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
其中,破音现象是指当声音信号等级超过电子元件负载的上限,使得声音信号中的一部份被截除,造成发出的声音中存在杂音的现象。外录设备是指通过麦克风或者录音机自带的拾音设备,把声音信号传输到录音系统,再把声音信号录制在存储介质中的设备。这种录制方式可以方便地录制人声等多种声音信号,但是得到的音频信号容易受到外界干扰、声音信号容易失真。第三预设权重向量是破音评价信息与外录评价信息进行合并时的合并系数构成的向量,在一些实施例中,可以以加权平均的方式设置第三预设权重向量,也可以根据具体的破音评价信息与外录评价信息对音质的影像进行设置。Among them, the phenomenon of broken sound refers to the phenomenon that when the sound signal level exceeds the upper limit of the load of the electronic components, a part of the sound signal is cut off, resulting in the phenomenon of noise in the emitted sound. The external recording device refers to the device that transmits the sound signal to the recording system through the microphone or the pickup device that comes with the tape recorder, and then records the sound signal in the storage medium. This recording method can conveniently record a variety of sound signals such as human voice, but the obtained audio signal is susceptible to external interference and the sound signal is easily distorted. The third preset weight vector is a vector formed by combining coefficients when the broken sound evaluation information and the external recording evaluation information are combined. In some embodiments, the third preset weight vector may be set in a weighted average manner, or may be set according to specific The broken sound evaluation information and the external recording evaluation information are set for the sound quality of the video.
具体地,分别对目标音频信号对应的破音现象和外录设备进行检测,得到破音现象检测对应的破音评价信息和外录设备检测对应的外录评价信息。并按照第三预设权重向量,对破音评价信息和外录评价信息进行融合,得到第二评价信息。示例地,获取预先训练好的能够对破音现象和外录设备进行检测并进行评价的评测网络,例如,可以用于破音现象评测的破音评测网络模型、可以用于外录设备评测的外录网络模型,将目标音频信号输入相应的网络评测模型,得到破音现象检测对应的破音评价信息和外录设备检测对应的外录评价信息。例如,该目标音频信号在破音层面的评价信息对应的分值为90、在外录层面的评价信息对应的分值为80,第三预设权重向量为(0.6,0.4)时,目标评价信息对应的分值为(90*0.6+80*0.4)。Specifically, the sound breaking phenomenon corresponding to the target audio signal and the external recording device are respectively detected, and the sound breaking evaluation information corresponding to the sound breaking phenomenon detection and the external recording evaluation information corresponding to the external recording device detection are obtained. And according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information. For example, obtain a pre-trained evaluation network that can detect and evaluate the broken sound phenomenon and external recording equipment, for example, a broken sound evaluation network model that can be used for For the external recording network model, input the target audio signal into the corresponding network evaluation model, and obtain the audio-breaking evaluation information corresponding to the sound-breaking phenomenon detection and the external-recording evaluation information corresponding to the external recording device detection. For example, when the target audio signal has a score corresponding to the evaluation information at the breaking sound level of 90, the score corresponding to the evaluation information at the external recording level is 80, and the third preset weight vector is (0.6, 0.4), the target evaluation information The corresponding score is (90*0.6+80*0.4).
上述示例性实施例中,对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息,对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息,如此可以在目标音频信号对应的具体采集设备对应的不同音质类别上有针对性地对目标音频信号进行检测。并按照第三预设权重向量,对破音评价信息和外录评价信息进行融合,以使得到的第二评价信息,能够更加全面和有针对性地对音频信号的采集设备产生的音质影响进行检测,为后续音频信号的音质评价提供基础。In the above exemplary embodiment, the sound breaking phenomenon corresponding to the target audio signal is detected to obtain the corresponding broken sound evaluation information, and the external recording device corresponding to the target audio signal is detected to obtain the corresponding external recording evaluation information. The target audio signal is detected in a targeted manner according to different sound quality categories corresponding to the specific acquisition device corresponding to the target audio signal. And according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused, so that the obtained second evaluation information can be more comprehensive and targeted to the sound quality effect produced by the audio signal acquisition device. Detection provides a basis for the sound quality evaluation of subsequent audio signals.
在一示例性实施例中,如图6所示,为步骤S310的一种可实施方式的流程图,包括 以下步骤:In an exemplary embodiment, as shown in FIG. 6 , it is a flowchart of an implementable implementation manner of step S310, including the following steps:
在步骤S311中,按照第二时间长度对目标音频信号进行分割,得到第三数量个音频片段。In step S311, the target audio signal is divided according to the second time length to obtain a third number of audio segments.
在步骤S312中,对于每一音频片段,对音频片段对应的破音程度进行检测,得到音频片段对应的破音评价信息;其中,第三数量个音频片段对应第三数量个片段破音评价信息。In step S312, for each audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the sound breaking evaluation information corresponding to the audio clip is obtained; wherein, the third quantity of audio clips corresponds to the third quantity of fragment sound breaking evaluation information .
在步骤S313中,按照第四预设权重向量,对第三数量个片段破音评价信息进行融合,得到破音评价信息。In step S313, according to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
其中,第二时间长度是指对音频信号进行划分的参考度量值,在一些实施例中,可以为1秒、10秒、20秒、1分钟不等,上述1秒、10秒、20秒、1分钟等时间长度仅为示例性说明,并不对第二时间长度进行具体限定。第四预设权重向量是第三数量个片段破音评价信息进行合并时的合并系数构成的向量,在一些实施例中,可以以加权平均的方式设置第四预设权重向量,也可以根据具体的目标音频信号设置第四预设权重向量,例如,可以将目标音频信号的首尾处的音频片段设置相对较小的权重系数,为中间的音频片段设置相对较大的权重系数,以减小音频录制起始噪音过大的影响。The second time length refers to a reference metric value for dividing the audio signal, and in some embodiments, it may be 1 second, 10 seconds, 20 seconds, or 1 minute. Time lengths such as 1 minute are only exemplary, and do not specifically limit the second time length. The fourth preset weight vector is a vector composed of merging coefficients when the third number of pieces of sound breaking evaluation information are merged. A fourth preset weight vector is set for the target audio signal of The effect of recording excessive noise at the beginning.
具体地,按照第二时间长度对目标音频信号进行分割,得到第三数量个音频片段,例如,目标音频信号的长度为3分钟,以10秒为第一时间长度对目标音频信号进行分割,可以得到第三数量为18,长度为10秒的音频片段。对于每一10秒的音频片段,对音频片段对应的破音程度进行检测,得到音频片段对应的破音评价信息;其中,第三数量个音频片段对应第三数量个片段破音评价信息,每一破音评价信息对应一个权重系统,第三数量个片段破音评价信息对应一个第三数量维度的第四预设权重向量,按照第四预设权重向量,对第三数量个片段破音评价信息进行融合,得到破音评价信息。Specifically, the target audio signal is segmented according to the second time length to obtain a third number of audio clips. For example, the length of the target audio signal is 3 minutes, and the target audio signal is segmented with 10 seconds as the first time length. An audio clip with a third number of 18 and a length of 10 seconds is obtained. For each 10-second audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the sound breaking evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of clips. One broken sound evaluation information corresponds to a weighting system, and the third quantity of fragmentary broken sound evaluation information corresponds to a fourth preset weight vector of a third quantity dimension. According to the fourth preset weight vector, the third quantity of clips broken sound is evaluated. The information is fused to obtain the broken sound evaluation information.
上述示例性实施例中,按照第二时间长度对目标音频信号进行分割,得到第三数量个音频片段,对于每一音频片段,对音频片段对应的破音程度进行检测,得到音频片段对应的破音评价信息;其中,第三数量个音频片段对应第三数量个片段破音评价信息,按照第四预设权重向量,对第三数量个片段破音评价信息进行融合,得到破音评价信息。从而在更小的时间维度上对目标音频信号进行划分,使得后续能在更小的时间维度上对每一音频片段进行更为细致的检测,最终实现精确地对相应音频信号的音质进行界定的目的。In the above exemplary embodiment, the target audio signal is divided according to the second time length to obtain a third number of audio clips, and for each audio clip, the degree of sound breaking corresponding to the audio clip is detected, and the broken sound corresponding to the audio clip is obtained. Audio evaluation information; wherein, the third quantity of audio clips corresponds to the third quantity of audio fragmentation evaluation information, and according to the fourth preset weight vector, the third quantity of audio fragmentation evaluation information is fused to obtain the audio fragmentation evaluation information. In this way, the target audio signal is divided in a smaller time dimension, so that each audio segment can be detected in a smaller time dimension in the follow-up, and finally the sound quality of the corresponding audio signal can be accurately defined. Purpose.
图7是根据一具体示例性实施例示出的一种音质检测系统的结构图,如图7所示,具体包括:FIG. 7 is a structural diagram of a sound quality detection system according to a specific exemplary embodiment, as shown in FIG. 7 , which specifically includes:
音质检测系统将对目标音频信号的质量的评估分为两部分:第一部分为内容相关的音质评估;第二部分为采集设备相关的音质评估。其中第一部分主要针对音频信号的不同内容进行判断,然后根据具体的类别进行具有针对性的打分。第二部分主要针对音频信号的采集设备,检测采集设备是否会引入相关的失真。The sound quality detection system divides the evaluation of the quality of the target audio signal into two parts: the first part is the sound quality evaluation related to the content; the second part is the sound quality evaluation related to the acquisition device. The first part mainly judges the different contents of the audio signal, and then conducts targeted scoring according to specific categories. The second part is mainly aimed at the acquisition equipment of audio signals, and detects whether the acquisition equipment will introduce related distortion.
对于第一部分,首先通过对音频信号进行分类的深度学习网络,对输入的目标音频信号进行音乐、语音或其他(噪音及其他音频)等类型分类,并输出为目标音频信号中固定长度(如1秒输出一次结果)的音频片段对应的类别,例如,音乐、语音或其他种类的多个目标概率,多个目标概率的加和为1。在得到上述目标概率后,可以选择概率最高的分类进行后续打分过程。如分类为音乐,则对信号进行无参考的音乐质量打分(可以看作是其他类别的概率为0)。最后的得分即为概率最高的分类对应的无参考打分结果。在得到了这些概率之后,也可以将音频信号直接送入图7中三种打分网络进行检测,在得到结果后,最终的评分融合结果第一检测分值(第一评价信息)如公式(1)所示:For the first part, the input target audio signal is classified into music, speech or other (noise and other audio) types through the deep learning network that classifies the audio signal, and the output is a fixed length (such as 1) in the target audio signal. The category corresponding to the audio segment that outputs the result once per second, for example, multiple target probabilities of music, speech or other types, and the sum of the multiple target probabilities is 1. After obtaining the above target probability, the classification with the highest probability can be selected for the subsequent scoring process. If it is classified as music, the signal is scored without reference to the music quality (it can be regarded as the probability of other categories being 0). The final score is the no-reference scoring result corresponding to the category with the highest probability. After obtaining these probabilities, the audio signal can also be directly sent to the three scoring networks in Figure 7 for detection. After obtaining the results, the first detection score (first evaluation information) of the final score fusion result is shown in the formula (1 ) as shown:
内容相关融合打分结果=音乐概率*无参考音乐打分结果+语音概率*无参考语音打分Content-related fusion scoring result = music probability * no reference music score + voice probability * no reference voice score
结果+其他概率*音频事件分类网络打分(1)Result + other probability * audio event classification network score (1)
其中,音频事件分类网络旨在对非语音和音乐的音频信号进行是否为降低音频质量的 噪声打分。如babble噪声、引擎噪声、飞机机舱中的低频噪声等声音均为有害噪声,对应低分;如鸟叫、流水声等声音为非有害噪声,对应高分。Among them, the audio event classification network aims to score non-speech and music audio signals for noise that degrades audio quality. Sounds such as babble noise, engine noise, and low-frequency noise in the aircraft cabin are harmful noises and correspond to low scores; sounds such as bird calls and running water are non-harmful noises and correspond to high scores.
对于网络的第二部分,即采集设备相关的音质评估,旨在评价内容无关的音质质量,主要指音频信号采集过程中产生的对音质损伤的部分。其中主要包括破音检测网络和外录检测网络。For the second part of the network, that is, the sound quality evaluation related to the acquisition device, it aims to evaluate the sound quality independent of the content, mainly refers to the part that damages the sound quality during the audio signal acquisition process. These mainly include broken sound detection network and external recording detection network.
破音检测网络将输入音频信号切分成如1秒为单位的信号,对于每一段音频信号进行是否破音的判断,根据涵盖破音信号的单位数(如60秒涵盖10)对音频信号的破音程度进行评价。外录检测网络设计的目的是判断待测音频信号是否为明显的低质量手机麦克风采集。低质量的手机麦克风采集的信号通常由于采集设备原因产生频率响应窄,信噪比低等现象,影响音质。外录检测网络判断输入信号是否为低质量麦克风采集的音频信号。破音检测和外录检测共同产生的融合打分(第二评价信息对应的分值)如公式(2)所示:The broken sound detection network divides the input audio signal into signals in units of 1 second, and judges whether the sound is broken for each segment of the audio signal. Evaluate the sound level. The purpose of the design of the external recording detection network is to determine whether the audio signal to be tested is obviously collected by a low-quality mobile phone microphone. Signals collected by low-quality mobile phone microphones usually have a narrow frequency response and a low signal-to-noise ratio due to the acquisition equipment, which affects the sound quality. The external recording detection network determines whether the input signal is an audio signal collected by a low-quality microphone. The fusion score (the score corresponding to the second evaluation information) jointly generated by the broken sound detection and the external recording detection is shown in formula (2):
采集设备相关融合打分结果=破音检测结果*破音检测权重+外录检测结果*外录检测权重(2)Collection equipment-related fusion scoring results = broken sound detection result * broken sound detection weight + external recording detection result * external recording detection weight (2)
最终的融合结果(目标评价信息对应的分值)如公式(3)所示:The final fusion result (the score corresponding to the target evaluation information) is shown in formula (3):
融合打分结果=内容相关融合打分结果*内容相关融合打分权重+采集设备相关融合打分结果*采集设备相关融合打分权重(3)Fusion scoring result = content-related fusion scoring result * content-related fusion scoring weight + acquisition device-related fusion scoring result * acquisition-device-related fusion scoring weight (3)
该最终融合结果(目标评价信息对应的分值)能够精确地对相应的目标音频信号的音质进行评价,从而根据目标评价信息,确定目标音频信号的音质类别。The final fusion result (the score corresponding to the target evaluation information) can accurately evaluate the sound quality of the corresponding target audio signal, so as to determine the sound quality category of the target audio signal according to the target evaluation information.
上述音质检测系统中,无需获取与目标音频信号对应的原始无损音频信号,便可实现对目标音频信号质量进行检测的目的,同时,用于评价目标音频信号音质的目标评价信息是基于目标音频信号本身的多维度属性得到的,能够全方位对目标音频信号进行检测,最终实现精确地对相应音频信号的音质进行界定的目的。In the above sound quality detection system, the purpose of detecting the quality of the target audio signal can be achieved without acquiring the original lossless audio signal corresponding to the target audio signal. At the same time, the target evaluation information for evaluating the sound quality of the target audio signal is based on the target audio signal. It can detect the target audio signal in all directions, and finally achieve the purpose of accurately defining the sound quality of the corresponding audio signal.
应该理解的是,虽然图1-7的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-7中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIGS. 1-7 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 1-7 may include multiple steps or multiple stages. These steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or phases within the other steps.
图8是根据一示例性实施例示出的一种音质检测装置的框图。参照图8,该装置包括音频信号获取单元801、第一检测单元802、第二检测单元803、目标评价信息确定单元804和音质检测单元805,具体包括:Fig. 8 is a block diagram of an apparatus for detecting sound quality according to an exemplary embodiment. 8, the device includes an audio signal acquisition unit 801, a first detection unit 802, a second detection unit 803, a target evaluation information determination unit 804, and a sound quality detection unit 805, specifically including:
音频信号获取单元801,被配置为执行获取目标音频信号;An audio signal acquisition unit 801, configured to perform acquisition of a target audio signal;
第一检测单元802,被配置为执行检测目标音频信号对应的音频内容信号,得到与音频内容信号相关的第一评价信息;The first detection unit 802 is configured to perform detection of the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;
第二检测单元803,被配置为执行检测目标音频信号对应的音频采集信号,得到与音频采集信号相关的第二评价信息;The second detection unit 803 is configured to perform detection of the audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;
目标评价信息确定单元804,被配置为按照第一预设权重向量对第一评价信息与第二评价信息进行融合,得到与目标音频信号对应的目标评价信息;其中,目标评价信息与目标音频信号的音质相关;The target evaluation information determination unit 804 is configured to fuse the first evaluation information and the second evaluation information according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
音质检测单元805,被配置为执行根据目标评价信息,确定目标音频信号的音质类别。The sound quality detection unit 805 is configured to determine the sound quality category of the target audio signal according to the target evaluation information.
在一示例性实施例中,第一检测单元802还被配置为执行:对目标音频信号对应的音频内容信号进行分类,得到与音频内容信号对应的音频分类结果;按照音频分类结果,对目标音频信号进行检测,得到与音频内容信号相关的第一评价信息。In an exemplary embodiment, the first detection unit 802 is further configured to perform: classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal; The signal is detected to obtain first evaluation information related to the audio content signal.
在一示例性实施例中,第一检测单元802还被配置为执行:按照第一时间长度对目标音频信号进行分割,得到第一数量个音频片段;对于每一音频片段,对音频片段对应的音 频片段内容进行分类,得到与音频片段对应的第二数量个目标类别和音频片段为每个目标类别的目标概率;将每一音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为音频分类结果。In an exemplary embodiment, the first detection unit 802 is further configured to perform: dividing the target audio signal according to the first time length to obtain a first number of audio clips; The audio clip content is classified to obtain the second number of target categories corresponding to the audio clip and the target probability that the audio clip is each target category; the second number of target categories and the second number of target probabilities corresponding to each audio clip are , which is determined as the audio classification result.
在一示例性实施例中,第一检测单元802还被配置为执行:对于每一音频片段,按照第二数量个目标类别,对音频片段进行检测,得到与第二数量个目标类别相关的第二数量个片段内容评价信息;将第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一音频片段对应的片段内容评价信息;或,以第二数量个目标概率为权重系数,对与音频片段相关的第二数量个片段内容评价信息进行加权,得到每一音频片段对应的片段内容评价信息;其中,第一数量个音频片段对应第一数量个片段内容评价信息;按照第二预设权重向量,对第一数量个片段内容评价信息进行融合,得到第一评价信息。In an exemplary embodiment, the first detection unit 802 is further configured to perform: for each audio segment, detect the audio segment according to the second number of target categories, and obtain the first number related to the second number of target categories. Two pieces of segment content evaluation information; determining the segment content evaluation information corresponding to the largest probability value in the second number of target probabilities as the segment content evaluation information corresponding to each audio segment; or, taking the second number of target probabilities as the weight coefficient, weights the second quantity of segment content evaluation information related to the audio segment to obtain segment content evaluation information corresponding to each audio segment; wherein, the first number of audio segments corresponds to the first number of segment content evaluation information; according to The second preset weight vector fuses the content evaluation information of the first number of segments to obtain the first evaluation information.
在一示例性实施例中,第二检测单元803还被配置为执行:对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息;对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息;按照第三预设权重向量,对破音评价信息和外录评价信息进行融合,得到第二评价信息。In an exemplary embodiment, the second detection unit 803 is further configured to perform: detect the sound-breaking phenomenon corresponding to the target audio signal to obtain corresponding broken-sound evaluation information; detect the external recording device corresponding to the target audio signal , to obtain the corresponding external recording evaluation information; according to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
在一示例性实施例中,第二检测单元803还被配置为执行:按照第二时间长度对目标音频信号进行分割,得到第三数量个音频片段;对于每一音频片段,对音频片段对应的破音程度进行检测,得到音频片段对应的破音评价信息;其中,第三数量个音频片段对应第三数量个片段破音评价信息;按照第四预设权重向量,对第三数量个片段破音评价信息进行融合,得到破音评价信息。In an exemplary embodiment, the second detection unit 803 is further configured to perform: dividing the target audio signal according to the second time length to obtain a third number of audio clips; The degree of broken sound is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound evaluation information; according to the fourth preset weight vector, the third number of clips are broken according to the fourth preset weight vector. The sound evaluation information is fused to obtain the broken sound evaluation information.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.
图9是根据一示例性实施例示出的一种用于音质检测的电子设备900的框图。例如,设备900可以是移动电话、计算机、数字广播终端、消息收发设备、游戏控制台、平板设备、医疗设备、健身设备、个人数字助理等。FIG. 9 is a block diagram of an electronic device 900 for sound quality detection according to an exemplary embodiment. For example, device 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, or the like.
参照图9,设备900可以包括以下一个或多个组件:处理组件902、存储器904、电力组件906、多媒体组件908、音频组件910、输入/输出(I/O)的接口912、传感器组件914以及通信组件916。9, device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and Communication component 916.
处理组件902通常控制设备900的整体操作,诸如与显示、电话呼叫、数据通信、相机操作和记录操作相关联的操作。处理组件902可以包括一个或多个处理器920来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件902可以包括一个或多个模块,便于处理组件902和其他组件之间的交互。例如,处理组件902可以包括多媒体模块,以方便多媒体组件908和处理组件902之间的交互。The processing component 902 generally controls the overall operation of the device 900, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 902 may include one or more modules to facilitate interaction between processing component 902 and other components. For example, processing component 902 may include a multimedia module to facilitate interaction between multimedia component 908 and processing component 902.
存储器904被配置为存储各种类型的数据以支持在设备900的操作。这些数据的示例包括用于在设备900上操作的任何应用程序或方法的指令、联系人数据、电话簿数据、消息、图片、视频等。存储器904可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM)、电可擦除可编程只读存储器(EEPROM)、可擦除可编程只读存储器(EPROM)、可编程只读存储器(PROM)、只读存储器(ROM)、磁存储器、快闪存储器、磁盘或光盘。 Memory 904 is configured to store various types of data to support operation at device 900 . Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and the like. Memory 904 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
电源组件906为设备900的各种组件提供电力。电源组件906可以包括电源管理系统,一个或多个电源,及其他与为设备900生成、管理和分配电力相关联的组件。 Power supply assembly 906 provides power to various components of device 900 . Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 900 .
多媒体组件908包括在所述设备900和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中, 多媒体组件908包括一个前置摄像头和/或后置摄像头。当设备900处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。 Multimedia component 908 includes a screen that provides an output interface between the device 900 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
音频组件910被配置为输出和/或输入音频信号。例如,音频组件910包括一个麦克风(MIC),当设备900处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器904或经由通信组件916发送。在一些实施例中,音频组件910还包括一个扬声器,用于输出音频信号。 Audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a microphone (MIC) that is configured to receive external audio signals when device 900 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 904 or transmitted via communication component 916 . In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
I/O接口912为处理组件902和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
传感器组件914包括一个或多个传感器,用于为设备900提供各个方面的状态评估。例如,传感器组件914可以检测到设备900的打开/关闭状态,组件的相对定位,例如所述组件为设备900的显示器和小键盘,传感器组件914还可以检测设备900或设备900一个组件的位置改变,用户与设备900接触的存在或不存在,设备900方位或加速/减速和设备900的温度变化。传感器组件914可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件914还可以包括加速度传感器、陀螺仪传感器、磁传感器、压力传感器或温度传感器。 Sensor assembly 914 includes one or more sensors for providing status assessments of various aspects of device 900 . For example, the sensor assembly 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor assembly 914 can also detect a change in the position of the device 900 or a component of the device 900 , the presence or absence of user contact with the device 900 , the orientation or acceleration/deceleration of the device 900 and the temperature change of the device 900 . Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件916被配置为便于设备900和其他设备之间有线或无线方式的通信。设备900可以接入基于通信标准的无线网络,如WiFi,运营商网络(如2G、3G、4G或5G),或它们的组合。在一个示例性实施例中,通信组件916经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件916还包括近场通信(NFC)模块,以促进短程通信。 Communication component 916 is configured to facilitate wired or wireless communication between device 900 and other devices. Device 900 may access wireless networks based on communication standards, such as WiFi, carrier networks (eg, 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication.
在示例性实施例中,设备900可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, device 900 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器904,上述指令可由设备900的处理器920执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as memory 904 including instructions, executable by processor 920 of device 900 to perform the method described above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
在示例性实施例中,还提供了一种计算机程序产品,所述程序产品包括计算机程序,所述计算机程序存储在可读存储介质中,设备的至少一个处理器从所述可读存储介质读取并执行所述计算机程序,使得设备执行上述方法。In an exemplary embodiment, there is also provided a computer program product comprising a computer program stored in a readable storage medium from which at least one processor of the device reads The computer program is retrieved and executed to cause the apparatus to perform the above-described method.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.
本领域技术人员在考虑说明书及实践本公开后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the present disclosure. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (19)

  1. 一种音质检测方法,其特征在于,包括:A method for detecting sound quality, comprising:
    获取目标音频信号;Get the target audio signal;
    检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息;Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;
    检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息;Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;
    按照第一预设权重向量对所述第一评价信息与所述第二评价信息进行融合,得到与所述目标音频信号对应的目标评价信息;其中,所述目标评价信息与所述目标音频信号的音质相关;The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
    根据所述目标评价信息,确定所述目标音频信号的音质类别。According to the target evaluation information, the sound quality category of the target audio signal is determined.
  2. 根据权利要求1所述的音质检测方法,其特征在于,所述检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息,包括:The sound quality detection method according to claim 1, wherein the detecting the audio content signal corresponding to the target audio signal to obtain the first evaluation information related to the audio content signal, comprising:
    对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果;classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;
    按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息。According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
  3. 根据权利要求2所述的音质检测方法,其特征在于,所述对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果,包括:The sound quality detection method according to claim 2, wherein the classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal, comprising:
    按照第一时间长度对所述目标音频信号进行分割,得到第一数量个音频片段;The target audio signal is divided according to the first time length to obtain a first number of audio clips;
    对于每一所述音频片段,对所述音频片段对应的音频片段内容进行分类,得到与所述音频片段对应的第二数量个目标类别和所述音频片段为所述目标类别的目标概率;For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;
    将每一所述音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为所述音频分类结果。A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
  4. 根据权利要求3所述的音质检测方法,其特征在于,所述按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息,包括:The sound quality detection method according to claim 3, wherein the detecting the target audio signal according to the audio classification result to obtain the first evaluation information related to the audio content signal, comprising:
    对于每一所述音频片段,按照所述第二数量个目标类别,对所述音频片段进行检测,得到与所述第二数量个目标类别相关的第二数量个片段内容评价信息;For each of the audio clips, the audio clips are detected according to the second number of target categories to obtain a second number of clip content evaluation information related to the second number of target categories;
    将所述第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一所述音频片段对应的片段内容评价信息;或,以所述第二数量个目标概率为权重系数,对与所述音频片段相关的第二数量个片段内容评价信息进行加权,得到每一所述音频片段对应的片段内容评价信息;其中,所述第一数量个音频片段对应第一数量个片段内容评价信息;Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;
    按照第二预设权重向量,对所述第一数量个片段内容评价信息进行融合,得到所述第一评价信息。According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
  5. 根据权利要求1所述的音质检测方法,其特征在于,所述检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息,包括:The sound quality detection method according to claim 1, wherein the detecting the audio collection signal corresponding to the target audio signal to obtain the second evaluation information related to the audio collection signal, comprising:
    对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息;Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;
    对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息;Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;
    按照第三预设权重向量,对所述破音评价信息和所述外录评价信息进行融合,得到所述第二评价信息。According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  6. 根据权利要求5所述的音质检测方法,其特征在于,所述检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息,包括:The sound quality detection method according to claim 5, wherein the detecting the audio collection signal corresponding to the target audio signal to obtain the second evaluation information related to the audio collection signal, comprising:
    按照第二时间长度对所述目标音频信号进行分割,得到第三数量个音频片段;The target audio signal is divided according to the second time length to obtain a third number of audio clips;
    对于每一所述音频片段,对所述音频片段对应的破音程度进行检测,得到所述音频片段对应的破音评价信息;其中,所述第三数量个音频片段对应第三数量个片段破音评价信息;For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
    按照第四预设权重向量,对所述第三数量个片段破音评价信息进行融合,得到所述破音评价信息。According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
  7. 一种音质检测装置,其特征在于,包括:A sound quality detection device, comprising:
    音频信号获取单元,被配置为执行获取目标音频信号;an audio signal acquisition unit, configured to perform acquisition of the target audio signal;
    第一检测单元,被配置为执行检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息;a first detection unit, configured to perform detection of an audio content signal corresponding to the target audio signal, to obtain first evaluation information related to the audio content signal;
    第二检测单元,被配置为执行检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息;a second detection unit, configured to perform detection of an audio collection signal corresponding to the target audio signal, and obtain second evaluation information related to the audio collection signal;
    目标评价信息确定单元,被配置为按照第二预设权重向量对所述第一评价信息与所述第二评价信息进行融合,得到与所述目标音频信号对应的目标评价信息;其中,所述目标评价信息与所述目标音频信号的音质相关;A target evaluation information determining unit, configured to fuse the first evaluation information and the second evaluation information according to a second preset weight vector, to obtain target evaluation information corresponding to the target audio signal; wherein the The target evaluation information is related to the sound quality of the target audio signal;
    音质检测单元,被配置为执行根据所述目标评价信息,确定所述目标音频信号的音质类别。The sound quality detection unit is configured to perform determining the sound quality category of the target audio signal according to the target evaluation information.
  8. 根据权利要求7所述的音质检测装置,其特征在于,所述第一检测单元还被配置为执行:The sound quality detection device according to claim 7, wherein the first detection unit is further configured to perform:
    对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果;classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;
    按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息。According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
  9. 根据权利要求8所述的音质检测装置,其特征在于,所述第一检测单元还被配置为执行:The sound quality detection device according to claim 8, wherein the first detection unit is further configured to perform:
    按照第一时间长度对所述目标音频信号进行分割,得到第一数量个音频片段;The target audio signal is divided according to the first time length to obtain a first number of audio clips;
    对于每一所述音频片段,对所述音频片段对应的音频片段内容进行分类,得到与所述音频片段对应的第二数量个目标类别和所述音频片段为所述目标类别的目标概率;For each of the audio clips, classify the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;
    将每一所述音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为所述音频分类结果。A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
  10. 根据权利要求9所述的音质检测装置,其特征在于,所述第一检测单元还被配置为执行:The sound quality detection device according to claim 9, wherein the first detection unit is further configured to perform:
    对于每一所述音频片段,按照所述第二数量个目标类别,对所述音频片段进行检测,得到与所述第二数量个目标类别相关的第二数量个片段内容评价信息;For each of the audio clips, the audio clips are detected according to the second number of target categories to obtain a second number of clip content evaluation information related to the second number of target categories;
    将所述第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一所述音频片段对应的片段内容评价信息;或,以所述第二数量个目标概率为权重系数,对与所述音频片段相关的第二数量个片段内容评价信息进行加权,得到每一所述音频片段对应的片段内容评价信息;其中,所述第一数量个音频片段对应第一数量个片段内容评价信息;Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;
    按照第二预设权重向量,对所述第一数量个片段内容评价信息进行融合,得到所述第一评价信息。According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
  11. 根据权利要求7所述的音质检测装置,其特征在于,所述第二检测单元还被配置为执行:The sound quality detection device according to claim 7, wherein the second detection unit is further configured to perform:
    对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息;Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;
    对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息;Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;
    按照第三预设权重向量,对所述破音评价信息和所述外录评价信息进行融合,得到所述第二评价信息。According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  12. 根据权利要求11所述的音质检测装置,其特征在于,所述第二检测单元还被配置为执行:The sound quality detection device according to claim 11, wherein the second detection unit is further configured to perform:
    按照第二时间长度对所述目标音频信号进行分割,得到第三数量个音频片段;The target audio signal is divided according to the second time length to obtain a third number of audio clips;
    对于每一所述音频片段,对所述音频片段对应的破音程度进行检测,得到所述音频片段对应的破音评价信息;其中,所述第三数量个音频片段对应第三数量个片段破音评价信息;For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
    按照第四预设权重向量,对所述第三数量个片段破音评价信息进行融合,得到所述破音评价信息。According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
  13. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;memory for storing instructions executable by the processor;
    其中,所述处理器被配置为执行所述指令,以实现以下步骤:wherein the processor is configured to execute the instructions to implement the following steps:
    获取目标音频信号;Get the target audio signal;
    检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息;Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;
    检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息;Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;
    按照第一预设权重向量对所述第一评价信息与所述第二评价信息进行融合,得到与所述目标音频信号对应的目标评价信息;其中,所述目标评价信息与所述目标音频信号的音质相关;The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
    根据所述目标评价信息,确定所述目标音频信号的音质类别。According to the target evaluation information, the sound quality category of the target audio signal is determined.
  14. 根据权利要求13所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以实现以下步骤:14. The electronic device of claim 13, wherein the processor is configured to execute the instructions to implement the following steps:
    对所述目标音频信号对应的音频内容信号进行分类,得到与所述音频内容信号对应的音频分类结果;classifying the audio content signal corresponding to the target audio signal to obtain an audio classification result corresponding to the audio content signal;
    按照所述音频分类结果,对所述目标音频信号进行检测,得到与所述音频内容信号相关的第一评价信息。According to the audio classification result, the target audio signal is detected to obtain first evaluation information related to the audio content signal.
  15. 根据权利要求14所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以实现以下步骤:15. The electronic device of claim 14, wherein the processor is configured to execute the instructions to implement the following steps:
    按照第一时间长度对所述目标音频信号进行分割,得到第一数量个音频片段;The target audio signal is divided according to the first time length to obtain a first number of audio clips;
    对于每一所述音频片段,对所述音频片段对应的音频片段内容进行分类,得到与所述音频片段对应的第二数量个目标类别和所述音频片段为所述目标类别的目标概率;For each of the audio clips, classifying the audio clip content corresponding to the audio clip to obtain a second number of target categories corresponding to the audio clip and a target probability that the audio clip is the target category;
    将每一所述音频片段对应的第二数量个目标类别和第二数量个目标概率,确定为所述音频分类结果。A second number of target categories and a second number of target probabilities corresponding to each of the audio segments are determined as the audio classification result.
  16. 根据权利要求15所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以实现以下步骤:16. The electronic device of claim 15, wherein the processor is configured to execute the instructions to implement the following steps:
    对于每一所述音频片段,按照所述第二数量个目标类别,对所述音频片段进行检测,得到与所述第二数量个目标类别相关的第二数量个片段内容评价信息;For each of the audio clips, detecting the audio clips according to the second number of target categories to obtain a second number of clip content evaluation information related to the second number of target categories;
    将所述第二数量个目标概率中的最大概率值对应的片段内容评价信息确定为每一所述音频片段对应的片段内容评价信息;或,以所述第二数量个目标概率为权重系数,对与所述音频片段相关的第二数量个片段内容评价信息进行加权,得到每一所述音频片段对应的片段内容评价信息;其中,所述第一数量个音频片段对应第一数量个片段内容评价信息;Determining the segment content evaluation information corresponding to the maximum probability value in the second number of target probabilities as the segment content evaluation information corresponding to each of the audio segments; or, using the second number of target probabilities as a weight coefficient, Weighting the second quantity of clip content evaluation information related to the audio clip to obtain clip content evaluation information corresponding to each of the audio clips; wherein the first quantity of audio clips corresponds to the first quantity of clip content evaluation information;
    按照第二预设权重向量,对所述第一数量个片段内容评价信息进行融合,得到所述第一评价信息。According to the second preset weight vector, the content evaluation information of the first number of segments is fused to obtain the first evaluation information.
  17. 根据权利要求13所述的电子设备,其特征在于,所述处理器被配置为执行所 述指令,以实现以下步骤:The electronic device of claim 13, wherein the processor is configured to execute the instructions to implement the following steps:
    对目标音频信号对应的破音现象进行检测,得到对应的破音评价信息;Detect the sound-breaking phenomenon corresponding to the target audio signal, and obtain the corresponding sound-breaking evaluation information;
    对目标音频信号对应的外录设备进行检测,得到对应的外录评价信息;Detect the external recording device corresponding to the target audio signal to obtain the corresponding external recording evaluation information;
    按照第三预设权重向量,对所述破音评价信息和所述外录评价信息进行融合,得到所述第二评价信息。According to the third preset weight vector, the broken sound evaluation information and the external recording evaluation information are fused to obtain the second evaluation information.
  18. 根据权利要求17所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以实现以下步骤:18. The electronic device of claim 17, wherein the processor is configured to execute the instructions to implement the following steps:
    按照第二时间长度对所述目标音频信号进行分割,得到第三数量个音频片段;The target audio signal is divided according to the second time length to obtain a third number of audio clips;
    对于每一所述音频片段,对所述音频片段对应的破音程度进行检测,得到所述音频片段对应的破音评价信息;其中,所述第三数量个音频片段对应第三数量个片段破音评价信息;For each of the audio clips, the degree of broken sound corresponding to the audio clip is detected, and the broken sound evaluation information corresponding to the audio clip is obtained; wherein, the third number of audio clips corresponds to the third number of broken sound clips audio evaluation information;
    按照第四预设权重向量,对所述第三数量个片段破音评价信息进行融合,得到所述破音评价信息。According to the fourth preset weight vector, the sound breaking evaluation information of the third number of segments is fused to obtain the sound breaking evaluation information.
  19. 一种非暂时性计算机可读存储介质,存储有计算机指令,当所述计算机指令由电子设备的处理器执行时,使得所述电子设备能够执行以下步骤:A non-transitory computer-readable storage medium storing computer instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the following steps:
    获取目标音频信号;Get the target audio signal;
    检测所述目标音频信号对应的音频内容信号,得到与所述音频内容信号相关的第一评价信息;Detecting the audio content signal corresponding to the target audio signal to obtain first evaluation information related to the audio content signal;
    检测所述目标音频信号对应的音频采集信号,得到与所述音频采集信号相关的第二评价信息;Detecting the audio collection signal corresponding to the target audio signal to obtain second evaluation information related to the audio collection signal;
    按照第一预设权重向量对所述第一评价信息与所述第二评价信息进行融合,得到与所述目标音频信号对应的目标评价信息;其中,所述目标评价信息与所述目标音频信号的音质相关;The first evaluation information and the second evaluation information are fused according to the first preset weight vector to obtain the target evaluation information corresponding to the target audio signal; wherein, the target evaluation information and the target audio signal related to the sound quality;
    根据所述目标评价信息,确定所述目标音频信号的音质类别。According to the target evaluation information, the sound quality category of the target audio signal is determined.
PCT/CN2021/105044 2020-09-29 2021-07-07 Sound quality detection method and device WO2022068304A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011054305.2 2020-09-29
CN202011054305.2A CN112185421B (en) 2020-09-29 2020-09-29 Sound quality detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022068304A1 true WO2022068304A1 (en) 2022-04-07

Family

ID=73946047

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/105044 WO2022068304A1 (en) 2020-09-29 2021-07-07 Sound quality detection method and device

Country Status (2)

Country Link
CN (1) CN112185421B (en)
WO (1) WO2022068304A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185421B (en) * 2020-09-29 2023-11-21 北京达佳互联信息技术有限公司 Sound quality detection method and device, electronic equipment and storage medium
CN114374924B (en) * 2022-01-07 2024-01-19 上海纽泰仑教育科技有限公司 Recording quality detection method and related device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041661A1 (en) * 2011-08-08 2013-02-14 Cellco Partnership Audio communication assessment
CN106558308A (en) * 2016-12-02 2017-04-05 深圳撒哈拉数据科技有限公司 A kind of internet audio quality of data auto-scoring system and method
CN107818797A (en) * 2017-12-07 2018-03-20 苏州科达科技股份有限公司 Voice quality assessment method, apparatus and its system
CN110277106A (en) * 2019-06-21 2019-09-24 北京达佳互联信息技术有限公司 Audio quality determines method, apparatus, equipment and storage medium
CN112185421A (en) * 2020-09-29 2021-01-05 北京达佳互联信息技术有限公司 Sound quality detection method, device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716470B (en) * 2012-09-29 2016-12-07 华为技术有限公司 The method and apparatus of Voice Quality Monitor
JP6163468B2 (en) * 2014-08-25 2017-07-12 日本電信電話株式会社 Sound quality evaluation apparatus, sound quality evaluation method, and program
JP2016180965A (en) * 2015-03-25 2016-10-13 ヤマハ株式会社 Evaluation device and program
CN107945788B (en) * 2017-11-27 2021-11-02 桂林电子科技大学 Method for detecting pronunciation error and scoring quality of spoken English related to text
CN109147804A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 A kind of acoustic feature processing method and system based on deep learning
CN110188356B (en) * 2019-05-30 2023-05-19 腾讯音乐娱乐科技(深圳)有限公司 Information processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041661A1 (en) * 2011-08-08 2013-02-14 Cellco Partnership Audio communication assessment
CN106558308A (en) * 2016-12-02 2017-04-05 深圳撒哈拉数据科技有限公司 A kind of internet audio quality of data auto-scoring system and method
CN107818797A (en) * 2017-12-07 2018-03-20 苏州科达科技股份有限公司 Voice quality assessment method, apparatus and its system
CN110277106A (en) * 2019-06-21 2019-09-24 北京达佳互联信息技术有限公司 Audio quality determines method, apparatus, equipment and storage medium
CN112185421A (en) * 2020-09-29 2021-01-05 北京达佳互联信息技术有限公司 Sound quality detection method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112185421B (en) 2023-11-21
CN112185421A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN109446876B (en) Sign language information processing method and device, electronic equipment and readable storage medium
KR101906827B1 (en) Apparatus and method for taking a picture continously
CN105488957B (en) Method for detecting fatigue driving and device
WO2020259073A1 (en) Image processing method and apparatus, electronic device, and storage medium
WO2022068304A1 (en) Sound quality detection method and device
US20130088616A1 (en) Image Metadata Control Based on Privacy Rules
WO2021051949A1 (en) Image processing method and apparatus, electronic device, and storage medium
EP3855360A1 (en) Method and device for training image recognition model, and storage medium
WO2020062969A1 (en) Action recognition method and device, and driver state analysis method and device
EP2998960A1 (en) Method and device for video browsing
JP2023545158A (en) Warning threshold adjustment method and device, electronic equipment, and storage medium
WO2019015575A1 (en) Unlocking control method and related product
CN110390953B (en) Method, device, terminal and storage medium for detecting howling voice signal
CN106126058A (en) Based reminding method and device
US9799376B2 (en) Method and device for video browsing based on keyframe
CN114154068A (en) Media content recommendation method and device, electronic equipment and storage medium
CN115132224A (en) Abnormal sound processing method, device, terminal and storage medium
CN112069951A (en) Video clip extraction method, video clip extraction device, and storage medium
CN110675473A (en) Method, device, electronic equipment and medium for generating GIF dynamic graph
CN113032627A (en) Video classification method and device, storage medium and terminal equipment
CN112614507A (en) Method and apparatus for detecting noise
CN110019936A (en) A kind of annotation method and apparatus during playback of media files
CN111127846A (en) Door-knocking reminding method, door-knocking reminding device and electronic equipment
US11682412B2 (en) Information processing method, electronic equipment, and storage medium
CN109788367A (en) A kind of information cuing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873966

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.07.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21873966

Country of ref document: EP

Kind code of ref document: A1