CN115641873A

CN115641873A - Audio information evaluation method and device, electronic equipment and storage medium

Info

Publication number: CN115641873A
Application number: CN202211128483.4A
Authority: CN
Inventors: 魏耀都; 张晨; 郑羲光
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2023-01-24

Abstract

The present disclosure relates to an audio information evaluation method, apparatus, electronic device, and storage medium, the method comprising: performing singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed; performing recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed; and determining the evaluation result of the audio to be processed according to the singing level evaluation result and the recording quality evaluation result. The method and the device fully consider the influence of the inherent factor of the recording quality of the recording equipment, and can improve the accuracy of the evaluation result.

Description

Audio information evaluation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio information evaluation method and apparatus, an electronic device, and a storage medium.

Background

With the development of terminal technology, the K song scoring has become a necessary function for K song products. The K song scoring is mainly used for evaluating the K song audio of the user to obtain a corresponding score.

In the related art, a multidimensional karaoke scoring mode can be used for comprehensively describing the singing level of a user. The method generally gives comprehensive scores in the dimensions of intonation, rhythm, breath, emotion, vocal range, skill, voice and the like.

Therefore, in the related art, only the singing level of the user is evaluated, and the influence of inherent factors is not considered, so that the accuracy of the evaluation result is insufficient.

Disclosure of Invention

The present disclosure provides an audio information evaluation method, apparatus, electronic device, and storage medium, to at least solve the problem of insufficient accuracy of evaluation results in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an audio information evaluation method, including:

performing singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed;

performing recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed;

and determining the evaluation result of the audio to be processed according to the singing level evaluation result and the recording quality evaluation result.

Optionally, the performing recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed includes:

respectively carrying out recording quality evaluation on the audio to be processed according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor;

and according to the weight corresponding to each recording quality factor, carrying out weighted summation on the evaluation result corresponding to the at least one recording quality factor to obtain the recording quality evaluation result.

Optionally, the recording quality factor includes a signal-to-noise ratio;

the method for evaluating the recording quality of the audio to be processed according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor comprises the following steps:

dividing the audio to be processed into a plurality of first time segments according to a first time length;

determining a first time segment with voice in a plurality of first time segments to obtain at least one first voice segment;

for each first voice segment, determining the signal-to-noise ratio of a voice audio signal and a noise audio signal in the first voice segment;

and determining a signal-to-noise ratio evaluation result of the audio to be processed according to the signal-to-noise ratio of at least one first voice segment.

Optionally, the determining, for each first vocal segment, a signal-to-noise ratio of a vocal audio signal and a noise audio signal in the first vocal segment includes:

performing blind source separation on the first voice segments aiming at each first voice segment to obtain voice audio signals in the first voice segments;

determining the human voice energy and the noise energy in the first human voice segment according to the human voice audio signal;

and determining the ratio of the human voice energy to the noise energy to obtain the signal-to-noise ratio of the human voice audio signal and the noise audio signal in the first human voice segment.

determining a first time segment except the first vocal segment in the plurality of first time segments as a noise segment to obtain at least one noise segment;

determining an average of the energy of at least one of the noise segments as the noise energy;

for each first voice segment, determining the energy of the first voice segment as the voice energy of the first voice segment;

Optionally, the determining, according to the signal-to-noise ratio of at least one first vocal segment, a signal-to-noise ratio evaluation result of the audio to be processed includes:

determining the average value of the signal-to-noise ratios of at least one first voice segment to obtain the average value of the signal-to-noise ratios;

and determining a signal-to-noise ratio evaluation result of the audio to be processed according to the signal-to-noise ratio mean value, a first signal-to-noise ratio threshold value and a second signal-to-noise ratio threshold value, wherein the first signal-to-noise ratio threshold value is smaller than the second signal-to-noise ratio threshold value.

Optionally, the recording quality factor includes a bandwidth;

dividing the audio to be processed into a plurality of second time segments according to a second time length;

determining a second time segment with voice in the plurality of second time segments to obtain at least one second voice segment;

for each second voice segment, carrying out base frequency detection on the second voice segment according to frames, determining the frames with the detected base frequency as voiced signals, and determining the frames without the detected base frequency as unvoiced signals;

determining a bandwidth of the unvoiced signal in the second vocal segment and determining a bandwidth of the voiced signal in the second vocal segment;

and determining the bandwidth evaluation result of the audio to be processed according to the bandwidth of the unvoiced sound signal and the bandwidth of the voiced sound signal corresponding to at least one second human sound segment.

Optionally, the determining the bandwidth of the unvoiced sound signal in the second human sound segment and the bandwidth of the voiced sound signal in the second human sound segment include:

performing time-frequency transformation on each frame of unvoiced sound signal in the second vocal segment to obtain an amplitude spectrum of each frame of unvoiced sound signal, and performing time-frequency transformation on each frame of voiced sound signal in the second vocal segment to obtain an amplitude spectrum of each frame of voiced sound signal;

determining the bandwidth of the unvoiced signal according to the amplitude spectrum of each frame of the unvoiced signal;

and determining the bandwidth of the voiced sound signal according to the amplitude spectrum of each frame of the voiced sound signal.

Optionally, the determining the bandwidth of the unvoiced sound signal according to the magnitude spectrum of each frame of the unvoiced sound signal includes:

determining an average amplitude spectrum of the unvoiced sound signals according to the amplitude spectrum of each frame of the unvoiced sound signals, and using the average amplitude spectrum as an unvoiced sound amplitude spectrum;

determining a maximum amplitude value in the unvoiced sound amplitude spectrum as an unvoiced sound maximum amplitude value, and determining a preset proportion of the unvoiced sound maximum amplitude value as an unvoiced sound reference amplitude value;

determining the highest frequency value of which the amplitude value is larger than the unvoiced reference amplitude value in the unvoiced amplitude spectrum as a first unvoiced frequency value;

determining a frequency value with the fastest amplitude descending along with the frequency in the unvoiced sound amplitude spectrum as a second unvoiced sound frequency value;

determining the smallest frequency value of the first unvoiced frequency values and the second unvoiced frequency values as the bandwidth of the unvoiced signal;

the determining the bandwidth of the voiced sound signal according to the amplitude spectrum of each frame of the voiced sound signal comprises the following steps:

determining an average amplitude spectrum of the voiced sound signal according to the amplitude spectrum of each frame of the voiced sound signal, and using the average amplitude spectrum as a voiced sound amplitude spectrum;

determining a maximum amplitude value in the voiced sound amplitude spectrum as a voiced sound maximum amplitude value, and determining a preset proportion of the voiced sound maximum amplitude value as a voiced sound reference amplitude value;

determining the highest frequency value of the voiced sound amplitude spectrum, of which the amplitude value is larger than the voiced sound reference amplitude value, as a first voiced sound frequency value;

determining a frequency value with the amplitude falling fastest along with the frequency in the voiced sound amplitude spectrum as a second voiced sound frequency value;

determining a smallest frequency value of the first voiced frequency value and the second voiced frequency value as a bandwidth of the voiced sound signal.

Optionally, the determining, according to the bandwidth of the unvoiced signal and the bandwidth of the voiced signal corresponding to the at least one second vocal segment, the bandwidth evaluation result of the audio to be processed includes:

for each second voice segment, determining a bandwidth evaluation result of the unvoiced sound signal according to the bandwidth of the unvoiced sound signal, the first unvoiced sound bandwidth threshold and the second unvoiced sound bandwidth threshold;

for each second voice segment, determining a bandwidth evaluation result of the voiced sound signal according to the bandwidth of the voiced sound signal, a first voiced sound bandwidth threshold value and a second voiced sound bandwidth threshold value;

for each second voice segment, determining an average value of the bandwidth evaluation result of the unvoiced sound signal and the bandwidth evaluation result of the voiced sound signal as the bandwidth evaluation result of the second voice segment;

and determining the average value of the bandwidth evaluation results of at least one second voice segment as the bandwidth evaluation result of the audio to be processed.

Optionally, the recording quality factor includes a recording defect;

dividing the audio to be processed into a plurality of third time segments according to a third time length;

determining the maximum amplitude absolute value of the audio to be processed, determining a first proportion of the maximum amplitude absolute value as a first amplitude, and determining a second proportion of the maximum amplitude absolute value as a second amplitude; the first proportion is less than the second proportion;

for each third time segment, determining the number of data points of the amplitude absolute value between the first amplitude and the second amplitude in the third time segment as a first number of data points, and determining the number of data points of the amplitude absolute value between the second amplitude and the maximum amplitude absolute value in the third time segment as a second number of data points;

determining a third time segment with the second data point number larger than the first data point number as a recording defect segment;

and determining the recording defect evaluation result of the audio to be processed according to the number of the recording defect fragments and the total number of the third time fragments.

Optionally, before determining, for each of the third time slices, the number of data points of the third time slice between the first amplitude value and the second amplitude value, the method further includes:

determining a histogram of the third time slice according to the absolute value of the amplitude value for each third time slice;

for each third time segment, determining the number of data points of the third time segment between the first amplitude value and the second amplitude value as a first number of data points, including:

determining a sum of the histogram between the first and second magnitudes as the first number of data points;

determining, as a second data point count, a number of data points of the third time segment between the second amplitude and the absolute maximum amplitude, including:

determining a sum of the histogram between the second magnitude and the maximum magnitude absolute value as the second number of data points.

Optionally, determining a recording defect evaluation result of the audio to be processed according to the number of recording defect segments and the total number of the third time segments, where the determining includes:

and determining the ratio of the number of the recording defect fragments to the total number as a recording defect evaluation result of the audio to be processed.

acquiring an equipment identifier and a recording mode corresponding to the recording equipment of the audio to be processed;

and acquiring hardware scores corresponding to the equipment identifier and the recording mode, and determining the hardware scores as the recording quality evaluation result corresponding to the audio to be processed.

Optionally, the singing level evaluation of the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed includes:

acquiring a base frequency sequence of audio to be processed, and acquiring a tone template corresponding to the audio to be processed;

and determining a intonation evaluation result of the audio to be processed according to the fundamental frequency sequence and the tone template, and taking the intonation evaluation result as the singing level evaluation result.

Optionally, the elements in the fundamental frequency sequence include time points and fundamental frequency values, and the pitch template includes start times, end times and template frequency values corresponding to notes;

determining a intonation evaluation result of the audio to be processed according to the fundamental frequency sequence and the tone template, wherein the intonation evaluation result comprises the following steps:

according to the fundamental frequency value and the template frequency value, time alignment is carried out on the fundamental frequency sequence and the tone template, and a time point corresponding to the fundamental frequency value is updated to be an aligned time point;

determining a deviation between the fundamental frequency value and the template frequency value of each of the elements according to the alignment time point;

and determining a tone level evaluation result of the audio to be processed according to the deviation.

Optionally, the determining a deviation between the fundamental frequency value and the template frequency value of each element according to the aligned time point includes:

for each note, determining a target element of the alignment time point between the start time and the end time, and determining a ratio of a fundamental frequency value corresponding to the target element to the template frequency value;

and determining the deviation between the fundamental frequency value corresponding to the target element and the template frequency value according to the ratio.

Optionally, the determining, according to the ratio, a deviation between a fundamental frequency value corresponding to the target element and the template frequency value includes:

if the ratio is smaller than or equal to a first ratio threshold value, determining that no deviation exists between the fundamental frequency value corresponding to the target element and the template frequency value;

if the ratio is greater than the first ratio threshold value and the ratio is less than or equal to the value of the quadratic curve corresponding to the alignment time point of the target element, determining that no deviation exists between the fundamental frequency value corresponding to the target element and the template frequency value; the quadratic curve is a curve between the start time and the end time, values of the quadratic curve corresponding to the start time and the end time are respectively second ratio threshold values, a value of the quadratic curve corresponding to a central time between the start time and the end time is the first ratio threshold value, and the second ratio threshold value is greater than the first ratio threshold value;

and if the ratio is larger than the value of the quadratic curve corresponding to the alignment time point of the target element, determining the ratio as the deviation between the fundamental frequency value corresponding to the target element and the template frequency value.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio information evaluation apparatus including:

the singing level evaluation module is configured to execute singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed;

the recording quality evaluation module is configured to perform recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed;

and the comprehensive evaluation module is configured to determine the evaluation result of the audio to be processed according to the singing level evaluation result and the recording quality evaluation result.

Optionally, the recording quality evaluation module includes:

the recording quality factor evaluation unit is configured to perform recording quality evaluation on the audio to be processed according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor;

and the recording quality evaluation unit is configured to perform weighted summation on the evaluation result corresponding to the at least one recording quality factor according to the weight corresponding to each recording quality factor to obtain the recording quality evaluation result.

Optionally, the recording quality factor includes a signal-to-noise ratio;

the recording quality factor evaluation unit comprises:

a first segment dividing subunit configured to divide the audio to be processed into a plurality of first time segments according to a first time length;

a first vocal segment determination subunit configured to perform determining a first time segment having a vocal sound among a plurality of the first time segments, resulting in at least one first vocal segment;

a signal-to-noise ratio determination subunit configured to perform, for each of the first vocal segments, determining a signal-to-noise ratio of a vocal audio signal to a noise audio signal in the first vocal segment;

and the signal-to-noise ratio evaluation subunit is configured to determine a signal-to-noise ratio evaluation result of the audio to be processed according to the signal-to-noise ratio of at least one first voice segment.

Optionally, the signal-to-noise ratio determining subunit is configured to perform:

performing blind source separation on the first voice segment aiming at each first voice segment to obtain voice audio signals in the first voice segment;

determining first time segments except the first vocal segment in the plurality of first time segments as noise segments to obtain at least one noise segment;

Optionally, the snr evaluating subunit is configured to perform:

and determining a signal-to-noise ratio evaluation result of the audio to be processed according to the signal-to-noise ratio average value, a first signal-to-noise ratio threshold value and a second signal-to-noise ratio threshold value, wherein the first signal-to-noise ratio threshold value is smaller than the second signal-to-noise ratio threshold value.

Optionally, the recording quality factor includes a bandwidth;

the recording quality evaluation unit includes:

a second segment dividing subunit configured to perform dividing the audio to be processed into a plurality of second time segments according to a second time length;

a second voice segment determination subunit configured to perform determining a second time segment having voice in a plurality of the second time segments, resulting in at least one second voice segment;

a fundamental frequency detection subunit configured to perform, for each of the second human voice segments, fundamental frequency detection on the second human voice segments by frames, determine a frame in which a fundamental frequency is detected as a voiced signal, and determine a frame in which a fundamental frequency cannot be detected as an unvoiced signal;

a signal bandwidth determination subunit configured to perform determining a bandwidth of the unvoiced sound signal in the second vocal segment and determining a bandwidth of the voiced sound signal in the second vocal segment;

and the bandwidth evaluation subunit is configured to determine a bandwidth evaluation result of the audio to be processed according to the bandwidth of the unvoiced sound signal and the bandwidth of the voiced sound signal corresponding to the at least one second human sound segment.

Optionally, the signal bandwidth determining subunit includes:

the time-frequency transformation sub-module is configured to perform time-frequency transformation on each frame of unvoiced sound signals in the second vocal segment to obtain an amplitude spectrum of each frame of unvoiced sound signals, and perform time-frequency transformation on each frame of voiced sound signals in the second vocal segment to obtain an amplitude spectrum of each frame of voiced sound signals;

an unvoiced bandwidth determination sub-module configured to perform determination of a bandwidth of the unvoiced signal from a magnitude spectrum of the unvoiced signal for each frame;

a voiced-speech bandwidth determination sub-module configured to perform a determination of the bandwidth of the voiced speech signal from the magnitude spectrum of the voiced speech signal for each frame.

Optionally, the unvoiced bandwidth determination submodule is configured to perform:

determining a frequency value with the amplitude falling fastest along with the frequency in the unvoiced sound amplitude spectrum as a second unvoiced sound frequency value;

the voiced-sound bandwidth determination submodule is configured to perform:

determining the highest frequency value of which the amplitude value is larger than the voiced sound reference amplitude value in the voiced sound amplitude spectrum as a first voiced sound frequency value;

determining a frequency value with the amplitude which is the fastest to descend along with the frequency in the voiced sound amplitude spectrum as a second voiced sound frequency value;

Optionally, the bandwidth evaluation subunit is configured to perform:

for each second voice segment, determining a bandwidth evaluation result of the unvoiced sound signal according to the bandwidth of the unvoiced sound signal, a first unvoiced sound bandwidth threshold and a second unvoiced sound bandwidth threshold;

Optionally, the recording quality factor includes recording defects;

the recording quality evaluation unit includes:

a third segment dividing subunit configured to perform dividing the audio to be processed into a plurality of third time segments according to a third time length;

an amplitude determination subunit configured to perform determining a maximum amplitude absolute value of the audio to be processed, and determining a first proportion of the maximum amplitude absolute value as a first amplitude, and determining a second proportion of the maximum amplitude absolute value as a second amplitude; the first proportion is less than the second proportion;

a data point number determination subunit configured to perform, for each of the third time segments, determining a number of data points of which the absolute value of the amplitude is between the first amplitude and the second amplitude in the third time segment as a first number of data points, and determining a number of data points of which the absolute value of the amplitude is between the second amplitude and the maximum absolute value of the amplitude in the third time segment as a second number of data points;

a defective segment determining subunit configured to perform determination of a third time segment in which the second data point number is greater than the first data point number as a recording defective segment;

and the recording defect evaluation subunit is configured to determine a recording defect evaluation result of the audio to be processed according to the number of the recording defect segments and the total number of the third time segments.

Optionally, the recording quality evaluation unit further includes:

a histogram determination subunit configured to perform, for each of the third time slices, determining a histogram of the third time slice in terms of absolute values of magnitudes;

the data point number determination subunit is configured to perform:

determining, for each of the third time segments, a sum of the histogram between the first and second magnitudes as the first number of data points; determining a sum of the histogram between the second magnitude and the maximum magnitude absolute value as the second number of data points.

Optionally, the recording defect evaluation subunit is configured to perform:

Optionally, the recording quality evaluation module includes:

the equipment information acquisition unit is configured to execute equipment identification and a recording mode corresponding to the recording equipment for acquiring the audio to be processed;

and the recording quality evaluation result determining unit is configured to execute acquisition of hardware scores corresponding to the device identifier and the recording mode, and determine the hardware scores as the recording quality evaluation results corresponding to the audio to be processed.

Optionally, the singing level evaluation module includes:

the device comprises a fundamental frequency information acquisition unit, a pitch template acquisition unit and a pitch template generation unit, wherein the fundamental frequency information acquisition unit is configured to acquire a fundamental frequency sequence of audio to be processed and acquire a pitch template corresponding to the audio to be processed;

and the singing level evaluation unit is configured to determine a intonation evaluation result of the audio to be processed according to the fundamental frequency sequence and the pitch template, and the intonation evaluation result is used as the singing level evaluation result.

the singing level evaluation unit comprises:

a time alignment subunit configured to perform time alignment of the fundamental frequency sequence and the pitch template according to the fundamental frequency value and the template frequency value, and update a time point corresponding to the fundamental frequency value to an aligned time point;

a frequency deviation determination subunit configured to perform determining a deviation between the fundamental frequency value and the template frequency value for each of the elements according to the aligned time point;

a intonation evaluation subunit configured to perform determining a intonation evaluation result of the audio to be processed according to the deviation.

Optionally, the frequency deviation determination subunit includes:

a frequency ratio determination submodule configured to perform, for each of the notes, determining a target element of the aligned time point between the start time and the end time, and determining a ratio of a fundamental frequency value corresponding to the target element to the template frequency value;

and the frequency deviation determining submodule is configured to determine the deviation between the fundamental frequency value corresponding to the target element and the template frequency value according to the ratio.

Optionally, the frequency deviation determining submodule is configured to perform:

if the ratio is greater than the first ratio threshold and the ratio is less than or equal to the value of the quadratic curve corresponding to the alignment time point of the target element, determining that there is no deviation between the fundamental frequency value corresponding to the target element and the template frequency value; the quadratic curve is a curve between the starting time and the ending time, values of the quadratic curve corresponding to the starting time and the ending time are respectively second ratio threshold values, a value of the quadratic curve corresponding to a central time between the starting time and the ending time is the first ratio threshold value, and the second ratio threshold value is larger than the first ratio threshold value;

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio information evaluation method according to the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the audio information evaluation method according to the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program or computer instructions, wherein the computer program or computer instructions, when executed by a processor, implement the audio information evaluation method of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the embodiment of the disclosure determines the final evaluation result of the audio to be processed by combining the singing level evaluation result and the recording quality evaluation result of the audio to be processed, fully considers the influence of the inherent factor of the recording quality of the recording equipment, and can improve the accuracy of the evaluation result.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of audio information evaluation according to an exemplary embodiment;

FIG. 2 is a schematic diagram of a quadratic curve in an embodiment of the disclosure;

FIG. 3 is a flow diagram illustrating a method of audio information evaluation according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating recording quality evaluation of audio to be processed with respect to signal-to-noise ratio according to an embodiment of the disclosure;

FIG. 5 is a flowchart illustrating recording quality evaluation of audio to be processed with respect to bandwidth in an embodiment of the disclosure;

fig. 6 is a flowchart of recording quality evaluation of audio to be processed for recording defects in an embodiment of the present disclosure;

fig. 7 is a block diagram illustrating an audio information evaluation apparatus according to an exemplary embodiment;

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio information evaluation method according to an exemplary embodiment, where the audio information evaluation method may be used in an electronic device such as a mobile phone, a tablet computer, a server, and the like, as shown in fig. 1, and includes the following steps.

In step S11, performing singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed.

The audio to be processed is the K song audio of the user, can be the whole song sung by the user, and can also be a sentence or a word in the song. The embodiment of the disclosure scores to-be-processed audio, and the to-be-determined audio information is an evaluation result of the to-be-processed audio, that is, an evaluation score of the to-be-processed audio.

And performing singing level evaluation on the audio to be processed, mainly evaluating the singing level dimensions of the audio to be processed, such as intonation, rhythm, smell, emotion, gamut, skill, voice and the like, and giving a comprehensive score of at least one singing level dimension to obtain a singing level evaluation result corresponding to the audio to be processed.

In step S12, the audio to be processed is subjected to recording quality evaluation, so as to obtain a recording quality evaluation result corresponding to the audio to be processed.

The singing level and the recording quality (quality of the recording device) of the audio to be processed influence the evaluation results of the audio to be processed. For example: the phenomenon of timbre gloom not only can be caused by insufficient breath when a user sings, but also can be caused by the fact that the recording equipment cannot record high-frequency signals; the reason for the low pitch score may be either a cause of a running tune while the user is singing or a cause of current noise interference of the recording device. Therefore, the embodiment of the present disclosure combines the evaluation result of singing level evaluation and the evaluation result of recording quality evaluation on the audio to be processed to give a final evaluation result.

When the recording quality evaluation is carried out on the audio to be processed, the recording quality can be evaluated according to the preset recording quality factor related to the recording quality, and the recording quality evaluation result corresponding to the audio to be processed is obtained; and determining a recording quality evaluation result corresponding to the audio to be processed based on the recording equipment of the audio to be processed.

In step S13, an evaluation result of the audio to be processed is determined according to the singing level evaluation result and the recording quality evaluation result.

And determining the evaluation result of the audio to be processed by combining the singing level evaluation result and the recording quality evaluation result. Weighting and summing the singing level evaluation result and the recording quality evaluation result based on preset weight to obtain the evaluation result of the audio to be processed; other manners can be adopted to combine the singing level evaluation result and the recording quality evaluation result, such as a polynomial fitting manner, so as to obtain the evaluation result of the audio to be processed.

In an exemplary embodiment, the determining an evaluation result of the audio to be processed according to the singing level evaluation result and the recording quality evaluation result includes: and performing polynomial fitting on the singing level evaluation result and the recording quality evaluation result to obtain an evaluation result of the audio to be processed.

The singing level evaluation result and the recording quality evaluation result are fused into a final evaluation result in a polynomial fitting mode, and the fusion mode can adopt a polynomial fitting mode and determine the evaluation result of the audio to be processed according to the following formula:

S＝a ₀ A ⁰ +a ₁ A ¹ +…+a _n A ⁿ +b ₀ B ⁰ +b ₁ B ¹ +…+b _m B ^m

wherein S is the evaluation result of the audio to be processed, A is the evaluation result of singing level, B is the evaluation result of recording quality, a ₀ ，a ₁ ，…，a _n And b ₀ ，b ₁ ，…，b _m Is a polynomial coefficient. n and m may be preset.

For the polynomial coefficients, comprehensive scoring can be performed on singing audios with different singing levels according to a plurality of professional raters, and fitting is performed on the polynomials to obtain various polynomial coefficients. For example, the process of determining the polynomial coefficients may include: acquiring a plurality of singing audios with different singing levels; acquiring comprehensive scores of multiple vocational raters for each singing audio, and taking the average value of the comprehensive scores of the same singing audio as the evaluation result of the singing audio; determining a singing level evaluation result and a recording quality evaluation result of each singing audio; and establishing a polynomial to fit the recording quality evaluation result and the singing level evaluation result of the singing audio into the evaluation result of the singing audio, and minimizing the mean square error through a fitting algorithm to obtain each polynomial coefficient.

After the polynomial coefficient is determined by combining the comprehensive score given by the vocals professional evaluator, polynomial fitting can be performed on the singing level evaluation result and the recording quality evaluation result by using the polynomial coefficient for each audio data to be processed to obtain a final evaluation result, and the obtained evaluation result is consistent with the result given by the professional evaluator, so that the accuracy of the evaluation result is improved.

The audio information evaluation method provided by the exemplary embodiment determines the final evaluation result of the audio to be processed by combining the singing level evaluation result and the recording quality evaluation result of the audio to be processed, fully considers the influence of the inherent factor of the recording quality of the recording device, and can improve the accuracy of the evaluation result.

In an exemplary embodiment, the performing the recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed includes: acquiring an equipment identifier and a recording mode corresponding to the recording equipment of the audio to be processed; and acquiring a hardware score corresponding to the equipment identifier and the recording mode, and determining the hardware score as a recording quality evaluation result corresponding to the audio to be processed.

The recording quality scoring can be performed on recording equipment (hardware equipment) which records audio in the market under a specific recording mode to obtain equipment identification of the recording equipment and a hardware score corresponding to the recording mode, and the equipment identification and the hardware score corresponding to the recording mode are stored in a database. When the recording quality evaluation is carried out on the audio to be processed, the hardware score corresponding to the equipment identifier and the recording mode is inquired from the database by acquiring the equipment identifier and the recording mode of the recording equipment used by the user when the user records the audio to be processed, and the hardware score is determined as the recording quality evaluation result of the audio to be processed. The recording mode may include a standard mode, a conference mode, and the like.

By carrying out recording quality evaluation on various recording devices and recording modes in advance and storing the recording quality evaluation in the database, when the recording quality evaluation is carried out on the audio to be processed, the corresponding evaluation result can be directly obtained from the database, and the processing efficiency of the recording quality evaluation can be improved.

In an exemplary embodiment, the performing singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed includes: obtaining a base frequency sequence of audio to be processed, and obtaining a tone template corresponding to the audio to be processed; and determining a intonation evaluation result of the audio to be processed according to the fundamental frequency sequence and the tone template, and taking the intonation evaluation result as the singing level evaluation result.

The audio to be processed can be user dry sound, the dry sound is pure human sound without music, and the pure human sound is pure human sound which is not subjected to any spatial property or post-processing and processing after recording. The tone template is a preset tone template corresponding to the song corresponding to the audio to be processed.

The fundamental frequency of the audio to be processed can be extracted by adopting a time domain method or a frequency domain method, so as to obtain a fundamental frequency sequence of the audio to be processed. And acquiring a song identifier of a song corresponding to the audio to be processed, and acquiring a tone template corresponding to the song identifier. Matching the fundamental frequency sequence with the tone template, converting the matching result into a quantitative score according to a certain strategy to obtain a intonation evaluation result of the audio to be processed, and taking the intonation evaluation result as a singing level evaluation result.

It should be noted that, in addition to the intonation evaluation result as the singing level evaluation result, the singing level evaluation may be performed by using intonation in combination with at least one of dimensions such as rhythm, smell, emotion, range, skill, voice, and the like, so as to obtain a more accurate singing level evaluation result.

The intonation evaluation result is determined according to the fundamental frequency sequence and the pitch template of the audio to be processed and is used as the singing level evaluation result, so that the evaluation mode is simple, the processing speed is high, and the singing level evaluation efficiency can be improved.

On the basis of the technical scheme, the elements in the fundamental frequency sequence comprise time points and fundamental frequency values, and the tone template comprises start time, end time and template frequency values corresponding to the notes;

determining a intonation evaluation result of the audio to be processed according to the fundamental frequency sequence and the tone template, wherein the intonation evaluation result comprises the following steps: according to the fundamental frequency value and the template frequency value, time alignment is carried out on the fundamental frequency sequence and the tone template, and a time point corresponding to the fundamental frequency value is updated to be an aligned time point; determining a deviation between the fundamental frequency value and the template frequency value for each of the elements according to the alignment time point; and determining a tone level evaluation result of the audio to be processed according to the deviation.

Wherein the fundamental frequency sequence is composed of a plurality of elements (one element, i.e. one data point in the fundamental frequency sequence), each element comprising a time point and a fundamental frequency value. The tone template may be in a MIDI (Musical Instrument Digital Interface) format, containing a string of notes, each note containing information including, but not limited to, a start time, an end time, and a template frequency value. In the pitch template, a note has a fixed template frequency value that lasts for a length of time between a start time and an end time, and the value of the fundamental frequency of the same note in the sequence of fundamental frequencies as in the pitch template fluctuates around the template frequency value.

And performing time alignment on the fundamental frequency sequence and the tone template according to the fundamental frequency value in the fundamental frequency sequence and the template frequency value in the tone template, performing stretching processing on the time lengths of a plurality of elements corresponding to the same note in the fundamental frequency sequence by adopting a dynamic time warping algorithm to obtain a time mapping relation between the fundamental frequency sequence and the tone template, and updating the time point of each element in the fundamental frequency sequence into an aligned time point according to the time mapping relation. And determining the deviation between the fundamental frequency value of each element in the fundamental frequency sequence and the template frequency value of the same alignment time point according to the alignment time point, counting the deviation corresponding to each element, and determining the statistic value of the deviation as the intonation evaluation result of the audio to be processed. Wherein, the statistical value can be a mean value or a standard deviation, etc.

The deviation between the fundamental frequency value and the template frequency value is determined after the fundamental frequency sequence and the tone template are aligned in time, so that the obtained deviation is accurate, the intonation evaluation result of the audio to be processed is determined based on the deviation, and the accuracy of the intonation evaluation result can be improved.

On the basis of the above technical solution, said determining a deviation between the fundamental frequency value and the template frequency value of each of the elements according to the alignment time point includes:

The fundamental frequency value in the fundamental frequency sequence and the template frequency value in the tone template are respectively compared for each note, elements of the aligned time points in the audio sequence between the start time and the end time can be determined as target elements based on the start time and the end time of each note in the tone template, the ratio of the fundamental frequency value corresponding to the target elements to the template frequency value is respectively determined for each target element, when the ratio is determined, the larger frequency value between the fundamental frequency value and the template frequency value is taken as a numerator, and the smaller frequency value is taken as a denominator, so that the ratio of the fundamental frequency value to the template frequency value is obtained. After the ratio of the fundamental frequency value to the template frequency value is obtained, the deviation between the fundamental frequency value and the template frequency value corresponding to each target element may be determined based on the relationship between the ratio and a preset threshold value.

As for each note, the fundamental frequency value fluctuates up and down based on the template frequency value, the deviation between the fundamental frequency value corresponding to each target element and the template frequency value is determined by determining the ratio of the fundamental frequency value of each target element to the template frequency value and further based on the ratio, the accuracy of deviation determination can be further improved, and the accuracy of the intonation evaluation result is further improved.

On the basis of the above technical solution, the determining, according to the ratio, a deviation between a fundamental frequency value corresponding to the target element and the template frequency value includes:

The first ratio threshold is a value greater than 1 and less than 1.5, the second ratio threshold is a value greater than 1 and less than 1.5, and the second ratio threshold is greater than the first ratio threshold, for example, the first ratio threshold may be 1.059, and the second ratio threshold may be 1.25. Fig. 2 is a schematic diagram of a quadratic curve in the embodiment of the present disclosure, and as shown in fig. 2, the quadratic curve is determined based on a start time, an end time, a first ratio threshold and a second ratio threshold, when determining the quadratic curve, a center time between the start time and the end time is determined, the second ratio threshold c2 is determined as a value corresponding to the start time and the end time of the quadratic curve, and the first ratio threshold c1 is determined as a value corresponding to the center time of the quadratic curve.

Firstly, comparing the ratio with a first ratio threshold, if the ratio is smaller than the first ratio threshold, indicating that the fundamental frequency value fluctuates around the template frequency value, and the difference between the fundamental frequency value and the template frequency value is not large, and then determining that the fundamental frequency value of the target element is not deviated from the template frequency value; and for the target element with the ratio larger than the first ratio threshold value, further comparing the target element with the value of the quadratic curve, if the ratio corresponding to the target element is smaller than or equal to the value of the quadratic curve corresponding to the alignment time point of the target element, determining that the fundamental frequency value and the template frequency value corresponding to the target element are not deviated, and if the ratio corresponding to the target element is larger than the value of the quadratic curve corresponding to the alignment time point of the target element, determining that the fundamental frequency value and the template frequency value of the target element have a certain deviation, and determining the ratio as the deviation between the fundamental frequency value and the template frequency value corresponding to the target element.

A quadratic curve is set based on the starting time and the ending time of the notes, the ratio corresponding to the target element is compared with the value of the quadratic curve at the alignment time point, the deviation between the fundamental frequency value and the template frequency value is determined based on the comparison result, the fluctuation range of the fundamental frequency value is fully considered, the accuracy of the determined deviation can be improved, and the accuracy of the intonation evaluation result is further improved.

Fig. 3 is a flowchart illustrating an audio information evaluation method according to an exemplary embodiment, which includes the following steps, as shown in fig. 3.

In step S31, a singing level evaluation is performed on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed.

In step S32, the audio to be processed is respectively subjected to recording quality evaluation according to at least one recording quality factor, so as to obtain an evaluation result corresponding to each recording quality factor.

The recording quality factor is a factor affecting the recording quality, and may be, for example, a signal-to-noise ratio, a bandwidth of an audio signal, a recording defect, and the like.

And for each recording quality factor, correspondingly processing the audio to be processed respectively, and performing recording quality evaluation on the recording quality factor to obtain an evaluation result corresponding to each recording quality factor.

In step S33, the evaluation results corresponding to the at least one recording quality factor are weighted and summed according to the weight corresponding to each recording quality factor, so as to obtain the recording quality evaluation results.

The weight corresponding to each recording quality factor can be preset according to the requirement. And then after the recording quality evaluation is carried out on the audio to be processed based on at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor, the evaluation results corresponding to at least one recording quality factor can be weighted and summed according to the weight corresponding to each recording quality factor to obtain a recording quality evaluation result. For example, three recording quality factors (signal-to-noise ratio, bandwidth of an audio signal, and recording defect) may be used to perform recording quality evaluation on the audio to be processed, where the weighting ratio of the three recording quality factors may be 1.

In step S34, an evaluation result of the audio to be processed is determined according to the singing level evaluation result and the recording quality evaluation result.

According to the audio information evaluation method provided by the exemplary embodiment, the recording quality evaluation is performed on the audio to be processed according to at least one recording quality factor to obtain the evaluation result corresponding to each recording quality factor, and then the evaluation results corresponding to at least one recording quality factor are subjected to weighted summation to obtain the recording quality evaluation result, so that the final recording quality evaluation result is obtained by combining the evaluation results of at least one recording quality factor, and the accuracy of the recording quality evaluation result can be improved.

In one exemplary embodiment, the sound recording quality factor includes a signal-to-noise ratio; fig. 4 is a flowchart of performing recording quality evaluation on an audio to be processed with respect to a signal-to-noise ratio in an embodiment of the present disclosure, and as shown in fig. 4, the performing recording quality evaluation on the audio to be processed according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor includes:

in step S41, the audio to be processed is divided into a plurality of first time segments according to a first time length.

Wherein, the first time length is a preset time length of a first time segment.

And dividing the audio to be processed according to the first time length to obtain a plurality of first time segments corresponding to the audio to be processed.

In step S42, a first time segment with voice in the plurality of first time segments is determined, and at least one first voice segment is obtained.

And respectively carrying out voice extraction on each first time segment through a voice extraction algorithm or a voice extraction model, determining the first time segment from which voice can be extracted, obtaining the first time segment with voice, and taking the first time segment with voice as the first voice segment to obtain at least one first voice segment. The first time segment in which no human voice is extracted can be taken as a noise segment.

In step S43, for each of the first human voice segments, a signal-to-noise ratio of a human voice audio signal and a noise audio signal in the first human voice segment is determined.

Noise segments are excluded and the signal-to-noise ratio is calculated for the first vocal segment. And for each first human voice segment, respectively determining a human voice audio signal and a noise audio signal in the first human voice segment, and calculating the ratio of human voice energy corresponding to the human voice audio signal to noise energy corresponding to the noise audio signal to obtain the signal-to-noise ratio of the human voice audio signal and the noise audio signal in the first human voice segment.

In an exemplary embodiment, the determining, for each of the first vocal segments, a signal-to-noise ratio of a vocal audio signal to a noise audio signal in the first vocal segment includes: performing blind source separation on the first voice segment aiming at each first voice segment to obtain voice audio signals in the first voice segment; determining human voice energy and noise energy in the first human voice segment according to the human voice audio signal; and determining the ratio of the human voice energy to the noise energy to obtain the signal-to-noise ratio of the human voice audio signal and the noise audio signal in the first human voice segment.

In the blind source separation, under the condition that a source signal and a signal mixing parameter are not known, the source signal is estimated only according to an observed mixed signal.

The whole first human voice segment is used as a mixed signal, a human voice audio signal in the first human voice segment is used as a source signal, the first human voice segment comprises a human voice audio signal and a noise audio signal, the human voice audio signal in the first human voice segment is obtained by performing blind source separation on the first human voice segment, the energy corresponding to the human voice audio signal in the first human voice segment can be further determined, the human voice energy in the first human voice segment is obtained, the noise energy corresponding to the noise audio signal in the first human voice segment is obtained by subtracting the human voice energy from the energy of the whole first human voice segment, the ratio of the human voice energy to the noise energy is determined, and the signal-to-noise ratio of the human voice audio signal and the noise audio signal in the first human voice segment is obtained. The blind source separation may use an Independent Component Analysis (ICA) algorithm, for example.

The voice audio signals in the first voice segment are separated by performing blind source separation on the first voice segment, so that more accurate voice energy and noise energy can be obtained, and the accuracy of the signal to noise ratio can be improved.

In another exemplary embodiment, the determining, for each of the first vocal segments, a signal-to-noise ratio of a vocal audio signal to a noise audio signal in the first vocal segment includes: determining first time segments except the first vocal segment in the plurality of first time segments as noise segments to obtain at least one noise segment; determining an average of the energy of at least one of the noise segments as the noise energy; for each first voice segment, determining the energy of the first voice segment as the voice energy of the first voice segment; and determining the ratio of the human voice energy to the noise energy to obtain the signal-to-noise ratio of the human voice audio signal and the noise audio signal in the first human voice segment.

And determining first time segments except the first voice segment in the plurality of first time segments as noise segments, namely determining the first time segments without voices as noise segments, and obtaining at least one noise segment. The energy of each noise segment is calculated and the average of the energy of at least one noise segment is determined as the noise energy of the noise audio signal in the first vocal segment. And respectively determining the energy of each first human voice segment, and taking the energy of each first human voice segment as the human voice energy of the human voice audio signal in the first human voice segment. And for each first person sound segment, calculating the ratio of the person sound energy to the noise energy to obtain the signal-to-noise ratio of the person sound audio signal and the noise audio signal in each first person sound segment.

The average energy value of at least one noise segment is used as noise energy, the energy of the first voice segment is used as voice energy, the calculated amount is low, and the determination efficiency of the signal to noise ratio can be improved.

In step S44, a signal-to-noise ratio evaluation result of the audio to be processed is determined according to a signal-to-noise ratio of at least one of the first vocal segments.

And counting the signal-to-noise ratio of at least one first voice segment, and determining a signal-to-noise ratio evaluation result of the audio to be processed according to the counting result.

In an exemplary embodiment, the determining, according to the signal-to-noise ratio of at least one of the first vocal segments, a signal-to-noise ratio evaluation result of the audio to be processed includes: determining the average value of the signal-to-noise ratio of at least one first voice segment to obtain the average value of the signal-to-noise ratio; and determining a signal-to-noise ratio evaluation result of the audio to be processed according to the signal-to-noise ratio mean value, a first signal-to-noise ratio threshold value and a second signal-to-noise ratio threshold value, wherein the first signal-to-noise ratio threshold value is smaller than the second signal-to-noise ratio threshold value.

And calculating the average value of the signal-to-noise ratio of at least one first voice segment to obtain the average value of the signal-to-noise ratio. Setting two different signal-to-noise ratio thresholds, wherein the first signal-to-noise ratio threshold is smaller than the second signal-to-noise ratio threshold, comparing the signal-to-noise ratio average with the first signal-to-noise ratio threshold and the second signal-to-noise ratio threshold respectively, and if the signal-to-noise ratio average is smaller than or equal to the first signal-to-noise ratio threshold, determining that the signal-to-noise ratio evaluation result is 0 (namely the signal-to-noise ratio evaluation score is 0); if the average value of the signal-to-noise ratios is larger than or equal to the second signal-to-noise ratio threshold value, determining that the signal-to-noise ratio evaluation result is full score; if the average signal-to-noise ratio is larger than the first signal-to-noise ratio threshold value and smaller than the second signal-to-noise ratio threshold value, determining a signal-to-noise ratio evaluation result according to the following formula:

wherein S is _snr Represents the signal-to-noise ratio evaluation result, S _Fsnt Which represents the full fraction of the signal-to-noise ratio,

representing the mean signal-to-noise ratio, snr _th1 Representing a first signal-to-noise ratio threshold, snr _th2 Representing a second signal-to-noise ratio threshold.

By determining the average value of the signal-to-noise ratio of at least one first voice segment and determining the signal-to-noise ratio evaluation result of the audio to be processed based on the comparison result of the average value of the signal-to-noise ratio and the first signal-to-noise ratio threshold value and the second signal-to-noise ratio threshold value, a more accurate signal-to-noise ratio evaluation result can be obtained, and the accuracy of the signal-to-noise ratio evaluation is improved.

The audio to be processed is divided into a plurality of first time segments, at least one first voice segment in the first time segments is determined, the signal-to-noise ratio of each first voice segment is determined, the signal-to-noise ratio of the at least one first voice segment is counted, the signal-to-noise ratio evaluation result of the audio to be processed can be obtained, and the signal-to-noise ratio evaluation of the audio to be processed is achieved.

In one exemplary embodiment, the recording quality factor includes a bandwidth; fig. 5 is a flowchart of performing recording quality evaluation on an audio to be processed according to bandwidth in the embodiment of the present disclosure, and as shown in fig. 5, performing recording quality evaluation on the audio to be processed according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor includes:

in step S51, the audio to be processed is divided into a plurality of second time segments according to a second time length.

Wherein the second time length is a preset time length of a second time segment. The second time period may be the same as or different from the first time period.

And dividing the audio to be processed according to the second time length to obtain a plurality of second time segments corresponding to the audio to be processed.

In step S52, a second time segment with a voice in the plurality of second time segments is determined, so as to obtain at least one second voice segment.

And respectively carrying out voice extraction on each second time segment through a voice extraction algorithm or a voice extraction model, determining the second time segments from which voice can be extracted to obtain second time segments with voice, and taking the second time segments with voice as the second voice segments to obtain at least one second voice segment.

In step S53, for each second voice segment, performing frame-by-frame base frequency detection on the second voice segment, determining a frame in which the base frequency is detected as a voiced signal, and determining a frame in which the base frequency cannot be detected as an unvoiced signal.

For each second human voice segment, performing fundamental frequency detection on the second human voice segment according to frames, determining frames in which the fundamental frequency cannot be detected as an unvoiced signal, adding an unvoiced signal flag (for example, the unvoiced signal flag may be 0), determining frames in which the fundamental frequency can be detected as a voiced signal, and adding a voiced signal flag (for example, the voiced signal flag may be 1); the unvoiced signal determined on the basis of the fundamental frequency detection is updated on the basis of a pre-constrained fundamental frequency range (a typical range is 80Hz to 1100 Hz), the voiced signal outside the fundamental frequency range is changed to an unvoiced signal, and the voiced signal flag is changed to an unvoiced signal flag.

In step S54, the bandwidth of the unvoiced sound signal in the second vocal segment is determined, and the bandwidth of the voiced sound signal in the second vocal segment is determined.

The bandwidth of the unvoiced signal and the bandwidth of the voiced signal may be determined by video transforming the second vocal segments and based on the transformation result.

In an exemplary embodiment, the determining the bandwidth of the unvoiced sound signal in the second vocal section and the determining the bandwidth of the voiced sound signal in the second vocal section includes: performing time-frequency transformation on each frame of unvoiced sound signal in the second vocal segment to obtain an amplitude spectrum of each frame of unvoiced sound signal, and performing time-frequency transformation on each frame of voiced sound signal in the second vocal segment to obtain an amplitude spectrum of each frame of voiced sound signal; determining the bandwidth of the unvoiced signal according to the amplitude spectrum of each frame of the unvoiced signal; and determining the bandwidth of the voiced sound signal according to the amplitude spectrum of each frame of the voiced sound signal.

All the second voice segments can be subjected to time-frequency transformation in a framing manner, if one second voice segment has N frames, the transformed second voice segment has N frames, each frame is composed of M (M is the number of data points in one frame) complex frequency values, and then the amplitude values of all the complex frequency values are calculated to obtain an amplitude spectrum with the dimension of N × M. And dividing the N frames of amplitude spectrums into two types according to the corresponding unvoiced signal marks and voiced signal marks, wherein the unvoiced signal amplitude spectrums and the voiced signal amplitude spectrums are obtained to obtain the amplitude spectrum of each frame of unvoiced signal and the amplitude spectrum of each frame of voiced signal.

After the amplitude spectrum of each frame of unvoiced signal is obtained, the bandwidth of the unvoiced signal can be determined based on the frequency value corresponding to the amplitude spectrum of each frame of unvoiced signal; after obtaining the magnitude spectrum of each frame of voiced sound signal, the bandwidth of the voiced sound signal may be determined based on the frequency value corresponding to the magnitude spectrum of each frame of voiced sound signal.

The time-frequency transformation is carried out on the unvoiced signal and the voiced signal, and the bandwidth of the unvoiced signal and the bandwidth of the voiced signal are determined based on the amplitude spectrum, so that the accurate bandwidth can be obtained.

In an exemplary embodiment, said determining the bandwidth of said unvoiced signal according to the magnitude spectrum of said unvoiced signal for each frame includes: determining an average amplitude spectrum of the unvoiced sound signals according to the amplitude spectrum of each frame of the unvoiced sound signals, and using the average amplitude spectrum as an unvoiced sound amplitude spectrum; determining a maximum amplitude value in the unvoiced sound amplitude spectrum as an unvoiced sound maximum amplitude value, and determining a preset proportion of the unvoiced sound maximum amplitude value as an unvoiced sound reference amplitude value; determining the highest frequency value of which the amplitude value is larger than the unvoiced reference amplitude value in the unvoiced amplitude spectrum as a first unvoiced frequency value; determining a frequency value with the amplitude falling fastest along with the frequency in the unvoiced sound amplitude spectrum as a second unvoiced sound frequency value; and determining the minimum frequency value in the first unvoiced sound frequency value and the second unvoiced sound frequency value as the effective bandwidth of the unvoiced sound signal.

The preset proportion is a proportion value preset for determining the bandwidth of the audio signal, and may be e ^-6 。

Carrying out average calculation on the amplitude spectrums of all the frame unvoiced signals to obtain an average amplitude spectrum of the unvoiced signals, taking the average amplitude spectrum of the unvoiced signals as an unvoiced amplitude spectrum, and recording the unvoiced amplitude spectrum as spectrum 0; determining the maximum amplitude value in the unvoiced sound amplitude spectrum as an unvoiced sound maximum amplitude value Max0, and determining a preset proportion (which can be represented by R) of the unvoiced sound maximum amplitude value Max0 as an unvoiced sound reference amplitude value, namely the unvoiced sound reference amplitude value is R × Max0; determining amplitude values of which the amplitude values are larger than the unvoiced reference amplitude values in the unvoiced amplitude spectrum, determining the highest frequency value f0 in the frequency values corresponding to the amplitude values, and taking the highest frequency value f0 as a first unvoiced frequency value; the current judged frequency value can be used as a current frequency value, a first frequency value and a second frequency value which correspond to a frequency range taking the current frequency value as a center are determined, the first frequency value is smaller than the current frequency value, the second frequency value is larger than the current frequency value, a first amplitude value which corresponds to the first frequency value in the unvoiced amplitude spectrum is determined, a second amplitude value which corresponds to the second frequency value in the unvoiced amplitude spectrum is determined, a difference value between the first amplitude value and the second amplitude value is determined as an amplitude drop value which corresponds to the current frequency value, each frequency value in the unvoiced amplitude spectrum is respectively used as the current frequency value and determines a corresponding amplitude drop value, and a frequency value which corresponds to the largest amplitude drop value in the amplitude drop values which correspond to each frequency value is determined as a second unvoiced frequency value ff0; and comparing the first unvoiced frequency value f0 with the second unvoiced frequency value ff0, and determining the frequency value with the minimum of the two as the bandwidth of the unvoiced signal.

The highest frequency value with the amplitude value larger than the unvoiced sound reference amplitude value is determined as a first unvoiced sound frequency value, the frequency value with the amplitude falling along with the frequency at the fastest speed is determined as a second unvoiced sound frequency value, the minimum frequency value in the first unvoiced sound frequency value and the second unvoiced sound frequency value is determined as the effective bandwidth of the unvoiced sound signal, the change of the amplitude value and the amplitude value along with the frequency value of the unvoiced sound signal is comprehensively considered, and the accuracy of determining the bandwidth of the unvoiced sound signal can be improved.

In an exemplary embodiment, the determining the bandwidth of the voiced sound signal according to the magnitude spectrum of the voiced sound signal by each frame includes:

determining an average amplitude spectrum of the voiced sound signal according to the amplitude spectrum of each frame of the voiced sound signal, and using the average amplitude spectrum as a voiced sound amplitude spectrum; determining a maximum amplitude value in the voiced sound amplitude spectrum as a voiced sound maximum amplitude value, and determining a preset proportion of the voiced sound maximum amplitude value as a voiced sound reference amplitude value; determining the highest frequency value of which the amplitude value is larger than the voiced sound reference amplitude value in the voiced sound amplitude spectrum as a first voiced sound frequency value; determining a frequency value with the amplitude which is the fastest to descend along with the frequency in the voiced sound amplitude spectrum as a second voiced sound frequency value; determining a smallest frequency value of the first voiced frequency value and the second voiced frequency value as a bandwidth of the voiced sound signal.

Carrying out average calculation on the magnitude spectrums of all the frames of voiced sound signals to obtain an average magnitude spectrum of the voiced sound signals, taking the average magnitude spectrum of the voiced sound signals as a voiced sound magnitude spectrum, and recording the voiced sound magnitude spectrum as a spectrum 1; determining the maximum amplitude value in the voiced sound amplitude spectrum as a voiced sound maximum amplitude value Max1, and determining a preset proportion (which can be represented by R) of the voiced sound maximum amplitude value Max1 as a voiced sound reference amplitude value, namely the voiced sound reference amplitude value is R & ltMax 1 > determining amplitude values of which the amplitude values are larger than a voiced sound reference amplitude value in a voiced sound amplitude spectrum, determining a highest frequency value f1 in frequency values corresponding to the amplitude values, and taking the highest frequency value f1 as a first voiced sound frequency value; the method includes the steps that a currently judged frequency value is taken as a current frequency value, a first frequency value and a second frequency value which correspond to a frequency range with the current frequency value as the center are determined, the first frequency value is smaller than the current frequency value, the second frequency value is larger than the current frequency value, a first amplitude value which corresponds to the first frequency value in a voiced sound amplitude spectrum is determined, a second amplitude value which corresponds to the second frequency value in the voiced sound amplitude spectrum is determined, a difference value between the first amplitude value and the second amplitude value is determined as an amplitude drop value which corresponds to the current frequency value, each frequency value in the voiced sound amplitude spectrum is taken as the current frequency value and determines a corresponding amplitude drop value, and a frequency value which corresponds to the largest amplitude drop value in the amplitude drop values which correspond to each frequency value is determined as a second voiced sound frequency value ff1; and comparing the first voiced sound frequency value f1 with the second voiced sound frequency value ff1, and determining the frequency value with the minimum value as the bandwidth of the voiced sound signal.

The method comprises the steps of determining the highest frequency value of which the amplitude value is larger than the voiced sound reference amplitude value as a first voiced sound frequency value, determining the frequency value of which the amplitude is the fastest to decline along with the frequency as a second voiced sound frequency value, and determining the minimum frequency value of the first voiced sound frequency value and the second voiced sound frequency value as the effective bandwidth of the voiced sound signal.

In step S55, a bandwidth evaluation result of the audio to be processed is determined according to a bandwidth of an unvoiced signal and a bandwidth of a voiced signal corresponding to at least one of the second vocal segments.

The bandwidth of the unvoiced sound signal corresponding to the at least one second vocal segment can be counted to obtain a bandwidth statistic value of the unvoiced sound signal, and a bandwidth evaluation result of the unvoiced sound signal is determined based on the relation between the bandwidth statistic value of the unvoiced sound signal and a bandwidth threshold value; the bandwidth of the voiced sound signal corresponding to the at least one second voice segment can be counted to obtain a bandwidth statistic value of the voiced sound signal, and a bandwidth evaluation result of the voiced sound signal is determined based on the relation between the bandwidth statistic value of the voiced sound signal and a bandwidth threshold value; and determining the average value of the bandwidth evaluation result of the unvoiced signal and the bandwidth evaluation result of the voiced signal as the bandwidth evaluation result of the audio to be processed.

Or respectively determining a bandwidth evaluation result of the unvoiced sound signal and a bandwidth evaluation result of the voiced sound signal corresponding to each second vocal segment, further determining a bandwidth evaluation result of the second vocal segment based on the bandwidth evaluation result of the unvoiced sound signal and the bandwidth evaluation result of the voiced sound signal, and further determining a bandwidth evaluation result of the audio to be processed based on the bandwidth evaluation result of at least one second vocal segment.

In an exemplary embodiment, the determining, according to a bandwidth of an unvoiced sound signal and a bandwidth of a voiced sound signal corresponding to at least one of the second vocal segments, a bandwidth evaluation result of the audio to be processed includes: for each second voice segment, determining a bandwidth evaluation result of the unvoiced sound signal according to the bandwidth of the unvoiced sound signal, a first unvoiced sound bandwidth threshold and a second unvoiced sound bandwidth threshold; for each second voice segment, determining a bandwidth evaluation result of the voiced sound signal according to the bandwidth of the voiced sound signal, a first voiced sound bandwidth threshold value and a second voiced sound bandwidth threshold value; for each second human voice segment, determining an average value of the bandwidth evaluation result of the unvoiced sound signal and the bandwidth evaluation result of the voiced sound signal as the bandwidth evaluation result of the second human voice segment; and determining the average value of the bandwidth evaluation results of at least one second voice segment as the bandwidth evaluation result of the audio to be processed.

For each second voice segment, when determining the bandwidth evaluation result of the unvoiced sound signal, a first unvoiced sound bandwidth threshold and a second unvoiced sound bandwidth threshold may be preset, where the first unvoiced sound bandwidth threshold is smaller than the second unvoiced sound bandwidth threshold, and if the bandwidth of the unvoiced sound signal is smaller than or equal to the first unvoiced sound bandwidth threshold, the bandwidth evaluation result of the unvoiced sound signal is determined to be 0 (that is, the bandwidth evaluation score of the unvoiced sound signal is 0); if the bandwidth of the unvoiced sound signal is greater than or equal to the second unvoiced sound bandwidth threshold, determining that the bandwidth evaluation result of the unvoiced sound signal is full score (namely that the bandwidth evaluation score of the unvoiced sound signal is full score); if the bandwidth of the unvoiced sound signal is greater than the first unvoiced sound bandwidth threshold and less than the second unvoiced sound bandwidth threshold, determining the bandwidth evaluation result of the unvoiced sound signal according to the following formula:

wherein,

representing the result of the evaluation of the bandwidth of the unvoiced signal, S _BW Indicates the full score, BW, of the bandwidth rating _unvoiced Indicates the bandwidth, BW, of the unvoiced signal _unvoicedth1 Denotes the first unvoiced bandwidth threshold, BW _unvoicedth2 Representing a second unvoiced bandwidth threshold.

For each second voice segment, when determining a bandwidth evaluation result of the voiced sound signal, a first voiced sound bandwidth threshold and a second voiced sound bandwidth threshold may be preset, where the first voiced sound bandwidth threshold is smaller than the second voiced sound bandwidth threshold, and if the bandwidth of the voiced sound signal is smaller than or equal to the first voiced sound bandwidth threshold, it is determined that the bandwidth evaluation result of the voiced sound signal is 0 (that is, the bandwidth evaluation score of the voiced sound signal is 0 min); if the bandwidth of the voiced sound signal is greater than or equal to the second voiced sound bandwidth threshold, determining that the bandwidth evaluation result of the voiced sound signal is full score (namely that the bandwidth evaluation score of the voiced sound signal is full score); if the bandwidth of the voiced sound signal is greater than the first voiced sound bandwidth threshold and less than the second voiced sound bandwidth threshold, determining a bandwidth evaluation result of the voiced sound signal according to the following formula:

wherein,

representing the result of the bandwidth evaluation of a voiced signal, S _BW Indicates the full score, BW, of the bandwidth evaluation _voiced Indicates the bandwidth, BW, of a voiced signal _voicedth1 Represents the first voiced bandwidth threshold, BW _voicedth2 Representing a second voiced bandwidth threshold.

Determining the average value of the bandwidth evaluation results of the unvoiced sound signals and the bandwidth evaluation results of the voiced sound signals for each second voice segment, and determining the average value as the bandwidth evaluation result of the second voice segment; and then determining the average value of the bandwidth evaluation results of the at least one second voice segment as the bandwidth evaluation result of the audio to be processed.

By determining the bandwidth evaluation result of the unvoiced signal based on the comparison result of the bandwidth of the unvoiced signal and the first unvoiced bandwidth threshold and the second unvoiced bandwidth threshold, and determining the bandwidth evaluation result of the voiced signal based on the comparison result of the bandwidth of the voiced signal and the first voiced bandwidth threshold and the second voiced bandwidth threshold, and further determining the bandwidth evaluation results of the unvoiced signals of all the second voiced segments and the average value of the bandwidth evaluation results of the voiced quotation marks as the bandwidth evaluation result of the audio to be processed, a more accurate bandwidth evaluation result can be obtained, and thus the accuracy of bandwidth evaluation is improved.

The audio to be processed is divided into a plurality of second time segments, at least one second voice segment in the second time segments is determined, the bandwidth of an unvoiced signal and the bandwidth of a voiced signal in each second voice segment are determined, and then the bandwidth of the unvoiced signal and the bandwidth of the voiced signal of at least one second voice segment are counted, so that the bandwidth evaluation result of the audio to be processed can be obtained, and the bandwidth evaluation of the audio to be processed is realized.

In one exemplary embodiment, the recording quality factor includes a recording imperfection; fig. 6 is a flowchart of performing recording quality evaluation on a to-be-processed audio according to a recording defect in the embodiment of the present disclosure, and as shown in fig. 6, the performing recording quality evaluation on the to-be-processed audio according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor includes:

in step S61, the audio to be processed is divided into a plurality of third time segments according to a third time length.

Wherein the third time length is a preset time length of a third time slice. The third length of time may be greater than the first length of time and the second length of time.

And dividing the audio to be processed according to the third time length so as to divide the waveform of the audio to be processed into shorter time segments and obtain a plurality of third time segments corresponding to the audio to be processed.

In step S62, determining a maximum amplitude absolute value of the audio to be processed, determining a first proportion of the maximum amplitude absolute value as a first amplitude, and determining a second proportion of the maximum amplitude absolute value as a second amplitude; the first ratio is less than the second ratio.

Counting the absolute value of the amplitude of the audio to be processed, determining the maximum absolute value of the amplitude of the audio to be processed, determining a first proportion of the maximum absolute value of the amplitude as a first amplitude, and determining a second proportion of the maximum absolute value of the amplitude as a second amplitude. The first ratio and the second ratio are both proportionality coefficients smaller than 1, for example, the first ratio may be 0.8, and the second ratio may be 0.9.

In step S63, for each third time segment, determining a number of data points in the third time segment, where the absolute value of the amplitude value is between the first amplitude value and the second amplitude value, as a first number of data points, and determining a number of data points in the third time segment, where the absolute value of the amplitude value is between the second amplitude value and the maximum absolute value, as a second number of data points.

And counting the number of data points of the absolute value of the amplitude between the first amplitude and the second amplitude as the number of first data points, and counting the number of data points of the absolute value of the amplitude between the second amplitude and the maximum absolute value as the number of second data points for each third time slice.

In step S64, a third time segment in which the second data point number is greater than the first data point number is determined as a recording defect segment.

Recording flaws mainly detect the popping sound in the audio to be processed, the absolute value of the amplitude of the popping sound is large, and if the number of data points with large absolute values of the amplitude is large, the existence of the popping sound can be considered, namely the recording flaws exist. And comparing the first data point number with the second data point number, and if the second data point number is greater than the first data point number, determining a third time segment corresponding to the second time point number and the first time point number as a recording defective segment.

In step S65, a recording defect evaluation result of the audio to be processed is determined according to the number of the recording defect segments and the total number of the third time segments.

Determining the number of the recording defect fragments in the audio to be processed, determining the total number of the third time fragments in the audio to be processed, and determining the recording defect evaluation result of the audio to be processed based on the proportion of the recording defect fragments in all the third time fragments.

In an exemplary embodiment, determining a recording defect evaluation result of the audio to be processed according to the number of recording defect segments and the total number of the third time segments includes: and determining the ratio of the number of the recording defect fragments to the total number as a recording defect evaluation result of the audio to be processed.

And determining the proportion of the recording defect segments in all the third time segments, namely determining the ratio of the number of the recording defect segments to the total number, and determining the ratio as the recording defect evaluation result of the audio to be processed. The ratio of the number of the recording defect segments to the total number is determined as the recording defect evaluation result of the audio to be processed, so that a more accurate and quantized recording defect evaluation result can be obtained.

In an exemplary embodiment, before determining, for each of the third time slices, the number of data points of the third time slice between the first amplitude value and the second amplitude value, further comprising: determining a histogram of the third time slice according to the absolute value of the amplitude value for each third time slice;

for each third time segment, determining the number of data points of the third time segment between the first amplitude value and the second amplitude value as a first number of data points, including: determining a sum of the histogram between the first and second magnitudes as the first number of data points;

determining, as a second data point count, a number of data points of the third time segment between the second amplitude and the absolute maximum amplitude, including: determining a sum of the histogram between the second magnitude and the maximum magnitude absolute value as the second number of data points.

And counting the number of data points corresponding to each amplitude absolute value aiming at each third time segment, and representing the number of data points corresponding to the amplitude absolute value as a histogram. When determining the number of the first data points, the sum of the first amplitude and the second amplitude of the histogram may be counted to obtain the number of the first data points. In determining the second data point number, the sum of the absolute value of the second amplitude and the maximum amplitude of the histogram may be counted to obtain the second data point number.

The histogram of the third time segment is determined according to the absolute value of the amplitude, and the number of the first data points and the number of the second data points are determined based on the histogram, so that the data processing efficiency can be improved, and the evaluation efficiency of recording defects can be improved.

Fig. 7 is a block diagram illustrating an audio information evaluation apparatus according to an exemplary embodiment. Referring to fig. 7, the apparatus includes a singing level evaluation module 71, a recording quality evaluation module 72, and a comprehensive evaluation module 73.

The singing level evaluation module 71 is configured to perform singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed;

the recording quality evaluation module 72 is configured to perform recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed;

the comprehensive evaluation module 73 is configured to determine an evaluation result of the audio to be processed according to the singing level evaluation result and the recording quality evaluation result.

Optionally, the recording quality evaluation module includes:

Optionally, the recording quality factor includes a signal-to-noise ratio;

the recording quality factor evaluation unit comprises:

a first vocal segment determination subunit configured to perform determining a first time segment having vocal in the plurality of first time segments, resulting in at least one first vocal segment;

Optionally, the snr evaluating subunit is configured to perform:

Optionally, the recording quality factor includes a bandwidth;

the recording quality evaluation unit includes:

a second segment dividing subunit configured to divide the audio to be processed into a plurality of second time segments according to a second time length;

Optionally, the signal bandwidth determining subunit includes:

the time-frequency transformation sub-module is configured to perform time-frequency transformation on each frame of unvoiced sound signal in the second human sound segment to obtain an amplitude spectrum of each frame of unvoiced sound signal, and perform time-frequency transformation on each frame of voiced sound signal in the second human sound segment to obtain an amplitude spectrum of each frame of voiced sound signal;

a voiced-speech bandwidth determination sub-module configured to perform a determination of a bandwidth of the voiced speech signal from the magnitude spectrum of the voiced speech signal for each frame.

determining a frequency value of the smallest of the first unvoiced frequency value and the second unvoiced frequency value as a bandwidth of the unvoiced signal;

the voiced-sound bandwidth determination submodule is configured to perform:

Optionally, the bandwidth evaluation subunit is configured to perform:

Optionally, the recording quality factor includes recording defects;

the recording quality evaluation unit includes:

a defective segment determination subunit configured to perform determination of a third time segment in which the second data point number is greater than the first data point number as a recording defective segment;

and the recording defect evaluation subunit is configured to determine a recording defect evaluation result of the audio to be processed according to the number of the recording defect fragments and the total number of the third time fragments.

Optionally, the recording quality evaluation unit further includes:

the data point number determination subunit is configured to perform:

determining, for each of the third time segments, a sum of the histogram between the first and second magnitudes as the first number of data points; determining a sum of the absolute value of the second amplitude and the maximum amplitude of the histogram as the second data point number.

Optionally, the recording defect evaluation subunit is configured to perform:

Optionally, the recording quality evaluation module includes:

Optionally, the singing level evaluation module includes:

the device comprises a fundamental frequency information acquisition unit, a pitch template acquisition unit and a pitch template processing unit, wherein the fundamental frequency information acquisition unit is configured to acquire a fundamental frequency sequence of audio to be processed and acquire the pitch template corresponding to the audio to be processed;

the singing level evaluation unit comprises:

Optionally, the frequency deviation determination subunit includes:

Optionally, the comprehensive evaluation module is configured to perform:

and performing polynomial fitting on the singing level evaluation result and the recording quality evaluation result to obtain an evaluation result of the audio to be processed.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described audio information evaluation methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the audio information evaluation method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program or computer instructions, which when executed by a processor, implements the audio information evaluation method described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An audio information evaluation method, comprising:

2. The method according to claim 1, wherein the performing the recording quality evaluation on the audio to be processed to obtain a recording quality evaluation result corresponding to the audio to be processed comprises:

respectively evaluating the recording quality of the audio to be processed according to at least one recording quality factor to obtain an evaluation result corresponding to each recording quality factor;

3. The method of claim 2, wherein the recording quality factor comprises a signal-to-noise ratio;

for each first vocal segment, determining a signal-to-noise ratio of a vocal audio signal and a noise audio signal in the first vocal segment;

4. The method according to claim 3, wherein the determining a signal-to-noise ratio evaluation result of the audio to be processed according to the signal-to-noise ratio of at least one of the first vocal segments comprises:

5. The method of claim 2, wherein the recording quality factor comprises bandwidth;

for each second voice segment, carrying out fundamental frequency detection on the second voice segment according to frames, determining the frames with the detected fundamental frequency as voiced sound signals, and determining the frames without the detected fundamental frequency as clear sound signals;

6. The method of claim 5, wherein determining the bandwidth of the unvoiced signal in the second segment of human speech and determining the bandwidth of the voiced signal in the second segment of human speech comprises:

7. The method of claim 6, wherein determining the bandwidth of the unvoiced sound signal according to the magnitude spectrum of the unvoiced sound signal for each frame comprises:

8. The method according to claim 5, wherein the determining a bandwidth evaluation result of the audio to be processed according to a bandwidth of an unvoiced signal and a bandwidth of a voiced signal corresponding to at least one of the second vocal segments comprises:

for each second human voice segment, determining an average value of the bandwidth evaluation result of the unvoiced sound signal and the bandwidth evaluation result of the voiced sound signal as the bandwidth evaluation result of the second human voice segment;

9. The method of claim 2, wherein the recording quality factor comprises a recording defect;

for each third time segment, determining the number of data points of the absolute value of the amplitude value between the first amplitude value and the second amplitude value in the third time segment as a first number of data points, and determining the number of data points of the absolute value of the amplitude value between the second amplitude value and the maximum absolute value in the third time segment as a second number of data points;

10. The method of claim 9, wherein determining the recording defect evaluation result of the audio to be processed according to the number of recording defect segments and the total number of the third time segments comprises:

11. The method according to claim 1, wherein performing singing level evaluation on the audio to be processed to obtain a singing level evaluation result corresponding to the audio to be processed comprises:

12. The method of claim 11, wherein the elements in the fundamental frequency sequence comprise time points and fundamental frequency values, and wherein the pitch template comprises start times, end times, and template frequency values corresponding to notes;

according to the fundamental frequency value and the template frequency value, time alignment is carried out on the fundamental frequency sequence and the tone template, and a time point corresponding to the fundamental frequency value is updated to be an alignment time point;

determining a deviation between the fundamental frequency value and the template frequency value for each of the elements according to the alignment time point;

13. An audio information evaluation apparatus, comprising:

14. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio information evaluation method of any of claims 1 to 12.

15. A computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the audio information evaluation method of any one of claims 1 to 12.