CN106548784B

CN106548784B - Voice data evaluation method and system

Info

Publication number: CN106548784B
Application number: CN201510586445.7A
Authority: CN
Inventors: 傅鸿城
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2015-09-16
Filing date: 2015-09-16
Publication date: 2020-04-24
Anticipated expiration: 2035-09-16
Also published as: WO2017045428A1; CN106548784A

Abstract

The embodiment of the invention discloses a method and a system for evaluating voice data, which are applied to the technical field of information processing. In the method of the embodiment, the voice data evaluation system quantizes the voice data contained in the plurality of pieces of sound data of one accompaniment respectively, and then acquires and stores the optimal voice data of one accompaniment according to the plurality of pieces of quantized voice data, so that the preset standard data, namely the optimal voice data, is automatically generated by the voice data evaluation system, and the system can conveniently evaluate the voice data to be evaluated of one accompaniment.

Description

Voice data evaluation method and system

Technical Field

The invention relates to the technical field of information processing, in particular to a method and a system for evaluating voice data.

Background

An existing evaluation system for voice data (such as songs) can evaluate the voice data uploaded by a user and provide the evaluation result of the voice data to the user in a scoring manner, specifically, the system compares the uploaded voice data with preset standard data and scores according to the comparison result.

However, the standard data preset in the conventional voice data evaluation system are all manufactured and preset in the system through manual offline, so that the cost is high, the difficulty is high, and the timeliness is low.

Disclosure of Invention

The embodiment of the invention provides a method and a system for evaluating voice data, which realize that the preset standard data is automatically generated by a voice data evaluation system.

The embodiment of the invention provides a method for evaluating voice data, which comprises the following steps:

quantizing voice data contained in a plurality of pieces of sound data of an accompaniment respectively to obtain a plurality of pieces of quantized voice data;

clustering the quantized voice data to obtain the optimal voice data of the accompaniment;

and storing the optimal voice data of the accompaniment, wherein the optimal voice data is used for evaluating the voice data to be evaluated of the accompaniment.

An embodiment of the present invention further provides an evaluation system for voice data, including:

a first quantizing unit, configured to quantize, respectively, speech data included in a plurality of pieces of sound data of an accompaniment to obtain a plurality of pieces of quantized speech data;

the optimal acquisition unit is used for clustering a plurality of pieces of quantized voice data obtained by the first quantization unit to acquire optimal voice data of the accompaniment;

and the storage unit is used for storing the optimal voice data of the accompaniment acquired by the optimal acquisition unit, and the optimal voice data is used for evaluating the voice data to be evaluated of the accompaniment.

It can be seen that, in the method of this embodiment, the evaluation system of the speech data quantizes the speech data included in the multiple pieces of sound data of one accompaniment, and then obtains and stores the optimal speech data of one accompaniment according to the multiple pieces of quantized speech data, so that the evaluation system of the speech data automatically generates the preset standard data, i.e., the optimal speech data, to facilitate the evaluation of the system on the speech data of one accompaniment to be evaluated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a method for presetting standard data in an evaluation method of voice data according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of evaluating speech data in an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for quantizing voice data included in voice data according to an embodiment of the present invention;

FIG. 4 is a flow chart of a method for obtaining optimal voice data in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a system for evaluating speech data according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another speech data evaluation system according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another speech data evaluation system according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another speech data evaluation system according to an embodiment of the present invention;

FIG. 9 is a flow chart of a method for extracting speech data from sound data in an embodiment of the present invention;

FIG. 10 is a flowchart of a method for warping a plurality of fundamental frequency sub-sequences in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention provides a method for evaluating voice data, which is mainly a method executed by a system for evaluating voice data, and a flow chart is shown in fig. 1, and comprises the following steps:

step 101, quantizing voice data included in a plurality of pieces of sound data of one accompaniment to obtain a plurality of pieces of quantized voice data, respectively.

It is understood that for an accompaniment, any user can sing at least once along with an accompaniment and obtain at least one piece of sound data, and one piece of sound data includes accompaniment music and singing data (i.e. voice data) of the user. In this embodiment, each user may upload respective sound data to the evaluation system of the sound data by operating the evaluation system of the sound data, and then the evaluation system of the sound data may process the sound data included in the sound data of each user respectively. Therefore, the speech data in the sound data needs to be extracted first by the speech data evaluation system, and then quantized, where the quantization process is to standardize the speech data and to express default standardized data of the speech data evaluation system, so as to facilitate the subsequent processing.

And 102, clustering a plurality of pieces of quantized voice data to obtain the optimal voice data of an accompaniment. The clustering is an analysis process for grouping a set of physical or abstract objects into a plurality of classes composed of similar objects, and during concrete implementation, for any quantized voice data, the distances between the quantized voice data and other quantized voice data are respectively calculated, and then the optimal voice data can be obtained according to the calculated distances.

Step 103, storing the optimal voice data of one accompaniment, wherein the optimal voice data is standard data of the evaluation voice data and is mainly used for evaluating the voice data to be evaluated of the accompaniment.

Through the above steps 101 to 103, the system automatically presets the standard data for evaluating the voice data. Further, referring to fig. 2, when evaluating the speech data to be evaluated of the accompaniment, the evaluation may be specifically implemented according to the following steps:

step 201, quantizing the voice data to be evaluated of an accompaniment to obtain quantized voice data to be evaluated, specifically, when a certain user sings according to an accompaniment to obtain a piece of voice data, and uploads the voice data to the evaluation system of the voice data to wait for the evaluation of the evaluation system, the evaluation system may extract the voice data to be evaluated from the uploaded voice data, and then quantizes the voice data to be evaluated according to the method of step 101.

Step 202, calculating a first distance between the quantized voice data to be evaluated obtained in step 201 and the optimal voice data, where the first distance may be a euclidean distance.

Step 203, determining the evaluation score of the voice data to be evaluated according to the first distance calculated in the step 202.

Specifically, the evaluation system of the voice data may quantize the first distance to a score between 0 and 100 as an evaluation score, specifically, a second distance having a largest distance to the optimal voice data obtained in step 102 among the plurality of pieces of quantized voice data obtained in step 101 may be obtained first, and assuming that the first distance is k and the second distance is m, the evaluation score determined by the evaluation system of the voice data is 100 × (m-k)/m.

Furthermore, the evaluation system of the voice data can also output the evaluation score, and further output the position of the inconsistency between the voice data to be evaluated and the optimal voice data, so that the user can intuitively know the defect of singing, and the singing level of the user can be improved in a targeted manner. In the process of obtaining the optimal voice data, the distance between each position of the voice data to be evaluated and each position of the optimal voice data is calculated, and if the distance corresponding to a certain position is greater than a preset value, the voice data to be evaluated and the optimal voice data are inconsistent at the position.

It can be seen that, in the method of this embodiment, the evaluation system of the speech data quantizes the speech data included in the multiple pieces of sound data of an accompaniment, and then obtains and stores the optimal speech data of the accompaniment according to the multiple pieces of quantized speech data, so that the evaluation system of the speech data automatically generates the preset standard data, i.e., the optimal speech data, to facilitate the evaluation of the speech data to be evaluated of the accompaniment by the system.

Referring to fig. 3, in a specific embodiment, in the process of executing step 101, the system for evaluating voice data quantizes voice data included in a certain piece of voice data to obtain a piece of quantized voice data, and specifically may be implemented by the following steps:

step a1, extracting fundamental frequency information of the piece of sound data. Because the frequency generated by vocal cord vibration when a person utters a voice is filtered by a vocal track to generate a large amount of overtones, in order to facilitate subsequent operations, the evaluation system of voice data extracts data (namely fundamental frequency information) directly representing fundamental tones of the vocal cord vibration frequency from voice data, and the fundamental tones determine pitches of whole notes, so that the fundamental frequency information extracted in the step can represent voice data contained in one piece of voice data.

The extracted fundamental frequency information generally includes a fundamental frequency sequence, i.e. includes a plurality of fundamental frequency points, i.e. time points, and fundamental frequency values, i.e. frequency values, at each fundamental frequency point, the frequency values representing the speech intensity at the corresponding time point.

And step B1, converting the fundamental frequency information to make the fundamental frequency value included in the converted fundamental frequency information be a small-range numerical value, wherein the small-range numerical value is a numerical value in a certain range smaller than the original fundamental frequency value, so that the subsequent processing is easy to realize.

Specifically, the evaluation system of the voice data may directly convert a plurality of fundamental frequency values in the fundamental frequency information into a small-range numerical value, and specifically, may take a log value of the fundamental frequency value to obtain the small-range numerical value.

The evaluation system of the voice data can also perform second preprocessing on the fundamental frequency information, and then convert the fundamental frequency value included in the fundamental frequency information after the second preprocessing into a small-range numerical value. The second preprocessing comprises at least one of the following processing modes: low-pass filtering, compression, zero setting of singular fundamental frequency points, zero-fundamental frequency point filling and the like, wherein the low-pass filtering can comprise median filtering or mean filtering and the like, if the median filtering is adopted, if the length of a fundamental frequency sequence is smaller than a preset length (for example, 35 frames), an evaluation system of voice data can carry out the median filtering according to the actual length of the fundamental frequency sequence; if the length of the fundamental frequency sequence is greater than or equal to the preset length, the evaluation system of the voice data may perform median filtering on each segment by taking a point, for example, performing median filtering on 10 points per frame.

The compression process may include: for the fundamental frequency sequence included in the fundamental frequency information, every N (for example, 5) fundamental frequency points take the fundamental frequency values on one fundamental frequency point to form a new fundamental frequency sequence, which is equivalent to compressing the fundamental frequency sequence by N times.

The singular fundamental frequency point nulling process may include: comparing the fundamental frequency value of a certain fundamental frequency point with the fundamental frequency values of the fundamental frequency points before and after the fundamental frequency point, if the difference value is larger than the preset difference value, detecting that the fundamental frequency point is a singular fundamental frequency point, and setting the fundamental frequency value of the fundamental frequency point to zero.

The zero-base frequency point filling process may include: if the fundamental frequency values of a plurality of consecutive fundamental frequency points in the fundamental frequency sequence are zero and the lengths of the plurality of consecutive fundamental frequency points are smaller than a preset length (for example, 15 frames), the fundamental frequency values of the plurality of consecutive fundamental frequency points are all set as the fundamental frequency values of the non-zero fundamental frequency points before the plurality of fundamental frequency points. The fundamental frequency points with zero fundamental frequency values are silent sections in the voice data, namely, parts without human voice.

Step C1, the converted fundamental frequency information is quantized into a note sequence, or the fundamental frequency information after the first pre-processing is performed on the converted fundamental frequency information is quantized into a note sequence, and a piece of quantized speech data includes information of the note sequence. The note sequence comprises a plurality of notes, the information of each note comprises the starting time, the duration and the corresponding pitch value of the note, the starting time is the starting time of a fundamental frequency subsequence contained in the converted fundamental frequency information or the fundamental frequency information after the first pretreatment, the duration is the length of a regulated fundamental frequency subsequence, and the pitch value is the frequency value of the regulated fundamental frequency subsequence.

Here, the first preprocessing may include at least one of the following processing modes: low-pass filtering, three-point smoothing and the like, wherein the low-pass filtering can be median filtering or mean filtering specifically; and the three-point smoothing process may specifically include: if the difference between the fundamental frequency value at a certain fundamental frequency point and the fundamental frequency values at the fundamental frequency points before and after the fundamental frequency point is larger than a preset difference (such as 0.01), the fundamental frequency value at the fundamental frequency point is set as the fundamental frequency value at the previous fundamental frequency point or the subsequent fundamental frequency point with smaller difference from the fundamental frequency value of the fundamental frequency point.

When the voice data evaluation system performs the quantization of the step, the converted fundamental frequency information or the fundamental frequency sequence in the first preprocessed fundamental frequency information may be segmented to obtain a plurality of fundamental frequency subsequences, and specifically, if the difference between the fundamental frequency values of two adjacent fundamental frequency points is greater than a preset value, such as 0.05, the fundamental frequency points are segmented from the previous or later fundamental frequency point of the two adjacent fundamental frequency points, so that the start time of one fundamental frequency subsequence and the end time of the other fundamental frequency subsequence are obtained. In this way, the entire baseband sequence is segmented to obtain a plurality of baseband subsequences, and obtain the start time and the end time of each baseband subsequence, and the length of each baseband subsequence can be determined according to the difference between the start time and the end time.

The frequency value of each fundamental frequency subsequence is then represented by the median fundamental frequency value of the fundamental frequency subsequence, and the median fundamental frequency value is regulated to an integer between 0 and 24. Since human acoustic energy spans two octaves, one octave comprising 12 semitones, and thus two octaves having 24 semitones, the operation here is to normalize the fundamental median value of each fundamental subsequence to a note value.

Through the above operations, a note sequence is obtained, which includes a plurality of notes, and the information of each note includes the start time of the note (i.e. the start time of the fundamental frequency subsequence), the duration of the note (i.e. the length of the warped fundamental frequency subsequence), and the pitch value (i.e. the median fundamental frequency value of the warped fundamental frequency subsequence). The pitch value of each note in the sequence of notes thus quantized has an accuracy of 1 semitone, a maximum value of 23, and a duration of 10 frames (i.e., 0.1 second) and a maximum value of 19.

Further, if the pitch values of at least two adjacent notes in the note sequence obtained by the above processing are the same, it is necessary to combine the at least two notes into one note, where the start time of the combined note is the start time of the combined note (specifically, the earliest start time of the at least two notes), the duration is the sum of the durations of the combined at least two notes, and the pitch value is the pitch value of any one of the combined notes.

Referring to fig. 4, in another specific embodiment, if there are n pieces of quantized speech data obtained in step 101, where n is a positive integer greater than 1, the evaluation system of speech data can be implemented by the following steps when performing step 102:

step a2, the distances between any two pieces of quantized speech data in the n pieces of quantized speech data are calculated, respectively.

It can be understood that the one piece of quantized speech data obtained by the above operation includes information of a note sequence including start time, duration and pitch value of each note in the plurality of notes, and then the note S in the first piece of quantized speech data of the n pieces of quantized speech data_iAnd the note S in the second quantized speech data_jDistance D (Si, S)_j) The method specifically comprises the following steps:

calculated according to the following formula:

wherein:

Δ p is a representation of a note S_iAnd S_jΔ p ═ min (abs (p))_i-p_j)，abs(p_i-p_j-24)+1.0，abs(p_i-p_j+24) +1.0), abs for absolute value, min for minimum value, p_iIs a note S_iPitch value of p_jIs a note S_jPitch value of (d);

Δ d is the note S_iAnd S_jThe time difference of (a), i.e. the difference of the corresponding time lengths of the respective notes, σ is a weight value of the time difference, e.g. 0.4.

Thus, the distance between a note in a piece of quantized speech data (e.g., a first piece of quantized speech data) and a note in another piece of quantized speech data (e.g., a second piece of quantized speech data) can be obtained, and the direct distance between the first piece of quantized speech data and the second piece of quantized speech data is the calculated maximum distance between the note of the first piece of quantized speech data and the note of the second piece of quantized speech data.

Step B2, respectively calculating the sum of the distances between each piece of quantized voice data in the n pieces of quantized voice data and the other n-1 pieces of quantized voice data, so that each piece of quantized voice data can correspond to a sum of distances.

And step C2, using the quantized speech data corresponding to the sum of the minimum distances as the optimal speech data.

An embodiment of the present invention further provides an evaluation system for voice data, a schematic structural diagram of which is shown in fig. 5, and the evaluation system specifically includes:

the first quantizing unit 10 is configured to quantize the speech data included in the plurality of pieces of sound data of one accompaniment to obtain a plurality of pieces of quantized speech data, respectively.

The first quantization unit 10 may extract the voice data in the voice data and then quantize the voice data, where the quantization process is to normalize the voice data and to show default normalized data of an evaluation system of the voice data, so as to facilitate subsequent processing.

An optimal obtaining unit 11, configured to cluster the multiple pieces of quantized speech data obtained by the first quantizing unit 10, and obtain optimal speech data of the accompaniment.

Specifically, the optimal acquisition unit 11 may calculate, for any one of the pieces of quantized speech data, a distance between the piece of quantized speech data and another piece of quantized speech data, and then may acquire the optimal speech data according to the calculated distance.

A storage unit 12, configured to store the optimal speech data of one accompaniment acquired by the optimal acquisition unit 11, where the optimal speech data is used to evaluate the speech data to be evaluated of the one accompaniment.

In the system of this embodiment, the first quantizing unit 10 quantizes the voice data included in the multiple pieces of voice data of an accompaniment, and then the optimal obtaining unit 11 obtains the optimal voice data of the accompaniment according to the multiple pieces of quantized voice data and stores the optimal voice data in the storage unit 12, so that the system for evaluating the voice data automatically generates the preset standard data, i.e. the optimal voice data, to facilitate the system to evaluate the voice data to be evaluated of the accompaniment.

Referring to fig. 6, in a specific embodiment, the first quantization unit 10 in the speech data evaluation system may be specifically implemented by the information extraction unit 110, the conversion unit 120, and the information quantization unit 130, and the optimal acquisition unit 11 may be specifically implemented by the first calculation unit 111, the second calculation unit 121, and the optimal determination unit 131, where:

the information extracting unit 110 is configured to extract fundamental frequency information of the sound data, where the fundamental frequency information includes a fundamental frequency sequence, and the fundamental frequency sequence includes fundamental frequency values at a plurality of fundamental frequency points.

The conversion unit 120 is configured to convert the fundamental frequency information extracted by the information extraction unit 110, so that a fundamental frequency value included in the converted fundamental frequency information is a small-range numerical value.

Specifically, if the fundamental frequency information of the voice data extracted by the information extraction unit 110 includes a plurality of fundamental frequency values, the conversion unit 120 is specifically configured to directly convert the plurality of fundamental frequency values into a small-range numerical value; or, the converting unit 120 is specifically configured to perform a second preprocessing on the fundamental frequency information, and convert the fundamental frequency value included in the fundamental frequency information after the second preprocessing into a small-range numerical value. Wherein the second pretreatment comprises at least one of the following treatment modes: low-pass filtering, compression, zero setting of singular fundamental frequency points, filling of zero fundamental frequency points and the like.

An information quantization unit 130, configured to quantize the fundamental frequency information converted by the conversion unit 120 into a note sequence, or quantize the fundamental frequency information after performing the first preprocessing on the converted fundamental frequency information into a note sequence, where the piece of quantized speech data includes information of the note sequence; wherein, the first pretreatment comprises at least one of the following treatment modes: low pass filtering, three point smoothing, etc.

It is understood that the information of the note sequence includes a start time, a duration and a corresponding pitch value of each note in the notes, where the start time is a start time of a fundamental frequency sub-sequence included in the converted fundamental frequency information or the fundamental frequency information after the first preprocessing, the duration is a length of the one fundamental frequency sub-sequence after warping, and the pitch value is a frequency value of the one fundamental frequency sub-sequence after warping.

The information quantization unit 130 may segment the fundamental frequency sequence in the converted fundamental frequency information or the fundamental frequency sequence in the fundamental frequency information after the first preprocessing to obtain a plurality of fundamental frequency subsequences. Specifically, if the difference between the fundamental frequency values of two adjacent fundamental frequency points is greater than a preset value, the two fundamental frequency points are segmented from the fundamental frequency point with earlier or later time in the two adjacent fundamental frequency points, and then the start time of one fundamental frequency subsequence and the end time of the other fundamental frequency subsequence can be obtained. In this embodiment, the information quantization unit 130 needs to regulate the lengths of the respective baseband subsequences to a preset range (for example, between 0 and 20).

The information quantization unit 130 may then represent the frequency value of each fundamental frequency subsequence by the median fundamental frequency value of the fundamental frequency subsequence, and normalize the median fundamental frequency value to an integer between 0 and 24.

Further, if the pitch values of at least two adjacent notes in the note sequence obtained through the above processing are the same, the information quantizing unit 130 needs to combine the at least two notes into one note, the start time of the combined one note (specifically, the earliest start time of the at least two notes) is the start time of the combined one note, the duration is the sum of the durations of the combined at least two notes, and the pitch value is the pitch value of any one of the combined notes.

A first calculating unit 111, configured to calculate distances between any two pieces of quantized speech data in the n pieces of quantized speech data, respectively, if there are n pieces of quantized speech data obtained by the first quantizing unit 10, where n is a positive integer greater than 1. Assume that a piece of quantized speech data obtained by the first quantizing unit 10 includes information of a note sequence including duration and pitch value of each note of a plurality of notes;

the first calculating unit 111 is specifically configured to calculate the note S in the first quantized speech data of the n pieces of quantized speech data_iAnd the note S in the second quantized speech data_jDistance D (S) of_i,S_j) Comprises the following steps:

wherein:

Δ p is a representation note S_iAnd S_jΔ p ═ min (abs (p))_i-p_j)，abs(p_i-p_j-24)+1.0，abs(p_i-p_j+24) +1.0), said p_iIs a note S_iPitch value of p_jIs a note S_jPitch value of (d);

Δ d is a note S_iAnd S_jThe σ is a weighted value of the time difference.

Thus, the distance between a note in the first piece of quantized speech data and a note in the second piece of quantized speech data is obtained, and the distance between the first piece of quantized speech data and the second piece of quantized speech data is: a maximum distance between a note of the first piece of quantized speech data and a note of the second piece of quantized speech data.

The second calculating unit 121 is configured to calculate, according to the distances obtained by the first calculating unit 111, a sum of distances between each piece of quantized speech data in the n pieces of quantized speech data and each piece of other n-1 pieces of quantized speech data, so that each piece of quantized speech data can correspond to a sum of distances.

An optimum determining unit 131, configured to use a piece of quantized speech data corresponding to the sum of the minimum distances calculated by the second calculating unit 121 as the optimum speech data.

Referring to fig. 7, in another specific embodiment, the evaluation system of speech data may include, in addition to the structure shown in fig. 5, a second quantizing unit 13, a third calculating unit 14, a score determining unit 15, and an output unit 16, wherein:

and a second quantizing unit 13, configured to quantize the to-be-evaluated speech data of the accompaniment to obtain quantized to-be-evaluated speech data.

The second quantization unit 13 may specifically extract the speech data to be evaluated from the sound data, and then quantize the speech data to be evaluated according to the quantization method performed by the first quantization unit 10.

A third calculating unit 14, configured to calculate a first distance between the quantized voice data to be evaluated obtained by the second quantizing unit 13 and the optimal voice data obtained by the optimal obtaining unit 11;

a score determining unit 15, configured to determine an evaluation score of the to-be-evaluated speech data according to the first distance calculated by the third calculating unit 14.

Specifically, the score determining unit 15 is specifically configured to obtain a second distance, which is the largest distance from the optimal voice data, in the multiple pieces of quantized voice data, where the first distance is k, and the second distance is m; determining the rating score as 100 x (m-k)/m.

And the output unit 16 is used for outputting the position where the voice data to be evaluated is inconsistent with the optimal voice data. And the output unit 16 may also output the evaluation score determined by the score determination unit 15.

Since the optimal acquisition unit 11 calculates the distance between each position of the voice data to be evaluated and each position of the optimal voice data in the process of acquiring the optimal voice data, if the distance corresponding to a certain position is greater than a preset value, the output unit 16 determines that the voice data to be evaluated is inconsistent with the optimal voice data at the position, and outputs the inconsistent position.

An embodiment of the present invention further provides an evaluation system of voice data, a schematic structural diagram of which is shown in fig. 8, where the evaluation system of voice data may generate relatively large differences due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 20 (e.g., one or more processors) and a memory 21, and one or more storage media 22 (e.g., one or more mass storage devices) storing an application 221 or data 222. Wherein the memory 21 and the storage medium 22 may be a transient storage or a persistent storage. The program stored on the storage medium 22 may include one or more modules (not shown), each of which may include a series of instruction operations in an evaluation system for voice data. Still further, the central processor 30 may be configured to communicate with the storage medium 22 to execute a series of instructional operations on the storage medium 22 on the system for evaluating voice data.

The system for evaluating voice data may also include one or more power supplies 23, one or more wired or wireless network interfaces 24, one or more input-output interfaces 25, and/or one or more operating systems 223, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the speech data evaluation system described in the above-described method embodiment may be based on the structure of the speech data evaluation system shown in fig. 8.

The method of the embodiment of the present invention is described below as a specific application example, and when the voice data evaluation system executes the evaluation method in the embodiment:

1. quantizing voice data contained in sound data

For an accompaniment, such as a song, the user uploads the respective singing data (i.e., the voice data of the user) to the voice data evaluation system, and the voice data evaluation system quantizes the voice data included in the voice data of each user, specifically, for the voice data (denoted as midi) included in a certain piece of voice data:

referring to fig. 9, the evaluation system of voice data needs to extract midi included in the sound data first, specifically:

step 301, preprocessing the sound data, including: removing a direct current component of the sound data, and if the volume of the sound data is smaller than a preset volume, performing energy gain on the sound data and then performing noise reduction processing; and if the volume of the sound data is less than or equal to the preset volume, directly carrying out noise reduction processing on the sound data.

Step 302, extracting fundamental frequency information of the preprocessed sound data, wherein the fundamental frequency information comprises a fundamental frequency sequence which comprises fundamental frequency values on a plurality of fundamental frequency points; then finding out effective base frequency sequence between mutes, specifically, if the base frequency values of x continuous base frequency points in the base frequency sequence are zero and x is larger than a preset value, the base frequency sequence between the x base frequency points is a mute section. The effective fundamental frequency sequence is the fundamental frequency sequence with the mute section removed.

In step 303, it is determined whether the length of the effective baseband sequence is smaller than a threshold (for example, 35 frames), if so, step 304 is executed and step 306 is executed, and if so, step 305 is executed and step 306 is executed.

And 304, performing median filtering according to the actual length of the effective fundamental frequency sequence.

Step 305, performing median filtering on the effective fundamental frequency sequence according to a preset window, for example, performing 10-point median filtering on each frame of the effective fundamental frequency sequence.

Step 306, the granularity of the filtered fundamental frequency sequence is reduced, generally by 5 times, that is, the fundamental frequency value of one fundamental frequency point is taken from every 5 fundamental frequency points, and other fundamental frequency values are removed.

Step 307, taking logarithm with 2 as base, namely log2, of the fundamental frequency value on each fundamental frequency point for the fundamental frequency sequence with reduced granularity, and converting into a small-range numerical value.

And 308, segmenting the converted base frequency sequence by using the difference value of the base frequency values of two adjacent base frequency points, specifically, if the difference value of the base frequency values of two adjacent base frequency points is greater than 0.05, segmenting the base frequency point, dividing the base frequency point into a plurality of base frequency subsequences, and outputting the information of each base frequency subsequence, including the starting time, the length and the frequency value of the base frequency subsequence, which is the information of the extracted midi.

Referring to fig. 10, the evaluation system of speech data normalizes the information of a plurality of fundamental frequency sub-sequences, quantizes the information to obtain a note sequence, that is, quantized speech data, and specifically:

step 401, for each piece of midi information extracted through the above steps, normalizing the frequency value of each fundamental frequency subsequence to an integer between 0 and 24, that is, converting the frequency value to a pitch value of a note, and the precision of the pitch value is a semitone.

Step 402, the length of each base frequency subsequence is normalized to an integer between 0 and 20, that is, the length is converted into the duration of a note, and the precision of the duration is 10 frames, that is, 0.1 s.

Step 403, determining whether the adjacent notes need to be combined, specifically, if the pitch values of the adjacent notes are the same, the adjacent notes need to be combined, and if yes, performing step 404. If no merging is needed, the flow ends.

Step 404, adjacent notes are merged, the duration of the merged notes is the sum of the durations of the merged notes, and the pitch value of the merged notes is the pitch value of any merged note. Since the duration of the combined note will change, the duration of the combined note needs to be combined again and the process returns to step 402.

2. Obtaining the optimal voice data, and recording as the optimal midi

The method for clustering multiple pieces of midi of one quantified accompaniment to obtain the optimal midi is described in the embodiment corresponding to fig. 4, which is not described herein again, and in this process, the distance value m with the maximum distance from the optimal midi needs to be recorded.

3. And presetting the optimal midi to an evaluation system of the voice data.

4. The specific evaluation method for evaluating the voice data included in the sound data to be evaluated is the method described in the embodiment corresponding to fig. 2, and is not described herein again.

5. And outputting the evaluation score and the position of the to-be-evaluated voice data inconsistent with the optimal midi. Since the distance between one note in one midi and one note in another midi is calculated in the process of obtaining the optimal midi, if the distance is greater than a preset value, the voice data to be evaluated at the position of the note is determined to be inconsistent with the optimal midi.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The method and system for evaluating voice data provided by the embodiment of the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the embodiment of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for evaluating speech data, comprising:

quantizing a plurality of pieces of sound data of an accompaniment respectively to obtain a plurality of pieces of quantized sound data;

storing the optimal voice data of the accompaniment for evaluating the voice data to be evaluated of the accompaniment,

the method further comprises the following steps:

quantizing the voice data to be evaluated of the accompaniment to obtain quantized voice data to be evaluated;

calculating a first distance between the quantized voice data to be evaluated and the optimal voice data;

determining the evaluation score of the voice data to be evaluated according to the calculated first distance,

the determining the evaluation score of the voice data to be evaluated according to the calculated first distance specifically includes:

obtaining a second distance with the maximum distance from the optimal voice data in the quantized voice data, wherein the first distance is k, and the second distance is m;

determining the rating score as 100 x (m-k)/m.

2. The method of claim 1, wherein quantizing a piece of voice data to obtain a piece of quantized voice data, comprises:

extracting fundamental frequency information of the sound data;

converting the fundamental frequency information to enable a fundamental frequency value included in the converted fundamental frequency information to be a small-range numerical value;

quantizing the converted fundamental frequency information into a note sequence, or quantizing the fundamental frequency information subjected to first preprocessing on the converted fundamental frequency information into the note sequence, wherein the piece of quantized voice data comprises information of the note sequence;

the information of the note sequence comprises the starting time, the duration and the corresponding pitch value of each note in the notes, wherein the starting time is the starting time of a fundamental frequency subsequence included in the converted fundamental frequency information or the fundamental frequency information after the first preprocessing, the duration is the length of the fundamental frequency subsequence after normalization, and the pitch value is the frequency value of the fundamental frequency subsequence after normalization.

3. The method according to claim 2, wherein the fundamental frequency information of the sound data includes a plurality of fundamental frequency values, and the converting the fundamental frequency information is performed so that the fundamental frequency values included in the converted fundamental frequency information are small-range values, specifically including: directly converting the plurality of fundamental frequency values into a small-range numerical value;

or, second preprocessing is carried out on the fundamental frequency information, and fundamental frequency values included in the fundamental frequency information after the second preprocessing are converted into small-range numerical values.

4. The method of claim 3, wherein the second pre-processing comprises at least one of: low-pass filtering, compressing, and setting a singular fundamental frequency point to be zero and filling the zero fundamental frequency point;

the first pretreatment comprises at least one of the following treatment modes: low pass filtering and three-point smoothing.

5. The method according to claim 1, wherein there are n pieces of quantized speech data, where n is a positive integer greater than 1, and the clustering the plurality of pieces of quantized speech data to obtain the optimal speech data of the accompaniment specifically includes:

respectively calculating the distance between any two quantized voice data in the n quantized voice data;

and respectively calculating the sum of the distances between each piece of quantized voice data in the n pieces of quantized voice data and other n-1 pieces of quantized voice data, and taking the piece of quantized voice data corresponding to the minimum distance sum as the optimal voice data.

6. The method of claim 5, wherein a piece of quantized speech data includes information of a note sequence including duration and pitch values of respective notes of a plurality of notes;

the distance D (Si, Sj) between the note Si in the first quantized speech data and the note Sj in the second quantized speech data in the n quantized speech data is specifically:

wherein:

the Δ p represents the pitch difference between the notes Si and Sj, and Δ p ═ min (abs (pi-pj), abs (pi-pj-24) +1.0, abs (pi-pj +24) +1.0), where pi is the pitch value of the note Si and pj is the pitch value of the note Sj;

the delta d is the time difference between the notes Si and Sj, and the sigma is the weight value of the time difference;

the distance between the first strip of quantized speech data and the second strip of quantized speech data is: a maximum distance between a note of the first piece of quantized speech data and a note of the second piece of quantized speech data.

7. The method of any of claims 1 to 6, further comprising:

and outputting the position of the inconsistency of the voice data to be evaluated and the optimal voice data.

8. A system for evaluating speech data, comprising:

a first quantizing unit for quantizing a plurality of pieces of sound data of an accompaniment to obtain a plurality of pieces of quantized sound data, respectively;

a storage unit configured to store the optimal speech data of one accompaniment acquired by the optimal acquisition unit, the optimal speech data being used to evaluate the speech data of the one accompaniment to be evaluated,

the system further comprises:

the second quantization unit is used for quantizing the voice data to be evaluated of the accompaniment to obtain quantized voice data to be evaluated;

the third calculation unit is used for calculating a first distance between the quantized voice data to be evaluated and the optimal voice data;

a score determining unit for determining an evaluation score of the speech data to be evaluated based on the first distance calculated by the third calculating unit,

the score determining unit is specifically configured to obtain a second distance, which is the largest distance from the optimal voice data, in the multiple pieces of quantized voice data, where the first distance is k, and the second distance is m; determining the rating score as 100 x (m-k)/m.

9. The system of claim 8, wherein the first quantization unit specifically comprises:

an information extraction unit for extracting fundamental frequency information of the sound data;

the conversion unit is used for converting the fundamental frequency information extracted by the information extraction unit so that the fundamental frequency value included in the converted fundamental frequency information is a small-range numerical value;

an information quantization unit, configured to quantize the fundamental frequency information converted by the conversion unit into a note sequence, or quantize the fundamental frequency information after performing first preprocessing on the converted fundamental frequency information into a note sequence, where the piece of quantized speech data includes information of the note sequence;

10. The system according to claim 9, wherein the fundamental frequency information of the sound data extracted by the information extraction unit includes a plurality of fundamental frequency values, and the conversion unit is specifically configured to directly convert the plurality of fundamental frequency values into a small-range numerical value;

or, the conversion unit is specifically configured to perform second preprocessing on the fundamental frequency information, and convert the fundamental frequency value included in the fundamental frequency information after the second preprocessing into a small-range numerical value.

11. The system of claim 10, wherein the second pre-processing comprises at least one of: low-pass filtering, compressing, and setting a singular fundamental frequency point to be zero and filling the zero fundamental frequency point;

12. the system of claim 8,

the quantized voice data obtained by the first quantization unit has n pieces, where n is a positive integer greater than 1, and the optimal obtaining unit specifically includes:

a first calculating unit, configured to calculate distances between any two pieces of quantized speech data in the n pieces of quantized speech data, respectively;

a second calculating unit, configured to calculate a sum of distances between each piece of quantized speech data in the n pieces of quantized speech data and other n-1 pieces of quantized speech data;

and an optimum determining unit configured to use a piece of quantized speech data corresponding to the sum of the minimum distances calculated by the second calculating unit as the optimum speech data.

13. The system as claimed in claim 12, wherein the quantized speech data obtained by the first quantizing unit includes information of a note sequence, the information of the note sequence including duration and pitch value of each of a plurality of notes;

the first calculating unit is specifically configured to calculate a distance D (Si, Sj) between a note Si in a first piece of quantized speech data in the n pieces of quantized speech data and a note Sj in a second piece of quantized speech data, where the distance D (Si, Sj) is:

wherein:

14. the system of any one of claims 8 to 13, further comprising:

and the output unit is used for outputting the position where the voice data to be evaluated is inconsistent with the optimal voice data.