CN115547299B

CN115547299B - Quantitative evaluation and classification method and device for quality division of control voice

Info

Publication number: CN115547299B
Application number: CN202211469949.7A
Authority: CN
Inventors: 潘卫军; 张坚; 蒋培元; 蒋倩兰; 王泆棣; 张玉梅
Original assignee: Civil Aviation Flight University of China
Current assignee: Civil Aviation Flight University of China
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-08-01
Anticipated expiration: 2042-11-22
Also published as: CN115547299A

Abstract

The invention discloses a quantitative evaluation and classification method and a device for quality division of control voice, wherein the method comprises the following steps: s1, inputting voice data of a standard control voice database marked with correct meanings; s2, constructing an evaluation index system for controlling the quality division of the voice according to the characteristics of the civil aviation land-air communication; s3, performing qualitative analysis on each evaluation index, wherein the qualitative analysis comprises a technical analysis method and an index grading quantization unit; s4, grouping data obtained by analyzing the single evaluation index by adopting a clustering method, and prescribing range values of each voice grade under the single evaluation index; s5, weighting each index by adopting a weighted fusion algorithm, and combining the indexes into a plurality of level quality control voice data sets; the apparatus includes at least one processor and at least one memory. The method solves the problems that the voice quality cannot be objectively and quantitatively analyzed and the corresponding relation between the voice quality and each evaluation index cannot be clearly controlled.

Description

Quantitative evaluation and classification method and device for quality division of control voice

Technical Field

The invention relates to the field of quality measurement of control voice data, in particular to a quantitative evaluation and classification method and device for quality division of control voice.

Background

The current main speech quality evaluation method mainly surrounds speech quality evaluation models such as MOS (Mean Opinion Score, average subjective opinion score), PESQ (Perceptual evaluation of speech quality, objective speech quality evaluation), PSQM (Perceptual Speech Quality Measure, perceptual speech quality evaluation) and the like, but the evaluation classification method is a very fuzzy evaluation method, and the evaluation score is obtained by mapping a machine learning algorithm and a neural network model according to a pre-determined level standard, so that subjective factors are large, and the evaluation result is not objective enough. In addition, existing objective speech quality assessment methods focus on: the voice quality is represented by reference-free based on certain specific parameters or by reference comparison based on signals, but the objective evaluation methods can only obtain comprehensive evaluation results, and are similar to a black box test method, and a set of relatively perfect evaluation index system is not formed in the objective voice evaluation process. The inability to objectively and quantitatively analyze the voice quality or find the unit of measure for objectively and quantitatively analyzing the voice quality is a great difficulty in later study of the performance of the voice recognition software because the recognition performances corresponding to different voice qualities are inconsistent under the same voice recognition software.

From the above, the existing classification method for controlling voice quality is not objective enough, cannot objectively and quantitatively analyze voice quality or find a measurement unit for objective and quantitative analysis, cannot clearly control the corresponding relationship between voice quality and each evaluation index, and does not form a sound evaluation index system for controlling voice.

Disclosure of Invention

The invention aims to solve the problem that objective quantitative analysis cannot be carried out on voice quality and the corresponding relation between the controlled voice quality and each evaluation index cannot be clearly defined, and provides a quantitative evaluation and classification method and device for classifying voice data quality by means of standard controlled voice databases (containing audio and labeled texts) established by projects and designing and generating controlled voice quality classification of test sets with different control categories and different difficulty levels.

In order to achieve the above object, the present invention provides the following technical solutions:

a quantization evaluation and classification method for control voice quality division comprises the following steps:

s1, inputting voice data of a standard control voice database marked with correct meanings;

s2, constructing an evaluation index system for controlling the quality division of the voice according to the characteristics of the civil aviation land-air communication;

s3, performing qualitative analysis on each evaluation index, wherein the qualitative analysis comprises a technical analysis method and an index grading quantization unit;

s4, adopting a clustering method to analyze a voice data analysis statistical data set obtained by analyzing a single evaluation index, grouping, and defining range values of each voice grade under the single evaluation index;

s5, weighting each evaluation index by adopting a weighted fusion algorithm, and combining the evaluation indexes into a voice data set with multiple levels of quality.

Preferably, in step S2, the evaluation index system for controlling the quality division of the voice includes a maximum index, an intermediate index, a minimum index, and a specified index, where the maximum index refers to a larger value, the better the voice recognition effect includes accents, the better the intermediate index refers to a value close to a certain intermediate value, the better the voice recognition effect includes speech speed, tone (pitch), and intensity, the smaller the value, the better the voice recognition effect includes continuity, interference level, a specific term duty, a gray vocabulary content, and pitch, and the specified index refers to a value close to a certain value, and the better the voice recognition effect includes language category.

Preferably, the grouping of the voice data analysis statistical data set obtained by analyzing the single evaluation index by using the clustering method in the step S4 is implemented, and the range value of each voice grade under the single evaluation index is specified, which includes the following steps:

step S4-1: inputting the number of grades to be divided and a voice data statistical analysis data set obtained by each single evaluation index analysis method;

step S4-2: and outputting the clustering result and each grade range.

Preferably, the method for outputting the clustering result in the step S4-2 comprises the following steps:

step S4-2-1: determining the optimal category number by adopting an elbow method or a contour coefficient method;

step S4-2-2: initializing class center values, calculating Euclidean distance from each sample point to each class center, assigning each sample point to the class closest to the sample point to form a clustering result, wherein the Euclidean distance refers to the real distance between two points in an m-dimensional space or the natural length of a vector, and the Euclidean distance is expressed as follows:

wherein d _ij Representative sample point x _i To particle x _j Distance x of (x) _ik The kth attribute, x, representing the ith sample _jk The kth attribute representing the jth sample shares an m-dimensional attribute;

step S4-2-3: calculating the average value of all samples in each category of the clustering result, and taking the average value as a new clustering center;

step S4-2-4: taking the sum of the distances from the sample to the class center as an objective function, and outputting if the iteration converges or meets a stop condition; otherwise, the category number is +1, and the step S4-2-2 is returned to repeat calculation;

step S4-2-5: the algorithm uses iterative computation, so that a global optimal solution is difficult to reach, a heuristic strategy is adopted for the algorithm, and a Nash equilibrium is utilized to realize a problem optimal solution.

Preferably, the quality of the regulated speech is classified into 1-5 stages, the higher the stage, the better the speech quality.

Preferably, the weighting fusion algorithm in step S5 includes a subjective weighting method and an objective weighting method, which are implemented as follows:

step S5-1: the subjective weighting method is to use expert experience to adjust and optimize the objective weighting value of the evaluation index, to use 1-9 scale method to compare the importance degree of each index belonging to one level relative to the same index of the upper level to form a judgment matrix X, to use the maximum feature vector method to calculate the feature vector of the corresponding feature root of the judgment matrix, to check the consistency of the judgment matrix, to use the feature vector as the subjective weight of each index, and to use alpha _j A representation;

step S5-2: the objective weighting method comprises the following steps:

step S5-2-1: forward direction of index, converting the minimum index and intermediate index into maximum index:

very small- > very large:

x _i ＝max{x ₁ ，x ₂ ，...，x _i }-x _i

intermediate- > very large:

wherein,,in order to identify the optimal value of the effect, the voice concentrated value obtained by the evaluation index method is taken as the optimal value, x _new Is a forward numerical value;

step S5-2-2: data normalization, balancing dimensional errors between indexes:

wherein x is _ij The value of the ith voice under the jth evaluation index is obtained;

step S5-2-3: data normalization, unifying to interval 0-1:

wherein n is the number of evaluation objects;

s5-2-4, calculating the information entropy value of each evaluation index

Information entropy value of each evaluation index:

wherein n is the number of evaluation objects, m is the number of evaluation indexes, and the value of j is taken from 1 to m;

step S5-2-5: calculating weights:

d _j ＝1-e _j

wherein the value of j is taken from 1 to m;

step S5-3: subjective and objective weight fusion:

where n is the number of evaluation objects, α _j Is subjective weight, beta _j Is an objective weight;

step S5-4: each speech synthesis score:

wherein z is _ij For the standardization of the value of the ith evaluation object under the jth evaluation index, taking the value of i from 1 to n;

step S5-5: each of the control voice quality class score ranges:

the comprehensive score is calculated by the whole standard control voice database according to the method from the step S5-1 to the step S5-4, the comprehensive score sequence of the whole database is divided according to 5 grades, the interval range of each grade is the score range of each quality grade, the interval range is 0 to 1, all evaluation indexes are subjected to forward processing, so that the higher the comprehensive score value is, the better the quality is, and the 5-grade quality is optimal.

A device for quantitatively evaluating and classifying control-oriented voice quality division comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any steps of the classification method.

Compared with the prior art, the invention has the beneficial effects that:

1. constructing a control voice quality evaluation index system, quantitatively analyzing each evaluation index in the evaluation index system, objectively and quantitatively researching the control voice quality, defining a metering unit for objectively and quantitatively analyzing the control voice quality, and acquiring the corresponding relation between the control voice quality and each evaluation index;

2. according to the quantitative analysis of the evaluation index, a main weighting fusion algorithm is adopted, an objective control voice quality dividing method is established, third-party control voice recognition software is tested according to different divided quality level voice data sets, aviation units can conveniently select the control voice recognition software, and air traffic control efficiency, safety, reliability and effectiveness are improved.

Description of the drawings:

FIG. 1 is a roadmap of a managed voice quality partitioning technique;

FIG. 2 is a block diagram of a quantitative analysis structure for quality division of a controlled voice;

FIG. 3 is a graph of index classifications;

FIG. 4 is a graph of index value versus recognition effect trend;

FIG. 5 is a graph of a first partial effect of index quantization hierarchy;

FIG. 6 is a graph of the effect of a second portion of the index quantization hierarchy;

fig. 7 is a technical roadmap for quality rating of a tubular voice.

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should not be construed that the scope of the above subject matter of the present invention is limited to the following embodiments, and all techniques realized based on the present invention are within the scope of the present invention.

Examples

The embodiment of the application is to collect control voice corpus from actual environment of civil aviation operation by means of project background to establish a standard control voice database, wherein corpus content comprises multiple scenes, different controller pronunciations, different control instruction voices, different flight phases, a great amount of land-air radio communication words and phrases, single or mixed language pronunciations and other empty pipe features, corresponding control instruction texts are marked for each control voice audio in the database, and data of the database are quantitatively evaluated and classified.

The implementation process and steps of the embodiment of the application are as follows, the flow chart of which is shown in fig. 1, and the quantitative analysis structure chart of the quality division of the control voice is shown in fig. 2:

s2, constructing an evaluation index system for controlling the quality division of the voice in consideration of the characteristics of the land-air communication;

s4, grouping the voice data analysis statistical data set obtained by analyzing the single evaluation index by adopting a clustering method, and prescribing the range value of each voice grade under the single evaluation index;

s5, weighting each evaluation index by adopting a weighted fusion algorithm, and combining the evaluation indexes into 5-level quality voice data sets.

In step S2, the classification of the index is as shown in fig. 3:

the evaluation index system for controlling the voice quality division comprises a maximum index, an intermediate index, a minimum index and a specified index, wherein the maximum index is larger in value, the voice recognition effect is better and comprises accents, the intermediate index is better when the value is close to a certain intermediate value, the voice recognition effect is better and comprises speech speed, tone (tone) and intensity, the minimum index is smaller in value, the voice recognition effect is better and comprises continuity, interference degree, specific terms, gray vocabulary content and voice variation, and the specified index is better in voice recognition effect when the value is a certain value and comprises language types, and the single language is better than the mixed language in recognition effect.

The step of qualitative analysis of each evaluation index system in step S3 is as follows, and the index quantization and grading effect graphs are shown in fig. 5 and 6, and fig. 5 and 6 are two graphs in which one whole graph is split:

step S3-1: the speech rate quantization unit is word/second (Chinese), syllable/second (English), the speech rate analysis method includes the following steps:

step S3-1-1: framing, windowing and preprocessing an input voice signal;

step S3-1-2: detecting an audio segment of valid speech; calculating the number of frames of effective audio frequency and obtaining the effective pronunciation time;

step S3-1-3: processing the text corresponding to the audio to obtain the effective character number or vocabulary number of the audio text;

step S3-1-4: calculating speech rate, speech rate = number of valid audio frames/number of syllables (or number of characters);

step S3-2: the tone (pitch) quantization unit is a pitch change frequency, and the tone (pitch) analysis method includes the steps of:

s3-2-1, framing, windowing and preprocessing an input voice signal, and filtering other interference factors;

s3-2-2, carrying out Fourier transformation on the preprocessed framing signals, and extracting time domain and frequency domain characteristic information of the voice waveform;

s3-2-3, directly estimating waveform change trend by a time domain and frequency domain estimation method of the voice waveform;

step S3-3: the sound intensity quantification unit is amplitude (dB), and the sound intensity analysis method comprises the following steps:

step S3-3-1: framing, windowing and preprocessing an input voice signal;

step S3-3-2: obtaining various frequency and amplitude values through short-time Fourier transformation and splitting of original signals;

step S3-3-3: carrying out normal distribution description on each amplitude value in the voice, and taking expected values of normal distribution as intensity measurement values of the voice;

step S3-4: the accent quantization unit is similarity, and the accent analysis method comprises the following steps:

step S3-4-1: establishing a standard mandarin chinese phone library, and mapping different sound features into corresponding phones;

step S3-4-2: extracting phonemes of the input speech using a phoneme extraction algorithm;

step S3-4-3: comparing the difference of the standard pronunciation and the pronunciation of the accent of the input system, decoding the input voice by the acoustic model to obtain a voice characteristic sequence, comparing the voice characteristic sequence with the characteristic sequence of the standard mandarin, expressing the characteristic sequence by using the characteristic vector, and calculating the similarity between the two characteristic vectors;

step S3-5: the continuous quantification unit is the number of continuous abnormal segments in a voice, and the continuous analysis method comprises the following steps:

step S3-5-1: preprocessing input voice;

step S3-5-2: removing silence segments at the head end and the tail end of each voice by using an energy-based voice endpoint detection method, and marking out effective voice internal continuity abnormal segments;

step S3-5-3: the voice endpoint detection method based on the voice marks out the speaker-free voice part in the continuous abnormal section in the effective voice section in the step S3-5-2;

step S3-5-4: based on a context judgment algorithm, marking whether the marked sound segment in the step S3-5-3 belongs to a normal sentence breaking or the same voice segment, and if the marked sound segment belongs to the same voice segment, counting the duration of the voice segment;

step S3-6: the interference degree quantization unit is a noise energy value, and the interference degree analysis method comprises the following steps:

step S3-6-1: performing short-time Fourier transform on the input voice to smooth the time domain and the frequency domain respectively, so as to obtain a short-time local energy spectrum value of the voice with noise;

step S3-6-2: taking the ratio of the energy spectrum value to the local minimum value as a threshold to reject noise energy in the noisy speech;

step S3-6-3: continuously updating the noise energy according to the threshold judgment result in the judgment process until the optimal noise reduction effect is obtained, and taking the energy value when the optimal noise reduction effect is obtained as the interference degree;

step S3-7: the language category quantization unit is a language category (Chinese-0, english-1, chinese-English mixture-2).

The language category analysis method comprises the following steps:

step S3-7-1: constructing Chinese and English speech recognizers, wherein each speech recognizer pertinently contains the speech characteristics of respective languages;

step S3-7-2: extracting features of the input voice, matching the features with voice features of various languages, and determining voice language types;

step S3-8: the term quantization unit is the ratio of civil aviation terms in a piece of voice text, and the term ratio analysis method comprises the following steps:

step S3-8-1: obtaining a correct text corresponding to each text in the control voice database (based on manual labeling/semi-automation);

step S3-8-2: using a text analysis algorithm to perform text sentence breaking, word segmentation, character discrimination and other processes;

step S3-8-3: establishing a dictionary of control instruction professional terms by referring to air traffic radio speech expression, matching the vocabulary extracted in the step S3-8-2 with the dictionary through a matching algorithm, and counting the number consistent with the matching as the content of the speech professional term packet;

step S3-9: the gray vocabulary content quantization unit is gray vocabulary content, and the gray vocabulary content analysis method comprises the following steps:

step S3-9-1: training an acoustic model by adopting a word library of sound sensing words;

step S3-9-2: framing, windowing and preprocessing the input voice, and extracting voice characteristics;

step S3-9-3: the acoustic model in the step S3-9-1 receives the voice characteristics in the step S3-9-2, detects the audio segment of the input voice containing the sound sensing word, establishes a gating mechanism by combining a context discrimination algorithm, and screens the sound sensing word to determine whether the audio segment is reserved;

step S3-9-4: marking the audio segments of the sense words which are detected to be nonsensical, and counting the number of nonsensical audio segments in the whole voice;

step S3-10: the voice variable unit is the number of voice changes, and the voice change analysis method comprises the following steps:

step S3-10-1: constructing a complete multi-tone word dictionary and a merged vocabulary library which is easy to generate sound variation;

step S3-10-2: obtaining a correct text corresponding to each text in the control voice database (based on manual labeling/semi-automation);

step S3-10-3: performing word segmentation, part-of-speech tagging, character discrimination and the like by using a text analysis algorithm;

step S3-10-4: and matching the controlled voice text with the polyphonic word dictionary and the combined vocabulary in the step S3-10-1 by adopting a matching algorithm, and counting the polyphonic words and the vocabulary number contained in the text.

The index values and the recognition effects are shown in fig. 4:

the higher the accent similarity of the voice data is, the better the voice recognition effect is, the voice speed, the gene frequency of the tone (tone) and the amplitude of the tone intensity of the voice data are, the best the voice recognition effect is when a certain intermediate value is, the lower the abnormal segment number, the noise energy, the proportion of the professional terms, the gray vocabulary content number and the tone variation number of the voice data are, the better the voice recognition effect is, and the best the voice recognition effect is when the language class of the voice data is 1.

The implementation of the content in step S4 includes the following steps:

step S4-1: the number of grades to be divided and the voice data analysis statistical data set ({ x) obtained by each single evaluation index analysis method are input ₁ x ₂ x ₃ ,…,x _n N is the number of data in the dataset);

step S4-2: outputting a clustering result and each grade range;

the method for outputting the clustering result in the step S4-2 comprises the following steps:

step S4-2-1: since the regulated voice data was not previously availableThere is a specified class (this class refers to a class classified based on a certain speech evaluation index, and the concept that the quality class mentioned last in the patent is different) so that by using the elbow method or the contour coefficient method to the above-mentioned acquired data set, the optimum class number k (i.e., the number of foci, centroid set { c) ₁ c ₂ c ₃ ,…c _k |c _i May or may not be values within the dataset);

step S4-2-2: initializing particle (class center) value x _j ∈{c ₁ c ₂ c ₃ ,…c _k Calculating the Euclidean distance from each sample point to the center of each class, assigning each sample to the class nearest to the sample point to form a clustering result, wherein the Euclidean distance refers to the real distance between two points in an m-dimensional space or the natural length of a vector, and the Euclidean distance is expressed as follows:

wherein d _ij Representative sample point x _i To particle x _j Distance x of (x) _ik The kth attribute, x, representing the ith sample _jk The kth attribute representing the jth sample shares m-dimensional attributes, in the present invention, each evaluation index is analyzed to obtain a one-dimensional data value, so m=1, and the calculation formula of the euclidean distance in the present invention is as follows:

step S4-2-3: updating the clustering center point: calculating the average value of all samples in the class as a new clustering center for the clustering result;

step S4-2-4: updating the clustering center point: taking the sum of the distances from the sample to the class center as an objective function, and outputting if the iteration converges or meets a stop condition; otherwise, the category number is +1, and the step S4-2-2 is returned to repeat calculation;

step S4-2-5: the algorithm uses iterative computation, so that a global optimal solution is difficult to reach, and a heuristic strategy can be adopted for searching Nash equilibrium and searching for a problem optimal solution.

The control voice quality rating step is as follows, and the control voice quality rating technical roadmap is as shown in fig. 7:

with reference to the guidance of "civil aircraft pilot, flight instructor and ground instructor qualification approval rules" (CCAR-61 department) "," MH/T4014-2003 air traffic radio language expression ", and those skilled in the civil aviation arts, the classification of the quality of the control voice in the 1-5 stages is creatively proposed, the higher the stage is, the better the quality of the voice is, the 5 stages are the highest, and the judgment standards of each stage are as follows:

1) Stage 1: the occupation ratio of the control professional vocabulary is smaller; too fast or too slow speech; chinese-English mixed pronunciation voice; is influenced by own native language or region, and has a small amount of accents in mandarin; lexical audio (polyphones, homophones) that can mislead semantic understanding; the interference caused by transmission channels, surrounding noise and the like is large;

2) 2 stages: the occupation ratio of the control professional vocabulary is small; too fast or too slow speech; a small amount of gray words and Chinese-English mixed pronunciation are provided; the mandarin is slightly accented due to the influence of own native language or region; vocabulary audio with individual misleading semantic understanding;

3) 3 stages: the occupation ratio of the specialized vocabulary is common; the speech speed is normal; the pronunciation of the gray-free vocabulary; single language speech; the speech signal may occasionally stall; the audio is less disturbed; no accent;

4) 4 stages: the occupation ratio of the specialized vocabulary is large; the voice definition is good; the speech speed is normal; voice fluency; having individual misleading semantic understanding vocabulary audio;

5) 5 stages: the occupation ratio of the specialized vocabulary of control is larger; the interference degree is small; voice fluency; mandarin pronunciation standards; vocabulary audio without misleading semantic understanding.

Step S5-1: the subjective weighting method is to use expert experience to adjust and optimize the objective weighting value of the evaluation index, so that the weighting value is more scientific and reasonable, thereby realizing quantitative and visual display of the quality status of the controlled voice, and the method is to use 1-9 scale method to make the same belong to one layerThe indexes of the index (B) are quantitatively compared in pairs relative to the importance degree of the same index of the upper layer to form a judging matrix X, the characteristic vector of the characteristic root corresponding to the judging matrix is calculated by adopting a maximum characteristic vector method, when the judging matrix is tested to meet the consistency, the characteristic vector can be used as the subjective weight of each index, and alpha is used _j A representation;

step S5-2: the objective weighting method comprises the following steps:

very small- > very large:

x _i ＝max{x ₁ ，x ₂ ，...，x _i }-x _i

intermediate- > very large:

wherein,,in order to identify the optimal value of the effect, the voice concentrated value obtained by the evaluation index method is taken as the optimal value, x _new Is a forward numerical value; the method comprises the steps of carrying out a first treatment on the surface of the

Step S5-2-2: data normalization, balancing dimensional errors between indexes:

step S5-2-3: data normalization, unifying to interval 0-1:

wherein n is the number of evaluation objects;

s5-2-4, calculating the information entropy value of each evaluation index

Information entropy value of each evaluation index:

step S5-2-5: calculating weights:

d _j ＝1-e _j

wherein the value of j is taken from 1 to m;

step S5-3: subjective and objective weight fusion:

step S5-4: each speech synthesis score:

step S5-5: each of the control voice quality class score ranges:

A quantitative evaluation and classification device for control voice quality division adopts a Core i7-12700 processor, a memory adopts a solid state disk of three stars 480 PRO 1T, and 4 NVIDIA P40 GPU speeds up the processing speed of related steps.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A quantitative evaluation and classification method for control voice quality division is characterized by comprising the following steps:

s2, constructing an evaluation index system for controlling voice quality division according to the characteristics of civil aviation land-air communication, wherein the evaluation index system comprises a large-scale index, an intermediate-scale index, a small-scale index and a specified index, the large-scale index comprises accents, the intermediate-scale index comprises speech speed, tone and sound intensity, the small-scale index comprises continuity, interference degree, professional term occupation ratio, gray vocabulary content and sound variation, and the specified index comprises language types;

s3, performing qualitative analysis on each evaluation index, wherein the qualitative analysis comprises a technical analysis method and an index grading quantization unit, the qualitative analysis on each evaluation index is to analyze and process the voice data in the step S1 through a single evaluation index, and a voice data statistical analysis data set is obtained, and the qualitative analysis on each evaluation index comprises the following steps:

step S3-1: the Chinese language speed quantization unit is word/second, the English language speed quantization unit is syllable/second, and the language speed analysis method comprises the following steps:

step S3-1-1: framing, windowing and preprocessing an input voice signal;

step S3-1-4: calculating speech rate, speech rate = effective audio frame number/pitch number or character number;

step S3-2: the tone quantization unit is a pitch change frequency, and the tone analysis method comprises the following steps:

step S3-3: the tone scale unit is amplitude, and the tone scale analysis method comprises the following steps:

step S3-3-1: framing, windowing and preprocessing an input voice signal;

step S3-5: the continuous quantification unit is the number of continuous abnormal segments in one voice, and the continuous analysis method comprises the following steps:

step S3-5-1: preprocessing input voice;

step S3-6-2: taking the ratio of the energy spectrum value to the local minimum value as a threshold, and removing noise energy in the voice with noise;

step S3-6-3: continuously updating the noise energy according to a threshold judgment result in the judgment process until an optimal noise reduction effect is obtained, and taking the energy value when the optimal noise reduction effect is obtained as the interference degree;

step S3-7: the language category quantization unit is a language category, chinese is represented by 0, english is represented by 1, chinese-English mixing is represented by 2, and the language category analysis method comprises the following steps:

step S3-8-1: acquiring a correct text corresponding to each text in the control voice database;

step S3-8-2: using a text analysis algorithm to perform text sentence breaking, word segmentation and character discrimination processing;

step S3-8-3: establishing a control instruction technical term dictionary, matching the vocabulary extracted in the step S3-8-2 with the dictionary through a matching algorithm, and counting the number consistent with the matching as the content of the speech technical term packet;

step S3-9-3: the acoustic model in the step S3-9-1 receives the voice characteristics in the step S3-9-2, detects an audio segment of input voice containing a sound sensing word, establishes a gating mechanism by combining a context discrimination algorithm, and discriminates the sound sensing word to determine whether the audio segment is reserved;

step S3-10-2: acquiring a correct text corresponding to each text in the control voice database;

step S3-10-3: performing word segmentation, part-of-speech tagging and character discrimination processing by using a text analysis algorithm;

step S3-10-4: matching the controlled voice text with the polyphonic word dictionary and the combined vocabulary in the step S3-10-1 by adopting a matching algorithm, and counting the polyphonic words and the vocabulary number contained in the text;

s4, grouping the voice data statistical analysis data set obtained through single evaluation index processing by adopting a clustering method, and prescribing range values of each voice grade under the single evaluation index, wherein the grouping of the voice data statistical analysis data set obtained through the single evaluation index processing by adopting the clustering method comprises the following steps:

step S4-2: determining the optimal category number by adopting an elbow method or a contour coefficient method;

step S4-3: initializing class center values, calculating Euclidean distance from each sample point to each class center, assigning each sample point to the class closest to the sample point to form a clustering result, wherein the Euclidean distance refers to the real distance between two points in an m-dimensional space or the natural length of a vector, and the Euclidean distance is expressed as follows:

step S4-4: calculating the average value of all samples in each category of the clustering result, and taking the average value as a new clustering center;

step S4-5: taking the sum of the distances from the sample to the class center as an objective function, and outputting if the iteration converges or meets a stop condition; otherwise, the category number is +1, and the step S4-3 is returned to repeat calculation;

step S4-6: the algorithm uses iterative computation, so that a global optimal solution is difficult to reach, a heuristic strategy is adopted for the algorithm, and a Nash equilibrium is utilized to realize a problem optimal solution;

s5, weighting each evaluation index by adopting a weighting fusion algorithm, processing the grouped voice data statistical analysis data sets to form a voice data set with a plurality of level qualities, wherein the weighting fusion algorithm comprises a subjective weighting method and an objective weighting method, and the voice data set with the plurality of level qualities is formed by the following steps:

step S5-1: the subjective weighting method uses expert experience to adjust and optimize the objective weighting value of the evaluation index, which is to use a 1-9 scale method to quantitatively compare the importance degree of each index belonging to one level relative to the same index of the upper level to form a judgment matrix X, calculate the feature vector of the corresponding feature root of the judgment matrix by using a maximum feature vector method, and when the judgment matrix meets consistency, use the feature vector as the subjective weight of each index and use alpha _j A representation;

step S5-2: the objective weighting method comprises the following steps:

very small- > very large:

x _i ＝max{x ₁ ，x ₂ ，...，x _i }-x _i

intermediate- > very large:

step S5-2-2: data normalization, balancing dimensional errors between indexes:

step S5-2-3: data normalization, unifying to interval 0-1:

wherein n is the number of evaluation objects;

s5-2-4, calculating the information entropy value of each evaluation index

Information entropy value of each evaluation index:

step S5-2-5: calculating weights:

d _j ＝1-e _j

wherein the value of j is taken from 1 to m;

step S5-3: subjective and objective weight fusion:

step S5-4: each speech synthesis score:

step S5-5: each of the control voice quality class score ranges:

the whole standard control voice database calculates comprehensive scores according to the method from step S5-1 to step S5-4, the comprehensive score sequence of the whole database is divided into a plurality of grades, the interval range of each grade is the score range of each quality grade, the interval range is 0 to 1, all evaluation indexes are subjected to forward processing, so that the higher the comprehensive score value is, the better the quality is, and the highest grade quality is optimal.

2. The method for quantitatively evaluating and classifying quality of controlled speech according to claim 1, wherein the quality of the controlled speech is classified into 1-5 classes, the higher the class, the better the quality of the speech.

3. The device for quantitatively evaluating and classifying the control-oriented voice quality division is characterized by comprising at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the classification method of claim 1.