CN115547299A - Quantitative evaluation and classification method and device for controlled voice quality division - Google Patents
Quantitative evaluation and classification method and device for controlled voice quality division Download PDFInfo
- Publication number
- CN115547299A CN115547299A CN202211469949.7A CN202211469949A CN115547299A CN 115547299 A CN115547299 A CN 115547299A CN 202211469949 A CN202211469949 A CN 202211469949A CN 115547299 A CN115547299 A CN 115547299A
- Authority
- CN
- China
- Prior art keywords
- index
- voice
- value
- evaluation
- evaluation index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000011158 quantitative evaluation Methods 0.000 title claims abstract description 12
- 238000011156 evaluation Methods 0.000 claims abstract description 88
- 238000004458 analytical method Methods 0.000 claims abstract description 19
- 238000013139 quantization Methods 0.000 claims abstract description 14
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 238000004891 communication Methods 0.000 claims abstract description 9
- 230000000694 effects Effects 0.000 claims description 27
- 239000000126 substance Substances 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 230000008859 change Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 239000002245 particle Substances 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 2
- 230000001276 controlling effect Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000004445 quantitative analysis Methods 0.000 description 8
- 241001672694 Citrus reticulata Species 0.000 description 5
- 238000009432 framing Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000013441 quality evaluation Methods 0.000 description 5
- 230000002159 abnormal effect Effects 0.000 description 4
- 238000002372 labelling Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a quantitative evaluation and classification method and a device for controlled voice quality division, wherein the method comprises the following steps: s1, inputting voice data of a standard control voice database marked with correct meanings; s2, considering the characteristics of civil aviation land-air communication, constructing an evaluation index system for controlling voice quality division; s3, qualitatively analyzing each evaluation index, including a technical analysis method and an index grading quantization unit; s4, grouping data obtained by analyzing the single evaluation index by adopting a clustering method, and specifying the range value of each voice grade under the single evaluation index; s5, weighting each index by adopting a weighted fusion algorithm to combine a plurality of control voice data sets with the grade quality; the apparatus includes at least one processor and at least one memory. The problem that the voice quality cannot be objectively and quantitatively analyzed and the corresponding relation between the voice quality and each evaluation index cannot be clearly controlled is solved.
Description
Technical Field
The invention relates to the field of quality measurement of controlled voice data, in particular to a quantitative evaluation and classification method and device for controlled voice quality division.
Background
At present, the main Speech Quality evaluation methods mainly surround Speech Quality evaluation models such as MOS (Mean Opinion Score), PESQ (objective Speech Quality evaluation), PSQM (Perceptual Speech Quality evaluation), and the like, but the evaluation classification method is a very fuzzy evaluation method, which obtains an evaluation Score by mapping a machine learning algorithm and a neural network model according to a level standard determined by people in advance, and has large subjective factors and insufficient objectivity of an evaluation result. In addition, existing objective speech quality assessment methods focus on: the speech quality is represented without reference based on certain specific parameters or is represented by reference comparison based on signals, but the objective evaluation methods can only obtain comprehensive evaluation results, and similar to the black box test method, a set of relatively perfect evaluation index system is not formed in the objective speech evaluation process. The inability to perform objective quantitative analysis on speech quality or find out the measurement units of objective quantitative analysis is a great difficulty in later stage research on the performance of speech recognition software, because the recognition performances corresponding to different speech qualities are different under the same speech recognition software.
According to the above contents, the existing method for classifying the controlled speech quality is not objective enough, cannot objectively and quantitatively analyze the speech quality or find out a measure unit of the objective quantitative analysis, cannot clearly control the corresponding relationship between the speech quality and each evaluation index, and does not form a sound controlled speech evaluation index system.
Disclosure of Invention
The invention aims to solve the problems that the voice quality cannot be objectively and quantitatively analyzed and the corresponding relation between the voice quality and each evaluation index cannot be clearly controlled, and provides a quantitative evaluation and classification method and a device for controlling voice quality classification, which are designed and generated by performing voice data quality classification on a standard control voice database (containing audio and labeled text) established by a project and designing and generating test sets with different control types and different difficulty levels.
In order to achieve the above object, the present invention provides the following technical solutions:
a quantitative evaluation and classification method for controlling voice quality division comprises the following steps:
s1, inputting voice data of a standard control voice database marked with correct meanings;
s2, considering the characteristics of civil aviation land-air communication, constructing an evaluation index system for controlling voice quality division;
s3, qualitatively analyzing each evaluation index, including a technical analysis method and an index grading quantization unit;
s4, grouping data obtained by analyzing the single evaluation index by adopting a clustering method, and specifying the range value of each voice grade under the single evaluation index;
s5, weighting each evaluation index by adopting a weighted fusion algorithm to combine into a control voice data set with multiple levels of quality.
Preferably, in step S2, the evaluation index system for controlling the speech quality division includes an extremely large index, an intermediate index, an extremely small index, and a specific index, where the larger the value of the extremely large index is, the better the speech recognition effect is, including accents, the closer the value of the intermediate index is to a certain intermediate value, the better the speech recognition effect is, including speech speed, tone (pitch), and sound intensity, the smaller the value of the extremely small index is, the better the speech recognition effect is, including continuity, interference degree, professional term proportion, gray vocabulary content, and pitch change, and the when the value of the specific index is a certain value, the better the speech recognition effect is, including language category.
Preferably, the grouping the data analyzed by the single evaluation index by using the clustering method in step S4 to specify the range value of each voice level under the single evaluation index includes the following steps:
step S4-1: inputting the number of grades to be divided and a data set obtained by each single evaluation index analysis method;
step S4-2: and outputting the clustering result and each grade range.
Preferably, the method for outputting the clustering result in step S4-2 is implemented, and comprises the following steps:
step S4-2-1: determining the optimal category number by adopting an elbow method or a contour coefficient method;
step S4-2-2: initializing a class center value, calculating Euclidean distance from each sample point to each class center, assigning each sample to the class closest to the sample point to form a clustering result, wherein the Euclidean distance refers to the real distance between two points in an m-dimensional space or the natural length of a vector, and the formula is as follows:
wherein the content of the first and second substances,representative sample pointsTo particle pointThe distance of (a) to (b),representing the kth property of the ith sample,representing the kth attribute of the jth sample, and sharing m-dimensional attributes;
step S4-2-3: calculating the mean value of all samples in each category of the clustering result as a new clustering center;
step S4-2-4: taking the sum of the distances from the sample to the class center as a target function, and outputting if the iteration is converged or the stopping condition is met; otherwise, the number of the categories is +1, and the step S4-2-2 is returned to for repeated calculation;
step S4-2-5: the algorithm uses iterative computation, the global optimal solution is difficult to achieve, a heuristic strategy is adopted for the solution, and Nash equilibrium is utilized to achieve the optimal solution of the problem.
Preferably, the regulated speech quality is classified into 1-5 classes, with higher classes giving better speech quality.
Preferably, the weighting fusion algorithm adopted in step S5 includes a subjective weighting method and an objective weighting method, and the implementation steps are as follows:
step S5-1: the subjective weighting method is to use expert experience to adjust and optimize the weighting value objectively given by the evaluation index, and is to use a 1-9 scale method to carry out quantitative comparison on the importance degree of each index belonging to one level relative to the same index of the previous level to form a judgment matrix X, use a maximum eigenvector method to calculate the eigenvector of the corresponding characteristic root of the judgment matrix, and use the eigenvector as the weight of each index when the judgment matrix is checked to meet the consistency;
step S5-2: the objective weighting method comprises the following steps:
step S5-2-1: the indexes are converted into the positive indexes, namely the extremely small indexes and the intermediate indexes are converted into the extremely large indexes:
very small- > very large:
intermediate- > very large:
wherein the content of the first and second substances,in order to identify the optimal value of the effect, the voice centralized value obtained by the evaluation index method is taken as the optimal value,is a forward numerical value;
step S5-2-2: data normalization, dimensional error between balance indices:
wherein the content of the first and second substances,the numerical value of the ith voice under the jth evaluation index is obtained;
step S5-2-3: data normalization, unifying to interval 0-1:
wherein n is the number of evaluation objects;
s5-2-4, calculating information entropy of each evaluation index
Information entropy of each evaluation index:
wherein n is the number of the evaluation objects, m is the number of the evaluation indexes, and the value of j is taken from 1 to m;
step S5-2-5: calculating the weight:
wherein the value of j is taken from 1 to m;
step S5-3: fusion of subjective and objective weights:
wherein n is the number of evaluation objects,in order to be the subjective weight, the user can select,is an objective weight;
step S5-4: each voice is given a composite score:
wherein the content of the first and second substances,taking the value of i from 1 to n for the normalization of the value of the ith evaluation object under the jth evaluation index;
step S5-5: each controlled speech quality level score range:
the comprehensive score is calculated according to the evaluation method in the whole standard control voice database, the comprehensive score sequence of the whole database is divided according to 5 grades, the interval range of each grade is the score range of each quality grade, the interval range is 0-1, all evaluation indexes are subjected to forward processing, and therefore the quality is better when the comprehensive score value is larger, and the 5-grade quality is optimal.
A device for quantitative evaluation and classification for managing voice quality division comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the steps of the classification method.
Compared with the prior art, the invention has the following beneficial effects:
1. establishing a control voice quality evaluation index system, carrying out quantitative analysis on each evaluation index in the evaluation index system, realizing objective quantitative research on control voice quality, defining a metering unit for objective quantitative analysis on control voice quality, and acquiring a corresponding relation between control voice quality and each evaluation index;
2. according to quantitative analysis of evaluation indexes and a main empowerment fusion algorithm, an objective control voice quality division method is established, third-party control voice recognition software is tested according to the divided voice data sets with different quality grades, the aviation unit can conveniently select the control voice recognition software, and the air traffic control efficiency, safety, reliability and effectiveness are improved.
Drawings
FIG. 1 is a diagram of a partitioning technique for controlling speech quality;
FIG. 2 is a block diagram of a quantitative analysis structure for controlling speech quality classification;
FIG. 3 is a diagram of index classification;
FIG. 4 is a diagram illustrating the index values and the recognition effect trend;
FIG. 5 is a diagram of the effect of the first part of the index quantization grading;
FIG. 6 is a diagram of the effect of the second part of the index quantization step;
FIG. 7 is a roadmap for the assessment technique for governing speech quality ratings.
Detailed Description
The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.
Examples
The embodiment of the application collects control voice linguistic data from an actual environment of civil aviation operation by depending on a project background to establish a standard control voice database, wherein the linguistic content comprises multiple scenes, different controller pronunciations, different control instruction voices, different flight stages, an ultra-large amount of Liu Kongmo radio communication phrase vocabularies, single or mixed language pronunciations and other air management characteristics, a corresponding control instruction text is marked for each control voice audio in the database, and quantitative evaluation and classification are carried out on data in the database.
The implementation process and steps of the embodiment of the application are as follows, the flow block diagram is shown in fig. 1, and the quantitative analysis structure block diagram for controlling the voice quality division is shown in fig. 2:
s1, inputting voice data of a standard control voice database marked with correct meanings;
s2, considering the characteristics of the land-air communication, constructing an evaluation index system for controlling voice quality division;
s3, qualitatively analyzing each evaluation index, including a technical analysis method and an index grading quantization unit;
s4, grouping data obtained by analyzing the single evaluation index by adopting a clustering method, and specifying the range value of each voice grade under the single evaluation index;
s5, weighting each evaluation index by adopting a weighted fusion algorithm to combine into a control voice data set with 5 levels of quality.
In step S2, the index is classified as shown in fig. 3:
the evaluation index system for controlling the voice quality division comprises an extremely large index, an intermediate index, an extremely small index and a specified index, wherein the larger the value of the extremely large index is, the better the voice recognition effect is, including accent, the closer the value of the intermediate index is to a certain intermediate value, the better the voice recognition effect is, including speech speed, tone (tone) and sound intensity, the smaller the value of the extremely small index is, the better the voice recognition effect is, including continuity, interference degree, professional term proportion, gray vocabulary content and sound variation, and when the value of the specified index is a certain value, the voice recognition effect is good, including language category, and the recognition effect of single-language voice is better than that of mixed-language voice.
The step of qualitatively analyzing each evaluation index system described in step S3 is as follows, the index quantitative grading effect graph is shown in fig. 5 and 6, and fig. 5 and 6 are two graphs divided from one whole graph:
step S3-1: the unit of speech rate quantization is word/second (Chinese), syllable/second (English), the speech rate analysis method includes the following steps:
step S3-1-1: performing framing, windowing and preprocessing on an input voice signal;
step S3-1-2: detecting an audio segment of valid speech; calculating the frame number of the effective audio frequency to obtain the effective pronunciation time;
step S3-1-3: processing a text corresponding to the audio to obtain the effective character number or vocabulary number of the audio text;
step S3-1-4: calculating the speech rate, speech rate = number of valid audio frames/number of syllables (or number of characters);
step S3-2: the tone (tone) quantization unit is the pitch change frequency, and the tone (tone) analysis method comprises the following steps:
s3-2-1, performing framing, windowing and preprocessing on the input voice signal, and filtering out other interference factors;
s3-2-2, performing Fourier transform on the preprocessed framing signals, and extracting time domain and frequency domain characteristic information of the voice waveform;
s3-2-3, directly estimating the waveform variation trend by a time domain and frequency domain estimation method of a voice waveform;
step S3-3: amplitude (dB) is taken as a sound intensity quantization unit, and the sound intensity analysis method comprises the following steps:
step S3-3-1: performing framing, windowing and preprocessing on an input voice signal;
step S3-3-2: obtaining each frequency and amplitude value through short-time Fourier transform and splitting an original signal;
step S3-3-3: performing normal distribution description on each amplitude value in the voice, and taking an expected value of the normal distribution as a sound intensity measurement value of the voice;
step S3-4: the accent quantization unit is similarity, and the accent analysis method comprises the following steps:
step S3-4-1: establishing a standard mandarin chinese phoneme library, and mapping different sound characteristics into corresponding phonemes;
step S3-4-2: extracting phonemes of the input speech by using a phoneme extraction algorithm;
step S3-4-3: comparing the difference of the standard pronunciation and the pronunciation of the accent of the input system, decoding the input voice by the acoustic model to obtain a voice characteristic sequence, comparing the voice characteristic sequence with a characteristic sequence of standard Mandarin, expressing the characteristic sequence by using a characteristic vector, and calculating the similarity between the two characteristic vectors;
step S3-5: the continuity quantification unit is the number of continuity abnormal segments in a piece of voice, and the continuity analysis method comprises the following steps:
step S3-5-1: preprocessing input voice;
step S3-5-2: removing the mute sections at the head end and the tail end of each voice by using an energy-based voice endpoint detection method, and marking out a continuity abnormal section in the effective voice;
step S3-5-3: the voice endpoint detection method based on the voice marks the part without the pronunciation of the speaker in the continuity abnormal section in the effective voice section in the step S3-5-2;
step S3-5-4: based on a context judgment algorithm, whether the phonetic segments marked in the step S3-5-3 belong to normal punctuation or the same phonetic segment is marked, and if the phonetic segments belong to the same phonetic segment, the duration of the phonetic segment is counted;
step S3-6: the interference degree quantization unit is a noise energy value, and the interference degree analysis method comprises the following steps:
step S3-6-1: carrying out short-time Fourier transform on input voice to respectively smooth a time domain and a frequency domain to obtain a short-time local energy spectrum value of the voice with noise;
step S3-6-2: the ratio of the energy spectrum value to the local minimum value is used as a threshold to remove the noise energy in the voice with noise;
step S3-6-3: continuously updating noise energy according to a threshold judgment result in a judgment process until an optimal noise reduction effect is obtained, wherein an energy value when the optimal noise reduction effect is obtained is used as an interference degree;
step S3-7: the unit of the language category quantization is the language category (Chinese-0, english-1, chinese-English hybrid-2).
The index value and the recognition effect are shown in fig. 4:
the higher the accent similarity of the voice data is, the better the voice recognition effect is when the speech rate of the voice data, the gene frequency of the tone (tone), and the amplitude of the sound intensity are, and at a certain intermediate value, the voice recognition effect is, the lower the number of abnormal segments, the noise energy, the occupation ratio of the professional terms, the number of gray vocabulary contents, and the number of pitch changes of the voice data are, the better the voice recognition effect is, and when the language category of the voice data is 1, the best the voice recognition effect is.
The language category analysis method comprises the following steps:
step S3-7-1: building Chinese and English speech recognizers, wherein each speech recognizer pertinently contains speech characteristics of respective language;
step S3-7-2: extracting the characteristics of the input voice, matching the characteristics with the voice characteristics of various languages, and determining the category of the voice language;
step S3-8: the quantitative unit of the professional term is the proportion of civil aviation professional terms in a voice text, and the analysis method of the proportion of the professional terms comprises the following steps:
step S3-8-1: acquiring correct texts corresponding to each text in a controlled voice database (based on manual labeling/semi-automation);
step S3-8-2: performing text sentence breaking, word segmentation, character discrimination and other processing by using a text analysis algorithm;
step S3-8-3: establishing a control instruction professional term dictionary by referring to the air traffic radio communication term, matching the vocabulary extracted in the step S3-8-2 with the dictionary through a matching algorithm, and counting the number of matched words as the voice professional term content;
step S3-9: the grey vocabulary content quantization unit is grey vocabulary content, and the grey vocabulary content analysis method comprises the following steps:
step S3-9-1: training an acoustic model by adopting an induced word bank;
step S3-9-2: performing framing, windowing and preprocessing on input voice, and extracting voice characteristics;
step S3-9-3: the acoustic model in the step S3-9-1 receives the voice characteristics in the step S3-9-2, detects an audio segment of input voice containing a sensed word, establishes a gating mechanism by combining a context discrimination algorithm, discriminates the sensed word and determines whether the audio segment is reserved or not;
step S3-9-4: labeling the voice frequency bands of the sensed voice words detected as meaningless, and counting the number of the meaningless voice frequency bands in the whole voice;
step S3-10: the sound change quantization unit is the number of sound changes, and the sound change analysis method comprises the following steps:
step S3-10-1: constructing a complete polyphone dictionary and a merged vocabulary library which is easy to change;
step S3-10-2: acquiring correct texts corresponding to each text in a controlled voice database (based on manual labeling/semi-automation);
step S3-10-3: performing word segmentation, part of speech tagging, character discrimination and other processing by using a text analysis algorithm;
step S3-10-4: matching the controlled voice text with the polyphone dictionary and the merged vocabulary in the step S3-10-1 by adopting a matching algorithm, and counting the polyphone and vocabulary number contained in the text.
The implementation of the content in step S4 includes the following steps:
step S4-1: inputting the number of grades to be divided and a data set ({ x) obtained by each single evaluation index analysis method 1 x 2 x 3 ,…,x n N is the number of data in the data set);
step S4-2: outputting clustering results and all grade ranges;
the method for grouping the data obtained by analyzing the single evaluation index by adopting the clustering method in the step S4 and specifying the range value of each voice grade under the single evaluation index comprises the following steps:
step S4-2-1: since the voice data is classified into classes which are not previously specified (the classes refer to the classes classified based on a certain voice evaluation index and the concept that the quality classes mentioned in the patent are different), the optimal class number k (i.e. the number of the points to be gathered and the set of the centroids { c) is determined by the elbow method or the contour coefficient method for the acquired data set 1 c 2 c 3 ,…c k |c i May or may not be intra-dataset values });
step S4-2-2: initialized particle (class center) value x j∈ {c 1 c 2 c 3 ,…c k Calculating Euclidean distance from each sample point to the center of each class, assigning each sample to the class closest to the sample point to form a clustering result, wherein the Euclidean distance refers to the real distance between two points in the m-dimensional spaceThe distance, or the natural length of the vector, is given by the formula:
wherein the content of the first and second substances,representative sample pointsTo particle pointThe distance of (a) to (b),representing the kth property of the ith sample,the kth attribute representing the jth sample has m-dimensional attributes, and in the invention, each evaluation index is analyzed to obtain a one-dimensional data value, so that m =1, and the calculation formula of the Euclidean distance in the invention is as follows:
step S4-2-3: updating a clustering central point: calculating the mean value of all samples in the cluster as a new cluster center for the cluster result;
step S4-2-4: updating a clustering central point: taking the sum of the distances from the sample to the class center as a target function, and outputting if iteration converges or meets a stopping condition; otherwise, the number of the categories is +1, and the step S4-2-2 is returned to for repeated calculation;
step S4-2-5: iterative computation is used in the algorithm, the global optimal solution is difficult to achieve, and a heuristic strategy can be adopted to search for Nash equilibrium and the problem optimal solution.
The step of controlling the quality rating of the voice is as follows, and a route diagram of the controlling quality rating technology is shown in fig. 7:
according to the qualification approval rules of civil aircraft drivers, flight instructors and ground instructors (CCAR-61 division), < MH/T4014-2003 radio communication words for air traffic and guidance of people in the civil aviation field, 1-5 grades of controlled voice quality are innovatively proposed, the higher the grade is, the better the voice quality is, the 5 grades are the highest quality, and the judgment standards of all grades are as follows:
1) Level 1: the occupation ratio of the control professional vocabularies is small; too fast or too slow a speech rate; chinese-English mixed pronunciation voice; under the influence of own native language or region, mandarin chinese has a few accents; vocabulary audio (polyphones, homophones) containing misleading semantic understanding; the interference of transmission channels, ambient noise and the like is large;
2) And 2, stage: the occupation rate of the control professional vocabularies is low; too fast or too slow a speech rate; a small amount of gray words and Chinese-English mixed pronunciation exist; affected by own native language or region, mandarin with slight accent; lexical audio with individual misleading semantic understandings;
3) And 3, level: the occupation rate of the professional vocabularies of control is common; the speed of speech is normal; pronunciation of words without gray color; single language voice; speech signals occasionally pause; the audio frequency is interfered to a lesser extent; no accent;
4) 4, level: the control professional vocabulary accounts for a large rate; the speech definition is good; the speed of speech is normal; the voice is fluent; there are individual misleading semantic understanding vocabulary audios;
5) And 5, stage: the ratio of the control professional vocabularies is high; the interference degree is small; the voice is fluent; standard of mandarin pronunciation; lexical audio without misleading semantic understanding.
Step S5-1: the subjective weighting method is to use expert experience to adjust and optimize the weighting value objectively assigned by evaluation indexes, so that the weighting is more scientific and reasonable, thereby realizing the quantitative and visual display of the quality condition of the controlled voice, and the method is to use a 1-9 scale method to carry out quantitative comparison on every two of the importance degrees of each index belonging to the same level relative to the same index of the same level on the same level to form a judgment matrix X, use a maximum eigenvector method to calculate the eigenvector of the corresponding characteristic root of the judgment matrix, and use the eigenvector as the weight of each index when the judgment matrix is checked to meet the consistency;
step S5-2: the objective weighting method comprises the following steps:
step S5-2-1: the indexes are converted into the positive indexes, namely the extremely small indexes and the intermediate indexes are converted into the extremely large indexes:
very small- > very large:
intermediate- > very large:
wherein the content of the first and second substances,in order to identify the optimal value of the effect, the voice centralized value obtained by the evaluation index method is taken as the optimal value,is a forward numerical value;
step S5-2-2: data normalization, dimensional error between balance indices:
wherein the content of the first and second substances,the numerical value of the ith voice under the jth evaluation index is obtained;
step S5-2-3: data normalization, unifying to interval 0-1:
wherein n is the number of evaluation objects;
s5-2-4, calculating information entropy of each evaluation index
Information entropy of each evaluation index:
wherein n is the number of evaluation objects, m is the number of evaluation indexes, and the value of j is taken from 1 to m;
step S5-2-5: calculating the weight:
wherein the value of j is taken from 1 to m;
step S5-3: fusion of subjective and objective weights:
wherein n is the number of evaluation objects,in order to be the subjective weight, the user can select,is an objective weight;
step S5-4: the comprehensive score of each voice is as follows:
wherein the content of the first and second substances,taking the value of i from 1 to n for the normalization of the value of the ith evaluation object under the jth evaluation index;
step S5-5: each controlled speech quality level score range:
the comprehensive score is calculated by the whole standard control voice database according to the evaluation method, the comprehensive score sequence of the whole database is divided according to 5 grades, the interval range of each grade is the score range of each quality grade, the interval range is 0 to 1, all evaluation indexes are subjected to forward processing, and therefore the quality is better when the comprehensive score value is larger, and the quality of 5 grades is optimal.
A quantitative evaluation and classification device for voice quality control division adopts a Core i7-12700 processor, a memory adopts a Samsung 980 PRO 1T solid state disk and 4 NVIDIA P40 GPUs to accelerate the processing speed of relevant steps.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (7)
1. A quantitative evaluation and classification method for controlling voice quality division is characterized by comprising the following steps:
s1, inputting voice data of a standard control voice database marked with correct meanings;
s2, considering the characteristics of civil aviation land-air communication, constructing an evaluation index system for controlling voice quality division;
s3, qualitatively analyzing each evaluation index, including a technical analysis method and an index grading quantization unit;
s4, grouping data obtained by analyzing the single evaluation index by adopting a clustering method, and specifying the range value of each voice grade under the single evaluation index;
and S5, weighting each evaluation index by adopting a weighted fusion algorithm to combine a plurality of control voice data sets with the grade quality.
2. The quantitative evaluation and classification method for controlling speech quality division according to claim 1, wherein in step S2, the evaluation index system for controlling speech quality division includes an ultra-large index, an intermediate index, an ultra-small index, and a specified index, the ultra-large index has a larger value, and a better speech recognition effect including accents, the intermediate index has a value closer to a certain intermediate value, and the speech recognition effect including speech speed, pitch, and pitch is better, the ultra-small index has a smaller value, and the speech recognition effect including continuity, interference degree, professional term proportion, gray vocabulary content, and pitch change is better, and the specified index has a value, and the speech recognition effect includes a language category.
3. The method for quantitatively evaluating and classifying controlled speech quality division according to claim 1, wherein the step S4 of grouping the data analyzed by the single evaluation index by using the clustering method to specify the range value of each speech level under the single evaluation index comprises the following steps:
step S4-1: inputting the number of grades to be divided and a data set obtained by each single evaluation index analysis method;
step S4-2: and outputting the clustering result and each grade range.
4. The quantitative evaluation and classification method for controlling speech quality classification according to claim 3, wherein the method for outputting the clustering result in step S4-2 comprises the following steps:
step S4-2-1: determining the optimal category number by adopting an elbow method or a contour coefficient method;
step S4-2-2: initializing a class center value, calculating Euclidean distance from each sample point to each class center, assigning each sample to the class closest to the sample point to form a clustering result, wherein the Euclidean distance refers to the real distance between two points in an m-dimensional space or the natural length of a vector, and the formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,representative sample pointsTo particle pointThe distance of (a) to (b),representing the kth property of the ith sample,representing the kth attribute of the jth sample, and sharing m-dimensional attributes;
step S4-2-3: calculating the mean value of all samples in each category of the clustering result as a new clustering center;
step S4-2-4: taking the sum of the distances from the sample to the class center as a target function, and outputting if iteration converges or meets a stopping condition; otherwise, the number of the categories is +1, and the step S4-2-2 is returned to for repeated calculation;
step S4-2-5: the algorithm uses iterative computation, the global optimal solution is difficult to achieve, a heuristic strategy is adopted for the solution, and Nash equilibrium is utilized to achieve the optimal solution of the problem.
5. The method as claimed in claim 1, wherein the classification of the controlled speech quality is classified into 1-5 grades, and the higher the grade is, the better the speech quality is.
6. The method as claimed in claim 2, wherein the weighting fusion algorithm used in step S5 includes a subjective weighting method and an objective weighting method, and the method includes the following steps:
step S5-1: the subjective weighting method is to use expert experience to adjust and optimize the weighting value objectively given by the evaluation index, and is to use a 1-9 scale method to carry out quantitative comparison on the importance degree of each index belonging to one level relative to the same index of the previous level to form a judgment matrix X, use a maximum eigenvector method to calculate the eigenvector of the corresponding characteristic root of the judgment matrix, and use the eigenvector as the weight of each index when the judgment matrix is checked to meet the consistency;
step S5-2: the objective weighting method comprises the following steps:
step S5-2-1: the indexes are converted into the positive indexes, namely the extremely small indexes and the intermediate indexes are converted into the extremely large indexes:
very small- > very large:
intermediate- > very large:
wherein the content of the first and second substances,in order to identify the optimal value of the effect, the voice centralized value obtained by the evaluation index method is taken as the optimal value,is a forward numerical value;
step S5-2-2: data normalization, dimensional error between balance indices:
wherein the content of the first and second substances,the value of the ith voice under the jth evaluation index is obtained;
step S5-2-3: data normalization, unifying to interval 0-1:
wherein n is the number of evaluation objects;
s5-2-4, calculating information entropy of each evaluation index
Information entropy of each evaluation index:
wherein n is the number of the evaluation objects, m is the number of the evaluation indexes, and the value of j is taken from 1 to m;
step S5-2-5: calculating the weight:
wherein the value of j is taken from 1 to m;
step S5-3: fusion of subjective and objective weights:
wherein n is the number of evaluation objects,in order to be the subjective weight, the user can select,is an objective weight;
step S5-4: each voice is given a composite score:
wherein the content of the first and second substances,taking the value of i from 1 to n for the normalization of the value of the ith evaluation object under the jth evaluation index;
step S5-5: each controlled speech quality level score range:
the comprehensive score is calculated by the whole standard control voice database according to the evaluation method, the comprehensive score sequence of the whole database is divided according to 5 grades, the interval range of each grade is the score range of each quality grade, the interval range is 0 to 1, all evaluation indexes are subjected to forward processing, and therefore the quality is better when the comprehensive score value is larger, and the quality of 5 grades is optimal.
7. The device for quantitatively evaluating and classifying the quality division of the control voice is characterized by comprising at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the classification method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211469949.7A CN115547299B (en) | 2022-11-22 | 2022-11-22 | Quantitative evaluation and classification method and device for quality division of control voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211469949.7A CN115547299B (en) | 2022-11-22 | 2022-11-22 | Quantitative evaluation and classification method and device for quality division of control voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115547299A true CN115547299A (en) | 2022-12-30 |
CN115547299B CN115547299B (en) | 2023-08-01 |
Family
ID=84721576
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211469949.7A Active CN115547299B (en) | 2022-11-22 | 2022-11-22 | Quantitative evaluation and classification method and device for quality division of control voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115547299B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116092482A (en) * | 2023-04-12 | 2023-05-09 | 中国民用航空飞行学院 | Real-time control voice quality metering method and system based on self-attention |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005164870A (en) * | 2003-12-02 | 2005-06-23 | Nippon Telegr & Teleph Corp <Ntt> | Objective evaluation apparatus for speech quality taking band limitation into consideration |
CN104835354A (en) * | 2015-05-20 | 2015-08-12 | 青岛民航空管实业发展有限公司 | Control load management system and controller workload evaluation method |
CN105679309A (en) * | 2014-11-21 | 2016-06-15 | 科大讯飞股份有限公司 | Method and device for optimizing speech recognition system |
CN107564534A (en) * | 2017-08-21 | 2018-01-09 | 腾讯音乐娱乐(深圳)有限公司 | Audio quality authentication method and device |
CN108877839A (en) * | 2018-08-02 | 2018-11-23 | 南京华苏科技有限公司 | The method and system of perceptual evaluation of speech quality based on voice semantics recognition technology |
CN110490428A (en) * | 2019-07-26 | 2019-11-22 | 合肥讯飞数码科技有限公司 | Job of air traffic control method for evaluating quality and relevant apparatus |
CN112435512A (en) * | 2020-11-12 | 2021-03-02 | 郑州大学 | Voice behavior assessment and evaluation method for rail transit simulation training |
CN112466335A (en) * | 2020-11-04 | 2021-03-09 | 吉林体育学院 | English pronunciation quality evaluation method based on accent prominence |
CN112967711A (en) * | 2021-02-02 | 2021-06-15 | 早道(大连)教育科技有限公司 | Spoken language pronunciation evaluation method, spoken language pronunciation evaluation system and storage medium for small languages |
CN113779798A (en) * | 2021-09-14 | 2021-12-10 | 国网江苏省电力有限公司电力科学研究院 | Electric energy quality data processing method and device based on intuitive fuzzy combination empowerment |
CN113792982A (en) * | 2021-08-19 | 2021-12-14 | 北京邮电大学 | Scientific and technological service quality assessment method and device based on combined weighting and fuzzy gray clustering |
EP4086903A1 (en) * | 2021-05-04 | 2022-11-09 | GN Audio A/S | System with post-conversation evaluation, electronic device, and related methods |
-
2022
- 2022-11-22 CN CN202211469949.7A patent/CN115547299B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005164870A (en) * | 2003-12-02 | 2005-06-23 | Nippon Telegr & Teleph Corp <Ntt> | Objective evaluation apparatus for speech quality taking band limitation into consideration |
CN105679309A (en) * | 2014-11-21 | 2016-06-15 | 科大讯飞股份有限公司 | Method and device for optimizing speech recognition system |
CN104835354A (en) * | 2015-05-20 | 2015-08-12 | 青岛民航空管实业发展有限公司 | Control load management system and controller workload evaluation method |
CN107564534A (en) * | 2017-08-21 | 2018-01-09 | 腾讯音乐娱乐(深圳)有限公司 | Audio quality authentication method and device |
CN108877839A (en) * | 2018-08-02 | 2018-11-23 | 南京华苏科技有限公司 | The method and system of perceptual evaluation of speech quality based on voice semantics recognition technology |
CN110490428A (en) * | 2019-07-26 | 2019-11-22 | 合肥讯飞数码科技有限公司 | Job of air traffic control method for evaluating quality and relevant apparatus |
CN112466335A (en) * | 2020-11-04 | 2021-03-09 | 吉林体育学院 | English pronunciation quality evaluation method based on accent prominence |
CN112435512A (en) * | 2020-11-12 | 2021-03-02 | 郑州大学 | Voice behavior assessment and evaluation method for rail transit simulation training |
CN112967711A (en) * | 2021-02-02 | 2021-06-15 | 早道(大连)教育科技有限公司 | Spoken language pronunciation evaluation method, spoken language pronunciation evaluation system and storage medium for small languages |
EP4086903A1 (en) * | 2021-05-04 | 2022-11-09 | GN Audio A/S | System with post-conversation evaluation, electronic device, and related methods |
CN113792982A (en) * | 2021-08-19 | 2021-12-14 | 北京邮电大学 | Scientific and technological service quality assessment method and device based on combined weighting and fuzzy gray clustering |
CN113779798A (en) * | 2021-09-14 | 2021-12-10 | 国网江苏省电力有限公司电力科学研究院 | Electric energy quality data processing method and device based on intuitive fuzzy combination empowerment |
Non-Patent Citations (2)
Title |
---|
柳震: "基于管制员语音反应时的疲劳风险定量评价模型", 《科技与创新》 * |
柳震: "基于管制员语音反应时的疲劳风险定量评价模型", 《科技与创新》, no. 08, 23 March 2018 (2018-03-23) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116092482A (en) * | 2023-04-12 | 2023-05-09 | 中国民用航空飞行学院 | Real-time control voice quality metering method and system based on self-attention |
Also Published As
Publication number | Publication date |
---|---|
CN115547299B (en) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3162994B2 (en) | Method for recognizing speech words and system for recognizing speech words | |
US20190266998A1 (en) | Speech recognition method and device, computer device and storage medium | |
TWI395201B (en) | Method and system for identifying emotional voices | |
CN111640418A (en) | Prosodic phrase identification method and device and electronic equipment | |
Swain et al. | Study of feature combination using HMM and SVM for multilingual Odiya speech emotion recognition | |
JP2007171944A (en) | Method and apparatus for automatic text-independent grading of pronunciation for language instruction | |
CN113539240A (en) | Animation generation method and device, electronic equipment and storage medium | |
CN112349289A (en) | Voice recognition method, device, equipment and storage medium | |
CN112885336A (en) | Training and recognition method and device of voice recognition system, and electronic equipment | |
CN112397054A (en) | Power dispatching voice recognition method | |
CN115547299B (en) | Quantitative evaluation and classification method and device for quality division of control voice | |
CN117711444B (en) | Interaction method, device, equipment and storage medium based on talent expression | |
Gupta et al. | A study on speech recognition system: a literature review | |
Wang | Detecting pronunciation errors in spoken English tests based on multifeature fusion algorithm | |
Mathur et al. | A study of machine learning algorithms in speech recognition and language identification system | |
Barczewska et al. | Detection of disfluencies in speech signal | |
Rao et al. | Language identification—a brief review | |
CN116564281B (en) | Emotion recognition method and device based on AI | |
Hoseini | Persian speech emotion recognition approach based on multilayer perceptron | |
Hlaing et al. | Word Representations for Neural Network Based Myanmar Text-to-Speech S. | |
JP5066668B2 (en) | Speech recognition apparatus and program | |
Marie-Sainte et al. | A new system for Arabic recitation using speech recognition and Jaro Winkler algorithm | |
Mengistu et al. | Text independent amharic language dialect recognition using neuro-fuzzy gaussian membership function | |
Mary et al. | Modeling and fusion of prosody for speaker, language, emotion, and speech recognition | |
Wang et al. | Automatic tonal and non-tonal language classification and language identification using prosodic information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |