CN116564351A - Voice dialogue quality evaluation method and system and portable electronic equipment - Google Patents
Voice dialogue quality evaluation method and system and portable electronic equipment Download PDFInfo
- Publication number
- CN116564351A CN116564351A CN202310345168.5A CN202310345168A CN116564351A CN 116564351 A CN116564351 A CN 116564351A CN 202310345168 A CN202310345168 A CN 202310345168A CN 116564351 A CN116564351 A CN 116564351A
- Authority
- CN
- China
- Prior art keywords
- voice
- sub
- segment
- duration
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 43
- 239000012634 fragment Substances 0.000 claims abstract description 42
- 230000003993 interaction Effects 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000011156 evaluation Methods 0.000 claims description 23
- 238000001303 quality assessment method Methods 0.000 claims description 16
- 238000004458 analytical method Methods 0.000 claims description 10
- 238000012482 interaction analysis Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 14
- 230000010339 dilation Effects 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 238000003646 Spearman's rank correlation coefficient Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 101100129590 Schizosaccharomyces pombe (strain 972 / ATCC 24843) mcp5 gene Proteins 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000012076 audiometry Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- General Health & Medical Sciences (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a voice dialogue quality evaluation method and system and portable electronic equipment, and belongs to the technical field of voice quality evaluation. The method comprises the following steps: step S100: analyzing the voice dialogue to be evaluated to obtain a first interaction attribute; s200: dividing the voice dialogue to be evaluated into a plurality of voice fragment groups; s300: obtaining a second duration attribute of each voice sub-segment in each voice segment group; s400: determining at least one candidate sub-segment from each of the speech segment groups; s500: training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; s600: and inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group. The invention can accurately realize the non-reference quality grading output of the multi-character voice dialogue.
Description
Technical Field
The present invention relates to the field of speech quality evaluation technologies, and in particular, to a speech dialogue quality evaluation method, system, and portable electronic device.
Background
Sound is one of the main ways humans know the world and perceive the world. With the full popularity of networks, network audio services have grown rapidly. When the quality of audio is poor, it needs enhancement processing to improve the audio quality. In the audio field, most audio evaluation platforms currently popular simply rely on one or two parameters of the audio as criteria for evaluating the audio quality, which is not reasonable in practice, since firstly the audio quality is related to the human auditory system, influenced by a number of factors, and cannot be measured solely on the basis of one or two simple parameters.
Along with the gradual perfection of a quality comprehensive evaluation system, two methods for evaluating the audio quality are evolved: subjective evaluation and objective evaluation. In the subjective evaluation method, a series of audio sequences are subjected to audiometry according to the international telecommunication union telecommunication standardization sector (ITU-T) p.800 standard by organizing testers, evaluation results of voice quality by the testers are counted to obtain an average value of the evaluation results, the final audio quality evaluation result is expressed as a mean opinion score (Mean Opinion Score, "MOS"), and the higher the MOS value is, the better the audio quality is. However, the subjective evaluation method has the defects of long experimental period and high economic cost.
Objective assessment methods are widely used to assess audio quality, and are divided into reference and non-reference audio quality assessment models. The operation mechanism of the reference audio quality evaluation method is that processed voice and lossless voice are compared, in the comparison process, firstly, voice is aligned, deviation of the processed voice and the lossless voice is found, after the voice of each small segment and the lossless voice of the small segment are aligned, the voice of each small segment and the lossless voice of the small segment are independently placed into an auditory model, loss of frequency cost on each frequency band and generation of additional frequency are seen, whether increase and decrease of frequency components are obvious enough or not on human hearing is judged, and finally, smoothing, weighting average and the like of voice damage conditions of each segment are carried out on the whole time domain, and finally, the voice quality is mapped to individual voice quality fractions. I TU-T historically has mainly introduced three well-known models, namely PSQM (p.861), PESQ (p.862), POLQA (p.863), the POLQA model currently being the most well accepted. The PSQM and PESQ models are only applicable to audio frequencies below 16 KHz. The POLQA model can be applied to 48kHz audio signals, and the algorithm is still in a protected state, is not public and has high use cost. And the reference audio quality assessment model needs to provide reference audio, and can not assess audio quality in a scene without reference audio. The reference audio quality assessment model needs to provide reference audio, and cannot assess audio quality in a scene without reference audio. The model for evaluating the quality of the non-reference audio is mostly realized based on deep learning, and a more representative method is MOSNET, which adopts a network architecture of CNN and BLSTM, training data is derived from The Voice Conversion Challenge (VCC) 2018, and various evaluation indexes are in the front of the industry. The time sequence convolution network (TCN) has been successful in the fields of machine translation, traffic prediction, sound event detection, etc., and has the potential of exceeding the LSTM network, in addition, the VCC2018 has a limited data size, and how to fully mine the internal information of the data by using small-scale data so as to optimize the audio evaluation performance is a problem to be solved in the industry.
In practical application, the deep learning model used by the current non-reference voice quality evaluation technology is insufficient in adaptability to voices with different timbres, and the deep learning neural network used in the prior art defaults to adopt the same evaluation model to perform the same processing evaluation on voices with different sources, so that the adaptability of voice conversation quality evaluation under a multi-voice environment, particularly a multi-target character multi-voice conversation environment and the model update training problem are not considered.
Disclosure of Invention
The invention aims to provide a voice dialogue quality evaluation method, a voice dialogue quality evaluation system and portable electronic equipment, which can accurately realize the non-reference quality score output of a multi-character voice dialogue, so that the result is more targeted and adaptive.
In order to achieve the above object, the present invention provides the following solutions:
a voice conversation quality evaluation method, comprising: comprising the following steps:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
s300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
s500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
s600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group;
the voice quality evaluation model is a convolution-time sequence convolution network model.
Further, the first interaction attribute in the step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;
the step S200 specifically includes: and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.
Further, the second duration attribute in the step S300 is used to characterize the duration of each voice sub-segment.
Further, the second duration attributes of the candidate sub-segments in the step S400 are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
Further, the convolution-time sequence convolution network model comprises a preamble convolution module and a time sequence convolution module;
the preamble convolution module performs feature extraction on the input voice sub-segment;
the time sequence convolution module comprises n expansion convolution modules, wherein the expansion factor of each expansion convolution is 2 n-1 。
The invention also provides a voice dialogue quality evaluation system, which comprises:
a voice analysis unit: the method comprises a first interaction analysis subunit and a second duration analysis subunit; the first interaction analysis subunit is used for analyzing the voice dialogue to be evaluated, obtaining a first interaction attribute of the voice dialogue to be evaluated, and sending the first interaction attribute to the voice grouping unit; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;
voice grouping unit: dividing the voice dialogue to be evaluated into a first number of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and a plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
a speech candidate unit: the method comprises the steps of determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein second duration attributes of the plurality of candidate sub-segments are identical;
model training unit: the method comprises the steps of training and updating at least one existing voice quality evaluation model by taking the candidate sub-segments as training samples to obtain an updated voice quality evaluation model;
and an evaluation output unit: the voice sub-segments except the candidate sub-segments in each voice segment group are input into the updated voice quality evaluation model, and the voice quality evaluation score of the target person corresponding to each voice segment group is output;
the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module and the evaluation output module are sequentially connected; the second duration analysis subunit is connected with the voice grouping module and the voice candidate module;
the speech quality evaluation model is a reference-free speech quality evaluation model.
Further, the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialogue to be evaluated.
Further, the second duration attribute is used to characterize a duration of each voice sub-segment.
Further, the second duration attributes of the plurality of candidate sub-segments are the same, including one of:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
The invention also provides portable electronic equipment, which comprises a voice receiving unit, a memory, a processor and a display unit, wherein the memory is stored with computer executable program instructions, and the receiving unit is used for receiving voice conversations; the executable program instructions are executed by the processor to implement the voice dialog quality assessment method and display the assessment results on the display unit.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the voice dialogue quality evaluation method provided by the invention, the voice dialogue to be evaluated is analyzed to obtain the first interaction attribute; dividing the voice dialogue to be evaluated into a plurality of voice fragment groups to obtain a second duration attribute of each voice sub-fragment in each voice fragment group; then determining at least one candidate sub-segment from each of the speech segment groups; training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; and finally, inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group, so that the non-reference quality score output of the multi-person voice dialogue can be accurately realized, and the result is more targeted and adaptive. In addition, the invention utilizes a convolution-time sequence convolution (CNN-TCN) network architecture, adopts a Label Distribution Learning (LDL) method to improve the network audio evaluation performance, carries out preprocessing segmentation on the voice according to the scene characteristics of the multi-person voice conversation, carries out targeted updating training of a non-reference voice quality evaluation model according to the grouping sample data, and carries out evaluation output, so that the result has more pertinence, suitability and accuracy, and is suitable for multi-voice environments, especially multi-target person multi-voice conversation environments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a voice dialogue quality evaluation method according to an embodiment of the invention;
fig. 2 is a block diagram of a plurality of voice segment groups of a voice conversation quality evaluation method in an embodiment of the present invention;
FIG. 3 (a) is a schematic diagram of a no-reference speech dialog quality assessment model of the present invention;
FIG. 3 (b) is a schematic diagram of the structure of a time-series convolution module in the model for quality assessment of a reference-less speech dialogue of the present invention;
FIG. 3 (c) is a schematic diagram of an expanded convolution module of the no-reference speech dialogue quality assessment model of the present invention;
FIG. 4 is a schematic structural diagram of an electronic device according to a voice conversation quality evaluation method of the present invention;
FIG. 5 is a schematic diagram of a voice conversation quality evaluation system in accordance with one embodiment of the present invention;
FIG. 6 is a schematic diagram of a speech dialog quality assessment system in accordance with a preferred embodiment of the present invention;
FIG. 7 (a) is an effect diagram corresponding to sentence level indicators according to the present invention;
fig. 7 (b) is an effect diagram corresponding to the system level index according to the embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a voice dialogue quality evaluation method, a voice dialogue quality evaluation system and portable electronic equipment, which can accurately realize the non-reference quality score output of a multi-character voice dialogue, so that the result is more targeted and adaptive.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1, a method for evaluating quality of a voice conversation according to an embodiment of the present invention includes:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
specifically, the first interaction attribute in step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;
s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
the step S200 specifically includes:
and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.
S300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment;
s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
the second duration attributes of the candidate sub-segments in the step S400 are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
S500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
the speech quality evaluation model in the step S500 is a no-reference speech quality evaluation model, and the no-reference speech quality evaluation model is a convolution-time sequence convolution network model.
S600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group.
In one embodiment of the present invention, as shown in fig. 2, a grouping schematic diagram of a plurality of voice segment groups of a voice conversation quality evaluation method, a voice conversation quality evaluation method of the present invention specifically includes:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
in this embodiment, in the leftmost part of fig. 2, the speech dialogue to be evaluated includes speech A1, B1, A2, C1, C2, B2, C3, A3; the voice dialog is an english voice dialog, such as a spoken english dialog; the segment of voice dialog is from at least three different people, assumed to be person a, person B and person C, wherein the voices from person a are A1, A2, A3; the voices from the person B are B1 and B2; the voices from the person C are C1, C2 and C3; in the scenario of FIG. 2, character conversations are interleaved;
in this embodiment, when the step S100 is performed, the first interaction attribute is used to characterize a first number of different target characters related to the speech dialogue to be evaluated.
S200: dividing the voice dialogue to be evaluated into a plurality of voice fragment groups based on the first interaction attribute;
in this embodiment, the voice dialogue to be evaluated is divided into 3 voice segment groups based on the first interaction attribute, which is still represented by A, B, C;
specifically, each voice segment group contains at least one voice sub-segment, and a plurality of voice sub-segments located in the same voice segment group belong to the same target person; for example, the speech segment group a contains speech sub-segments A1, A2 and A3, the speech segment group B contains speech sub-segments B1, B2, and the speech segment group C contains speech sub-segments C1, C2 and C3.
S300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment;
in the present embodiment, it is assumed that the duration of the voice sub-segments A1, A2, and A3 included in the voice segment group a are 18s, 30s, and 25s, respectively (s represents seconds); the duration of the speech sub-segments B1, B2 included in the speech segment group B is 26s, 18s and 30s, respectively, and the duration of the speech sub-segments C1, C2 and C3 included in the speech segment group C is 25s, 31s and 18s, respectively.
S400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
in this embodiment, the criteria for determining at least one candidate sub-segment from each speech segment group specifically includes one or a combination of the following:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
If criterion (1) is implemented, A1, B2, C3 may be selected as a plurality of candidate sub-segments, since the second duration attributes of the plurality of candidate sub-segments are the same (all 18 s);
if the execution criterion (2) is set to 1s in advance, A3, B1, C1 may be selected as a plurality of candidate sub-segments, since the absolute value of the difference value of the plurality of candidate sub-segments is smaller than the preset upper limit value;
of course, other options are possible, and the above is merely an illustrative example.
S500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
specifically, the existing at least one speech quality assessment model is a no-reference speech quality assessment model.
Specifically, as still another improved embodiment of the present invention, the reference-free speech quality assessment model is a convolution-time series convolution network model.
Fig. 3 (a) - (c) are schematic diagrams illustrating principles and components of a model for quality evaluation of a non-reference voice conversation according to an embodiment of the present invention.
The embodiment provides a new deep learning-based reference-free audio quality evaluation method, which utilizes a convolution-time sequence convolution (CNN-TCN) network architecture and adopts a Label Distribution Learning (LDL) method to improve network audio evaluation performance.
The input signals processed by the corresponding steps (steps S100-S400) of the method shown in FIG. 1 pass through a convolution module, the output signals are input as a time sequence convolution module, and then are output by the time sequence convolution module and are input as a full connection module, the output of the full connection module comprises two branches, and the first branch a is a frame level MOS value; after the second branch b (the branch mapped into the global average pooling) performs sentence-level tag distribution processing, sentence-level MOS values are output as shown in fig. 3 (a).
Next, the re-update training principle of the non-reference voice dialogue quality evaluation model used in the present invention will be described.
1. Introduction to data set
The VCC2018 generates audio with MOS labels by various voice synthesis systems, and each audio is subjected to MOS grading (1-5 points) by four persons, training set data 13580, verification set data 3000 and test set data 4000. Because the number of training sets is small, the embodiment increases the distribution of learning labels to improve the model performance.
The process of retraining and updating an existing model belongs to the prior art, and this embodiment does not specifically develop.
2. Signal input
The original audio in VCC2018 is subjected to STFT transformation to obtain a frequency domain input signal of the network, and optionally, hamming window is added, window length is 512, window shift is 256, and frequency point number is 512. The input dimension of the input signal is [ B, N, F ], wherein B: batch size, N: the number of frames, F, is the number of single-sided frequency points, and when the number of frequency points is 512, F is 257. Input is reshaped to obtain an input signal 1 with dimensions [ B, N,257,1].
3. Convolution module
The input signal is first subjected to feature extraction by a convolution module, the convolution module is the same as the convolution module in MOSNET, and the total number of the convolution modules is 12, and parameters of each layer are shown in table 1.
TABLE 1 convolution module layer parameters
4. Sequential convolution module
After the feature is extracted by the convolution module, the dimension is [ B, N,4,128], and the feature is reshape to obtain the time sequence convolution module input signal 2, wherein the dimension is [ B, N,512].
The sequential convolution module is formed by a plurality of expansion convolution modules, and preferably n is 4. The dilation factor of each dilation convolution is 2 n-1 Where n is the number of expansion convolution modules, as shown in FIG. 3 (b).
The first dilation convolution module is taken as an example to describe the module, the module is of a residual structure, the input signal 2 respectively comprises one-dimensional dilation convolution, channel normalization, reLU activation function and Dropout layer, and the above structure is repeated once, wherein the number of one-dimensional dilation convolution output channels can be set to 128, the convolution kernel number is 3, the step length is 1 and the dilation factor is 2 n-1 Filling with same, dropout rate 0.3. The input signal 2 is further subjected to 1 x 1 one-dimensional convolution, preferably, the number of output channels is 128, and the two obtained signals are added to obtain the input signal 3 of the fully-connected module, wherein the dimension is [ B, N,128]]As shown in fig. 3 (c).
5. Network output processing
The input signal 3 passes through the fully-connected layer, preferably with an output channel of 128, an activation function of Relu and a Dropout rate of 0.3, resulting in a fully-connected module output 4 with dimensions [ B, N,128]. The signal 4 is fully connected with the output channel being 1, so that a signal 5-frame level MOS value is obtained, and the dimension is [ B, N,1]. The signal 4 is fully connected through an output channel 101, the activation function is softmax, the fully connected output 6 is obtained, the dimension is [ B, N,101], the signal 6 is subjected to global pooling, the sentence-level tag distribution signal 7 is obtained, the dimension is [ B,1,101], the signal 8 sentence-level MOS value is obtained through mapping of the signal 7, and the dimension is [ B,1]. When B is 1, the mapping function is:
where y represents signal 8,x represents signal 7. The calculation range of the above formula is 0-5.
6. Loss function
The average avg_mos of the MOS in VCC2018 is used as the label of the signal 5 frame level MOS, the frame_mos and the signal 8 sentence level MOS value, and the utility_mos. The variance var _ MOS of each audio MOS value is calculated, the label distribution (gaussian distribution) is calculated from the following equation,
wherein lk is [0:100].
All or part of the steps of the various embodiments described above may also be implemented as computer program instructions, implemented by a portable electronic device.
Referring to fig. 4, the present invention may also be implemented as a portable electronic device, where the electronic device includes a voice receiving unit, a memory, a processor, and a display unit, where the memory stores computer executable program instructions, and the receiving unit is configured to receive an english voice dialogue;
the executable program instructions are executed by the processor to implement all or part of the steps of the voice dialog quality assessment method described in fig. 1 and display the assessment results on the display unit.
Fig. 5-6 are schematic structural diagrams of a voice conversation quality evaluation system according to various embodiments of the present invention.
As shown in fig. 5, the present embodiment provides a voice dialogue quality evaluation system, which includes a voice receiving module, a voice analyzing module, a voice grouping module, a voice candidate module, a model training module, and an evaluation output module, for implementing the method described in fig. 1.
As shown in fig. 6, in this embodiment, the voice parsing module includes a first interaction parsing subunit and a second duration parsing subunit;
in this embodiment, the first interaction analysis subunit is configured to analyze a voice dialogue to be evaluated, obtain a first interaction attribute of the voice dialogue to be evaluated, and send the first interaction attribute to a voice grouping module; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;
in this embodiment, the voice receiving module is configured to receive a voice dialogue to be evaluated;
in this embodiment, the voice grouping module is configured to segment the voice dialog to be evaluated into a first number of voice segment groups based on the first interaction attribute, where each voice segment group includes at least one voice sub-segment, and a plurality of voice sub-segments located in the same voice segment group belong to the same target person;
in this embodiment, the voice candidate module is configured to determine at least one candidate sub-segment from each voice segment group, so as to obtain a plurality of candidate sub-segments, where second duration attributes of the plurality of candidate sub-segments are the same;
in this embodiment, the model training module is configured to train and update an existing at least one speech quality evaluation model with the plurality of candidate sub-segments as training samples, to obtain an updated speech quality evaluation model;
in this embodiment, the evaluation output module is configured to input the speech sub-segments except the candidate sub-segments in each speech segment group into the updated speech quality evaluation model, and output a speech quality evaluation score of the target person corresponding to each speech segment group;
in this embodiment, the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module, and the evaluation output module are sequentially connected; the second duration analysis subunit is connected with the voice grouping module and the voice candidate module;
in this embodiment, the speech quality evaluation model is a no-reference speech quality evaluation model, and the no-reference speech quality evaluation model is a convolution-time sequence convolution network model.
In this embodiment, the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialogue to be evaluated.
In this embodiment, the second duration attribute is used to characterize the duration of each voice sub-segment.
In this embodiment, the second duration attributes of the candidate sub-segments are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
As a further preferred aspect, the time-series convolution module is composed of n expansion convolution modules, each expansion convolution having an expansion factor of 2 n-1 Where n is the number of expansion convolution modules.
Specifically, to obtain a more beneficial training model and training parameters, it is verified that the value of n is related to the first number K of different target persons involved in the speech dialogue to be evaluated and the number { num1, num2, … … num } of speech sub-segments contained in the K speech segment groups, where num represents the number of speech sub-segments contained in the i (i=1, 2,3, … …, K) th speech segment group.
Specifically, the number n of the expansion convolution modules is determined as follows:
if K is less than or equal to 4, n=4;
if K>4, then
Wherein, min { } represents taking a smaller value,representing taking a larger value of { num1, num2, … … num K };representing an upward rounding.
As shown in fig. 7, the effect diagram corresponding to the sentence level index and the effect diagram corresponding to the system level index in the technical scheme of the present invention are shown.
In this embodiment, the comparison index is selected: linear Correlation Coefficient (LCC), spearman Rank Correlation Coefficient (SRCC), and Mean Square Error (MSE).
It can be seen that the method and moset pair of this example are shown in table 2, and each index is superior to moset.
TABLE 2 comparison of the methods of the present example with MOSNET indicators
Index (I) | The invention is that | MOSNET |
LCC | (0.6684,0.9643) | (0.642,0.957) |
SRCC | (0.6342,0.9282) | (0.589,0.888) |
MSE | (0.4642,0.0434) | (0.538,0.084) |
Note that: (A, B) A represents sentence level index, and B represents system level index.
According to the technical scheme, a convolution-time sequence convolution (CNN-TCN) network architecture is utilized, a Label Distribution Learning (LDL) method is adopted to improve network audio evaluation performance, pretreatment segmentation is carried out on voices according to scene characteristics of multi-user voice conversations, targeted updating training of a non-reference voice quality evaluation model is carried out according to grouping sample data, and evaluation output is carried out, so that the result is more targeted and adaptive.
The technical scheme of the invention firstly analyzes the voice dialogue to be evaluated to obtain a first interaction attribute; dividing the voice dialogue to be evaluated into a plurality of voice fragment groups to obtain a second duration attribute of each voice sub-fragment in each voice fragment group; then determining at least one candidate sub-segment from each of the speech segment groups; training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; and finally, inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group, so that the non-reference quality score output of the multi-person voice dialogue can be accurately realized, and the result is more targeted and adaptive.
In the other technical features of the embodiment, those skilled in the art can flexibly select to meet different specific actual requirements according to actual conditions. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known compositions or structures have not been described in detail so as not to obscure the invention, and are within the scope of the invention as defined by the appended claims.
Modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the invention as defined by the appended claims. In the above description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known techniques, such as specific construction details, operating conditions, and other technical conditions, have not been described in detail in order to avoid obscuring the present invention.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.
Claims (10)
1. A method for evaluating the quality of a voice conversation, comprising:
s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;
s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
s300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;
s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;
s500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;
s600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group;
the voice quality evaluation model is a convolution-time sequence convolution network model.
2. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
the first interaction attribute in the step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;
the step S200 specifically includes: and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.
3. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment.
4. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
in the step S400, the second duration attributes of the candidate sub-segments are the same, including one of the following cases:
the duration of the plurality of candidate sub-fragments is the same;
the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.
5. The method for evaluating the quality of a voice conversation according to claim 1, wherein,
the convolution-time sequence convolution network model comprises a preamble convolution module and a time sequence convolution module;
the preamble convolution module performs feature extraction on the input voice sub-segment;
the time sequence convolution module comprises n expansion convolution modules, wherein the expansion factor of each expansion convolution module is 2 n-1 。
6. A speech dialog quality assessment system comprising:
a voice receiving module: for receiving a voice dialog to be evaluated;
and a voice analysis module: the method comprises a first interaction analysis subunit and a second duration analysis subunit; the first interaction analysis subunit is used for analyzing the voice dialogue to be evaluated, obtaining a first interaction attribute of the voice dialogue to be evaluated, and sending the first interaction attribute to the voice grouping module; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;
and a voice grouping module: dividing the voice dialogue to be evaluated into a first number of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and a plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;
a voice candidate module: the method comprises the steps of determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein second duration attributes of the plurality of candidate sub-segments are identical;
model training module: the method comprises the steps of training and updating at least one existing voice quality evaluation model by taking the candidate sub-segments as training samples to obtain an updated voice quality evaluation model;
and an evaluation output module: the voice sub-segments except the candidate sub-segments in each voice segment group are input into the updated voice quality evaluation model, and the voice quality evaluation score of the target person corresponding to each voice segment group is output;
the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module and the evaluation output module are sequentially connected; the second duration analysis subunit is also connected with the voice grouping module and the voice candidate module;
the voice quality evaluation model is a convolution-time sequence convolution network model.
7. The speech dialog quality assessment system of claim 6, wherein the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialog to be assessed.
8. The speech dialogue quality assessment system of claim 6 wherein,
the second duration attribute is used to characterize the duration of each speech sub-segment.
9. The method of claim 6, wherein the second duration attributes of the plurality of candidate sub-segments are the same, comprising:
the duration of the plurality of candidate sub-fragments is the same;
or, the absolute value of the difference value of the duration time of the plurality of candidate sub-segments is smaller than a preset upper limit value.
10. The portable electronic equipment is characterized by comprising a voice receiving unit, a memory, a processor and a display unit, wherein the memory is stored with computer executable program instructions, and the receiving unit is used for receiving voice conversations;
the executable program instructions are executed by the processor to implement the speech dialog quality assessment method of any of claims 1-5 and display the assessment results on the display unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310345168.5A CN116564351B (en) | 2023-04-03 | 2023-04-03 | Voice dialogue quality evaluation method and system and portable electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310345168.5A CN116564351B (en) | 2023-04-03 | 2023-04-03 | Voice dialogue quality evaluation method and system and portable electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116564351A true CN116564351A (en) | 2023-08-08 |
CN116564351B CN116564351B (en) | 2024-01-23 |
Family
ID=87485165
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310345168.5A Active CN116564351B (en) | 2023-04-03 | 2023-04-03 | Voice dialogue quality evaluation method and system and portable electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116564351B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130246064A1 (en) * | 2012-03-13 | 2013-09-19 | Moshe Wasserblat | System and method for real-time speaker segmentation of audio interactions |
CN108346434A (en) * | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
CN108564968A (en) * | 2018-04-26 | 2018-09-21 | 广州势必可赢网络科技有限公司 | A kind of method and device of evaluation customer service |
CN110401622A (en) * | 2018-04-25 | 2019-11-01 | 中国移动通信有限公司研究院 | A kind of speech quality assessment method, device, electronic equipment and storage medium |
CN111429938A (en) * | 2020-03-06 | 2020-07-17 | 江苏大学 | Single-channel voice separation method and device and electronic equipment |
CN112750465A (en) * | 2020-12-29 | 2021-05-04 | 昆山杜克大学 | Cloud language ability evaluation system and wearable recording terminal |
CN112885377A (en) * | 2021-02-26 | 2021-06-01 | 平安普惠企业管理有限公司 | Voice quality evaluation method and device, computer equipment and storage medium |
CN114220419A (en) * | 2021-12-31 | 2022-03-22 | 科大讯飞股份有限公司 | Voice evaluation method, device, medium and equipment |
WO2022103290A1 (en) * | 2020-11-12 | 2022-05-19 | "Stc"-Innovations Limited" | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
CN115512718A (en) * | 2022-09-14 | 2022-12-23 | 中科猷声(苏州)科技有限公司 | Voice quality evaluation method, device and system for stock voice file |
-
2023
- 2023-04-03 CN CN202310345168.5A patent/CN116564351B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130246064A1 (en) * | 2012-03-13 | 2013-09-19 | Moshe Wasserblat | System and method for real-time speaker segmentation of audio interactions |
CN108346434A (en) * | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
CN110401622A (en) * | 2018-04-25 | 2019-11-01 | 中国移动通信有限公司研究院 | A kind of speech quality assessment method, device, electronic equipment and storage medium |
CN108564968A (en) * | 2018-04-26 | 2018-09-21 | 广州势必可赢网络科技有限公司 | A kind of method and device of evaluation customer service |
CN111429938A (en) * | 2020-03-06 | 2020-07-17 | 江苏大学 | Single-channel voice separation method and device and electronic equipment |
WO2022103290A1 (en) * | 2020-11-12 | 2022-05-19 | "Stc"-Innovations Limited" | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
CN112750465A (en) * | 2020-12-29 | 2021-05-04 | 昆山杜克大学 | Cloud language ability evaluation system and wearable recording terminal |
CN112885377A (en) * | 2021-02-26 | 2021-06-01 | 平安普惠企业管理有限公司 | Voice quality evaluation method and device, computer equipment and storage medium |
CN114220419A (en) * | 2021-12-31 | 2022-03-22 | 科大讯飞股份有限公司 | Voice evaluation method, device, medium and equipment |
CN115512718A (en) * | 2022-09-14 | 2022-12-23 | 中科猷声(苏州)科技有限公司 | Voice quality evaluation method, device and system for stock voice file |
Non-Patent Citations (2)
Title |
---|
YING QIN: "Automatic Assessment of Speech Impairment in Cantonese-Speaking People with Aphasia", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 》 * |
马文: "基于深度学习的环境音频多标签分类方法研究", 《中国优秀硕士学位论文全文数据库》 * |
Also Published As
Publication number | Publication date |
---|---|
CN116564351B (en) | 2024-01-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109599093B (en) | Intelligent quality inspection keyword detection method, device and equipment and readable storage medium | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN110782872A (en) | Language identification method and device based on deep convolutional recurrent neural network | |
CN108564942A (en) | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system | |
CN107818797B (en) | Voice quality evaluation method, device and system | |
CN111243602A (en) | Voiceprint recognition method based on gender, nationality and emotional information | |
CN109408660B (en) | Music automatic classification method based on audio features | |
CN111429948A (en) | Voice emotion recognition model and method based on attention convolution neural network | |
CN113012720B (en) | Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction | |
CN109979486B (en) | Voice quality assessment method and device | |
US9786300B2 (en) | Single-sided speech quality measurement | |
CN102789779A (en) | Speech recognition system and recognition method thereof | |
CN111933113B (en) | Voice recognition method, device, equipment and medium | |
CN111599339B (en) | Speech splicing synthesis method, system, equipment and medium with high naturalness | |
CN111508505A (en) | Speaker identification method, device, equipment and storage medium | |
CN116564351B (en) | Voice dialogue quality evaluation method and system and portable electronic equipment | |
CN115497455B (en) | Intelligent evaluating method, system and device for oral English examination voice | |
CN116884427A (en) | Embedded vector processing method based on end-to-end deep learning voice re-etching model | |
Novotney12 et al. | Analysis of low-resource acoustic model self-training | |
CN116230018A (en) | Synthetic voice quality evaluation method for voice synthesis system | |
Duong | Development of accent recognition systems for Vietnamese speech | |
CN113035236B (en) | Quality inspection method and device for voice synthesis data | |
CN115881156A (en) | Multi-scale-based multi-modal time domain voice separation method | |
CN111785236A (en) | Automatic composition method based on motivational extraction model and neural network | |
Slaney et al. | Pitch-gesture modeling using subband autocorrelation change detection. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |