CN116564351A

CN116564351A - Voice dialogue quality evaluation method and system and portable electronic equipment

Info

Publication number: CN116564351A
Application number: CN202310345168.5A
Authority: CN
Inventors: 秦思
Original assignee: HUBEI UNIVERSITY OF ECONOMICS
Current assignee: HUBEI UNIVERSITY OF ECONOMICS
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-08-08
Anticipated expiration: 2043-04-03
Also published as: CN116564351B

Abstract

The invention discloses a voice dialogue quality evaluation method and system and portable electronic equipment, and belongs to the technical field of voice quality evaluation. The method comprises the following steps: step S100: analyzing the voice dialogue to be evaluated to obtain a first interaction attribute; s200: dividing the voice dialogue to be evaluated into a plurality of voice fragment groups; s300: obtaining a second duration attribute of each voice sub-segment in each voice segment group; s400: determining at least one candidate sub-segment from each of the speech segment groups; s500: training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; s600: and inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group. The invention can accurately realize the non-reference quality grading output of the multi-character voice dialogue.

Description

Voice dialogue quality evaluation method and system and portable electronic equipment

Technical Field

The present invention relates to the field of speech quality evaluation technologies, and in particular, to a speech dialogue quality evaluation method, system, and portable electronic device.

Background

Sound is one of the main ways humans know the world and perceive the world. With the full popularity of networks, network audio services have grown rapidly. When the quality of audio is poor, it needs enhancement processing to improve the audio quality. In the audio field, most audio evaluation platforms currently popular simply rely on one or two parameters of the audio as criteria for evaluating the audio quality, which is not reasonable in practice, since firstly the audio quality is related to the human auditory system, influenced by a number of factors, and cannot be measured solely on the basis of one or two simple parameters.

Along with the gradual perfection of a quality comprehensive evaluation system, two methods for evaluating the audio quality are evolved: subjective evaluation and objective evaluation. In the subjective evaluation method, a series of audio sequences are subjected to audiometry according to the international telecommunication union telecommunication standardization sector (ITU-T) p.800 standard by organizing testers, evaluation results of voice quality by the testers are counted to obtain an average value of the evaluation results, the final audio quality evaluation result is expressed as a mean opinion score (Mean Opinion Score, "MOS"), and the higher the MOS value is, the better the audio quality is. However, the subjective evaluation method has the defects of long experimental period and high economic cost.

Objective assessment methods are widely used to assess audio quality, and are divided into reference and non-reference audio quality assessment models. The operation mechanism of the reference audio quality evaluation method is that processed voice and lossless voice are compared, in the comparison process, firstly, voice is aligned, deviation of the processed voice and the lossless voice is found, after the voice of each small segment and the lossless voice of the small segment are aligned, the voice of each small segment and the lossless voice of the small segment are independently placed into an auditory model, loss of frequency cost on each frequency band and generation of additional frequency are seen, whether increase and decrease of frequency components are obvious enough or not on human hearing is judged, and finally, smoothing, weighting average and the like of voice damage conditions of each segment are carried out on the whole time domain, and finally, the voice quality is mapped to individual voice quality fractions. I TU-T historically has mainly introduced three well-known models, namely PSQM (p.861), PESQ (p.862), POLQA (p.863), the POLQA model currently being the most well accepted. The PSQM and PESQ models are only applicable to audio frequencies below 16 KHz. The POLQA model can be applied to 48kHz audio signals, and the algorithm is still in a protected state, is not public and has high use cost. And the reference audio quality assessment model needs to provide reference audio, and can not assess audio quality in a scene without reference audio. The reference audio quality assessment model needs to provide reference audio, and cannot assess audio quality in a scene without reference audio. The model for evaluating the quality of the non-reference audio is mostly realized based on deep learning, and a more representative method is MOSNET, which adopts a network architecture of CNN and BLSTM, training data is derived from The Voice Conversion Challenge (VCC) 2018, and various evaluation indexes are in the front of the industry. The time sequence convolution network (TCN) has been successful in the fields of machine translation, traffic prediction, sound event detection, etc., and has the potential of exceeding the LSTM network, in addition, the VCC2018 has a limited data size, and how to fully mine the internal information of the data by using small-scale data so as to optimize the audio evaluation performance is a problem to be solved in the industry.

In practical application, the deep learning model used by the current non-reference voice quality evaluation technology is insufficient in adaptability to voices with different timbres, and the deep learning neural network used in the prior art defaults to adopt the same evaluation model to perform the same processing evaluation on voices with different sources, so that the adaptability of voice conversation quality evaluation under a multi-voice environment, particularly a multi-target character multi-voice conversation environment and the model update training problem are not considered.

Disclosure of Invention

The invention aims to provide a voice dialogue quality evaluation method, a voice dialogue quality evaluation system and portable electronic equipment, which can accurately realize the non-reference quality score output of a multi-character voice dialogue, so that the result is more targeted and adaptive.

In order to achieve the above object, the present invention provides the following solutions:

a voice conversation quality evaluation method, comprising: comprising the following steps:

s100: analyzing a voice dialogue to be evaluated to obtain a first interaction attribute of the voice dialogue to be evaluated;

s200: dividing the voice dialogue to be evaluated into a plurality of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and the plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;

s300: analyzing each voice sub-segment contained in each voice segment group to obtain a second duration attribute of each voice sub-segment in each voice segment group;

s400: determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein the second duration attributes of the plurality of candidate sub-segments are the same;

s500: taking the candidate sub-segments as training samples, training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model;

s600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group;

the voice quality evaluation model is a convolution-time sequence convolution network model.

Further, the first interaction attribute in the step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;

the step S200 specifically includes: and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.

Further, the second duration attribute in the step S300 is used to characterize the duration of each voice sub-segment.

Further, the second duration attributes of the candidate sub-segments in the step S400 are the same, including one of the following cases:

the duration of the plurality of candidate sub-fragments is the same;

the absolute value of the difference value of the duration of the plurality of candidate sub-segments is smaller than a preset upper limit value.

Further, the convolution-time sequence convolution network model comprises a preamble convolution module and a time sequence convolution module;

the preamble convolution module performs feature extraction on the input voice sub-segment;

the time sequence convolution module comprises n expansion convolution modules, wherein the expansion factor of each expansion convolution is 2 ^n-1 。

The invention also provides a voice dialogue quality evaluation system, which comprises:

a voice analysis unit: the method comprises a first interaction analysis subunit and a second duration analysis subunit; the first interaction analysis subunit is used for analyzing the voice dialogue to be evaluated, obtaining a first interaction attribute of the voice dialogue to be evaluated, and sending the first interaction attribute to the voice grouping unit; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;

voice grouping unit: dividing the voice dialogue to be evaluated into a first number of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and a plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;

a speech candidate unit: the method comprises the steps of determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein second duration attributes of the plurality of candidate sub-segments are identical;

model training unit: the method comprises the steps of training and updating at least one existing voice quality evaluation model by taking the candidate sub-segments as training samples to obtain an updated voice quality evaluation model;

and an evaluation output unit: the voice sub-segments except the candidate sub-segments in each voice segment group are input into the updated voice quality evaluation model, and the voice quality evaluation score of the target person corresponding to each voice segment group is output;

the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module and the evaluation output module are sequentially connected; the second duration analysis subunit is connected with the voice grouping module and the voice candidate module;

the speech quality evaluation model is a reference-free speech quality evaluation model.

Further, the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialogue to be evaluated.

Further, the second duration attribute is used to characterize a duration of each voice sub-segment.

Further, the second duration attributes of the plurality of candidate sub-segments are the same, including one of:

the duration of the plurality of candidate sub-fragments is the same;

The invention also provides portable electronic equipment, which comprises a voice receiving unit, a memory, a processor and a display unit, wherein the memory is stored with computer executable program instructions, and the receiving unit is used for receiving voice conversations; the executable program instructions are executed by the processor to implement the voice dialog quality assessment method and display the assessment results on the display unit.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the voice dialogue quality evaluation method provided by the invention, the voice dialogue to be evaluated is analyzed to obtain the first interaction attribute; dividing the voice dialogue to be evaluated into a plurality of voice fragment groups to obtain a second duration attribute of each voice sub-fragment in each voice fragment group; then determining at least one candidate sub-segment from each of the speech segment groups; training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; and finally, inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group, so that the non-reference quality score output of the multi-person voice dialogue can be accurately realized, and the result is more targeted and adaptive. In addition, the invention utilizes a convolution-time sequence convolution (CNN-TCN) network architecture, adopts a Label Distribution Learning (LDL) method to improve the network audio evaluation performance, carries out preprocessing segmentation on the voice according to the scene characteristics of the multi-person voice conversation, carries out targeted updating training of a non-reference voice quality evaluation model according to the grouping sample data, and carries out evaluation output, so that the result has more pertinence, suitability and accuracy, and is suitable for multi-voice environments, especially multi-target person multi-voice conversation environments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a voice dialogue quality evaluation method according to an embodiment of the invention;

fig. 2 is a block diagram of a plurality of voice segment groups of a voice conversation quality evaluation method in an embodiment of the present invention;

FIG. 3 (a) is a schematic diagram of a no-reference speech dialog quality assessment model of the present invention;

FIG. 3 (b) is a schematic diagram of the structure of a time-series convolution module in the model for quality assessment of a reference-less speech dialogue of the present invention;

FIG. 3 (c) is a schematic diagram of an expanded convolution module of the no-reference speech dialogue quality assessment model of the present invention;

FIG. 4 is a schematic structural diagram of an electronic device according to a voice conversation quality evaluation method of the present invention;

FIG. 5 is a schematic diagram of a voice conversation quality evaluation system in accordance with one embodiment of the present invention;

FIG. 6 is a schematic diagram of a speech dialog quality assessment system in accordance with a preferred embodiment of the present invention;

FIG. 7 (a) is an effect diagram corresponding to sentence level indicators according to the present invention;

fig. 7 (b) is an effect diagram corresponding to the system level index according to the embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, a method for evaluating quality of a voice conversation according to an embodiment of the present invention includes:

specifically, the first interaction attribute in step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;

the step S200 specifically includes:

and dividing the voice dialogue to be evaluated into a first number of voice fragment groups based on the first interaction attribute.

the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment;

the second duration attributes of the candidate sub-segments in the step S400 are the same, including one of the following cases:

the duration of the plurality of candidate sub-fragments is the same;

the speech quality evaluation model in the step S500 is a no-reference speech quality evaluation model, and the no-reference speech quality evaluation model is a convolution-time sequence convolution network model.

S600: inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group.

In one embodiment of the present invention, as shown in fig. 2, a grouping schematic diagram of a plurality of voice segment groups of a voice conversation quality evaluation method, a voice conversation quality evaluation method of the present invention specifically includes:

in this embodiment, in the leftmost part of fig. 2, the speech dialogue to be evaluated includes speech A1, B1, A2, C1, C2, B2, C3, A3; the voice dialog is an english voice dialog, such as a spoken english dialog; the segment of voice dialog is from at least three different people, assumed to be person a, person B and person C, wherein the voices from person a are A1, A2, A3; the voices from the person B are B1 and B2; the voices from the person C are C1, C2 and C3; in the scenario of FIG. 2, character conversations are interleaved;

in this embodiment, when the step S100 is performed, the first interaction attribute is used to characterize a first number of different target characters related to the speech dialogue to be evaluated.

S200: dividing the voice dialogue to be evaluated into a plurality of voice fragment groups based on the first interaction attribute;

in this embodiment, the voice dialogue to be evaluated is divided into 3 voice segment groups based on the first interaction attribute, which is still represented by A, B, C;

specifically, each voice segment group contains at least one voice sub-segment, and a plurality of voice sub-segments located in the same voice segment group belong to the same target person; for example, the speech segment group a contains speech sub-segments A1, A2 and A3, the speech segment group B contains speech sub-segments B1, B2, and the speech segment group C contains speech sub-segments C1, C2 and C3.

in the present embodiment, it is assumed that the duration of the voice sub-segments A1, A2, and A3 included in the voice segment group a are 18s, 30s, and 25s, respectively (s represents seconds); the duration of the speech sub-segments B1, B2 included in the speech segment group B is 26s, 18s and 30s, respectively, and the duration of the speech sub-segments C1, C2 and C3 included in the speech segment group C is 25s, 31s and 18s, respectively.

in this embodiment, the criteria for determining at least one candidate sub-segment from each speech segment group specifically includes one or a combination of the following:

the duration of the plurality of candidate sub-fragments is the same;

If criterion (1) is implemented, A1, B2, C3 may be selected as a plurality of candidate sub-segments, since the second duration attributes of the plurality of candidate sub-segments are the same (all 18 s);

if the execution criterion (2) is set to 1s in advance, A3, B1, C1 may be selected as a plurality of candidate sub-segments, since the absolute value of the difference value of the plurality of candidate sub-segments is smaller than the preset upper limit value;

of course, other options are possible, and the above is merely an illustrative example.

specifically, the existing at least one speech quality assessment model is a no-reference speech quality assessment model.

Specifically, as still another improved embodiment of the present invention, the reference-free speech quality assessment model is a convolution-time series convolution network model.

Fig. 3 (a) - (c) are schematic diagrams illustrating principles and components of a model for quality evaluation of a non-reference voice conversation according to an embodiment of the present invention.

The embodiment provides a new deep learning-based reference-free audio quality evaluation method, which utilizes a convolution-time sequence convolution (CNN-TCN) network architecture and adopts a Label Distribution Learning (LDL) method to improve network audio evaluation performance.

The input signals processed by the corresponding steps (steps S100-S400) of the method shown in FIG. 1 pass through a convolution module, the output signals are input as a time sequence convolution module, and then are output by the time sequence convolution module and are input as a full connection module, the output of the full connection module comprises two branches, and the first branch a is a frame level MOS value; after the second branch b (the branch mapped into the global average pooling) performs sentence-level tag distribution processing, sentence-level MOS values are output as shown in fig. 3 (a).

Next, the re-update training principle of the non-reference voice dialogue quality evaluation model used in the present invention will be described.

1. Introduction to data set

The VCC2018 generates audio with MOS labels by various voice synthesis systems, and each audio is subjected to MOS grading (1-5 points) by four persons, training set data 13580, verification set data 3000 and test set data 4000. Because the number of training sets is small, the embodiment increases the distribution of learning labels to improve the model performance.

The process of retraining and updating an existing model belongs to the prior art, and this embodiment does not specifically develop.

2. Signal input

The original audio in VCC2018 is subjected to STFT transformation to obtain a frequency domain input signal of the network, and optionally, hamming window is added, window length is 512, window shift is 256, and frequency point number is 512. The input dimension of the input signal is [ B, N, F ], wherein B: batch size, N: the number of frames, F, is the number of single-sided frequency points, and when the number of frequency points is 512, F is 257. Input is reshaped to obtain an input signal 1 with dimensions [ B, N,257,1].

3. Convolution module

The input signal is first subjected to feature extraction by a convolution module, the convolution module is the same as the convolution module in MOSNET, and the total number of the convolution modules is 12, and parameters of each layer are shown in table 1.

TABLE 1 convolution module layer parameters

4. Sequential convolution module

After the feature is extracted by the convolution module, the dimension is [ B, N,4,128], and the feature is reshape to obtain the time sequence convolution module input signal 2, wherein the dimension is [ B, N,512].

The sequential convolution module is formed by a plurality of expansion convolution modules, and preferably n is 4. The dilation factor of each dilation convolution is 2 ^n-1 Where n is the number of expansion convolution modules, as shown in FIG. 3 (b).

The first dilation convolution module is taken as an example to describe the module, the module is of a residual structure, the input signal 2 respectively comprises one-dimensional dilation convolution, channel normalization, reLU activation function and Dropout layer, and the above structure is repeated once, wherein the number of one-dimensional dilation convolution output channels can be set to 128, the convolution kernel number is 3, the step length is 1 and the dilation factor is 2 ^n-1 Filling with same, dropout rate 0.3. The input signal 2 is further subjected to 1 x 1 one-dimensional convolution, preferably, the number of output channels is 128, and the two obtained signals are added to obtain the input signal 3 of the fully-connected module, wherein the dimension is [ B, N,128]]As shown in fig. 3 (c).

5. Network output processing

The input signal 3 passes through the fully-connected layer, preferably with an output channel of 128, an activation function of Relu and a Dropout rate of 0.3, resulting in a fully-connected module output 4 with dimensions [ B, N,128]. The signal 4 is fully connected with the output channel being 1, so that a signal 5-frame level MOS value is obtained, and the dimension is [ B, N,1]. The signal 4 is fully connected through an output channel 101, the activation function is softmax, the fully connected output 6 is obtained, the dimension is [ B, N,101], the signal 6 is subjected to global pooling, the sentence-level tag distribution signal 7 is obtained, the dimension is [ B,1,101], the signal 8 sentence-level MOS value is obtained through mapping of the signal 7, and the dimension is [ B,1]. When B is 1, the mapping function is:

where y represents signal 8,x represents signal 7. The calculation range of the above formula is 0-5.

6. Loss function

The average avg_mos of the MOS in VCC2018 is used as the label of the signal 5 frame level MOS, the frame_mos and the signal 8 sentence level MOS value, and the utility_mos. The variance var _ MOS of each audio MOS value is calculated, the label distribution (gaussian distribution) is calculated from the following equation,

wherein lk is [0:100].

All or part of the steps of the various embodiments described above may also be implemented as computer program instructions, implemented by a portable electronic device.

Referring to fig. 4, the present invention may also be implemented as a portable electronic device, where the electronic device includes a voice receiving unit, a memory, a processor, and a display unit, where the memory stores computer executable program instructions, and the receiving unit is configured to receive an english voice dialogue;

the executable program instructions are executed by the processor to implement all or part of the steps of the voice dialog quality assessment method described in fig. 1 and display the assessment results on the display unit.

Fig. 5-6 are schematic structural diagrams of a voice conversation quality evaluation system according to various embodiments of the present invention.

As shown in fig. 5, the present embodiment provides a voice dialogue quality evaluation system, which includes a voice receiving module, a voice analyzing module, a voice grouping module, a voice candidate module, a model training module, and an evaluation output module, for implementing the method described in fig. 1.

As shown in fig. 6, in this embodiment, the voice parsing module includes a first interaction parsing subunit and a second duration parsing subunit;

in this embodiment, the first interaction analysis subunit is configured to analyze a voice dialogue to be evaluated, obtain a first interaction attribute of the voice dialogue to be evaluated, and send the first interaction attribute to a voice grouping module; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;

in this embodiment, the voice receiving module is configured to receive a voice dialogue to be evaluated;

in this embodiment, the voice grouping module is configured to segment the voice dialog to be evaluated into a first number of voice segment groups based on the first interaction attribute, where each voice segment group includes at least one voice sub-segment, and a plurality of voice sub-segments located in the same voice segment group belong to the same target person;

in this embodiment, the voice candidate module is configured to determine at least one candidate sub-segment from each voice segment group, so as to obtain a plurality of candidate sub-segments, where second duration attributes of the plurality of candidate sub-segments are the same;

in this embodiment, the model training module is configured to train and update an existing at least one speech quality evaluation model with the plurality of candidate sub-segments as training samples, to obtain an updated speech quality evaluation model;

in this embodiment, the evaluation output module is configured to input the speech sub-segments except the candidate sub-segments in each speech segment group into the updated speech quality evaluation model, and output a speech quality evaluation score of the target person corresponding to each speech segment group;

in this embodiment, the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module, and the evaluation output module are sequentially connected; the second duration analysis subunit is connected with the voice grouping module and the voice candidate module;

in this embodiment, the speech quality evaluation model is a no-reference speech quality evaluation model, and the no-reference speech quality evaluation model is a convolution-time sequence convolution network model.

In this embodiment, the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialogue to be evaluated.

In this embodiment, the second duration attribute is used to characterize the duration of each voice sub-segment.

In this embodiment, the second duration attributes of the candidate sub-segments are the same, including one of the following cases:

the duration of the plurality of candidate sub-fragments is the same;

As a further preferred aspect, the time-series convolution module is composed of n expansion convolution modules, each expansion convolution having an expansion factor of 2 ^n-1 Where n is the number of expansion convolution modules.

Specifically, to obtain a more beneficial training model and training parameters, it is verified that the value of n is related to the first number K of different target persons involved in the speech dialogue to be evaluated and the number { num1, num2, … … num } of speech sub-segments contained in the K speech segment groups, where num represents the number of speech sub-segments contained in the i (i=1, 2,3, … …, K) th speech segment group.

Specifically, the number n of the expansion convolution modules is determined as follows:

if K is less than or equal to 4, n=4;

if K>4, then

Wherein, min { } represents taking a smaller value,representing taking a larger value of { num1, num2, … … num K };representing an upward rounding.

As shown in fig. 7, the effect diagram corresponding to the sentence level index and the effect diagram corresponding to the system level index in the technical scheme of the present invention are shown.

In this embodiment, the comparison index is selected: linear Correlation Coefficient (LCC), spearman Rank Correlation Coefficient (SRCC), and Mean Square Error (MSE).

It can be seen that the method and moset pair of this example are shown in table 2, and each index is superior to moset.

TABLE 2 comparison of the methods of the present example with MOSNET indicators

Index (I)	The invention is that	MOSNET
			LCC	(0.6684，0.9643)	(0.642，0.957)
SRCC	(0.6342，0.9282)	(0.589，0.888)
			MSE	(0.4642，0.0434)	(0.538，0.084)

Note that: (A, B) A represents sentence level index, and B represents system level index.

According to the technical scheme, a convolution-time sequence convolution (CNN-TCN) network architecture is utilized, a Label Distribution Learning (LDL) method is adopted to improve network audio evaluation performance, pretreatment segmentation is carried out on voices according to scene characteristics of multi-user voice conversations, targeted updating training of a non-reference voice quality evaluation model is carried out according to grouping sample data, and evaluation output is carried out, so that the result is more targeted and adaptive.

The technical scheme of the invention firstly analyzes the voice dialogue to be evaluated to obtain a first interaction attribute; dividing the voice dialogue to be evaluated into a plurality of voice fragment groups to obtain a second duration attribute of each voice sub-fragment in each voice fragment group; then determining at least one candidate sub-segment from each of the speech segment groups; training and updating at least one existing voice quality evaluation model to obtain an updated voice quality evaluation model; and finally, inputting the voice sub-fragments except the candidate sub-fragments in each voice fragment group into the updated voice quality evaluation model to obtain the voice quality evaluation score of the target person corresponding to each voice fragment group, so that the non-reference quality score output of the multi-person voice dialogue can be accurately realized, and the result is more targeted and adaptive.

In the other technical features of the embodiment, those skilled in the art can flexibly select to meet different specific actual requirements according to actual conditions. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known compositions or structures have not been described in detail so as not to obscure the invention, and are within the scope of the invention as defined by the appended claims.

Modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the invention as defined by the appended claims. In the above description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known techniques, such as specific construction details, operating conditions, and other technical conditions, have not been described in detail in order to avoid obscuring the present invention.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method for evaluating the quality of a voice conversation, comprising:

2. The method for evaluating the quality of a voice conversation according to claim 1, wherein,

the first interaction attribute in the step S100 is used to characterize a first number of different target characters involved in the speech dialogue to be evaluated;

3. The method for evaluating the quality of a voice conversation according to claim 1, wherein,

the second duration attribute in step S300 is used to characterize the duration of each voice sub-segment.

4. The method for evaluating the quality of a voice conversation according to claim 1, wherein,

in the step S400, the second duration attributes of the candidate sub-segments are the same, including one of the following cases:

the duration of the plurality of candidate sub-fragments is the same;

5. The method for evaluating the quality of a voice conversation according to claim 1, wherein,

the convolution-time sequence convolution network model comprises a preamble convolution module and a time sequence convolution module;

the time sequence convolution module comprises n expansion convolution modules, wherein the expansion factor of each expansion convolution module is 2 ^n-1 。

6. A speech dialog quality assessment system comprising:

a voice receiving module: for receiving a voice dialog to be evaluated;

and a voice analysis module: the method comprises a first interaction analysis subunit and a second duration analysis subunit; the first interaction analysis subunit is used for analyzing the voice dialogue to be evaluated, obtaining a first interaction attribute of the voice dialogue to be evaluated, and sending the first interaction attribute to the voice grouping module; the second duration analysis subunit is configured to analyze each voice sub-segment included in each voice segment group, obtain a second duration attribute of each voice sub-segment in each voice segment group, and send the second duration attribute to a voice candidate module;

and a voice grouping module: dividing the voice dialogue to be evaluated into a first number of voice segment groups based on the first interaction attribute, wherein each voice segment group comprises at least one voice sub-segment, and a plurality of voice sub-segments positioned in the same voice segment group belong to the same target person;

a voice candidate module: the method comprises the steps of determining at least one candidate sub-segment from each voice segment group to obtain a plurality of candidate sub-segments, wherein second duration attributes of the plurality of candidate sub-segments are identical;

model training module: the method comprises the steps of training and updating at least one existing voice quality evaluation model by taking the candidate sub-segments as training samples to obtain an updated voice quality evaluation model;

and an evaluation output module: the voice sub-segments except the candidate sub-segments in each voice segment group are input into the updated voice quality evaluation model, and the voice quality evaluation score of the target person corresponding to each voice segment group is output;

the voice receiving module, the first interaction analysis subunit, the voice grouping module, the voice candidate module, the model training module and the evaluation output module are sequentially connected; the second duration analysis subunit is also connected with the voice grouping module and the voice candidate module;

7. The speech dialog quality assessment system of claim 6, wherein the first interaction attribute is used to characterize a first number of different target persons involved in the speech dialog to be assessed.

8. The speech dialogue quality assessment system of claim 6 wherein,

the second duration attribute is used to characterize the duration of each speech sub-segment.

9. The method of claim 6, wherein the second duration attributes of the plurality of candidate sub-segments are the same, comprising:

the duration of the plurality of candidate sub-fragments is the same;

or, the absolute value of the difference value of the duration time of the plurality of candidate sub-segments is smaller than a preset upper limit value.

10. The portable electronic equipment is characterized by comprising a voice receiving unit, a memory, a processor and a display unit, wherein the memory is stored with computer executable program instructions, and the receiving unit is used for receiving voice conversations;

the executable program instructions are executed by the processor to implement the speech dialog quality assessment method of any of claims 1-5 and display the assessment results on the display unit.