CN116110437A

CN116110437A - Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics

Info

Publication number: CN116110437A
Application number: CN202310395720.1A
Authority: CN
Inventors: 张涛; 侯晓慧; 刘赣俊; 赵鑫
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-05-12
Anticipated expiration: 2043-04-14
Also published as: CN116110437B

Abstract

The invention provides a pathological voice quality evaluation method based on voice characteristics and speaker characteristics fusion, which comprises the steps of inputting pathological voice, extracting two voice characteristics of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out characteristic fusion; taking the fused voice characteristics as input, extracting time information and predicting frame-level scores; taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker; and taking the voice characteristics obtained after the time information extraction and the speaker characteristics as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score. According to the invention, the voice characteristics and the speaker characteristics are extracted from the pathological voice, the characteristics are fused, and finally the score prediction is carried out, so that the mapping relation between the pathological voice and the subjective quality score corresponding to the pathological voice is found, and objective and quantitative evaluation on the quality of the pathological voice is realized.

Description

Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics

Technical Field

The invention belongs to the technical field of pathological voice quality evaluation, and particularly relates to a pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics.

Background

Along with the acceleration of life rhythm in modern society, people have increasingly bad sounding habits, living habits and voice abuse conditions so that the incidence rate of voice diseases is higher. Voice diseases can obstruct communication of people, separate individuals from society, even cause diseases such as depression, and have great influence on physiological and psychological conditions of people. Therefore, effective diagnosis and treatment of voice diseases are becoming an increasing focus of attention. And the accurate and quantitative evaluation of the quality of pathological voice plays an important role in the diagnosis and treatment of voice diseases.

The pathological voice quality evaluation is a new direction in the field of voice disorder research, and the corresponding quality fraction is obtained on the basis of pathological voice signal analysis. The main methods of pathological voice quality evaluation are divided into subjective perception evaluation and objective acoustic analysis evaluation. At present, the subjective perception evaluation is mainly used for pathological voice diagnosis clinically, and in the process, a plurality of doctors can make corresponding MOS scores for voices according to a certain subjective measurement standard by listening to the voices made by patients so as to measure the pathological voice quality. However, subjective and occasional evaluation exists, because doctors experience levels and acoustic perceptions have certain differences, and the evaluation mode has poor repeatability and high price.

The quality evaluation by the objective acoustic analysis method can be further divided into a reference evaluation mode and a non-reference evaluation mode. The referenced objective evaluation mode needs to have original voice and distorted voice at the same time and requires strict alignment of the original voice and the distorted voice, and has a certain limitation in real life, while the non-referenced objective evaluation mode can obtain corresponding quality fraction only through the distorted voice. In recent years, with the development of deep learning, reference-free evaluation with high accuracy and end-to-end advantage is possible, but at present, the subjects of the scholars in this direction are normal voices.

At present, the research on pathological voice mainly focuses on the detection of healthy and pathological voice and the classification of pathological symptoms, but the research on the quality perception of pathological voice is very little. Although some studies explore the relationship between objective parameters and pathological voice quality, these indicators can only be used for qualitative analysis of pathological voice quality. Currently, fresh students quantitatively evaluate pathological voice quality by using a deep learning technology.

Therefore, the mapping relation between the pathological voice quality and the subjective MOS score, which is expressed by using the deep learning technology, is explored, and the construction of the pathological voice quality evaluation model based on the deep learning has important research significance and practical application value.

Disclosure of Invention

In view of the above, the invention provides a pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics, so as to solve the defects of poor repeatability and high price of subjective quality evaluation and make up for the study blank of objective evaluation in the aspect of quantitative evaluation of pathological voice quality.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics comprises the following steps:

step 1: inputting pathological voice, extracting two voice features of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out feature fusion;

step 2: taking the fused voice characteristics as input, extracting time information and predicting frame-level scores;

step 3: taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker;

step 4: and (3) taking the voice characteristics obtained after the time information extraction in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score.

Further, in the step 1,

extracting spectrogram voice characteristics comprises framing pathological voice, adding a hamming window and performing short-time Fourier transform to obtain an amplitude spectrogram of voice signals;

extracting the voice characteristics of the mel-frequency cepstrum coefficient includes voice to pathology

Using p mel filters, m mel frequency cepstral coefficients are obtained per frame.

Further, in the step 1, feature fusion of two voices includes:

inputting the obtained spectrogram into a convolution module formed by stacking k convolution layers, and finally obtaining each frame

Feature vector of dimension->

The method comprises the steps of carrying out a first treatment on the surface of the The process is expressed by the following formula:

wherein ,

u is the input spectrogram of pathological voice, < ->

Represents a ReLU activation functionCount (n)/(l)>

Is a weight parameter of the first layer convolution layer, < ->

Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>

；/>

and />

Are respectively->

Weight parameters and bias parameters of the layer convolution layer, < ->

Is the output of the ith convolution layer, +.>

Is->

An output of the layer convolution layer;

will be

Corresponding to the m-dimensional Meier frequency cepstrum coefficient obtained from each frame, adding to obtain +.>

Frame-level features of the preliminary fusion of dimensions->

。

Further, the step 2 specifically includes:

step 201: will first

As input, via the input embedding layer, get +.>

It has three dimensions: [ n, u, v]The meaning of the expression is: the number n of one batch vector, the number u of each vector value and the embedding dimension v of each value;

step 202:

then the position coding is carried out to obtain +.>

The formula for position coding is:

wherein ,

for the location information of each value in the input vector, 2t refers to the dimension of the word vector encoded by each value,

and then will be

And->

Corresponding dimensions are added to obtain a new vector with position-coding information +.>

；

Step 203: will be

Sending the signals to an M-head self-attention mechanism module to obtain +.>

Its dimension and->

The same applies to the following formula:

wherein ,

， />

is the weight matrix in the t-th head in the M-head attention mechanism;

wherein ,

for the dimension of the resulting K vector, < >>

Is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>

Is the total output of the M-head attention mechanism;

step 204: will be

And->

Residual connection is carried out to obtain->

And then will

Performing batch normalization to obtain +.>

The calculation formula of the process is as follows:

step 205: will be

Obtaining ∈K through feedforward calculation>

The calculation formula is as follows:

wherein ,

is a weight matrix>

Representing a ReLU activation function;

step 206: will be

and />

Residual connection is carried out to obtain->

Then->

Performing batch normalization to obtain +.>

The calculation formula of the process is as follows:

step 207: will be

Performing N times of repetition coding to obtain +.>

；/>

Step 208: will be

Obtaining frame fraction +.>

，

wherein

Represented as a fully connected layer.

Further, the step 3 includes:

input of m-dimensional mel frequency cepstral coefficient features per frame

The time delay neural network is arranged, the time delay neural network integrates the frame-level characteristics into a speech-level characteristic while considering the context information between adjacent frames, and finally, the speech-level characteristics are calculated and spliced

Mean and variance of the outputs of the layer delay neural network, then go through +.>

A time delay neural network for obtaining a characteristic vector +.>

。

Further, in the step 4, feature fusion is performed on the voice feature obtained after the time information extraction in the step 2 and the speaker feature obtained in the step 3 as inputs, including:

will be

Sequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>

The calculation formula is as follows:

wherein ,

representation->

Exchange of dimensions, add>

Representing an adaptive averaging pooling operation;

feature vector

And->

Splicing to obtain preliminarily fused speaking-level characteristic vector

,

Will result in

As input, a convolution module formed by overlapping k convolution layers is fed to obtain high-dimensional speaking-level characteristics +.>

The process is expressed by the following formula:

wherein ,

，/>

representing the ReLU activation function,)>

Is a weight parameter of the first layer convolution layer, < ->

；/>

and />

Are respectively->

Weight parameters and bias parameters of the layer convolution layer, < ->

Is the output of the ith convolution layer, +.>

Is->

And (3) outputting a layer convolution layer.

Further, in the step 4, the predicting the speech-level quality score includes:

will be

Sequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>

The calculation formula is as follows: />

wherein ,

representing the exchange of dimensions->

Representing an adaptive averaging pooling operation.

Compared with the prior art, the pathological voice quality evaluation method based on the fusion of the voice characteristics and the speaker characteristics has the following advantages: according to the invention, the voice characteristics and the speaker characteristics are extracted from the pathological voice, the characteristics are fused, and finally the score prediction is carried out, so that the mapping relation between the pathological voice and the subjective quality score corresponding to the pathological voice is found, and objective and quantitative evaluation on the quality of the pathological voice is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of the overall process of the present invention;

FIG. 2 is a schematic diagram of an X-Vector extraction according to the present invention;

FIG. 3 is a schematic diagram of an encoder network architecture according to the present invention;

FIG. 4 is a graph comparing the fitting result of the present invention with the MOSNet to subjective MOS score;

fig. 5 is a graph of the results of the fitting of the present invention to subjective quality scores.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

As shown in figure 1, the invention provides a pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics, and compared with the previous method, the method has the advantages that the voice characteristics and the speaker characteristics are extracted from the pathological voice, and the characteristic fusion is carried out, so that the network can better extract effective information in the pathological voice. The invention inputs pathological voice and outputs the corresponding mass fraction, namely MOS fraction, has the advantages of end-to-end and accuracy, and has great practical significance.

The method of the invention generally comprises the following steps:

and step 1, extracting and fusing voice features. And inputting pathological voice, extracting various voice features, and carrying out feature fusion. The invention takes spectrograms and Mel Frequency Cepstrum Coefficients (MFCC) as voice features which need to be extracted from original pathological voice, and takes a multi-layer convolution structure as an example to perform feature fusion.

And 2, predicting the frame level score. And (3) taking the fused voice characteristics obtained in the step (1) as input, and extracting time information and predicting frame fraction. The invention takes the encoder as a time sequence processing model to extract the time sequence information of the fused voice characteristics, thereby realizing the prediction of pathological voice frame fraction.

And 3, extracting the speaker characteristics. And (3) taking the Mel Frequency Cepstrum Coefficient (MFCC) extracted in the step (1) as input to extract the speaker characteristics. The invention uses X-Vector as the speaker characteristic to increase the information representing the speaker identity.

And 4, predicting the speech level score. And (3) taking the voice characteristics obtained after the time information is extracted in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to finally obtain the speech-level quality score. The invention takes a multilayer convolution structure as an example to perform feature fusion; taking the adaptive pooling operation as an example, the speech-level quality score is predicted.

In one embodiment, the voice feature extraction and fusion in step 1 of the present invention specifically includes:

1. extraction of speech features

To the pathological voice of the input

The following two methods are respectively carried out to obtain two voice characteristics: spectrogram, mel-frequency cepstral coefficient (MFCC).

1) Frequency spectrum diagram: for pathological voice

And carrying out the steps of framing, adding a hamming window, carrying out short-time Fourier transform and the like to obtain an amplitude spectrogram U of the voice signal.

2) Mel frequency cepstral coefficients: for pathological voice

Using p mel filters, m MFCC coefficients are obtained per frame.

2. Fusion of speech features

Inputting the obtained spectrogram U into a convolution module formed by stacking k convolution layers, and finally obtaining each frame

Feature vector of dimension->

. The process can be expressed by the following formula:

wherein ,

u is the input spectrogram of pathological voice, < ->

Representing the ReLU activation function,)>

Is a weight parameter of the first layer convolution layer, < ->

；/>

and />

Are respectively->

Weight parameters and bias parameters of the layer convolution layer, < ->

Is the output of the ith convolution layer, +.>

Is->

And (3) outputting a layer convolution layer.

Will be

Correspondingly adding the m-dimensional MFCCs obtained from each frame to obtain +.>

Frame-level features of the preliminary fusion of dimensions->

。

In one embodiment, in step 2 of the present invention, the prediction of the frame level score specifically includes:

will be

As input, by means of a time-series processing module, encoder, for +.>

Extracting time sequence information to obtain +.>

Dimension frame level feature->

The encoder network architecture is shown in fig. 3. The detailed description of this process is as follows:

1) Will first

As input, via the input embedding layer, get +.>

It has three dimensions: [ n, u, v ]>

]The meaning of the expression is: number of lot vectors n->

The number of values u per vector and the embedding dimension v per value.

2）

Then the position coding is carried out to obtain +.>

The formula for position coding is:

wherein ,

for the position information of each value in the input vector, 2t refers to the dimension of the word vector encoded by each value.

And then will be

And->

Corresponding dimension addition to obtain new vector with position coding information

。

3) Will be

Sending the signals to an M-head self-attention mechanism module to obtain +.>

Its dimension and

the same applies. The formula is as follows:

wherein ,

, />

is the weight matrix in the t-th head in the M-head attention mechanism.

wherein ,

for the dimension of the resulting K vector, < >>

Is the total output of the M-head attention mechanism.

4) Will be

And->

Residual connection is carried out to obtain->

And then will

Performing batch normalization to obtain +.>

The calculation formula of the process is as follows:

/>

5) Will be

Obtaining ∈K through feedforward calculation>

The calculation formula is as follows:

wherein ,

is a weight matrix>

Representing a ReLU activation function.

6) Will be

and />

Residual connection is carried out to obtain->

Then->

Performing batch normalization to obtain +.>

The calculation formula of the process is as follows:

7) Will be

Go->

Repeating the encoding to obtain->

。

8) Will be

Obtaining frame fraction +.>

。

wherein

Represented as a fully connected layer.

In one embodiment, in step 3 of the present invention, the extracting of the speaker characteristic includes:

to the pathological voice of the input

The speaker ID X-Vector was obtained by the following method. As shown in fig. 2.

Inputting m-dimensional MFCC features of each frame

Layer delay neural network (TDNN), which combines frame-level features into one speech-level feature while considering context information between adjacent frames, and finally calculates and splices ∈ ->

Mean and variance of outputs of layer TDNN network, then go through +.>

TDNN, obtain the eigenvector +.>

。

In one embodiment, in step 4 of the present invention, the utterance fraction prediction includes:

1. fusion of speech features and speaker features

Will be

The calculation formula is as follows:

wherein ,

representation->

Exchange of dimensions, add>

Representing an adaptive averaging pooling operation.

Feature vector

And->

Splicing to obtain preliminarily fused speaking-level characteristic vector

。

Will result in

As input, sendA convolution module formed by overlapping k convolution layers is added to obtain high-dimensional speaking-level characteristics ∈>

. The process can be expressed by the following formula:

/>

wherein ,

，/>

representing the ReLU activation function,)>

Is a weight parameter of the first layer convolution layer, < ->

；/>

and />

Are respectively->

Weight parameters and bias parameters of the layer convolution layer, < ->

Is the output of the ith convolution layer, +.>

Is->

And (3) outputting a layer convolution layer.

2. Prediction of speech-level scores

Will be

The calculation formula is as follows:

wherein ,

representing the exchange of dimensions->

Representing an adaptive averaging pooling operation.

The implementation of the invention is illustrated by specific examples.

1. Speech feature extraction and fusion

1. Extraction of speech features

1) Extracting a spectrogram of pathological voice: the pathological voice data is read, the sampling rate is 16KHz, then frame division is carried out, the frame is shifted by 256, a Hamming window is added, and the window length is 512. And performing 512-point short-time Fourier transform on the windowed pathological voice data segment to obtain an amplitude spectrogram of each frame.

2) Extracting a mel frequency cepstrum coefficient of pathological voice: according to the MFCC extraction procedure in the specific implementation, 13 Mel-frequency cepstral coefficients are extracted using 40 Mel filters.

2. Fusion of speech features

1) The amplitude spectrum obtained by processing is input into a convolution module composed of 4 layers of convolution layers, the parameter number of the 1 st layer is 3 multiplied by 16, the size of the 1 st layer convolution kernel is 3 multiplied by 3, the number of output channels is 16, the parameter number of the 2 nd layer is 3 multiplied by 32, the size of the 2 nd layer convolution kernel is 3 multiplied by 3, and the number of output channels is 32. The number of parameters of layer 3 is 3×3×64, where the size of the layer 3 convolution kernel is 3×3 and the number of output channels is 64. The number of parameters of layer 4 is 3×3×128, where the size of the convolution kernel of layer 4 is 3×3 and the number of output channels is 128. The activation function of each of the 4 convolutional layers is a ReLU function.

2) The output of the convolution module is subjected to dimension transformation to obtain 512-dimensional vectors of each frame, and the 512-dimensional vectors are spliced with the 13-dimensional MFCC according to the corresponding frames to obtain 525-dimensional vectors.

2. Extraction of temporal information and prediction of frame-level scores

1. Extraction of time information

The 525-dimensional feature vector of each frame is input into an encoder, and time information is extracted to obtain 512-dimensional vectors of each frame. Wherein the number of heads m=8 of the multi-head attention mechanism in the encoder, and the number of encoders n=6.

2. Prediction of frame level scores

Output of encoder

The feature vector of 512 dimensions per frame is mapped to a one-dimensional vector, i.e., a frame-level MOS score, via the fully connected layer. />

3. Speaker feature extraction

The method comprises the steps of inputting 13-dimensional MFCC parameters extracted from each frame into a five-layer TDNN network for feature extraction, inputting 13-dimensional vectors into a first-layer TDNN network, outputting 512-dimensional vectors, respectively obtaining the average value and the variance of outputs of the fifth-layer TDNN, splicing the two to form a 1024-dimensional Vector, and mapping the 1024-dimensional Vector into a 512-dimensional X-Vector through a full-connection layer.

4. Fusion of speech and speaker features and speech score prediction

1. Fusion of speech features and speaker features

Will be

Sequentially performing dimension transformation, self-adaptive average pooling operation, and aggregating frame-level features as speech-level features to obtain 128-dimensional feature vectors of each pathological voice>

。

Will be 128-dimensional

And 512-dimensional X-vectors are spliced together to obtain a preliminarily fused 640-dimensional speech-level feature Vector. And then the deep-level characteristic extraction is carried out by a convolution module. The convolution module is divided into 4 layers, the parameter number of the 1 st layer is 3 multiplied by 8, wherein the size of the 1 st layer convolution kernel is 3 multiplied by 03, and the number of output channels is 8; the parameter amount of the layer 2 is 3×13×16, wherein the size of the convolution kernel of the layer 2 is 3×3, and the number of output channels is 16; the number of parameters of the 3 rd layer is 3×3×32, wherein the size of the 3 rd layer convolution kernel is 3×3, and the number of output channels is 32; the number of parameters of the 4 th layer is 3×3×64, wherein the size of the 4 th layer convolution kernel is 3×3, and the number of output channels is 64. Wherein the activation function of each of the 4 convolutional layers is a ReLU function. And then, carrying out dimension transformation processing to obtain 512-dimensional feature vectors of each pathological voice.

2. Prediction of speech-level scores

And mapping the obtained 512-dimensional feature vectors into one-dimensional MOS scores through self-adaptive average pooling operation to obtain the speech grade scores.

5. Loss function

1. Frame-level fractional network loss function:

to ensure accuracy of network output pathology voice frame fraction, network predicted frame fraction is used

And true frame level score->

The mean square loss between them optimizes the frame-level-difference prediction network.

2. Utterance level fraction network loss function:

to ensure accuracy of network output pathology voice utterance fraction, network predicted utterance fraction is used

And true speech level score->

The mean square loss between them optimizes the speech level fraction prediction network.

3. Total loss function

The total loss function for pathological voice objective quality assessment can be expressed as:

。

figure 4 emphasizes the performance of the MOSNet model and the proposed method. It can be seen from fig. 4 that the prediction score and the true value score of the two are distributed around the vicinity of the image of the direct proportion function, that is, the prediction value and the true value have stronger correlation, but the data obtained by prediction of the invention are distributed around the image of the direct proportion function more intensively, and the fitting effect is better. This shows that the method based on the fusion of the voice characteristics and the speaker characteristics can respectively perform effective characteristic representation and characteristic extraction on pathological voice.

Fig. 5 shows the effect of the present invention on the fitting of MOS score truth labels. The invention predicts the pathological voice quality fraction, draws the predicted data as a curve, and draws a histogram by MOS fraction truth value data. As can be seen from fig. 5, the true value distribution interval of the voice data is more dispersed, but the invention can still make reliable quality score prediction for pathological voice, because the prediction score curve can be regarded as the fitting curve of the histogram made by the true value data, the invention proves that the invention is used as an objective and automatic evaluation method, can fit the subjective evaluation of people, and makes reliable score prediction for the intelligibility of pathological voice.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics is characterized in that: the method comprises the following steps:

2. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 1, wherein: in the step (1) of the above-mentioned process,

3. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 2, wherein: in the step 1, feature fusion of two voices includes:

Feature vector of dimension

wherein ,/>

U is the input spectrogram of pathological voice, < ->

Representing the ReLU activation function,)>

Is a weight parameter of the first layer convolution layer, < ->

；/>

and />

Are respectively->

Weight parameters and bias parameters of the layer convolution layer, < ->

Is the output of the ith convolution layer, +.>

Is->

An output of the layer convolution layer;

will be

M ∈ obtained with each frame>

Corresponding addition of the cepstrum coefficients of the Weibull frequency to obtain +.>

Frame-level features of the preliminary fusion of dimensions->

。

4. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 3, wherein: the step 2 specifically includes:

step 201: will first

As input, via the input embedding layer, get +.>

step 202:

then the position coding is carried out to obtain +.>

The formula for position coding is: />

wherein ,/>

For inputting the position information of each value in the vector, 2t refers to the dimension of the word vector encoded by each value, and then +.>

And (3) with

；

Step 203: will be

Sending the signals to an M-head self-attention mechanism module to obtain +.>

Its dimension and->

The same applies to the following formula:

wherein ,/>

，/>

Is the weight matrix in the t-th head in the M-head attention mechanism; />

wherein ,/>

For the dimension of the resulting K vector,

Is the total output of the M-head attention mechanism;

step 204: will be

And->

Residual connection is carried out to obtain->

And then will

Performing batch normalization to obtain +.>

The calculation formula of the process is as follows:

;

step 205: will be

Obtaining ∈K through feedforward calculation>

The calculation formula is as follows:

wherein ,/>

Is a weight matrix>

Representing a ReLU activation function;

step 206: will be

and />

Residual connection is carried out to obtain->

Then->

Performing batch normalization to obtain

The calculation formula of the process is as follows:

;

step 207: will be

Performing N times of repetition coding to obtain +.>

；/>

Step 208: will be

Obtaining frame fraction +.>

，

wherein />

Represented as a fully connected layer.

5. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 4, wherein: the step 3 includes:

input of m-dimensional mel frequency cepstral coefficient features per frame

Layer delay neural network, delay neural network while considering the context information between adjacent frames, assemble the frame-level characteristic into a speech-level characteristic, calculate and splice +.>

A time delay neural network for obtaining the characteristic vector of k dimension

。

6. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 5, wherein: in the step 4, feature fusion is performed on the voice feature obtained after the time information extraction in the step 2 and the speaker feature obtained in the step 3 as inputs, including:

will be

The calculation formula is as follows:

wherein ,/>

Representation->

Exchange of dimensions, add>

Representing an adaptive averaging pooling operation;

feature vector

And->

Splicing to obtain preliminarily fused speaking-level characteristic vector

,

Will get->

The process is expressed by the following formula:

;

wherein ,

，/>

representing the ReLU activation function,)>

Is a weight parameter of the first layer convolution layer, < ->

；/>

and />

Are respectively->

Weight parameters and bias parameters of the layer convolution layer, < ->

Is the output of the ith convolution layer, +.>

Is->

And (3) outputting a layer convolution layer.

7. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 6, wherein: in the step 4, the predicting the speech-level quality score includes:

will be

The calculation formula is as follows:

wherein ,/>

Representing the exchange of dimensions->

Representing an adaptive averaging pooling operation. />