CN115472157A

CN115472157A - Traditional Chinese medicine clinical speech recognition method and model based on deep learning

Info

Publication number: CN115472157A
Application number: CN202211006117.1A
Authority: CN
Inventors: 王亚强; 张�林; 舒红平
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-13

Abstract

The invention discloses a traditional Chinese medicine clinical voice recognition method and a model based on deep learning, wherein a deep learning model former is mainly adopted to complete a traditional Chinese medicine clinical voice recognition task, an audio characteristic enhancement method is adopted to improve the model recognition effect, a combined CTC/Attention mechanism is adopted in the training and decoding processes of the traditional Chinese medicine clinical voice recognition model, and a target function jointly optimizes CTC loss and KL divergence loss in a training stage; in the decoding stage, firstly, n best candidates are generated by CTC decoding, then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output, so that a better recognition result is obtained in the traditional Chinese medicine clinical speech recognition. The invention inputs the traditional Chinese medical clinical electronic medical record through voice recognition, replaces the traditional mode that a traditional Chinese medical doctor inputs the medical record through handwriting the medical record or a keyboard, and can effectively save the time for the doctor to input the medical record and reduce the workload of the doctor.

Description

Traditional Chinese medicine clinical speech recognition method and model based on deep learning

Technical Field

The invention belongs to the field of voice recognition, and relates to a voice recognition method and a voice recognition model.

Background

At present, the research on the traditional Chinese medicine clinical speech recognition is stopped at solving the problem by using a traditional machine learning Model, namely a Hidden Markov Model (HMM), and only aims at the recognition of isolated words in the traditional Chinese medicine, and continuous speech recognition cannot be carried out. The hidden Markov model is a process of randomly generating a random sequence of unobservable states by a hidden Markov chain and then generating an observation by each state so as to generate an observation random sequence. The hidden markov model is determined by an initial state probability distribution pi, a state transition probability distribution a, and an observation probability distribution B, and can be represented by a ternary notation, i.e., λ = (a, B, pi). A, B, pi are called three elements of a hidden Markov model. Hidden markov models can infer changes in the underlying states from existing data, commonly referred to as observed states, and inferred states referred to as hidden states. In speech recognition, the speech signal is the observed state and the recognized Chinese characters are the hidden state.

Hidden markov models rely only on each state and its corresponding observed object, and the speech recognition task is not only related to a single chinese character, but also related to the length of the observed sequence and the context of the sequence, which also results in the unsatisfactory chinese medical clinical effect of hidden markov based speech recognition models.

In recent years, research focus in the field of machine learning has gradually turned to deep learning. Compared with the traditional machine learning model, the deep learning model has more layers of nonlinear structures and is stronger in expression and modeling capacity, so that the prediction accuracy is greatly improved. In the traditional Chinese medicine clinical speech recognition task, the deep learning model has more advantages than the traditional hidden Markov model in the aspect of processing the complex signal characteristics, and the recognition accuracy is greatly improved.

Disclosure of Invention

The invention adopts a deep learning model, a former, to successfully solve the problems and provides a traditional Chinese medicine clinical speech recognition model and a method based on deep learning.

The technical scheme of the invention is as follows:

the traditional Chinese medicine clinical speech recognition method based on deep learning is characterized by comprising the following steps:

s1, audio feature extraction: extracting Fbank characteristics of the traditional Chinese medicine audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering;

s2: and audio feature processing: masking in the time domain and the frequency domain of the traditional Chinese medicine audio Fbank characteristic; then, two layers of two-dimensional convolution down-sampling networks are adopted, the size of a convolution kernel is 3 multiplied by 3, the step length is 2, and after down-sampling, the number of audio characteristic frames is reduced to one fourth of the original number;

performing audio characteristic enhancement operation on Fbank characteristics of traditional Chinese medicine audio extracted through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering; the traditional Chinese medicine audio Fbank features are two-dimensional vectors which can be divided into time domains and frequency domains, and the audio feature enhancement is to mask in the time domains and the frequency domains; then, down-sampling is carried out on the Fbank features after the audio features are enhanced, namely, the frame number is reduced under the condition that the voice information is not lost, so that the calculated amount in a neural network is reduced, and the effect is generally achieved through convolution in a voice task; therefore, a two-layer two-dimensional convolution down-sampling network is adopted in the down-sampling part, the size of a convolution kernel is 3 multiplied by 3, the step length is set to be 2, and after down-sampling, the frame number is reduced to one fourth of the original frame number;

and S3, inputting the processed traditional Chinese medicine audio features into an encoder: the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module;

the forward feedback module comprises two fully-connected layers, two residual layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before the first fully-connected layer;

in the multi-head self-attention module, a self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:

q, K and V are obtained by linear transformation of Chinese medicine audio feature vector, d _k Is a feature vector dimension;

the multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio frequency characteristics from different aspects, and the calculation formula is as follows:

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (2)

wherein h represents the number of attention heads, W ^O Random weight matrix, W, for linear transformation after multi-headed attention stitching ^Q ,W ^K ,W ^V The weight matrix corresponding to Q, K and V in the ith attention head;

the convolution module adopts causal convolution and comprises point-by-point convolution, a gate control linear unit, one-dimensional depth convolution, layer standardization and an activation function ReLU;

s4: text feature extraction: mapping a text label corresponding to the traditional Chinese medicine audio to an index of the Chinese character in the modeling unit, namely text characteristics;

s5: text feature processing: adding position information corresponding to the text features into the text features, wherein the position information is obtained through position coding, and the formula is as follows:

pos represents the index of the position of the character in the current text feature vector, i represents the index of the text feature vector, d _model Represents the encoding dimension, set to 256;

s6, decoding: the decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios;

inputting the text features processed in the step S5 into an implicit multi-head self-Attention module in a decoder, wherein the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally obtaining an Attention numerical value of the text; then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a voice characteristic sequence output by an encoder, and the structure of the forward feedback module is consistent with that of a forward feedback module in the encoder;

s7, model training and decoding are carried out by adopting a combined CTC/Attention mechanism, the Attention mechanism is in non-monotonic alignment with the traditional Chinese medicine audio features and the traditional Chinese medicine text labels in association with context, and the CTC forces the input traditional Chinese medicine audio features and the traditional Chinese medicine text labels to be in monotonic alignment through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the two can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output.

Further, in the step S7, the CTC loss is calculated by performing a forward linear calculation on the output of the encoder, performing Softmax normalization, and then calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a combined loss, wherein the formula is as follows:

Loss＝λL _CTC (x,y)+(1-λ)L _kL (x,y) (6)

wherein Loss is the combined Loss, L _CTC For loss of CTC, L _KL The KL divergence loss is obtained, x represents the input traditional Chinese medicine audio frequency characteristics, and y represents a text label corresponding to the traditional Chinese medicine audio frequency; λ is a hyper-parameter that acts to balance the importance of CTC loss and KL divergence loss, and is set to 0.3.

Further, the number of mel filters in the audio feature extraction process in step S1 is set to 80.

The invention also provides a traditional Chinese medicine clinical speech recognition model based on deep learning, which comprises the following steps: the system comprises an audio characteristic extraction module, an audio characteristic processing module, a text characteristic extraction module, a text characteristic processing module, an encoder, a decoder, a model training and decoding module, wherein the audio characteristic extraction module extracts Fbank characteristics of audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering, the audio characteristic processing module masks the time domain and frequency domain of the Fbank characteristics of the traditional Chinese medicine audio, then a two-layer two-dimensional convolution down-sampling network is adopted, the convolution kernel size is 3 multiplied by 3, the step length is 2, and after down-sampling, the frame number is reduced to one fourth of the original frame number; the text feature extraction module maps the Chinese medicine text into feature vectors, the text feature processing module obtains the position information of the text features through position coding, and adds the corresponding position information into the Chinese medicine text features, and the calculation formula is as follows:

the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module; the forward feedback module comprises two fully-connected layers, two residual error layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before a first fully-connected layer; in the multi-head self-attention module, a self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:

q, K, V are the Chinese medicine audio characteristic vectors obtained by linear transformation, d _k Is a feature vector dimension;

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (2)

wherein h represents the number of attention heads, W ^O Random weight matrix, W, for linear transformation after multi-headed attention stitching ^Q ,W ^K ,W ^V A weight matrix corresponding to Q, K and V in the ith attention head;

the convolution module adopts causal convolution and comprises point-by-point convolution, a gate control linear unit, one-dimensional depth convolution, layer standardization and an activation function ReLU.

The decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios;

the text features processed by the text feature extraction module and the text feature processing module are input into an implicit multi-head self-Attention module in a decoder, the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally an Attention numerical value of the text is obtained; and then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a Chinese medicine audio characteristic sequence output by an encoder, and the structure of the forward feedback module is consistent with that of a forward feedback module in the encoder.

Furthermore, the model training and decoding module adopts a combined CTC/Attention mechanism, a target function jointly optimizes CTC loss and KL divergence loss, the Attention mechanism is in non-monotonic alignment with the traditional Chinese medicine audio features and the traditional Chinese medicine text labels in connection with context, and the CTC forces the input traditional Chinese medicine audio features and the traditional Chinese medicine text labels to be in monotonic alignment through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the CTC and the Attention mechanism can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, firstly, generating n optimal candidates by CTC decoding, then, re-scoring by an Attention decoder, and taking the result with the highest score as output, wherein the CTC loss is obtained by performing forward linear calculation once on the output of an encoder, normalizing by Softmax and calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a combined loss, wherein the formula is as follows:

Loss＝λL _CTC (x,y)+(1-λ)L _kL (x,y) (6)

wherein Loss is the combined Loss, L _CTC For loss of CTC, L _KL For KL divergence loss, x represents the input Chinese medicine audio features, and y represents the text label corresponding to the Chinese medicine audio; λ is a hyper-parameter, which acts to balance the importance of CTC loss and KL divergence loss, set to 0.3;

in the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output.

Further, the number of mel filters in the audio feature extraction module is set to 80.

In conclusion, the beneficial effects of the invention are as follows:

1. in the actual scene of the traditional Chinese medicine clinical speech recognition, there is usually significant background noise, which can cause a great deal of errors in the recognition process. Therefore, the invention enhances the Mel frequency spectrum characteristics extracted from the input audio, and is equivalent to artificially increasing some noises through the masking in the transverse frequency domain range and the masking in the longitudinal time domain range, thereby avoiding the over-fitting problem in the model training and improving the accuracy of the traditional Chinese medicine clinical speech recognition.

2. The CNN is good at extracting local characteristics of the audio, the Transformer is good at capturing the global interaction based on the audio content, and the local characteristics and the global representation are kept to the maximum extent by combining the advantages of the CNN and the Transformer, so that the effect of the traditional Chinese medicine clinical speech recognition is better.

3. A joint CTC/Attention mechanism is adopted, the CTC and the Attention share one coder, a target function jointly optimizes CTC loss and Attention loss, training convergence can be effectively accelerated, and meanwhile, better recognition results can be obtained through traditional Chinese medicine clinical voice recognition during decoding.

Drawings

FIG. 1 is a flow chart of Chinese medicine clinical speech recognition model training and decoding based on deep learning;

FIG. 2 is a diagram of a Chinese medicine clinical speech recognition model framework based on deep learning;

FIG. 3 is an identification example of one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

In the description of the embodiments of the present invention, it should be noted that the indication of the orientation or the positional relationship is based on the orientation or the positional relationship shown in the drawings, or the orientation or the positional relationship which is usually placed when the present invention is used, or the orientation or the positional relationship which is usually understood by those skilled in the art, or the orientation or the positional relationship which is usually placed when the present invention is used, is only for the convenience of describing the present invention and simplifying the description, and does not indicate or imply that the indicated device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, cannot be understood as limiting the present invention. Furthermore, the terms "first" and "second" are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

In the description of the embodiments of the present invention, it should be further noted that the terms "disposed" and "connected," unless otherwise explicitly specified or limited, are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood in specific cases by those skilled in the art; the drawings in the embodiments are used for clearly and completely describing the technical scheme in the embodiments of the invention, and obviously, the described embodiments are a part of the embodiments of the invention, but not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

The first embodiment is as follows:

the flowchart of the traditional Chinese medicine clinical speech recognition model based on deep learning in the embodiment is shown in fig. 2, and the model includes: the device comprises an audio characteristic extraction module, an audio characteristic processing module, a text characteristic extraction module, a text characteristic processing module, an encoder, a decoder and a model training and decoding module. The audio feature extraction module extracts Fbank features of the traditional Chinese medicine audio, and the audio feature processing module performs audio feature enhancement and down-sampling on the Fbank features; the text feature extraction and processing module maps the Chinese medicine text into an index of the modeling unit and adds text position information.

The encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module; wherein the feedforward module comprises two fully-connected layers, two residual layers and a non-linear activation function ReLU, and a layer normalization is performed before the first fully-connected layer.

In the multi-head self-attention module, the self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:

q, K, V are the Chinese medicine audio characteristic vectors obtained by linear transformation, d _k Is the feature vector dimension.

The multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio features from different aspects, compared with a single attention mechanism, the multi-head self-attention mechanism can learn more information, and the calculation formula is as follows:

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (2)

wherein h represents the number of attention heads, W ^O Random weight matrix, W, for linear transformation after multi-headed attention stitching ^Q ,W ^K ,W ^V And the weight matrix corresponds to Q, K and V in the ith attention head.

Speech recognition is a sequential problem that only needs to consider the input at time t and before, so causal convolution is used in the convolution module, including a point-by-point convolution, gated linear units, one-dimensional deep convolution, layer normalization, activation function ReLU.

The decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios. The method for carrying out Chinese medicine clinical speech recognition by using the model comprises the following steps:

firstly, a section of traditional Chinese medicine clinical voice information is selected, fbank characteristics of traditional Chinese medicine audio are extracted through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering (the number of filters is set to be 80) in an audio characteristic extraction module, and then audio characteristic enhancement operation and down-sampling operation are carried out on the Fbank characteristics by an audio characteristic processing module.

Firstly, extracting Fbank characteristics of traditional Chinese medicine audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering (set as 80), and then performing audio characteristic enhancement operation on the Fbank characteristics, wherein the Fbank characteristics of the traditional Chinese medicine audio are two-dimensional vectors which can be divided into a time domain and a frequency domain, and the audio characteristic enhancement is to perform masking in the time domain and the frequency domain.

And then, performing down-sampling on the traditional Chinese medicine audio Fbank characteristics after the audio characteristics are enhanced, namely reducing the frame number under the condition of ensuring that the voice information is not lost, thereby reducing the calculated amount in a neural network, and generally achieving the effect by convolution in a voice task. Therefore, a two-layer two-dimensional convolution down-sampling network is adopted in the down-sampling part, the convolution kernel size is 3 multiplied by 3, and the step length is set to be 2. After down-sampling, the frame number is reduced to one fourth of the original frame number. Meanwhile, the text feature extraction and processing module maps the Chinese medicine text into an index of a modeling unit and adds text position information. The position information is obtained by position coding, and the formula is as follows:

the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module; wherein the feedforward module comprises two fully-connected layers, two residual layers and a nonlinear activation function ReLU, and a layer normalization is performed before the first fully-connected layer.

Then in a decoding structure, inputting the Chinese medicine text vector into an implicit multi-head self-Attention module in a decoder, wherein the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in an encoder, finally obtaining an Attention value of a text, then entering the multi-head Attention module, the calculation mode of the multi-head Attention is the same as that of the self-Attention, but the input Q, K and V are different, wherein the Q is from a Chinese medicine text sequence, the K and V are from a Chinese medicine audio characteristic sequence output by the encoder, and finally a forward feedback module is consistent with a forward feedback module structure in the encoder.

And a combined CTC/Attention mechanism is adopted in the model training process, and a target function jointly optimizes CTC loss and KL divergence loss. The Attention mechanism is connected with the context to perform non-monotonic alignment on the traditional Chinese medicine audio features and the traditional Chinese medicine text labels, and the CTC forces the input traditional Chinese medicine audio features and the traditional Chinese medicine text labels to be in monotonic alignment through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the mixed CTC/Attention structure can effectively utilize the advantages of the traditional Chinese medicine audio features and the traditional Chinese medicine text labels to eliminate irregular alignment. The CTC loss is obtained by performing one-time forward linear calculation on the output of the encoder, performing Softmax normalization and calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a joint loss, wherein the formula is as follows:

Loss＝λL _CTC (x,y)+(1-λ)L _kL (x,y) (6)

wherein Loss is the combined Loss, L _CTC For loss of CTC, L _KL For KL divergence loss, x represents the input Chinese medicine audio features, and y represents the text label corresponding to the Chinese medicine audio; λ is a hyper-parameter that acts to balance the importance of CTC loss and KL divergence loss, and is set to 0.3.

In the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output. The model training and decoding flow chart is shown in fig. 1.

Next, the recognition effect of the speech recognition model and method of the present invention is verified.

The evaluation index of Chinese speech recognition is the Character Error Rate CER (Character Error Rate), so the evaluation index of Chinese clinical speech recognition in the invention is also the CER. The calculation formula is as follows, N represents the number of words, S represents replacement, D represents deletion, and I represents insertion

In order to keep the recognized word sequence consistent with the standard word sequence, some words need to be replaced, deleted, or inserted, and the total number of the inserted, replaced, or deleted words is divided by the percentage of the total number of words in the standard word sequence, i.e. WER.

The invention uses the text of the clinical medical record of traditional Chinese medicine and the audio transcribed by the text as experimental data. Three different deep learning models, namely a Convolutional Neural Network (CNN), a Transformer and a Conformer, are adopted for experiments, and WER is used as an evaluation index. The experimental results are shown in table 1, and the recognition effects are shown in fig. 3.

The experiment result shows that the performance of the Conformer model on the traditional Chinese medicine clinical speech recognition is the best.

TABLE 1 results of the experiment

Model	CER
		CNN	19.1％
Transformer	5.50％
		Conformer	3.79％

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. The traditional Chinese medicine clinical speech recognition method based on deep learning is characterized by comprising the following steps:

s2: and audio feature processing: masking the time domain and the frequency domain of the traditional Chinese medicine audio Fbank characteristic, then adopting a two-layer two-dimensional convolution down-sampling network, wherein the convolution kernel size is 3 multiplied by 3, the step length is 2, and after down-sampling, the number of audio characteristic frames is reduced to one fourth of the original number;

the forward feedback module comprises two fully-connected layers, two residual error layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before a first fully-connected layer;

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (2)

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (3)

the convolution module adopts causal convolution, and comprises a point-by-point convolution, a gate control linear unit, a one-dimensional depth convolution, layer standardization and an activation function ReLU;

inputting the text features processed in the step S5 into an implicit multi-head self-Attention module in a decoder, wherein the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally obtaining an Attention numerical value of the text; then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a Chinese medicine audio characteristic sequence output by an encoder, and the structure of a forward feedback module in the encoder is consistent with that of the forward feedback module in the encoder;

s7, model training and decoding are carried out by adopting a combined CTC/Attention mechanism, the Attention mechanism is in non-monotonic alignment with the traditional Chinese medicine audio features and the traditional Chinese medicine text labels in connection with context, and the CTC forcedly inputs the traditional Chinese medicine audio features and the traditional Chinese medicine text labels through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the two can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, n best candidates are generated by CTC decoding, and then are re-scored by the Attention decoder, and the result with the highest score is taken as output.

2. The deep learning-based traditional Chinese medicine clinical speech recognition method of claim 1, wherein the CTC loss in step S7 is calculated by performing a forward linear calculation on the output of the encoder, performing Softmax normalization, and then calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a joint loss, wherein the formula is as follows:

Loss＝λL _CTC (x,y)+(1-λ)L _kL (x,y) (6)

wherein Loss is the combined Loss, L _CTC For loss of CTC, L _KL For KL divergence loss, x represents the input Chinese audio features, and y represents the corresponding text of the Chinese audioThe label is provided; lambda is a hyper-parameter, which plays the role of balancing the importance of CTC loss and KL divergence loss, and is set to 0.3.

3. The deep learning-based traditional Chinese medicine clinical speech recognition method of claim 1, wherein the number of mel filters in the audio feature extraction process in step S1 is set to 80.

4. The traditional Chinese medicine clinical speech recognition model based on deep learning is characterized by comprising the following steps: the system comprises an audio characteristic extraction module, an audio characteristic processing module, a text characteristic extraction module, a text characteristic processing module, an encoder, a decoder, a model training and decoding module, wherein the audio characteristic extraction module extracts Fbank characteristics of audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering, the audio characteristic processing module masks the time domain and frequency domain of the Fbank characteristics of the traditional Chinese medicine audio, then a two-layer two-dimensional convolution downsampling network is adopted, the convolution kernel size is 3 multiplied by 3, the step length is 2, and after downsampling, the frame number is reduced to one fourth of the original frame number; the text feature extraction module maps the Chinese medicine text into feature vectors, the text feature processing module obtains position information of the text features through position coding, and adds corresponding position information into the Chinese medicine text features, and the calculation formula is as follows:

the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module;

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (2)

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (3)

the decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating and inputting the context information of the Chinese medicine text;

the text features processed by the text feature extraction module and the text feature processing module are input into an implicit multi-head self-Attention module in a decoder, the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally an Attention numerical value of the text is obtained; and then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but the input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a Chinese medicine audio characteristic sequence output by an encoder, and the structure of the forward feedback module is consistent with that of the forward feedback module in the encoder.

5. The deep learning-based clinical speech recognition model of traditional Chinese medicine according to claim 4, wherein the model training and decoding module employs a joint CTC/Attention mechanism, and an objective function jointly optimizes CTC loss and KL divergence loss; the Attention mechanism is connected with the context to perform non-monotonic alignment on the traditional Chinese medicine audio features and the traditional Chinese medicine text labels, and the CTC forcibly inputs the traditional Chinese medicine audio features and the traditional Chinese medicine text labels through a dynamic programming algorithm to perform monotonic alignment, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the traditional Chinese medicine audio features and the traditional Chinese medicine text labels can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, firstly, n best candidates are generated by CTC decoding, then the Attention decoder re-scores, and the result with the highest score is used as output; the CTC loss is obtained by performing one-time forward linear calculation on the output of an encoder, normalizing by softmax and calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a joint loss, wherein the formula is as follows:

Loss＝λL _CTC (x,y)+(1-λ)L _kL (x,y) (6)

wherein Loss is the combined Loss, L _CTC For loss of CTC, L _KL The KL divergence loss is obtained, x represents the input traditional Chinese medicine audio frequency characteristics, and y represents a text label corresponding to the traditional Chinese medicine audio frequency; λ is a hyper-parameter, which acts to balance the importance of CTC loss and KL divergence loss, set to 0.3;

6. The deep learning based clinical speech recognition model of TCM of claim 4, wherein the number of Mel filters in the audio feature extraction module is set to 80.