CN115472157A - Traditional Chinese medicine clinical speech recognition method and model based on deep learning - Google Patents

Traditional Chinese medicine clinical speech recognition method and model based on deep learning Download PDF

Info

Publication number
CN115472157A
CN115472157A CN202211006117.1A CN202211006117A CN115472157A CN 115472157 A CN115472157 A CN 115472157A CN 202211006117 A CN202211006117 A CN 202211006117A CN 115472157 A CN115472157 A CN 115472157A
Authority
CN
China
Prior art keywords
chinese medicine
attention
traditional chinese
module
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211006117.1A
Other languages
Chinese (zh)
Inventor
王亚强
张�林
舒红平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202211006117.1A priority Critical patent/CN115472157A/en
Publication of CN115472157A publication Critical patent/CN115472157A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a traditional Chinese medicine clinical voice recognition method and a model based on deep learning, wherein a deep learning model former is mainly adopted to complete a traditional Chinese medicine clinical voice recognition task, an audio characteristic enhancement method is adopted to improve the model recognition effect, a combined CTC/Attention mechanism is adopted in the training and decoding processes of the traditional Chinese medicine clinical voice recognition model, and a target function jointly optimizes CTC loss and KL divergence loss in a training stage; in the decoding stage, firstly, n best candidates are generated by CTC decoding, then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output, so that a better recognition result is obtained in the traditional Chinese medicine clinical speech recognition. The invention inputs the traditional Chinese medical clinical electronic medical record through voice recognition, replaces the traditional mode that a traditional Chinese medical doctor inputs the medical record through handwriting the medical record or a keyboard, and can effectively save the time for the doctor to input the medical record and reduce the workload of the doctor.

Description

Traditional Chinese medicine clinical speech recognition method and model based on deep learning
Technical Field
The invention belongs to the field of voice recognition, and relates to a voice recognition method and a voice recognition model.
Background
At present, the research on the traditional Chinese medicine clinical speech recognition is stopped at solving the problem by using a traditional machine learning Model, namely a Hidden Markov Model (HMM), and only aims at the recognition of isolated words in the traditional Chinese medicine, and continuous speech recognition cannot be carried out. The hidden Markov model is a process of randomly generating a random sequence of unobservable states by a hidden Markov chain and then generating an observation by each state so as to generate an observation random sequence. The hidden markov model is determined by an initial state probability distribution pi, a state transition probability distribution a, and an observation probability distribution B, and can be represented by a ternary notation, i.e., λ = (a, B, pi). A, B, pi are called three elements of a hidden Markov model. Hidden markov models can infer changes in the underlying states from existing data, commonly referred to as observed states, and inferred states referred to as hidden states. In speech recognition, the speech signal is the observed state and the recognized Chinese characters are the hidden state.
Hidden markov models rely only on each state and its corresponding observed object, and the speech recognition task is not only related to a single chinese character, but also related to the length of the observed sequence and the context of the sequence, which also results in the unsatisfactory chinese medical clinical effect of hidden markov based speech recognition models.
In recent years, research focus in the field of machine learning has gradually turned to deep learning. Compared with the traditional machine learning model, the deep learning model has more layers of nonlinear structures and is stronger in expression and modeling capacity, so that the prediction accuracy is greatly improved. In the traditional Chinese medicine clinical speech recognition task, the deep learning model has more advantages than the traditional hidden Markov model in the aspect of processing the complex signal characteristics, and the recognition accuracy is greatly improved.
Disclosure of Invention
The invention adopts a deep learning model, a former, to successfully solve the problems and provides a traditional Chinese medicine clinical speech recognition model and a method based on deep learning.
The technical scheme of the invention is as follows:
the traditional Chinese medicine clinical speech recognition method based on deep learning is characterized by comprising the following steps:
s1, audio feature extraction: extracting Fbank characteristics of the traditional Chinese medicine audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering;
s2: and audio feature processing: masking in the time domain and the frequency domain of the traditional Chinese medicine audio Fbank characteristic; then, two layers of two-dimensional convolution down-sampling networks are adopted, the size of a convolution kernel is 3 multiplied by 3, the step length is 2, and after down-sampling, the number of audio characteristic frames is reduced to one fourth of the original number;
performing audio characteristic enhancement operation on Fbank characteristics of traditional Chinese medicine audio extracted through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering; the traditional Chinese medicine audio Fbank features are two-dimensional vectors which can be divided into time domains and frequency domains, and the audio feature enhancement is to mask in the time domains and the frequency domains; then, down-sampling is carried out on the Fbank features after the audio features are enhanced, namely, the frame number is reduced under the condition that the voice information is not lost, so that the calculated amount in a neural network is reduced, and the effect is generally achieved through convolution in a voice task; therefore, a two-layer two-dimensional convolution down-sampling network is adopted in the down-sampling part, the size of a convolution kernel is 3 multiplied by 3, the step length is set to be 2, and after down-sampling, the frame number is reduced to one fourth of the original frame number;
and S3, inputting the processed traditional Chinese medicine audio features into an encoder: the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module;
the forward feedback module comprises two fully-connected layers, two residual layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before the first fully-connected layer;
in the multi-head self-attention module, a self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:
Figure BDA0003808754430000031
q, K and V are obtained by linear transformation of Chinese medicine audio feature vector, d k Is a feature vector dimension;
the multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio frequency characteristics from different aspects, and the calculation formula is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (2)
Figure BDA0003808754430000032
wherein h represents the number of attention heads, W O Random weight matrix, W, for linear transformation after multi-headed attention stitching Q ,W K ,W V The weight matrix corresponding to Q, K and V in the ith attention head;
the convolution module adopts causal convolution and comprises point-by-point convolution, a gate control linear unit, one-dimensional depth convolution, layer standardization and an activation function ReLU;
s4: text feature extraction: mapping a text label corresponding to the traditional Chinese medicine audio to an index of the Chinese character in the modeling unit, namely text characteristics;
s5: text feature processing: adding position information corresponding to the text features into the text features, wherein the position information is obtained through position coding, and the formula is as follows:
Figure BDA0003808754430000033
Figure BDA0003808754430000034
pos represents the index of the position of the character in the current text feature vector, i represents the index of the text feature vector, d model Represents the encoding dimension, set to 256;
s6, decoding: the decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios;
inputting the text features processed in the step S5 into an implicit multi-head self-Attention module in a decoder, wherein the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally obtaining an Attention numerical value of the text; then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a voice characteristic sequence output by an encoder, and the structure of the forward feedback module is consistent with that of a forward feedback module in the encoder;
s7, model training and decoding are carried out by adopting a combined CTC/Attention mechanism, the Attention mechanism is in non-monotonic alignment with the traditional Chinese medicine audio features and the traditional Chinese medicine text labels in association with context, and the CTC forces the input traditional Chinese medicine audio features and the traditional Chinese medicine text labels to be in monotonic alignment through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the two can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output.
Further, in the step S7, the CTC loss is calculated by performing a forward linear calculation on the output of the encoder, performing Softmax normalization, and then calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a combined loss, wherein the formula is as follows:
Loss=λL CTC (x,y)+(1-λ)L kL (x,y) (6)
wherein Loss is the combined Loss, L CTC For loss of CTC, L KL The KL divergence loss is obtained, x represents the input traditional Chinese medicine audio frequency characteristics, and y represents a text label corresponding to the traditional Chinese medicine audio frequency; λ is a hyper-parameter that acts to balance the importance of CTC loss and KL divergence loss, and is set to 0.3.
Further, the number of mel filters in the audio feature extraction process in step S1 is set to 80.
The invention also provides a traditional Chinese medicine clinical speech recognition model based on deep learning, which comprises the following steps: the system comprises an audio characteristic extraction module, an audio characteristic processing module, a text characteristic extraction module, a text characteristic processing module, an encoder, a decoder, a model training and decoding module, wherein the audio characteristic extraction module extracts Fbank characteristics of audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering, the audio characteristic processing module masks the time domain and frequency domain of the Fbank characteristics of the traditional Chinese medicine audio, then a two-layer two-dimensional convolution down-sampling network is adopted, the convolution kernel size is 3 multiplied by 3, the step length is 2, and after down-sampling, the frame number is reduced to one fourth of the original frame number; the text feature extraction module maps the Chinese medicine text into feature vectors, the text feature processing module obtains the position information of the text features through position coding, and adds the corresponding position information into the Chinese medicine text features, and the calculation formula is as follows:
Figure BDA0003808754430000051
Figure BDA0003808754430000052
pos represents the index of the position of the character in the current text feature vector, i represents the index of the text feature vector, d model Represents the encoding dimension, set to 256;
the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module; the forward feedback module comprises two fully-connected layers, two residual error layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before a first fully-connected layer; in the multi-head self-attention module, a self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:
Figure BDA0003808754430000053
q, K, V are the Chinese medicine audio characteristic vectors obtained by linear transformation, d k Is a feature vector dimension;
the multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio frequency characteristics from different aspects, and the calculation formula is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (2)
Figure BDA0003808754430000054
wherein h represents the number of attention heads, W O Random weight matrix, W, for linear transformation after multi-headed attention stitching Q ,W K ,W V A weight matrix corresponding to Q, K and V in the ith attention head;
the convolution module adopts causal convolution and comprises point-by-point convolution, a gate control linear unit, one-dimensional depth convolution, layer standardization and an activation function ReLU.
The decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios;
the text features processed by the text feature extraction module and the text feature processing module are input into an implicit multi-head self-Attention module in a decoder, the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally an Attention numerical value of the text is obtained; and then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a Chinese medicine audio characteristic sequence output by an encoder, and the structure of the forward feedback module is consistent with that of a forward feedback module in the encoder.
Furthermore, the model training and decoding module adopts a combined CTC/Attention mechanism, a target function jointly optimizes CTC loss and KL divergence loss, the Attention mechanism is in non-monotonic alignment with the traditional Chinese medicine audio features and the traditional Chinese medicine text labels in connection with context, and the CTC forces the input traditional Chinese medicine audio features and the traditional Chinese medicine text labels to be in monotonic alignment through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the CTC and the Attention mechanism can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, firstly, generating n optimal candidates by CTC decoding, then, re-scoring by an Attention decoder, and taking the result with the highest score as output, wherein the CTC loss is obtained by performing forward linear calculation once on the output of an encoder, normalizing by Softmax and calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a combined loss, wherein the formula is as follows:
Loss=λL CTC (x,y)+(1-λ)L kL (x,y) (6)
wherein Loss is the combined Loss, L CTC For loss of CTC, L KL For KL divergence loss, x represents the input Chinese medicine audio features, and y represents the text label corresponding to the Chinese medicine audio; λ is a hyper-parameter, which acts to balance the importance of CTC loss and KL divergence loss, set to 0.3;
in the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output.
Further, the number of mel filters in the audio feature extraction module is set to 80.
In conclusion, the beneficial effects of the invention are as follows:
1. in the actual scene of the traditional Chinese medicine clinical speech recognition, there is usually significant background noise, which can cause a great deal of errors in the recognition process. Therefore, the invention enhances the Mel frequency spectrum characteristics extracted from the input audio, and is equivalent to artificially increasing some noises through the masking in the transverse frequency domain range and the masking in the longitudinal time domain range, thereby avoiding the over-fitting problem in the model training and improving the accuracy of the traditional Chinese medicine clinical speech recognition.
2. The CNN is good at extracting local characteristics of the audio, the Transformer is good at capturing the global interaction based on the audio content, and the local characteristics and the global representation are kept to the maximum extent by combining the advantages of the CNN and the Transformer, so that the effect of the traditional Chinese medicine clinical speech recognition is better.
3. A joint CTC/Attention mechanism is adopted, the CTC and the Attention share one coder, a target function jointly optimizes CTC loss and Attention loss, training convergence can be effectively accelerated, and meanwhile, better recognition results can be obtained through traditional Chinese medicine clinical voice recognition during decoding.
Drawings
FIG. 1 is a flow chart of Chinese medicine clinical speech recognition model training and decoding based on deep learning;
FIG. 2 is a diagram of a Chinese medicine clinical speech recognition model framework based on deep learning;
FIG. 3 is an identification example of one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.
In the description of the embodiments of the present invention, it should be noted that the indication of the orientation or the positional relationship is based on the orientation or the positional relationship shown in the drawings, or the orientation or the positional relationship which is usually placed when the present invention is used, or the orientation or the positional relationship which is usually understood by those skilled in the art, or the orientation or the positional relationship which is usually placed when the present invention is used, is only for the convenience of describing the present invention and simplifying the description, and does not indicate or imply that the indicated device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, cannot be understood as limiting the present invention. Furthermore, the terms "first" and "second" are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
In the description of the embodiments of the present invention, it should be further noted that the terms "disposed" and "connected," unless otherwise explicitly specified or limited, are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood in specific cases by those skilled in the art; the drawings in the embodiments are used for clearly and completely describing the technical scheme in the embodiments of the invention, and obviously, the described embodiments are a part of the embodiments of the invention, but not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
The first embodiment is as follows:
the flowchart of the traditional Chinese medicine clinical speech recognition model based on deep learning in the embodiment is shown in fig. 2, and the model includes: the device comprises an audio characteristic extraction module, an audio characteristic processing module, a text characteristic extraction module, a text characteristic processing module, an encoder, a decoder and a model training and decoding module. The audio feature extraction module extracts Fbank features of the traditional Chinese medicine audio, and the audio feature processing module performs audio feature enhancement and down-sampling on the Fbank features; the text feature extraction and processing module maps the Chinese medicine text into an index of the modeling unit and adds text position information.
The encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module; wherein the feedforward module comprises two fully-connected layers, two residual layers and a non-linear activation function ReLU, and a layer normalization is performed before the first fully-connected layer.
In the multi-head self-attention module, the self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:
Figure BDA0003808754430000091
q, K, V are the Chinese medicine audio characteristic vectors obtained by linear transformation, d k Is the feature vector dimension.
The multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio features from different aspects, compared with a single attention mechanism, the multi-head self-attention mechanism can learn more information, and the calculation formula is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (2)
Figure BDA0003808754430000101
wherein h represents the number of attention heads, W O Random weight matrix, W, for linear transformation after multi-headed attention stitching Q ,W K ,W V And the weight matrix corresponds to Q, K and V in the ith attention head.
Speech recognition is a sequential problem that only needs to consider the input at time t and before, so causal convolution is used in the convolution module, including a point-by-point convolution, gated linear units, one-dimensional deep convolution, layer normalization, activation function ReLU.
The decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios. The method for carrying out Chinese medicine clinical speech recognition by using the model comprises the following steps:
firstly, a section of traditional Chinese medicine clinical voice information is selected, fbank characteristics of traditional Chinese medicine audio are extracted through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering (the number of filters is set to be 80) in an audio characteristic extraction module, and then audio characteristic enhancement operation and down-sampling operation are carried out on the Fbank characteristics by an audio characteristic processing module.
Firstly, extracting Fbank characteristics of traditional Chinese medicine audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering (set as 80), and then performing audio characteristic enhancement operation on the Fbank characteristics, wherein the Fbank characteristics of the traditional Chinese medicine audio are two-dimensional vectors which can be divided into a time domain and a frequency domain, and the audio characteristic enhancement is to perform masking in the time domain and the frequency domain.
And then, performing down-sampling on the traditional Chinese medicine audio Fbank characteristics after the audio characteristics are enhanced, namely reducing the frame number under the condition of ensuring that the voice information is not lost, thereby reducing the calculated amount in a neural network, and generally achieving the effect by convolution in a voice task. Therefore, a two-layer two-dimensional convolution down-sampling network is adopted in the down-sampling part, the convolution kernel size is 3 multiplied by 3, and the step length is set to be 2. After down-sampling, the frame number is reduced to one fourth of the original frame number. Meanwhile, the text feature extraction and processing module maps the Chinese medicine text into an index of a modeling unit and adds text position information. The position information is obtained by position coding, and the formula is as follows:
Figure BDA0003808754430000102
Figure BDA0003808754430000111
pos represents the index of the position of the character in the current text feature vector, i represents the index of the text feature vector, d model Represents the encoding dimension, set to 256;
the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module; wherein the feedforward module comprises two fully-connected layers, two residual layers and a nonlinear activation function ReLU, and a layer normalization is performed before the first fully-connected layer.
Then in a decoding structure, inputting the Chinese medicine text vector into an implicit multi-head self-Attention module in a decoder, wherein the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in an encoder, finally obtaining an Attention value of a text, then entering the multi-head Attention module, the calculation mode of the multi-head Attention is the same as that of the self-Attention, but the input Q, K and V are different, wherein the Q is from a Chinese medicine text sequence, the K and V are from a Chinese medicine audio characteristic sequence output by the encoder, and finally a forward feedback module is consistent with a forward feedback module structure in the encoder.
And a combined CTC/Attention mechanism is adopted in the model training process, and a target function jointly optimizes CTC loss and KL divergence loss. The Attention mechanism is connected with the context to perform non-monotonic alignment on the traditional Chinese medicine audio features and the traditional Chinese medicine text labels, and the CTC forces the input traditional Chinese medicine audio features and the traditional Chinese medicine text labels to be in monotonic alignment through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the mixed CTC/Attention structure can effectively utilize the advantages of the traditional Chinese medicine audio features and the traditional Chinese medicine text labels to eliminate irregular alignment. The CTC loss is obtained by performing one-time forward linear calculation on the output of the encoder, performing Softmax normalization and calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a joint loss, wherein the formula is as follows:
Loss=λL CTC (x,y)+(1-λ)L kL (x,y) (6)
wherein Loss is the combined Loss, L CTC For loss of CTC, L KL For KL divergence loss, x represents the input Chinese medicine audio features, and y represents the text label corresponding to the Chinese medicine audio; λ is a hyper-parameter that acts to balance the importance of CTC loss and KL divergence loss, and is set to 0.3.
In the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output. The model training and decoding flow chart is shown in fig. 1.
Next, the recognition effect of the speech recognition model and method of the present invention is verified.
The evaluation index of Chinese speech recognition is the Character Error Rate CER (Character Error Rate), so the evaluation index of Chinese clinical speech recognition in the invention is also the CER. The calculation formula is as follows, N represents the number of words, S represents replacement, D represents deletion, and I represents insertion
Figure BDA0003808754430000121
In order to keep the recognized word sequence consistent with the standard word sequence, some words need to be replaced, deleted, or inserted, and the total number of the inserted, replaced, or deleted words is divided by the percentage of the total number of words in the standard word sequence, i.e. WER.
The invention uses the text of the clinical medical record of traditional Chinese medicine and the audio transcribed by the text as experimental data. Three different deep learning models, namely a Convolutional Neural Network (CNN), a Transformer and a Conformer, are adopted for experiments, and WER is used as an evaluation index. The experimental results are shown in table 1, and the recognition effects are shown in fig. 3.
The experiment result shows that the performance of the Conformer model on the traditional Chinese medicine clinical speech recognition is the best.
TABLE 1 results of the experiment
Model CER
CNN 19.1%
Transformer 5.50%
Conformer 3.79%
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. The traditional Chinese medicine clinical speech recognition method based on deep learning is characterized by comprising the following steps:
s1, audio feature extraction: extracting Fbank characteristics of the traditional Chinese medicine audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering;
s2: and audio feature processing: masking the time domain and the frequency domain of the traditional Chinese medicine audio Fbank characteristic, then adopting a two-layer two-dimensional convolution down-sampling network, wherein the convolution kernel size is 3 multiplied by 3, the step length is 2, and after down-sampling, the number of audio characteristic frames is reduced to one fourth of the original number;
and S3, inputting the processed traditional Chinese medicine audio features into an encoder: the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module;
the forward feedback module comprises two fully-connected layers, two residual error layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before a first fully-connected layer;
in the multi-head self-attention module, a self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:
Figure FDA0003808754420000011
q, K, V are the Chinese medicine audio characteristic vectors obtained by linear transformation, d k Is a feature vector dimension;
the multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio frequency characteristics from different aspects, and the calculation formula is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (2)
head i =Attention(QW i Q ,KW i K ,VW i V ) (3)
wherein h represents the number of attention heads, W O Random weight matrix, W, for linear transformation after multi-headed attention stitching Q ,W K ,W V The weight matrix corresponding to Q, K and V in the ith attention head;
the convolution module adopts causal convolution, and comprises a point-by-point convolution, a gate control linear unit, a one-dimensional depth convolution, layer standardization and an activation function ReLU;
s4: text feature extraction: mapping a text label corresponding to the traditional Chinese medicine audio to an index of the Chinese character in the modeling unit, namely text characteristics;
s5: text feature processing: adding position information corresponding to the text features into the text features, wherein the position information is obtained through position coding, and the formula is as follows:
Figure FDA0003808754420000021
Figure FDA0003808754420000022
pos represents the index of the position of the character in the current text feature vector, i represents the index of the text feature vector, d model Represents the encoding dimension, set to 256;
s6, decoding: the decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating text context information corresponding to input traditional Chinese medicine audios;
inputting the text features processed in the step S5 into an implicit multi-head self-Attention module in a decoder, wherein the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally obtaining an Attention numerical value of the text; then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a Chinese medicine audio characteristic sequence output by an encoder, and the structure of a forward feedback module in the encoder is consistent with that of the forward feedback module in the encoder;
s7, model training and decoding are carried out by adopting a combined CTC/Attention mechanism, the Attention mechanism is in non-monotonic alignment with the traditional Chinese medicine audio features and the traditional Chinese medicine text labels in connection with context, and the CTC forcedly inputs the traditional Chinese medicine audio features and the traditional Chinese medicine text labels through a dynamic programming algorithm, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the two can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, n best candidates are generated by CTC decoding, and then are re-scored by the Attention decoder, and the result with the highest score is taken as output.
2. The deep learning-based traditional Chinese medicine clinical speech recognition method of claim 1, wherein the CTC loss in step S7 is calculated by performing a forward linear calculation on the output of the encoder, performing Softmax normalization, and then calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a joint loss, wherein the formula is as follows:
Loss=λL CTC (x,y)+(1-λ)L kL (x,y) (6)
wherein Loss is the combined Loss, L CTC For loss of CTC, L KL For KL divergence loss, x represents the input Chinese audio features, and y represents the corresponding text of the Chinese audioThe label is provided; lambda is a hyper-parameter, which plays the role of balancing the importance of CTC loss and KL divergence loss, and is set to 0.3.
3. The deep learning-based traditional Chinese medicine clinical speech recognition method of claim 1, wherein the number of mel filters in the audio feature extraction process in step S1 is set to 80.
4. The traditional Chinese medicine clinical speech recognition model based on deep learning is characterized by comprising the following steps: the system comprises an audio characteristic extraction module, an audio characteristic processing module, a text characteristic extraction module, a text characteristic processing module, an encoder, a decoder, a model training and decoding module, wherein the audio characteristic extraction module extracts Fbank characteristics of audio through framing, pre-emphasis, windowing, fast Fourier transform and Mel filtering, the audio characteristic processing module masks the time domain and frequency domain of the Fbank characteristics of the traditional Chinese medicine audio, then a two-layer two-dimensional convolution downsampling network is adopted, the convolution kernel size is 3 multiplied by 3, the step length is 2, and after downsampling, the frame number is reduced to one fourth of the original frame number; the text feature extraction module maps the Chinese medicine text into feature vectors, the text feature processing module obtains position information of the text features through position coding, and adds corresponding position information into the Chinese medicine text features, and the calculation formula is as follows:
Figure FDA0003808754420000031
Figure FDA0003808754420000042
pos represents the index of the position of the character in the current text feature vector, i represents the index of the text feature vector, d model Represents the encoding dimension, set to 256;
the encoder comprises two forward feedback modules, a multi-head self-attention module and a convolution module;
the forward feedback module comprises two fully-connected layers, two residual layers and a nonlinear activation function ReLU, and primary layer standardization is carried out before the first fully-connected layer;
in the multi-head self-attention module, a self-attention mechanism can obtain the correlation among the traditional Chinese medicine audio features, so as to obtain the relationship among the traditional Chinese medicine audio sequences, and the calculation formula is as follows:
Figure FDA0003808754420000041
q, K, V are the Chinese medicine audio characteristic vectors obtained by linear transformation, d k Is a feature vector dimension;
the multi-head self-attention mechanism learns the context information of the traditional Chinese medicine audio frequency characteristics from different aspects, and the calculation formula is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (2)
head i =Attention(QW i Q ,KW i K ,VW i V ) (3)
wherein h represents the number of attention heads, W O Random weight matrix, W, for linear transformation after multi-headed attention stitching Q ,W K ,W V The weight matrix corresponding to Q, K and V in the ith attention head;
the convolution module adopts causal convolution and comprises point-by-point convolution, a gate control linear unit, one-dimensional depth convolution, layer standardization and an activation function ReLU;
the decoder comprises an implicit multi-head self-attention module, a multi-head attention module and a forward feedback module, wherein the implicit multi-head self-attention module is used for calculating and inputting the context information of the Chinese medicine text;
the text features processed by the text feature extraction module and the text feature processing module are input into an implicit multi-head self-Attention module in a decoder, the operation of the implicit multi-head self-Attention module is the same as that of the multi-head self-Attention module in the encoder, and finally an Attention numerical value of the text is obtained; and then entering a multi-head attention module, wherein the multi-head attention is calculated in the same way as self attention, but the input Q, K and V are different, Q is from a Chinese medicine text sequence, K and V are from a Chinese medicine audio characteristic sequence output by an encoder, and the structure of the forward feedback module is consistent with that of the forward feedback module in the encoder.
5. The deep learning-based clinical speech recognition model of traditional Chinese medicine according to claim 4, wherein the model training and decoding module employs a joint CTC/Attention mechanism, and an objective function jointly optimizes CTC loss and KL divergence loss; the Attention mechanism is connected with the context to perform non-monotonic alignment on the traditional Chinese medicine audio features and the traditional Chinese medicine text labels, and the CTC forcibly inputs the traditional Chinese medicine audio features and the traditional Chinese medicine text labels through a dynamic programming algorithm to perform monotonic alignment, so that the problem of insufficient Attention mechanism alignment is solved, and the advantages of the traditional Chinese medicine audio features and the traditional Chinese medicine text labels can be effectively utilized by using a mixed CTC/Attention structure to eliminate irregular alignment; in the training stage, the objective function jointly optimizes CTC loss and KL divergence loss; in the decoding stage, firstly, n best candidates are generated by CTC decoding, then the Attention decoder re-scores, and the result with the highest score is used as output; the CTC loss is obtained by performing one-time forward linear calculation on the output of an encoder, normalizing by softmax and calculating according to a CTC loss formula; the KL divergence loss is obtained by performing Softmax operation on the output of a decoder and then calculating by using a KL divergence loss formula, and finally performing weighted summation on the KL divergence loss and the KL divergence loss to obtain a joint loss, wherein the formula is as follows:
Loss=λL CTC (x,y)+(1-λ)L kL (x,y) (6)
wherein Loss is the combined Loss, L CTC For loss of CTC, L KL The KL divergence loss is obtained, x represents the input traditional Chinese medicine audio frequency characteristics, and y represents a text label corresponding to the traditional Chinese medicine audio frequency; λ is a hyper-parameter, which acts to balance the importance of CTC loss and KL divergence loss, set to 0.3;
in the decoding stage, firstly, n best candidates are generated by CTC decoding, and then the candidates are re-scored by an Attention decoder, and the result with the highest score is used as output.
6. The deep learning based clinical speech recognition model of TCM of claim 4, wherein the number of Mel filters in the audio feature extraction module is set to 80.
CN202211006117.1A 2022-08-22 2022-08-22 Traditional Chinese medicine clinical speech recognition method and model based on deep learning Pending CN115472157A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211006117.1A CN115472157A (en) 2022-08-22 2022-08-22 Traditional Chinese medicine clinical speech recognition method and model based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211006117.1A CN115472157A (en) 2022-08-22 2022-08-22 Traditional Chinese medicine clinical speech recognition method and model based on deep learning

Publications (1)

Publication Number Publication Date
CN115472157A true CN115472157A (en) 2022-12-13

Family

ID=84366070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211006117.1A Pending CN115472157A (en) 2022-08-22 2022-08-22 Traditional Chinese medicine clinical speech recognition method and model based on deep learning

Country Status (1)

Country Link
CN (1) CN115472157A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117936080A (en) * 2024-03-22 2024-04-26 中国人民解放军总医院 Solid malignant tumor clinical auxiliary decision-making method and system based on federal large model
CN118072901A (en) * 2024-04-18 2024-05-24 中国人民解放军海军青岛特勤疗养中心 Outpatient electronic medical record generation method and system based on voice recognition

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117936080A (en) * 2024-03-22 2024-04-26 中国人民解放军总医院 Solid malignant tumor clinical auxiliary decision-making method and system based on federal large model
CN117936080B (en) * 2024-03-22 2024-06-04 中国人民解放军总医院 Solid malignant tumor clinical auxiliary decision-making method and system based on federal large model
CN118072901A (en) * 2024-04-18 2024-05-24 中国人民解放军海军青岛特勤疗养中心 Outpatient electronic medical record generation method and system based on voice recognition

Similar Documents

Publication Publication Date Title
CN111199727B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN115472157A (en) Traditional Chinese medicine clinical speech recognition method and model based on deep learning
KR100277694B1 (en) Automatic Pronunciation Dictionary Generation in Speech Recognition System
CN114444479A (en) End-to-end Chinese speech text error correction method, device and storage medium
CN113987179B (en) Dialogue emotion recognition network model based on knowledge enhancement and backtracking loss, construction method, electronic equipment and storage medium
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN111783477B (en) Voice translation method and system
KR20200119410A (en) System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN114818668A (en) Method and device for correcting personal name of voice transcribed text and computer equipment
CN113239690A (en) Chinese text intention identification method based on integration of Bert and fully-connected neural network
CN116127056A (en) Medical dialogue abstracting method with multi-level characteristic enhancement
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
Chen et al. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network
CN117043857A (en) Method, apparatus and computer program product for English pronunciation assessment
CN115410550A (en) Fine-grained rhythm-controllable emotion voice synthesis method, system and storage medium
CN113257230B (en) Voice processing method and device and computer storage medium
CN114863948A (en) CTCATtention architecture-based reference text related pronunciation error detection model
CN109427330B (en) Voice recognition method and system based on statistical language model score normalization
CN115860015A (en) Translation memory-based transcribed text translation method and computer equipment
CN115910065A (en) Lip language identification method, system and medium based on subspace sparse attention mechanism
CN114927144A (en) Voice emotion recognition method based on attention mechanism and multi-task learning
CN113077785A (en) End-to-end multi-language continuous voice stream voice content identification method and system
Alok et al. Design Considerations for Hypothesis Rejection Modules in Spoken Language Understanding Systems
CN116991982B (en) Interactive dialogue method, device, equipment and storage medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination