CN117476215A

CN117476215A - Medical auxiliary judging method and system based on AI

Info

Publication number: CN117476215A
Application number: CN202311536875.9A
Authority: CN
Inventors: 刘杰; 王荣霄; 徐大鹏
Original assignee: Shanghai Touchmai Digital Medical Technology Co ltd
Current assignee: Shanghai Touchmai Digital Medical Technology Co ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-01-30

Abstract

The invention discloses an AI-based medical auxiliary judging method and an AI-based medical auxiliary judging system, and belongs to the technical field of intelligent medical treatment. The method comprises the following steps: s100, generating a speech understanding AI model and an AI auxiliary judging model; s200, linking the speech understanding AI model and the AI auxiliary judging model; s300, inputting patient symptom voice description information to the voice understanding AI model, converting and outputting the patient symptom voice description information to the AI auxiliary judging model; s400, processing the AI auxiliary judging model and outputting a corresponding result; s500, continuously evaluating and adjusting the model. The method comprises the steps of converting symptom voice description of a patient into text through a voice understanding AI model, preprocessing the text, extracting features, sending the text to an AI auxiliary judging model, receiving text input by the AI auxiliary judging model, judging information related to diseases, outputting a score for each possible disease, converting the output score of the model into probability distribution, and listing each possible disease and the probability corresponding to each possible disease.

Description

Medical auxiliary judging method and system based on AI

Technical Field

The invention belongs to the technical field of intelligent medical treatment, and particularly relates to an AI-based medical auxiliary judging method and an AI-based medical auxiliary judging system.

Background

Today, artificial intelligence technology has made significant progress in various fields, with no exception to the medical field. With the continuous accumulation of medical knowledge and the continuous innovation of medical technology, there is a higher desire to improve the accuracy and efficiency of diagnosis and treatment. However, the complexity and diversity of the medical field is such that traditional physician judgment and diagnosis may be limited in some cases.

In this context, an artificial intelligence-based medical auxiliary discrimination system has been developed. The system utilizes advanced technologies such as machine learning, natural language processing, image recognition and the like, and aims to assist doctors in diagnosis and treatment decision by processing massive medical data, research papers, patient medical records, medical images and the like. The system finds patterns by analyzing large amounts of data, discovers hidden associations, and provides valuable information to support clinical practice.

The research and application of the medical auxiliary discrimination system have important clinical significance. First, it can help doctors provide more accurate information in diagnosing diseases, reducing the risk of diagnosis errors and missed diagnoses. Secondly, the system can help doctors to know the latest research results and treatment schemes in time when medical knowledge is updated rapidly, so that better medical services are provided. In addition, for some rare cases or complex diseases, the system can assist doctors in formulating more personalized treatment strategies by comparing data of similar cases.

However, AI-based medical assistance discrimination systems also face some challenges. Above all, data privacy and security issues, in particular, relate to the processing of patient sensitive information. In addition, the reliability and misdiagnosis rate of the system need to be sufficiently verified to ensure its practical effect in clinical practice. At the same time, medical personnel need to be trained to fully understand how to work with the system and how to further verify and confirm the results when needed.

In a word, the medical auxiliary discriminating system based on artificial intelligence has great potential in the medical field, can provide powerful support for doctors, and improves the accuracy and efficiency of medical decision, thereby better serving the health needs of patients. However, its development and application requires comprehensive consideration in aspects such as technology, ethics, and law to ensure optimal results and to ensure the rights and interests of the patient.

The prior artificial intelligence remains limited in its inferred energy for disease.

(1) Data quality and availability: AI models typically require large amounts of high quality data to train. In the medical field, obtaining such data may be complicated by privacy, ethical and legal issues.

(2) Model generalization: even if trained well over large amounts of data, AI models may perform poorly on new or different data sets, which is a significant problem for disease inference.

(3) Model interpretability: many advanced AI models, especially deep learning models, are considered "black boxes" because their decision process is difficult to interpret. This means that they can provide predictions, but it is difficult to explain the reasons behind predictions. In the medical field, understanding and explaining the cause of diagnosis is very important, as doctors and patients need to understand the basis of diagnosis and treatment decisions.

(4) Complexity and variability: diseases are often multifactorial and complex, the human body is a complex system, many involve multiple biological processes and environmental factors, and existing AI models may not be sufficiently complex to capture complex interactions between these factors. A simple AI model may not capture these complex interactions, while a more complex model may lead to overfitting and interpretation problems.

(5) Ethical and legal issues: the application of AI in the medical field involves many ethical and legal issues including data privacy, model transparency and responsibility attribution. These problems limit the application of AI in disease inference.

(6) Update and adaptation: medical science is an ever-evolving field. New studies and findings may change our understanding of certain diseases. AI models need to be updated regularly to ensure that their inferences are consistent with the latest medical knowledge.

These limitations need to be overcome by continued research and development.

AI should have a large fault tolerance for disease discrimination, and there is no multi-view discrimination for a disease in the market.

AI should have a high fault tolerance for disease discrimination because of the high complexity and diversity of the disease itself. The same disease may exhibit different symptoms and characteristics in different patients and may change over time even in the same patient. Therefore, discrimination at a single viewing angle is likely to lead to misdiagnosis or missed diagnosis.

In order to increase the accuracy of discrimination, some disease discrimination AI systems currently on the market increase the accuracy by a multi-view method. Multi-view discrimination means that the AI system will integrate information from a variety of sources, such as patient symptoms, personal medical history, physical signs, laboratory test results, etc., requiring the AI system to be able to integrate analysis of information obtained from different sources, such as imaging, biochemical tests, medical history, etc. By integrating various information sources, the AI can more comprehensively evaluate the condition of the patient, and the accuracy of discrimination is increased.

However, it is not easy to realize multi-view discrimination. This involves integrating large amounts of medical data and knowledge, building complex algorithmic models, and performing extensive data training and validation. Such systems require a large amount of data to process and integrate, requiring a very complex algorithm model. The current AI technology in the market is still under development, and for a certain disease, true multi-view comprehensive discrimination may not be fully realized, and a comprehensive and reliable multi-view discrimination AI system is not yet available, which is also a challenge and development direction of AI application in the medical field.

Disclosure of Invention

1. Technical problem to be solved by the invention

The invention aims to solve the problem that the AI auxiliary judgment in the prior art is difficult to directly analyze in multiple view angles according to the symptom description of a user.

2. Technical proposal

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

the invention discloses an AI-based medical auxiliary judging method, which comprises the following steps:

s100, generating a speech understanding AI model and an AI auxiliary judging model;

s200, linking the speech understanding AI model and the AI auxiliary judging model;

s300, inputting patient symptom voice description information to the voice understanding AI model, converting and outputting the patient symptom voice description information to the AI auxiliary judging model;

S400, processing the AI auxiliary judging model and outputting a corresponding result;

s500, continuously evaluating and adjusting the model.

Preferably, the generating the speech understanding AI model of step S100 specifically includes

S111, data collection and data set construction;

s112, preprocessing data;

s113, selecting a pre-training language model;

s114, targeted large-scale training and optimization.

Preferably, said step S111 is specifically to construct a dataset from audio paired with text transcription on the internet, the dataset covering a broad audio distribution from many different environments, recording settings, speakers and languages;

step S112 is specifically to use a heuristic automatic screening method to improve transcription quality, to detect and remove machine-generated transcripts from the training dataset, and to use an audio language detector to ensure that spoken language matches the transcribed language, meeting CLD2 requirements; fuzzy deduplication of the transcribed text is also included to reduce duplicate content and automatically generated content in the training dataset.

Preferably, the pre-training language model in the step S113 is Whisper-large-v2; the targeted large-scale training in step S114 is to transfer all tasks and condition information to the decoder as a series of input tokens, train the data parallel accelerator, train the model using Adam optimizer and gradient norm clipping, train the warm-up phase with linear learning rate decayed to zero after the first 2048 updates, train the model with 220 updates, and complete traversal of the data set.

Preferably, the generating the AI-assisted discriminant model in step S100 specifically includes

S121, dividing a data set into a training set, a verification set and a test set;

s122, selecting DeBERTaV3 as a pre-training language model;

s123, model training and optimization.

Preferably, the step S122 is specifically a pre-training method using the electrora style, and a sequence x= { X is given _i By randomly masking 15% of the markers intoThen training a language model parameterized by the parameter θ based on ++by prediction>Is marked by shielding->To reconstruct X:

wherein C is the set of indices in the sequence that are masked;

the ELECTRA is trained in the GAN style by two transducer encoders, one of which is referred to as a generator trained by MLM; another is called a arbiter trained by a marker-level classifier; the generator is used for generating fuzzy marks to replace the shielding marks in the input sequence, and then the modified input sequence is fed intoIn the discriminator, two classifiers in the discriminator need to determine whether the corresponding marks are original marks or generator-replaced marks, respectively using theta _G And theta _D Parameters representing the generator and the arbiter, the training objectives in the arbiter being called RTD (Replacedtoken detection), the loss function of the generator is written in the form:

Wherein,the input to the generator, which is generated by randomly masking 15% of the tokens in X, the input sequence of the discriminator is constructed by replacing the masked tokens with new tokens sampled according to the output probability of the generator:

the loss function expression of the discriminator is:

wherein 1 (·) is the indication function,is the input to the discriminator constructed by equation 3, L in ELECTRA _MLM And L _RTD Is jointly optimized, i.e. l=l _MLM +λL _RTD Where λ is discriminator loss L _RTD Is set to 50 in the electrora.

Preferably, the optimizing of step S123 includes

(1) Data enhancement, including operations such as random masking, random insertion of blanks, vocabulary replacement, etc., to reduce overfitting and add robustness of the model to input;

(2) The distributed training is performed on a plurality of servers by adopting a distributed technology, so as to obtain a globally optimal model, and the training efficiency and expansibility are improved;

(3) The learning rate scheduling, the learning rate scheduling strategy uses a cosine annealing algorithm, the learning rate is dynamically adjusted according to the training process, the learning rate determines the step length of each parameter update, and the model is enabled to be quickly converged and prevented from being trapped into local optimum, so that better model performance is obtained;

(4) Layer normalization, performing normalization operation on the output of each hidden layer, calculating the mean value and variance of the output of each hidden layer, normalizing the output by scaling and translation, and effectively reducing covariate offset between different layers, so that the model is easier to converge and the generalization capability is improved;

(5) And (3) an optimization algorithm, namely minimizing a loss function of the model by using an Adam optimization algorithm.

Preferably, the step S123 further includes monitoring the model pre-training process, and performing visual display, so as to facilitate analysis and adjustment of the hyper-parameters of the model, where the monitoring usage index is

(1) Loss function: using cross entropy loss, for one type of task in the multi-label classification task, the cross entropy loss function is:

the total cross entropy is the sum of the cross entropy of each class in the multi-label classification task;

(2) Verification set index: evaluating the performance and generalization capability of the model by using a verification set independent of a training set, and calculating indexes (accuracy, recall and F1 score) of the model on the verification set after each training round is finished so as to know the performance of the model on unseen data, and judging whether fitting phenomenon occurs or performing strategies such as early stop and the like by monitoring the change of the indexes of the verification set;

(3) Learning curve: the learning curve shows the change of the performance of the model on the training set and the verification set along with the training turn, and the fitting condition, the over-fitting degree or the under-fitting degree of the model are judged by comparing the learning curves of the training set and the verification set;

(4) Parameter update and learning rate: the update speed and amplitude of the parameters, and the decay process of the learning rate are observed to ensure convergence and stability of the model.

The system adopts the voice understanding AI model and the AI auxiliary discriminating model, converts the symptom voice description of a patient into a text through the voice understanding AI model, performs feature extraction after preprocessing the text and sends the text to the AI auxiliary discriminating model, the AI auxiliary discriminating model receives text input and discriminates information related to diseases, is used for classifying the diseases or symptoms in the text, maps the patient's voice operation to corresponding disease classification, outputs a score for each possible disease, converts the output score of the model into probability distribution through a softmax function, and lists each possible disease and the probability corresponding to the possible disease based on the result of the softmax.

Preferably, the method further comprises the steps of ranking the diseases with highest probability as possible diseases according to probability distribution output by the model, sorting the diseases according to probability, comparing the probability value with a set threshold value, and recommending follow-up measures according to comparison results.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

the invention relates to an AI-based medical auxiliary judging method and a judging system, wherein the method comprises the following steps: s100, generating a speech understanding AI model and an AI auxiliary judging model; s200, linking the speech understanding AI model and the AI auxiliary judging model; s300, inputting patient symptom voice description information to the voice understanding AI model, converting and outputting the patient symptom voice description information to the AI auxiliary judging model; s400, processing the AI auxiliary judging model and outputting a corresponding result; s500, continuously evaluating and adjusting the model. Through the cooperative use of the speech understanding AI model and the AI auxiliary discriminating model, the symptom speech description of the patient is converted into a text through the speech understanding AI model, the text is preprocessed and then is subjected to feature extraction and sent to the AI auxiliary discriminating model, the AI auxiliary discriminating model receives text input and discriminates information related to diseases and is used for classifying the diseases or symptoms in the text, the patient speech is mapped to corresponding disease classification, a score is output for each possible disease, the output score of the model is converted into probability distribution through a softmax function, and each possible disease and the probability corresponding to the possible disease are listed based on the result of the softmax. The method can rapidly provide targeted reference opinion for patients, improves the efficiency and reduces the medical cost.

Drawings

Fig. 1 is a flow chart of the method of the present embodiment 1.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Example 1

Referring to fig. 1, an AI-based medical assistance discriminating method of the present embodiment includes the steps of:

s500, continuously evaluating and adjusting the model.

The method of the embodiment is used by matching a speech understanding AI model and an AI auxiliary discriminating model, the symptom speech description of a patient is converted into a text by the speech understanding AI model, the text is preprocessed and then is subjected to feature extraction and sent to the AI auxiliary discriminating model, the AI auxiliary discriminating model receives text input and discriminates information related to diseases and is used for classifying diseases or symptoms in the text, a patient phone is mapped to corresponding disease classification, a score is output for each possible disease, the output score of the model is converted into probability distribution by using a softmax function, and each possible disease and the probability corresponding to each possible disease are listed based on the result of the softmax. By the method, targeted reference opinions can be provided for patients rapidly, efficiency is improved, and medical cost is reduced.

The generating the speech understanding AI model of step S100 specifically includes

S111, data collection and data set construction;

s112, preprocessing data;

s113, selecting a pre-training language model;

s114, targeted large-scale training and optimization.

Said step S111 is specifically to construct a dataset from audio paired with text transcription on the internet, the dataset covering a broad audio distribution from many different environments, recording settings, speakers and languages;

step S112 is specifically to use a heuristic automatic screening method to improve transcription quality, to detect and remove machine-generated transcripts from the training dataset, and to use an audio language detector to ensure that spoken language matches the transcribed language, meeting CLD2 requirements; if the two do not match, the (audio, transcript) pair is not included in the dataset as a speech recognition training example. Fuzzy deduplication of the transcribed text is also included to reduce duplicate content and automatically generated content in the training dataset.

First, the audio file is divided into segments of 30 seconds, and each segment is paired with a subset of transcripts that occur during that time period. All audio is trained, including segments without speech (albeit with sub-sampled probabilities), and these segments are used as training data for speech activity detection. To further improve the quality of the data set, after training the initial model, information is collected about its error rate on the training data source. These data sources are then manually inspected, giving priority to data sources with higher error rates and larger data volumes. This process efficiently identifies and removes low quality data sources, revealing a large number of transcripts with only partial transcripts or poorly aligned/misplaced transcripts, as well as low quality machine-generated subtitles that were not captured by the initial filtering heuristic. To avoid contamination, transcription level deduplication is performed between the training dataset and the evaluation dataset considered to have a higher risk of overlap. Finally, using the off-the-shell structure, all audio was sampled to 16000Hz, which was then changed to an 80-channel Log Scale Mel Spectrogram. The 30s long speech signal was changed to 3000 data points, each with a dimension of 80, all values were placed between-1 and +1, and the mean was 0.

The above approach ensures high quality training data, including various real world scenarios, even including cases without speech, which helps to detect training models for speech activity. Filtering and quality control measures help to enhance the overall performance and authenticity of translation and speech recognition systems.

The pre-training language model in the step S113 is Whisper-large-v2; whisper-large-v2 is a generic speech recognition model that can be used for speech recognition, whisper can also implement transcription in multiple languages, and translate languages into english. It is a multilingual and multitasking, transducer-based encoder-decoder model, also known as a sequence-to-sequence model, trained on a data set of large-scale audio. The model uses 68 Mo Xiao time scale annotated voice data of a large scale weakly supervised annotation, where 11 tens of thousands of hours cover 96 languages, 12 tens of thousands of hours labeled english, with multitask (multitask) supervised data training whispers. Whisper-large-v2 uses such a large and diverse data set to improve recognition of mouth sounds, background noise and technical terms. Models trained on this scale can be adapted to existing datasets with zero-order shifts, achieving high quality results without any dataset specific fine tuning. Compared to the Whisper large model, the large-v2 model trains 2.5 times epoch and regularization is added to improve performance. In the training process, the model receives the logarithmic Mel filter group characteristics extracted from the audio waveform, and performs pre-training to automatically regress and generate a transcription or translation result, and the generated model has good generalization capability and good performance in a standard benchmark test.

The Whisper architecture is a simple end-to-end method, implemented as an encoder-decoder transducer, where the input audio is first divided into 30 second segments, converted into a log-Mel spectrogram, and then passed into the encoder. The decoder is trained to predict the corresponding text headlines and mix special tags instructing the single model to perform tasks such as language recognition, phrase level time stamping, multilingual speech transcription, and english speech translation. All audio was resampled to 16000Hz and then 80 channel log amplitude mel-frequency spectra were calculated in 10 millisecond steps over each 25 millisecond window. For feature normalization, the input is globally scaled through the entire pre-training dataset to be between-1 and 1, and approximately zero on the mean. The encoder uses a small flow consisting of two convolution layers, with a filter width of 3, and applies a GELU activation function, with a stride of 2 for the second convolution layer. Sinusoidal position embedding is then added to the output of the flow, and then an encoder transducer block is applied. The Transformer uses the pre-activation residual block and applies the final layer normalization on the encoder output. The decoder embeds and binds the input-output token representation using the learned locations. The encoder and decoder have the same width and number of transform blocks.

The targeted large-scale training in step S114 is to pass all tasks and condition information to the decoder as a series of input tokens, and since the decoder is a language model of audio conditions, training it to condition training over the history of text transcription in hopes that it can learn to use a longer range of text contexts to resolve ambiguous audio. In particular, the transcribed text preceding the current audio piece is added to the decoder context with a certain probability. With a special token: < |startoftrans| > to denote the start of prediction. First, the language being spoken is predicted, each having a unique token (99 total) in the training set. These language targets are from the voxlangua 107 model. In the absence of speech in the audio segment, the model is trained to predict a particular token: this is indicated by < |nospech| >. The following token specifies the task (transcription or translation), using the token: < |transcribe| > or < |franslate| >. After this, by including a token: < |optimestags| > to specify whether to predict a timestamp. At this point, the task and the required format have been fully specified and the output begins. For time-stamped predictions, all times are quantized to the nearest 20 ms relative to the time of the current audio piece, which matches the local time resolution of the Whisper model, and additional tokens are added to the vocabulary for each time. Interleaving predictions with caption token: a start time token is predicted before each subtitle text, and an end time token is predicted after each subtitle text. When the final transcribed fragment is only partially contained in the current 30 second audio block and in timestamp mode, only its start time token is predicted to indicate that subsequent decoding should be performed on the audio window aligned with that time, otherwise the audio is truncated to exclude the fragment. Finally, add a token: < |endoftrancript| >. Training losses are masked only on previous context text and the training model predicts all other token.

Training the data parallel accelerator, training the model by using a dynamic loss scaling and activation check point technology, training the model by using an Adam optimizer and gradient norm clipping, performing a warm-up phase of linear learning rate decay to zero after the first 2048 updates, using a batch size of 256 fragments, training the model by 220 updates, and performing two to three complete traversals of the data set. Instead of using any data enhancement or regularization techniques, the diversity contained within a large dataset is relied upon to promote generalization and robustness.

The generation of the AI-aided discrimination model in the step S100 specifically comprises

S121, dividing the data set into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for evaluating the performance of the model and adjusting the super parameters, and the test set is used for finally evaluating the performance of the model. When the data set is divided, the representativeness of the data set is required to be maintained, and factors such as data distribution, label balance and the like are considered;

s122, selecting DeBERTaV3 as a pre-training language model;

s123, model training and optimization.

DeBERTaV3 uses the electrora style pre-training method and introduces Gradient decoupling embedded sharing (Gradient-Disentangled Embedding Sharing) technology. By using the ELECTRA-style pre-training method, deBERTaV3 better captures the relationships between text and provides a more accurate semantic representation. The expressive power and performance of the language model is further improved by decoupling the attention mechanism, using more complex decoders and improved pre-training strategies, and optimized embedded sharing.

The improvement of DeBERTaV3 mainly comprises the following aspects:

(1) Decoupling attention mechanism (Disentangled Attention): deBERTaV3 introduced a decoupled attention mechanism that decomposed the original self-attention mechanism (self-attention) into the attention of multiple subspaces to better capture semantic relationships between text.

(2) More complex decoder (Decoding-enhanced): deBERTaV3 uses a more complex decoder architecture, improving the performance of the model in the generation task.

(3) Optimized pre-training strategy: deBERTaV3 adopts the electrora style strategy in the pre-training process, i.e., pre-training using a mask language modeling (Masked Language Modeling) task to better utilize unlabeled data.

(4) Embedded sharing (Embedding Sharing): the DeBERTaV3 uses a gradient decoupling embedding sharing technology, and the learning efficiency and the expression capability of the model are further improved by sharing gradient information of an embedding layer.

DeBERTaV3 improves BERT by decoupled attention and enhanced mask decoder. The decoupled attention mechanism uses two independent vectors to represent each input word: one representing content and the other representing location. At the same time, it calculates the attention weight between words by decoupling the content and relative position of the words from the matrix. DeBERTaV3 is pre-trained using masking language modeling. Decoupled attention mechanisms have considered the content and relative position of the context words, deBERTaV3 uses an enhanced mask decoder to improve the MLM by adding absolute position information of the context words at the MLM decoding layer. The improvement of DeBERTaV3 makes the model obtain better performance in various natural language processing tasks, such as text classification, named entity recognition, semantic role labeling and the like. It is a further improvement of BERT and RoBERTa models, aimed at increasing the expressive and generalizing capacity of the model.

The step S122 is specifically to use the pre-training method of the ELECTRA style to give a sequence X= { X _i By randomly masking 15% of the markers intoThen training a language model parameterized by the parameter θ based on ++by prediction>Is marked by shielding->To reconstruct X:

wherein C is the set of indices in the sequence that are masked;

the ELECTRA is trained in the GAN style by two transducer encoders, one of which is referred to as a generator trained by MLM; another is called a arbiter trained by a marker-level classifier; the generator is used for generating fuzzy marks to replace the shielding marks in the input sequence, then the modified input sequence is fed into a discriminator, and two classifiers in the discriminator need to determine whether the corresponding marks are original marks or marks replaced by the generator, and the marks are respectively replaced by theta _G And theta _D Parameters representing the generator and the arbiter, the training objectives in the arbiter being called RTD (Replacedtoken detection), the loss function of the generator is written in the form:

The loss function expression of the discriminator is:

DeBERTaV3 combines the advantages of both methods by replacing the MLM target used in DeBERTa with the RTD target. During the pre-training phase, ELECTRA uses a generator and a arbiter, sharing tag embedding, a method called Embedding Sharing (ES). Let E and g _E Representing the parameters and gradients of the marker embedding, respectively. In each training step of ELECTRA, g _E By back-propagating the calculation and accumulating the error of the MLM loss of the calculation generator and the RTD loss of the discriminator, the calculation is as follows:

the above equation presents a multi-task learning problem, where the gradient of each task (i.e., the MLM task of the generator or the RTD task of the arbiter) pulls the solution toward its optimal value, creating a balance between the gradients of the two tasks.

Gradient decoupling embedded sharing (Gradient-Disentangled Embedding Sharing): the GDES method overcomes the disadvantages of ES and NES while retaining their advantages. The token embedding is shared between the generator and the arbiter so that both models can learn from the same vocabulary and utilize the rich semantic information encoded in the embedding. And meanwhile, in the whole model training process, sharing is limited. In each training iteration, the gradient of the generator is calculated based on MLM loss alone, and not RTD loss, so that the training efficiency of the model is the same as NES High. Sharing in GDES is unidirectional, i.e. gradients computed for MLM are used to update E _G And E is _D While gradients calculated for RTDs are used only to update E _D . The model using GDES converges faster and is as efficient as the model using NES. GDES is an efficient weight sharing method for a language model that is pre-trained using MLM and RTD.

Implementation of GDES by label embedding of the reparameterized arbiter:

E _D ＝sg(E _G )+E _Δ

wherein the stop gradient operator sg prevents the gradient from flowing through the generator embedded E _G Update only the remaining embedded E _Δ . Will E _Δ Initialized to zero matrix and the model is trained according to the NES process. In each iteration, the input is first generated for the arbiter using a generator, and E is updated using MLM loss _G And E is _D . The arbiter is then run on the generated inputs and passes E _Δ Updating E using RTD loss _D But only update E _Δ . After training, E _Δ Added to E _G And saving the obtained matrix as E of the discriminator _D 。

The optimization of step S123 includes

(1) Data enhancement: to enhance the generalization ability of the model, data enhancement operations are employed. Including random masking, random insertion of blanks, and vocabulary replacement, to reduce overfitting and add robustness to the model to the input.

(2) Distributed training: and training is performed on a plurality of servers by adopting a distributed technology so as to obtain a globally optimal model, and training efficiency and expansibility are improved.

(3) And (3) learning rate scheduling: the learning rate scheduling strategy uses a cosine annealing algorithm, the learning rate is dynamically adjusted according to the training process, the learning rate determines the step length of each parameter update, and the model is enabled to be quickly converged and prevented from being trapped into local optimum, so that better model performance is obtained. Cosine annealing (Cosine annealing) reduces the learning rate by a Cosine function. The cosine value in the cosine function slowly drops first, then drops rapidly and drops slowly again. This falling pattern works well with the learning rate in a very efficient way to calculate. The principle of cosine annealing is as follows:

wherein i represents a fifth run (index value);and->The maximum value and the minimum value of the learning rate are respectively represented, and the range of the learning rate is defined. After each restart, decrease +.>And->Is a value of (2); t (T) _cur Then it indicates that there are currently epoch numbers performed, T _cur Updating after each batch run; t (T) _i Indicating the total epoch number in the ith run.

(4) Layer normalization (LayerNormalization): layer normalization is a normalization operation on the output of each hidden layer. The mean value and variance of the output of each hidden layer are calculated, normalized through scaling and translation, so that covariate offset among different layers is effectively reduced, the model is easier to converge, and the generalization capability is improved. In addition, optimizing model performance through a series of modifications, including rearranging the order of layer normalization and residual connection to help avoid accumulation of numerical errors, using a single linear layer for label prediction simplifies model structure, improves computational efficiency, and replaces the ReLU function with a GeLU activation function to improve representation capability and performance of the model. These improvements have led to DeBERTaV3 exhibiting better performance and efficiency over a variety of tasks.

(5) Optimization algorithm: the training process uses Adam optimization algorithm to minimize the loss function of the model. The algorithm automatically adjusts the learning rate of each parameter to better adapt to the gradient conditions of different parameters during the training process. Adam's algorithm combines the gradient accumulation of AdaGrad and the gradient square accumulation of RMSProp. Adam's algorithm adaptively adjusts the learning rate of the parameters by using the estimates to achieve more accurate updates in different parameter spaces.

Adam algorithm principle:

m _t ＝β ₁ m _t-1 +(1-β ₁ )g _t

/>

wherein the gradient g _t As a loss functionFor theta _t Obtaining a deflection guide; m is m _t For time, estimating a first moment of the gradient in a momentum form; v _t A second moment estimate for the gradient in momentum form; />Estimate (+_for the first moment after bias correction>Is beta ₁ To the power t of (2); />Estimating a second moment after correcting the deviation; the last term is the update formula.

The step S123 further includes monitoring the model pre-training process and visually displaying, so as to facilitate analysis and adjustment of the model hyper-parameters, wherein the monitoring usage index is as follows

The speech understanding generating model and the disease discriminating model are combined to create an end-to-end medical diagnostic system that can understand symptoms and infer possible diseases from the patient's speech description. The following is an embodiment of linking the two models:

(1) Audio-to-text: the voice description of the patient is converted to text using the voice understanding generation model Whisper-large-v 2. The system is ensured to obtain text information from the patient's dictation.

(2) Text preprocessing: text obtained from the speech understanding generating model is preprocessed. Including removing noise, normalizing text, splitting text into sentences or paragraphs for subsequent processing.

(3) Feature extraction: useful features are extracted from the patient's speech text, including Word embedding (Word embedding), TF-IDF (Term Frequency-Inverse DocumentFrequency) features, and other text representation methods. These features will be used as inputs for the AI decision model.

(4) AI disease discrimination model: using the collected patient session and corresponding disease label data, an AI discriminant model DeBERTaV3 is trained that accepts text input and discriminates information related to the disease for classifying the disease or symptom in the text, mapping the patient session to a corresponding disease classification.

(5) Model evaluation and optimization: and evaluating the link model to verify the accuracy of the link model on the disease discrimination task. And evaluating by using the test data set, and optimizing and adjusting the model according to the evaluation result so as to improve the accuracy and performance of the system.

(6) Real-time processing and deployment: the real-time voice input processing is realized, the model can rapidly respond to the description of the patient, and timely disease discrimination results are provided. A Web service is constructed to process the voice input and return the discrimination result.

These technical details ensure that the overall system is fully evaluated and optimized in terms of performance and accuracy to meet the high demands of the medical field.

Understanding and listing the corresponding possible diseases and their probabilities: the extracted feature vectors are passed to a pre-trained DeBERTaV3 model, the output layer of which uses softmax activation functions. The model will be softmax, also known as a normalized exponential function, with the aim of presenting the multi-classified results in the form of probabilities. It maps the outputs of multiple neurons into (0, 1) intervals for multiple classification. Assuming that an array V, vi represents the i-th element in V, the softmax value of this element is:

the full connection layer multiplies the weight matrix by the input vector and adds bias, and maps n (- -infinity, + -infinity) real numbers to K (- -infinity, + -infinity) real numbers (fractions); softmax will be K (- ≡, ++ infinity) is mapped to K real numbers (probabilities) of (0, 1), while ensuring that their sum is 1. The method comprises the following steps:

Wherein x is input of the full connection layer, W _n×K For the weight, b is the bias term,the probability output for Softmax is calculated as follows:

the probability of splitting into each category is as follows:

wherein w is _j Is a vector formed by the same color weight of the full connection layer in the figure.

Example 2

The system adopts the voice understanding AI model and the AI auxiliary discriminating model described in the embodiment 1, converts symptom voice description of a patient into text through the voice understanding AI model, performs feature extraction after preprocessing the text and sends the text to the AI auxiliary discriminating model, the AI auxiliary discriminating model receives text input and discriminates information related to diseases and is used for classifying diseases or symptoms in the text, mapping a patient's voice operation to corresponding disease classification, outputting a score for each possible disease, converting the output score of the model into probability distribution by using a softmax function, and listing each possible disease and the corresponding probability thereof based on the result of the softmax.

And the method also comprises the steps of ranking the diseases with highest probability as possible diseases according to probability distribution output by the model, sorting according to probability, comparing the probability value with a set threshold value, and recommending follow-up measures according to comparison results.

The following are the technical details:

(1) Setting a probability threshold: first, probability thresholds for different diseases are determined, which will be used to decide whether to take further examination measures. These thresholds are adjusted according to medical knowledge, actual data, and advice from medical professionals. A threshold is set to determine which probability values are considered to be high probability according to actual demand. For example, setting the threshold to 0.5, that is, indicating that a disease having a probability of 0.5 or more is considered to be a high probability. When the probability of a certain disease exceeds a certain threshold, further examination steps are considered.

(2) Interpretation of results: and comparing the calculated probability value with a preset threshold according to probability distribution output by the model. Diseases with probabilities higher than a set threshold are listed as possible diseases.

(3) Checking and deciding: the ranking is according to probability size so that the doctor or patient can see the most likely disease. Based on the possible diseases and their probability values, the physician decides further examination measures based on clinical experience and medical knowledge. High probability of disease requires more detailed examination or further diagnosis, while low probability of disease requires further exclusion.

(4) Patient communication: the possible diseases and their probability values are communicated to the patient, interpreting the possible diagnostic results and suggested examination measures. This aids the patient in understanding and participating in the decision making process.

Writing conclusion: providing a diagnosis result, distinguishing the suspected disease of the patient and giving the patient a scheme for further examination:

(1) Making checking measures: for those diseases whose probability is above the threshold, appropriate examination measures are made. Comprising the following steps: suggesting patients for a particular type of laboratory or imaging exam; recommending patient consultation specialists, e.g., physicians, oncologists, etc.; more detailed clinical evaluations or follow-up are scheduled.

(2) Writing a conclusion: and drawing a conclusion according to the probability value and the adopted checking measures. According to the high probability of the disease, corresponding examination measures are proposed, and the conclusion clearly indicates that: which diseases are identified as likely to be present; suggested checking measures for each possibility; a follow-up schedule or follow-up plan is required.

The system of the embodiment deploys the model as an API interface, deploys the trained model as a service using a Web framework. The client sends the data to the API interface through the HTTP request, the model processes the data, and the result is returned to the client. This approach enables the model to be provided in the form of a service to other applications for invocation, applicable to various types of application scenarios.

Data input and output of the system of the present embodiment: in an application, specific text data is sent as input to the model by calling the API interface of the model. The model processes the input data and outputs the generated text or answer. And the output result is returned to the application program through the API interface and is displayed or applied on the mobile application or the website.

Model optimization of the system of the present embodiment: in the application, the model is optimized for improving the response speed and accuracy of the model. The model optimization technology comprises the following steps:

(1) And (3) outputting a result by the cache model: and by caching the output result of the model, repeated calculation is avoided, and the efficiency and response speed of the model are improved.

(2) Accelerating computations using a GPU: by using a Graphics Processing Unit (GPU) to accelerate the computation of the model, the speed of reasoning is significantly increased. GPUs have excellent performance in parallel computing, and are particularly effective in a large number of matrix computing, convolution operations, or parallel processing tasks.

(3) Depth model compression: the volume and computation of the model is reduced by model compression techniques, including pruning, quantization and distillation, to improve the performance and efficiency of the model.

(4) Parallelization and distributed computing: parallel computing and distributed computing are utilized to expedite the training and reasoning process. By distributing computing tasks over multiple computing nodes, computing efficiency and throughput are improved.

The above is the technical details in the deployment and application process of the system model in this embodiment, and other related technologies and methods, such as load balancing, containerized deployment, automatic expansion, etc., are considered for specific service requirements and technical requirements, so as to meet the requirements of higher efficiency, reliability and stability.

Agent technology, which also includes DeBERTaV 3-based technology, is a key component in the agent system for interacting with users and performing tasks. The following are the technical details:

(1) Natural Language Processing (NLP) capabilities: the method has strong natural language processing capability, analyzes instructions, problems or demands of users, and performs corresponding operations and generates replies according to user input.

(2) Editable workflow mechanism: the user creates a flow chart or a flow chart of the task through a graphical interface interaction mode. Setting the functions of condition, logic jump, result processing and the like to construct the flow and logic of the task and provide more visual task flow design and control modes. Editable workflow mechanisms allow users to define logical controls such as conditional statements and jumps. The user decides on different paths and operations of the task flow and defines the result data that needs to be recovered during the task execution. For example, data generated at a certain executing node is used in subsequent nodes. In this way, the user can process and utilize the result data, supporting the consistency and integrity of task execution.

(3) Large language model and workflow integration: the DeBERTaV3 has a powerful Natural Language Processing (NLP) capability, parses a user's instructions, questions, or requirements, and performs a corresponding operation and generates a response according to a user input. The user sends a prompt to the large language model at the appropriate node in the workflow. Prompt is an input Prompt describing a user's request or question for a model. By sending a prompt to the model, the workflow guidance model generates text or answers related to the task requirements. In the task execution process, the task completion speed is increased by optimizing the transmission of the prompt and the recovery process of the result data. In addition, the agent technology of DeBERTaV3 has the capability of result reclamation and secondary processing: the model generates results, and the workflow uses variables and data channels to pass input data and to save the generated results for subsequent data processing or secondary processing. In subsequent nodes of the workflow, the text or answer generated by the model is used as input data for secondary processing or further operations, including review and modification of the generated text, or integration of the generated output with other steps or data. Through the secondary processing function, the user further optimizes the generated result, and the workflow fully utilizes the generating capability of the model, so that the workflow meets the task requirements and the expected result.

(4) Review markup and flow control: the results generated by the model are automatically evaluated and labeled using predefined review rules and algorithms. Including grammar checking, emotion analysis, keyword matching, etc. The generated results are marked as pass or fail according to the results of the review rules. The results of the automated review tab are combined with the automated flow guidance. Depending on the result of the evaluation, an appropriate branch path is automatically selected or a specific operation is performed. For example, if the review flag is passed, the automated process control directs the task to continue to execute downward; if the review is marked as not passing, automatic flow control may return to the particular node for secondary processing or modification. Finally, a user interactive control mechanism is introduced into the workflow, so that the user can directly participate in the review mark and the flow control. And the user manually reviews the generated result and provides feedback, and the flow control is performed according to the feedback and judgment of the user. This approach increases flexibility and human intervention capability for task flows and results generated. The flow control function allows a user to set control flows in a workflow based on review tags and other conditions. And according to the review result, the flow control guides the workflow to select different paths or execute specific operations.

(5) User feedback and iteration: the agent of DeBERTaV3 has the capability of processing user feedback and iteration, knowing the satisfaction degree of the user on the generated result, and performing model adjustment and improvement according to the feedback so as to gradually improve the quality of the generated result and meet the user requirement.

The agent technology of the DeBERTaV3 realizes the functions of interaction with a user, task execution, result processing and the like, and can iterate and improve according to user feedback and requirements, so that the agent technology of the DeBERTaV3 can effectively provide a higher-quality and customized generation result.

The foregoing examples merely illustrate certain embodiments of the invention and are described in more detail and are not to be construed as limiting the scope of the invention; it should be noted that it is possible for a person skilled in the art to make several variants and modifications without departing from the concept of the invention, all of which fall within the scope of protection of the invention; accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. An AI-based medical auxiliary discrimination method is characterized in that: the method comprises the following steps:

s500, continuously evaluating and adjusting the model.

2. The AI-based medical assistance discriminating method according to claim 1, wherein: the generating the speech understanding AI model of step S100 specifically includes

S111, data collection and data set construction;

s112, preprocessing data;

s113, selecting a pre-training language model;

s114, targeted large-scale training and optimization.

3. The AI-based medical assistance discriminating method according to claim 2, wherein:

4. The AI-based medical assistance discriminating method according to claim 3, wherein: the pre-training language model in the step S113 is Whisper-large-v2; the targeted large-scale training in step S114 is to transfer all tasks and condition information to the decoder as a series of input tokens, train the data parallel accelerator, train the model using Adam optimizer and gradient norm clipping, train the warm-up phase with linear learning rate decayed to zero after the first 2048 updates, train the model with 220 updates, and complete traversal of the data set.

5. The AI-based medical assistance discriminating method of claim 3, wherein: the generation of the AI-aided discrimination model in the step S100 specifically comprises

s122, selecting DeBERTaV3 as a pre-training language model;

s123, model training and optimization.

6. The AI-based medical assistance discriminating method according to claim 4, wherein: the step S122 is specifically to use the pre-training method of the ELECTRA style to give a sequence X= { X _i By randomly masking 15% of the markers intoThen training a language model parameterized by the parameter θ based on ++by prediction>Is marked with a shieldingTo reconstruct X:

wherein C is the set of indices in the sequence that are masked;

the loss function expression of the discriminator is:

7. The AI-based medical assistance discriminating method according to claim 4, wherein: the optimization of step S123 includes

8. The AI-based medical assistance determination method according to claim 7, wherein: the step S123 further includes monitoring the model pre-training process and visually displaying, so as to facilitate analysis and adjustment of the model hyper-parameters, wherein the monitoring usage index is as follows

9. An AI-based medical assistance discriminating system, characterized in that: the system adopts the speech understanding AI model and the AI auxiliary discriminating model in claim 1, converts the symptom speech description of the patient into text through the speech understanding AI model, performs feature extraction after preprocessing the text and sends the text to the AI auxiliary discriminating model, the AI auxiliary discriminating model receives text input and discriminates information related to diseases, is used for classifying diseases or symptoms in the text, maps the patient's speech to corresponding disease classification, outputs a score for each possible disease, converts the output score of the model into probability distribution by using a softmax function, and lists each possible disease and the probability corresponding to each possible disease based on the result of the softmax.

10. The AI-based medical assistance decision making system of claim 9, wherein: and the method also comprises the steps of ranking the diseases with highest probability as possible diseases according to probability distribution output by the model, sorting according to probability, comparing the probability value with a set threshold value, and recommending follow-up measures according to comparison results.