CN117457237A - Auxiliary inquiry method and system based on medical entity extraction - Google Patents

Auxiliary inquiry method and system based on medical entity extraction Download PDF

Info

Publication number
CN117457237A
CN117457237A CN202311360225.3A CN202311360225A CN117457237A CN 117457237 A CN117457237 A CN 117457237A CN 202311360225 A CN202311360225 A CN 202311360225A CN 117457237 A CN117457237 A CN 117457237A
Authority
CN
China
Prior art keywords
medical
medical inquiry
user
text content
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202311360225.3A
Other languages
Chinese (zh)
Inventor
周雨校
陈钇名
马一鸣
王彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mita Carbon Hangzhou Intelligent Technology Co ltd
Original Assignee
Mita Carbon Hangzhou Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mita Carbon Hangzhou Intelligent Technology Co ltd filed Critical Mita Carbon Hangzhou Intelligent Technology Co ltd
Priority to CN202311360225.3A priority Critical patent/CN117457237A/en
Publication of CN117457237A publication Critical patent/CN117457237A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H80/00ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses an auxiliary consultation method and system based on medical entity extraction, which are based on a general large language training model, acquire a medical consultation field combination and train the large language training model by taking the medical consultation field combination as input; obtaining a medical inquiry answer pair, further obtaining a medical inquiry answer pair vector, and establishing a department knowledge base; according to the similarity between the user problem vector and the symptom field vector in the department knowledge base, acquiring the symptom field vector with the highest similarity with the problem of the user, and further acquiring the solution vector of the problem of the user; and acquiring a final solution vector based on the trained large language training model so as to acquire answers of questions of the user and provide professional diagnosis and treatment opinions for the user. The invention can ensure that the user can acquire the medical advice of the profession and the front edge at any time and any place in daily life under the condition that the doctor has limited time; the accuracy of the on-line machine inquiry greatly reduces the probability of causing wrong medical guidance for the user.

Description

Auxiliary inquiry method and system based on medical entity extraction
Technical Field
The invention relates to the technical field of medical health care informatics, in particular to an auxiliary inquiry method and system based on medical entity extraction.
Background
With the rapid development of the internet and artificial intelligence technology, medical informatization is becoming a major trend in the medical field. As patients often experience a range of daily physical changes and symptoms during the period of illness. These discomfort and symptoms often lead to the need for patients to get medical advice and counseling in time, so there is an increasing need for on-line inquiry services for patients in the medical field. However, the conventional on-line inquiry method has a plurality of problems: first, due to the limited time of doctors, patients often cannot respond to questions in time, which results in the patients not being able to acquire medical advice in time when they are in urgent need. Secondly, in the process of inquiring a patient, the user can not accurately describe own symptoms or present problems due to lack of professional knowledge, and the traditional online inquiring system can cause incorrect judgment of a doctor on diseases and influence the treatment effect based on description of the non-professional user problems.
Disclosure of Invention
The invention provides an auxiliary consultation method and system based on medical entity extraction, which are used for overcoming the technical problems.
In order to achieve the above object, the technical scheme of the present invention is as follows:
an auxiliary inquiry method based on medical entity extraction comprises the following steps:
s1: acquiring an original medical inquiry data set; the data in the original medical inquiry data set is unstructured text data;
s2: preprocessing the original medical inquiry data set to obtain a preprocessed medical inquiry data set;
s3: acquiring a medical inquiry field combination based on a universal large language training model according to the preprocessed medical inquiry data set; the medical inquiry field combination is a field combination with the same preset prompting words and sequence;
s4: training the universal large language training model according to the medical inquiry field combination to obtain a trained large language training model based on the prefix p of each element in the medical inquiry field combination of the loss function adjustment model;
s5, acquiring a medical inquiry answer pair according to the preprocessed medical inquiry data set and the trained large language training model so as to extract structural data of medical inquiry contents;
s6: obtaining a medical inquiry answer pair vector according to the medical inquiry answer pair so as to establish a department knowledge base based on a third party vector base; the department knowledge base comprises symptom field vectors and solution vectors corresponding to the symptom field vectors;
s7: acquiring a user problem vector according to the user problem and based on the trained large language training model;
s8: obtaining the similarity between the user problem vector and the symptom field vector in the department knowledge base to obtain the symptom field vector with the highest similarity with the problem of the user, and further obtaining the solution vector of the problem of the user;
s9: and acquiring a final solution vector according to the symptom field vector with the highest similarity with the problem of the user and the solution vector of the problem of the user and based on the trained large language training model so as to acquire an answer of the problem of the user and provide a professional diagnosis and treatment opinion for the user.
Further, the method for acquiring the preprocessed medical inquiry data set is as follows:
s21: text recognition is carried out on the original medical inquiry data set to obtain text content list [ I ] of pages I, i=0, 1 and …; to obtain text content list [ i ] [ k ] of the ith page and the kth line; wherein I is the total number of pages of the text; k is the text line number of the ith page;
s22: obtaining average similarity of the kth line of all pages according to text content list [ i ] [ k ] of the kth line of the ith page; acquiring first text content of an ith page, wherein the first text content of the ith page is processed based on average similarity;
s23: according to the first text content of the ith page, acquiring the length of the kth line 1 in the first text content of the ith page to acquire the second text content of the ith page;
the second text content of the ith page is the text content of the ith page after being processed based on the length of the kth line 1 in the first text content of the ith page; where k1 is the line number of the ith page in the first text content.
Further, the average similarity of the kth line of all pages is obtained as follows:
wherein: average (average) score(k) Representing the average similarity of the kth line of all pages; i is the total number of pages; i and j are numbers of text pages; similarity (list [ i ]][k],list[j][k]) Representing list [ i ]][k]And list [ j ]][k]Similarity between list [ i ]][k]Text content representing the ith page, the kth line; list [ j ]][k]Representing the text content of the kth line of the jth page.
Further, the method for acquiring the first text content of the ith page is as follows:
when the average similarity of the kth line of all pages is greater than the similarity threshold, the first text content of the ith page
Otherwise, the first text content of the ith page is list [ i' ] =list [ i ]
Wherein: list [ i' ] is the first text content of the i-th page; list [ i' ] [ k1] represents the first text content of the ith page k1 line.
Further, the method for acquiring the second text content of the ith page is as follows:
wherein: list [ i ] ", a]Second text content representing an i-th page; list [ i ]']First text content for the i-th page; list [ i ]'][k1]First text content representing an i-th page, line k 1; l (L) i,k1 A length of the first text content representing the i-th page, the k 1-th line; l (L) D A text length setting threshold is indicated; num (num) i {L i,k1 >L D -represents the number of pages of the first text content having a length greater than a text length set threshold; num (num) D Indicating the set page number threshold.
An auxiliary consultation system based on an auxiliary consultation method extracted by a medical entity, comprising: the system comprises a medical inquiry data set acquisition module, a data preprocessing module, a medical inquiry entity extraction model fine adjustment module, a medical inquiry entity automatic extraction module, a knowledge base text vectorization module, a user inquiry sentence vectorization module, a question similarity matching module and an answer generation module;
the medical inquiry data set acquisition module is used for acquiring an original medical inquiry data set; the data in the original medical inquiry data set is unstructured text data;
the data preprocessing module is used for preprocessing an original medical inquiry data set to obtain a preprocessed medical inquiry data set;
the medical inquiry entity extraction module is used for acquiring a medical inquiry field combination based on a universal large language training model according to the preprocessed medical inquiry data set; the medical inquiry field combination is a field combination with the same preset prompting words and sequence;
the medical inquiry entity extraction model fine adjustment module is used for training a general large language training model according to the medical inquiry field combination so as to adjust the prefix p of each element in the medical inquiry field combination of the model based on a loss function and acquire a trained large language training model;
the automatic extraction module of the medical inquiry entity is used for acquiring a medical inquiry answer pair according to the medical inquiry data set and the trained large language training model so as to extract medical inquiry contents;
the knowledge base text vectorization module is used for acquiring a medical inquiry answer pair vector according to the medical inquiry answer pair so as to acquire a department knowledge base based on a third party vector library; the department knowledge base comprises symptom field vectors and solution vectors corresponding to the symptom field vectors;
the user question vectorization module is used for acquiring a user question vector according to the user question and based on the trained large language training model;
the problem similarity matching module is used for obtaining the similarity between the user problem vector and the symptom field vector in the department knowledge base so as to obtain the symptom field vector with the highest problem similarity with the user, so as to obtain the solution vector of the problem of the user;
the answer generation module is used for obtaining a final solution vector according to the symptom field vector with highest similarity with the user's question and the solution vector of the user's question and based on the trained large language training model.
The beneficial effects are that: the invention relates to an auxiliary inquiry method and system based on medical entity extraction, which converts original medical inquiry information into a structured preprocessed medical inquiry data set, acquires a medical inquiry field combination based on a general large language training model to train the general large language training model, acquires a final solution vector of a user problem based on the trained large language training model in combination with the user problem and a department knowledge base, acquires an answer to the user problem, and provides professional diagnosis and treatment opinion for the user. The invention can ensure that the high-demand user can acquire the professional and leading medical advice at any time and any place in daily life under the condition of limited doctor time; the accuracy of the on-line machine inquiry greatly reduces the probability of causing wrong medical guidance for the user.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of an auxiliary consultation method based on medical entity extraction according to the present invention;
fig. 2 is a signal flow diagram of an auxiliary interrogation system according to an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment provides an auxiliary inquiry method based on medical entity extraction, as shown in fig. 1, comprising the following steps:
s1: acquiring an original medical inquiry data set; the data in the original medical inquiry data set is unstructured text data;
specifically, the acquiring of the original medical inquiry data set includes acquiring the original medical inquiry data set according to subject information (such as obstetrics, orthopedics, internal medicine and the like) of different departments, wherein the data in the original medical inquiry data set are unstructured text data, including actual clinical case data of each department of a hospital, network-published medical inquiry data, unstructured text data of clinical medical books, papers and the like.
S2: preprocessing the obtained unstructured original medical inquiry data set to obtain a preprocessed medical inquiry data set;
s21: text recognition is carried out on the original medical inquiry data set to obtain text content list [ I ] of pages I, i=0, 1 and …; to obtain text content list [ i ] [ k ] of the ith page and the kth line; wherein I is the total number of pages of the text; k is the text line number of the ith page;
s22: obtaining average similarity of the kth line of all pages according to text content list [ i ] [ k ] of the kth line of the ith page; acquiring first text content of an ith page, wherein the first text content of the ith page is processed based on average similarity;
wherein: average (average) score(k) Representing the average similarity of the kth line of all pages; i is the total number of pages; i and j are numbers of text pages; similarity (list [ i ]][k],list[j][k]) Representing list [ i ]][k]And list [ j ]][k]Similarity between list [ i ]][k]Text content representing the ith page, the kth line; list [ j ]][k]Text content representing a kth line of a jth page; any similarity index is possible here, such as jaccccard similarity and cosine similarity;
the first text content acquisition method comprises the following steps:
when the average similarity of the kth line of all pages is greater than the similarity threshold, the first text content of the ith page
Otherwise, the first text content of the ith page is list [ i' ] =list [ i ];
s23: according to the first text content of the ith page, acquiring the length of the kth line 1 in the first text content of the ith page to acquire the second text content of the ith page; the second text content of the ith page is the text content of the ith page after being processed based on the length of the kth line 1 in the first text content of the ith page; wherein k1 is a line number of an i-th page in the first text content; the length acquisition method is a conventional technique in the art and will not be described here.
The method for acquiring the second text content of the ith page is as follows:
wherein: list [ i ] ", a]Second text content representing an i-th page; list [ i ]']First text content for the i-th page; list [ i ]'][k1]First text content representing an i-th page, line k 1; l (L) i,k1 A length of the first text content representing the i-th page, the k 1-th line; l (L) D A text length setting threshold is indicated; num (num) i {L i,k1 >L D -represents the number of pages of the first text content having a length greater than a text length set threshold; num (num) D Representing a set page number threshold;
specifically, in the process, firstly, the contents of pdf, jpg and the like are subjected to text recognition by a tool of pdfpllumberer, pdfminer and the like through text extraction or ocr technology, so that original unwashed text information is extracted. Since the extracted text information has paragraph disorder and interference elements such as header footers, etc., which can cause the text information to be incoherent, further adverse effect on subsequent text extraction is generated, and further processing is required. Thereby avoiding cross-line text being blocked by header footers.
Header and footer removal: and removing the header and footer to further acquire coherent text information. A combination method based on similarity and length judgment is used. Since headers and footers in a document typically have a fixed pattern and appear at the beginning and end of each page. By this feature, it is determined which are the header and footer by detecting the similarity of the text contents of each page, particularly the first line and the last line. The text of each page is first cut separately. The text content of each page being an element of the list, i.e. [ list [0]],list[1],list[2]…list[I]]Wherein the k-th line content of each i-page is expressed as list [ i ]][k]. The similarity of the k lines of content extracted from all the pages of pdf is calculated by summing the similarity of the k lines of each page, and then obtainingAverage similarity average score(k) The method comprises the steps of carrying out a first treatment on the surface of the And if the average similarity threshold is exceeded, the line of content is considered as a header or a footer, and the line of content is removed to obtain the first text content.
In addition, considering that the header footer may be a simple number, in this case, the similarity mechanism may fail, so a length judgment mechanism is introduced, and based on the first text content, the second text content is obtained. Wherein the header/footer lengths should be the same, then for a document, when the length of list [0] for most pages is L, it is possible that list [0] is the header (the ideas of judging list < -1 > are the footers are the same). The lengths of the same lines of each page are compared in turn according to the thought, and if the number of pages with the lengths exceeding the text length setting threshold is greater than the set page number threshold, the line header/footer of each page is determined to be removed.
Preprocessing the original medical interrogation dataset further comprises: segmentation processing is carried out based on the data with header and footer removed: and (5) carrying out paragraph recombination by a method for judging the end punctuation of the sentence. First, each extracted row of content is read. When the end of line content is not). ","? "wait for a specific symbol, select to merge with the next row. By this method, the cut paragraphs can be made as complete as possible. Obtaining a preprocessed medical inquiry data set;
s3: acquiring a medical inquiry field combination based on a universal large language training model according to the preprocessed medical inquiry data set; the medical inquiry field combination is a field combination with the same prompting words and sequence as the preset prompting words and sequences related to the medical inquiry field;
specifically, in the embodiment of the invention, the universal large language training model is an open-source large language model on the market at present; and carrying out small sample entity extraction according to the entity extraction framework, namely according to the preprocessed medical inquiry data set, and obtaining the medical inquiry field combination. The extraction entity mainly comprises: symptoms, diagnosis results, treatment regimen. Based on the framework, entity extraction is performed through a general Large Language Model (LLM) based on preset prompt words, a medical inquiry field combination is obtained, missing information is filled into none, in order to obtain complete entity extraction information, the medical inquiry field combination needs to be judged after extraction, and whether key fields required by subsequent training are completely contained in the medical inquiry field combination is confirmed. If the key field of the preset entity is (symptom + diagnosis), we merge the two fields into one key, and if any field in the key is empty, the piece of information is invalidated. And finally, manually checking and correcting the extracted entity information. The resulting combination of medical interrogation fields is used as a dedicated data set for subsequent fine tuning of the large language training model.
Specifically, one embodiment of the present invention is as follows: the original text information is: "user description: i have recently always felt headache, sometimes accompanied by nausea. The doctor replies: you may suffer from migraine. I propose that you take paracetamol on a daily basis, avoid intense light irritation and keep good work and rest. The obtained data are { "symptoms": "headache, nausea" ], "diagnosis": "migraine", "treatment scheme": "paracetamol is taken on time every day, strong light stimulation is avoided, and good work and rest" is maintained;
s4: training the universal large language training model according to the medical inquiry field combination to obtain a trained large language training model based on the prefix p of each element in the medical inquiry field combination of the loss function adjustment model;
specifically, the general large language model has poor effect in extracting medical entities, needs manual large-scale adjustment and has low efficiency. Therefore, in order to obtain more accurate medical entity data, the universal large language model is improved and trained, and then the preprocessed medical inquiry data set obtained in the step S2 can be subjected to full medical entity extraction. The process is mainly based on a fine Tuning method of Prefix-Tuning, specifically, a Prefix template is added before each minimum unit text token is input. The elements in the minimum unit text token medical field combination in this embodiment, wherein the template is a trainable parameter and the other parts of the large language model are fixed, fine tuning the Prefix template. For a given input x (this is a token representation of text) prefixed by p (token representation of prefix), the output y of the model can be written as: y=f (p, x; θ), where f is the transducer model and θ is the weighting parameter of the model. In this process, only p is optimized, while the weight parameter θ is fixed. When Prefix-Tuning is used for the task of medical consultation entity extraction, the lead model identifies and extracts key entities such as symptoms, diseases and treatment methods in the medical text.
Training the universal large language training model to obtain a trained large language training model, wherein the training process is as follows:
(1) Initializing a prefix p by a random value; keeping the weight parameter theta of the model unchanged;
(2) Inputting prefixes and texts, namely adding a prefix p in front of each input text x, and then inputting a medical inquiry field combination into a large language training model;
(3) Calculating loss according to the output y of the model and the real label;
(4) Optimizing prefix, namely updating prefix p by using a gradient descent method;
repeating steps (2) - (4) until the model converges.
In this way we can achieve a higher quality entity extraction effect. But since the weight parameter θ of the model is fixed and we only fine tune the prefix of the input, which may limit the generalization ability of the model on other tasks, the model is most suitable for medical consultation entity extraction.
S5, acquiring a medical inquiry answer pair according to the medical inquiry data set and the trained large language training model so as to extract medical inquiry contents;
specifically, the automatic extraction and question-answer pair construction method of the medical consultation entity comprises the following steps: and (3) performing full-quantification entity extraction on the preprocessed medical inquiry data set obtained in the step (S2) through the trained large language training model of the step (S4). The medical inquiry data set based on the unstructured text after preprocessing is realized, then different question templates are formulated for each extracted entity data, and substitution is carried out in a mode of combining the synonymous substitution of the large language model with the regular expression, so that the construction of the questions is more diversified. The prompt word (prompt) entered into the large language model is as follows: prompt "symptoms of a user are symptoms, please construct a question in a different way, describing these symptoms. ". Regular expressions mainly replace question ways, for example, "how do you do? "change to" how do me handle? "," how should this be? "," how should a dog be treated? "etc. The acquisition of the pair of medical question answers based on the trained large language training model is merely a use of the prior art and will not be further described herein.
Another embodiment of the invention is as follows: assuming that a certain entity information user symptom is chest distress, cough and fever, acquiring a medical inquiry answer pair through a large language model aiming at the certain entity information user symptom, and replacing the inquiry mode is as follows: "1 how do me feel chest distress, cough and fever today? 2. What is we have chest distress and cough and fever today? 3. My body temperature reached 38.5 degrees, and accompanied chest distress and cough, how do you do? ". Then matching through regular expressions, and further increasing questions such as how to do woolen questions and the like to question the way of how to treat woolen questions I do what is? "," how should this be? "," how should a dog be treated? The method can effectively enrich the diversity of the question and answer pair data sets, can more accurately capture the actual questions possibly presented by the user, and improves the accuracy of the question and answer.
S6: obtaining a medical inquiry answer pair vector according to the medical inquiry answer pair so as to obtain a department knowledge base based on a third party vector base; the department knowledge base comprises symptom field vectors and solution vectors corresponding to the symptom field vectors;
specifically, for the medical consultation entity data set (medical consultation answer pair) generated in S5, each piece of complete entity information is combined into a row and added into a plain text file, and different entity fields are separated according to tab. And cutting the contents of each field of the constructed medical entity information through the characteristics of line feed characters, epitope making and the like. The above-described pair of medical questions is then converted into a vector form by an embedded (ebedding) model (e.g., ganymedel/text 2vec-large-chinese, shibing624/text2vec-base-chinese, nghuyong/ernie-3.0-base-zh, BAAI_ bge-large-zh) or the like. And the vector library is stored and retrieved based on a third party vector library (such as Faiss, milvus and the like), so that a knowledge base of different departments containing medical question and answer pair vectors related to various diseases is formed.
S7: acquiring a user problem vector according to the user problem and based on the trained large language training model;
in this embodiment, the content of the user question is first preprocessed by the trained large language training model in S4, including sentence understanding and keyword sentence extraction, and then the keyword/sentence of the user question is converted into the user question vector by using the ebedding model related to S6, so as to obtain one or more user question vectors according to the number of extracted keywords and sentences.
S8: obtaining the similarity between the user problem vector and symptom field vectors in a department knowledge base to obtain symptom field vectors with the highest similarity with the user problem, so as to obtain solution vectors of the user problem;
specifically, the problem vector in the step S7 is subjected to similarity matching with symptom field vectors in the department knowledge base constructed in the step S6 through cosine similarity or distance similarity and the like, scoring is carried out, and a plurality of target subsections with the highest similarity matching scores are recalled, namely the symptom field of the target problem in the knowledge base with the highest similarity with the problem vector input by the user and the corresponding answer are recalled.
S9: and obtaining a final solution vector according to the symptom field vector with highest similarity to the problem of the user and the solution vector of the problem of the user and based on the trained large language training model.
Specifically, the accurate visit answer is generated: and splicing the problem of the user with the symptom field vector with the highest similarity with the problem of the user in the knowledge base obtained by matching, and forming a new problem by a method of prompting word engineering. And the embedded model in S6 is used for carrying out vector transformation on the questions, new question vectors are further obtained, answers are predicted through the generated large language model, and finally accurate and professional diagnosis answers are given.
The embodiment also discloses an auxiliary consultation system based on the auxiliary consultation method extracted by the medical entity, which comprises the following steps: the system comprises a medical inquiry data set acquisition module, a data preprocessing module, a medical inquiry entity extraction module fine adjustment module, a medical inquiry entity automatic extraction module, a knowledge base text vectorization module, a user inquiry sentence vectorization module, a question similarity matching module and an answer generation module, wherein the answer generation module is shown in fig. 2;
the medical inquiry data set acquisition module is used for acquiring an original medical inquiry data set; the data in the original medical inquiry data set is unstructured text data; the original inquiry data set is obtained according to the subject information (such as obstetrics, orthopedics, internal medicine and the like) of different departments, and mainly comprises actual clinical cases of all departments of a hospital. A web published medical inquiry dataset, a medical book, a paper, etc.
The data preprocessing module is used for preprocessing the original medical inquiry data set to obtain a preprocessed medical inquiry data set; specifically, the method comprises the steps of processing an original file of an acquired data set of each department, extracting and converting unstructured text content by using a text extraction or OCR technology, removing interference elements of a header and a footer in the text by a combination method of similarity and length judgment, and segmenting the extracted text to restore a text structure.
The medical inquiry entity extraction module is used for acquiring a medical inquiry field combination based on a general large language training model according to the preprocessed medical inquiry data set; the medical inquiry field combination is a field combination with the same preset prompting words and sequence; specifically, the medical inquiry entity extraction module designs an entity extraction framework, and the main target entities are: symptoms, diagnosis, treatment regimen. And extracting the entity from the speech segments by using a general large language training model and preset prompt words and sequences. And then screening out effective entity information according to key fields (such as symptoms and diagnosis), and manually auditing and correcting the extracted entity information to create a medical consultation entity extraction special data set.
The medical inquiry entity extraction model fine adjustment module is used for training a general large language training model according to the medical inquiry field combination so as to adjust the prefix p of each element in the medical inquiry field combination of the model based on a loss function and acquire a trained large language training model; specifically, a general large language training model is trained by using a fine Tuning method based on Prefix-Tuning.
The automatic extraction module of the medical inquiry entity is used for acquiring a medical inquiry answer pair according to the medical inquiry data set and the trained large language training model so as to extract medical inquiry contents; specifically, a pre-training model in a medical inquiry entity extraction model fine-tuning module is utilized to carry out entity extraction on the unstructured medical text after pretreatment, and then a plurality of question templates are formulated for each extracted data based on the entities, so that the diversity of question and answer on a data set is enriched.
The knowledge base text vectorization module is used for acquiring a medical question answer pair vector according to the medical question answer pair so as to acquire a department knowledge base based on a third party vector base; the department knowledge base comprises symptom field vectors and solution vectors corresponding to the symptom field vectors; and cutting each field of the constructed medical entity information and vectorizing the fields to form a medical inquiry knowledge base containing vectorization of each subsection.
The user question vectorization module is used for acquiring a user question vector according to the user question and based on the trained large language training model; and preprocessing the problem of the user, extracting keywords and carrying out vectorization processing to obtain one or more problem vectors.
The problem similarity matching module is used for obtaining the similarity between the user problem vector and the symptom field vector in the department knowledge base so as to obtain the symptom field vector with the highest problem similarity with the user, so as to obtain the solution vector of the problem of the user; and matching the user problem vector with the symptom field in the knowledge base in similarity, and extracting a plurality of target problems with highest similarity.
The answer generation module is used for obtaining a final solution vector according to the symptom field vector with highest similarity to the user's question and the solution vector of the user's question and based on the trained large language training model: the module splices the questions of the user with the knowledge base content obtained by matching to form new questions, and a trained large language training model generates final professional inquiry answers.
The dynamic updating auxiliary consultation method and the system based on reinforcement learning are applied to an obstetrical online consultation platform of a certain hospital, are positioned to become an online consultation assistant of the hospital, and provide accurate and timely consultation service for obstetrical patients. The following functions can be realized: (1) The system acquires the latest obstetrical clinical cases, the medical inquiry data set disclosed by the network, and information such as clinical medical books, papers and the like, and provides the latest relevant medical advice and knowledge for the public. And the structured entity extraction is performed through the data set updated in real time, so that the accuracy and timeliness of the consultation information provided by the system are ensured. (2) providing a pre-diagnosis service: the user can submit own questions and symptoms through the online obstetrical consultation platform, and the system can search related questions according to the question library and provide accurate pre-diagnosis suggestions. (3) on-line medical instruction: the online diagnosis and treatment consultation platform can provide advice and guidance of daily care, including knowledge in aspects of daily health care, living habit, diet conditioning and the like, and helps users develop good care habits and keep healthy.
Through the online auxiliary consultation platform, obstetrical patients can acquire professional obstetrical medical advice and guidance at any time and any place, so that the problems of doubt and confusion are solved, and the diagnosis efficiency is improved. For hospitals, the system can lighten the burden of doctors, improve the utilization efficiency of medical resources and provide better service quality and user experience.
The embodiment has the following beneficial effects:
1. the information resource in the medical field is acquired in an automatic mode, and is automatically extracted into a form of structured information and question-answer pairs, so that the efficiency of information acquisition and processing is improved, and the requirements of manually reading and arranging various information related to each department are reduced.
2. By using a large language model to match and generate questions and answers, the user is assisted in accurately describing symptoms or questions, so that the accuracy of medical consultation is improved.
3. Aiming at the problems of various diseases of users, the accuracy of on-line question answering is improved through a reinforcement learning method, the daily question answering requirement is effectively met by the users, and a large amount of manpower and material resources are saved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (6)

1. An auxiliary inquiry method based on medical entity extraction is characterized by comprising the following steps:
s1: acquiring an original medical inquiry data set; the data in the original medical inquiry data set is unstructured text data;
s2: preprocessing the original medical inquiry data set to obtain a preprocessed medical inquiry data set;
s3: acquiring a medical inquiry field combination based on a universal large language training model according to the preprocessed medical inquiry data set; the medical inquiry field combination is a field combination with the same preset prompting words and sequence;
s4: training the universal large language training model according to the medical inquiry field combination to obtain a trained large language training model based on the prefix p of each element in the medical inquiry field combination of the loss function adjustment model;
s5, acquiring a medical inquiry answer pair according to the preprocessed medical inquiry data set and the trained large language training model so as to extract structural data of medical inquiry contents;
s6: obtaining a medical inquiry answer pair vector according to the medical inquiry answer pair so as to establish a department knowledge base based on a third party vector base; the department knowledge base comprises symptom field vectors and solution vectors corresponding to the symptom field vectors;
s7: acquiring a user problem vector according to the user problem and based on the trained large language training model;
s8: obtaining the similarity between the user problem vector and the symptom field vector in the department knowledge base to obtain the symptom field vector with the highest similarity with the problem of the user, and further obtaining the solution vector of the problem of the user;
s9: and acquiring a final solution vector according to the symptom field vector with the highest similarity with the problem of the user and the solution vector of the problem of the user and based on the trained large language training model so as to acquire an answer of the problem of the user and provide a professional diagnosis and treatment opinion for the user.
2. The auxiliary consultation method based on medical entity extraction according to claim 1, characterized in that the method for acquiring the preprocessed medical consultation data set is as follows:
s21: text recognition is carried out on the original medical inquiry data set to obtain text content list [ I ] of pages I, i=0, 1 and …; to obtain text content list [ i ] [ k ] of the ith page and the kth line; wherein I is the total number of pages of the text; k is the text line number of the ith page;
s22: obtaining average similarity of the kth line of all pages according to text content list [ i ] [ k ] of the kth line of the ith page; acquiring first text content of an ith page, wherein the first text content of the ith page is processed based on average similarity;
s23: according to the first text content of the ith page, acquiring the length of the kth line 1 in the first text content of the ith page to acquire the second text content of the ith page;
the second text content of the ith page is the text content of the ith page after being processed based on the length of the kth line 1 in the first text content of the ith page; where k1 is the line number of the ith page in the first text content.
3. The method of claim 2, wherein the average similarity of the kth line of all pages is obtained as follows:
wherein: average (average) score(k) Representing the average similarity of the kth line of all pages; i is the total number of pages; i and j are numbers of text pages; similarity (list [ i ]][k],list[j][k]) Representing list [ i ]][k]And list [ j ]][k]Similarity between list [ i ]][k]Text content representing the ith page, the kth line; list [ j ]][k]Representing the text content of the kth line of the jth page.
4. The auxiliary inquiry method based on medical entity extraction according to claim 2, wherein the first text content acquisition method of the ith page is as follows:
when the average similarity of the kth line of all pages is greater than the similarity threshold, the first text content of the ith page
Otherwise, the first text content of the ith page is list [ i ] ]=list[i]
Wherein: list [ i ] ]First text content for the i-th page; list [ i ] ][k1]The first text content of the ith page, line k 1.
5. The auxiliary consultation method based on medical entity extraction according to claim 2, characterized in that the method for acquiring the second text content of the ith page is as follows:
wherein: list [ i ] ", a]Second text content representing an i-th page; list [ i ] ]First text content for the i-th page; list [ i ] ][k1]First text content representing an i-th page, line k 1; l (L) i,k1 A length of the first text content representing the i-th page, the k 1-th line; l (L) D A text length setting threshold is indicated; num (num) i {L i,k1 >L D -represents the number of pages of the first text content having a length greater than a text length set threshold; num (num) D Indicating the set page number threshold.
6. An auxiliary interviewing system based on an auxiliary interviewing method of medical entity extraction according to any one of claims 1-5, comprising: the system comprises a medical inquiry data set acquisition module, a data preprocessing module, a medical inquiry entity extraction model fine adjustment module, a medical inquiry entity automatic extraction module, a knowledge base text vectorization module, a user inquiry sentence vectorization module, a question similarity matching module and an answer generation module;
the medical inquiry data set acquisition module is used for acquiring an original medical inquiry data set; the data in the original medical inquiry data set is unstructured text data;
the data preprocessing module is used for preprocessing an original medical inquiry data set to obtain a preprocessed medical inquiry data set;
the medical inquiry entity extraction module is used for acquiring a medical inquiry field combination based on a universal large language training model according to the preprocessed medical inquiry data set; the medical inquiry field combination is a field combination with the same preset prompting words and sequence;
the medical inquiry entity extraction model fine adjustment module is used for training a general large language training model according to the medical inquiry field combination so as to adjust the prefix p of each element in the medical inquiry field combination of the model based on a loss function and acquire a trained large language training model;
the automatic extraction module of the medical inquiry entity is used for acquiring a medical inquiry answer pair according to the medical inquiry data set and the trained large language training model so as to extract medical inquiry contents;
the knowledge base text vectorization module is used for acquiring a medical inquiry answer pair vector according to the medical inquiry answer pair so as to acquire a department knowledge base based on a third party vector library; the department knowledge base comprises symptom field vectors and solution vectors corresponding to the symptom field vectors;
the user question vectorization module is used for acquiring a user question vector according to the user question and based on the trained large language training model;
the problem similarity matching module is used for obtaining the similarity between the user problem vector and the symptom field vector in the department knowledge base so as to obtain the symptom field vector with the highest problem similarity with the user, so as to obtain the solution vector of the problem of the user;
the answer generation module is used for obtaining a final solution vector according to the symptom field vector with highest similarity with the user's question and the solution vector of the user's question and based on the trained large language training model.
CN202311360225.3A 2023-10-19 2023-10-19 Auxiliary inquiry method and system based on medical entity extraction Withdrawn CN117457237A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311360225.3A CN117457237A (en) 2023-10-19 2023-10-19 Auxiliary inquiry method and system based on medical entity extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311360225.3A CN117457237A (en) 2023-10-19 2023-10-19 Auxiliary inquiry method and system based on medical entity extraction

Publications (1)

Publication Number Publication Date
CN117457237A true CN117457237A (en) 2024-01-26

Family

ID=89593991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311360225.3A Withdrawn CN117457237A (en) 2023-10-19 2023-10-19 Auxiliary inquiry method and system based on medical entity extraction

Country Status (1)

Country Link
CN (1) CN117457237A (en)

Similar Documents

Publication Publication Date Title
CN106874643B (en) Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors
CN111709233B (en) Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network
CN111538894B (en) Query feedback method and device, computer equipment and storage medium
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
CN110838368A (en) Robot active inquiry method based on traditional Chinese medicine clinical knowledge graph
US20200118682A1 (en) Medical diagnostic aid and method
CN110931128B (en) Method, system and device for automatically identifying unsupervised symptoms of unstructured medical texts
US11468989B2 (en) Machine-aided dialog system and medical condition inquiry apparatus and method
AU2020100604A4 (en) Expert report editor
WO2023029502A1 (en) Method and apparatus for constructing user portrait on the basis of inquiry session, device, and medium
CN109003677B (en) Structured analysis processing method for medical record data
CN111312354A (en) Breast medical record entity identification and annotation enhancement system based on multi-agent reinforcement learning
CN111145903A (en) Method and device for acquiring vertigo inquiry text, electronic equipment and inquiry system
CN111180025A (en) Method and device for representing medical record text vector and inquiry system
CN115309910A (en) Language piece element and element relation combined extraction method and knowledge graph construction method
CN116911300A (en) Language model pre-training method, entity recognition method and device
Kim et al. Constructing novel datasets for intent detection and ner in a korean healthcare advice system: guidelines and empirical results
CN117877660A (en) Medical report acquisition method and system based on voice recognition
CN113254609A (en) Question-answering model integration method based on negative sample diversity
WO2023124837A1 (en) Inquiry processing method and apparatus, device, and storage medium
CN116453674A (en) Intelligent medical system
CN117457237A (en) Auxiliary inquiry method and system based on medical entity extraction
CN114822788A (en) Intelligent doctor recommendation method based on doctor-patient interaction data driving
CN113643825B (en) Medical case knowledge base construction method and system based on clinical key feature information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20240126

WW01 Invention patent application withdrawn after publication