CN116628171B - Medical record retrieval method and system based on pre-training language model - Google Patents

Medical record retrieval method and system based on pre-training language model Download PDF

Info

Publication number
CN116628171B
CN116628171B CN202310905527.8A CN202310905527A CN116628171B CN 116628171 B CN116628171 B CN 116628171B CN 202310905527 A CN202310905527 A CN 202310905527A CN 116628171 B CN116628171 B CN 116628171B
Authority
CN
China
Prior art keywords
medical record
document
language model
information
short text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310905527.8A
Other languages
Chinese (zh)
Other versions
CN116628171A (en
Inventor
王实
李丽
张奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huimeiyun Technology Co ltd
Original Assignee
Beijing Huimeiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huimeiyun Technology Co ltd filed Critical Beijing Huimeiyun Technology Co ltd
Priority to CN202310905527.8A priority Critical patent/CN116628171B/en
Publication of CN116628171A publication Critical patent/CN116628171A/en
Application granted granted Critical
Publication of CN116628171B publication Critical patent/CN116628171B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application relates to a medical record retrieval method and a medical record retrieval system based on a pre-training language model, which are applied to the technical field of language models and comprise the steps of generating the pre-training language model based on the existing corpus information; adjusting the pre-training language model based on a preset medical record library; retraining the adjusted pre-trained language model; and retrieving the medical record document according to the retrained pre-training language model and generating a reply. The application realizes semantic retrieval based on a large language model, realizes natural language query interaction, and improves the searching accuracy and user experience; the recall rate of medical record searching is improved through a special electronic medical record document vectorization algorithm; the multi-round dialogue memory effect and the natural language search result generation are realized by constructing the context and the prompt; the effects of defects in the language generation model or in violation of facts are addressed by combining vector search and context mechanisms.

Description

Medical record retrieval method and system based on pre-training language model
Technical Field
The application relates to the technical field of language models, in particular to a medical record retrieval method and system based on a pre-training language model.
Background
In the medical treatment process, in order to ensure the complete treatment of the illness state, the diagnosis and treatment means and the diagnosis and treatment process of the illness state need to be recorded and kept in order to facilitate the subsequent treatment, and along with the development of the age, the existing hospitals form a medical diagnosis process taking a medical system as a core, and in the doctor diagnosis process, the diagnosis efficiency is further increased on the basis of not delaying the diagnosis by filling in a diagnosis book on line and keeping an electronic file.
In a medical system, the method for retrieving the medical information of the patient is one of the indispensable steps, the existing method for searching the medical information of the patient generally adopts medical record searching, and the medical record searching is realized based on a full-text searching technical scheme, which is a method for efficiently and quickly searching documents containing specific keywords or phrases from a large number of documents.
With respect to the prior art, the inventor considers that the traditional full text retrieval technology mainly relies on vocabulary matching, but not semantic matching; when the vocabulary in the query and the document are not exactly the same, the related document may not be found by the conventional full text retrieval technology, so that the semantic matching degree is low.
Disclosure of Invention
In order to improve the existing full text retrieval technology, the matching of words is mainly relied on, but not the matching of semantics; when the vocabulary in the query and the vocabulary in the documents are not completely the same, the related documents can not be found by the traditional full text retrieval technology, so that the problem of low semantic matching degree is solved.
According to a first aspect of the present application, there is provided a medical record retrieval method based on a pre-trained language model, the method comprising the steps of:
generating a pre-training language model based on the existing corpus information;
adjusting the pre-training language model based on a preset medical record library;
retraining the adjusted pre-trained language model;
and retrieving the medical record document according to the retrained pre-training language model and generating a reply.
In a specific embodiment, the generating the pre-trained language model comprises;
acquiring a source sequence and a target sequence through corpus information;
respectively segmenting a source sequence and a target sequence;
endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary;
obtaining a vector representation of the source sequence;
acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector;
calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss;
calculating the gradient of the cross entropy total loss relative to each parameter through a back propagation algorithm;
updating parameters through an optimization algorithm until the total cross entropy loss reaches a preset range;
setting the optimized training language model as a pre-training language model.
In a specific embodiment, the adjusting the pre-training language model based on the preset medical record library includes:
different electronic medical record data are stored in the medical record library;
acquiring a source sequence and a target sequence through electronic medical record data;
respectively segmenting a source sequence and a target sequence;
endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary;
obtaining a vector representation of the source sequence;
acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector;
calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss;
calculating the gradient corresponding to the total loss of the cross entropy relative to the preset parameters through a back propagation algorithm;
updating preset parameters through an optimization algorithm until the total cross entropy loss reaches a preset range;
setting the optimized training language model as a pre-training language model.
In a specific embodiment, the retrieving the medical record document according to the retrained pre-trained language model includes:
acquiring user searching interaction information;
generating corresponding restriction information based on the user search interaction information;
retrieving the medical record document according to the retrained pre-training language model and obtaining a retrieval result;
judging whether the search result accords with the limit information;
if not, canceling the generation of the search result.
In a specific embodiment, the retrieving the medical record document according to the retrained pre-trained language model includes:
obtaining medical record document information, wherein the medical record document information is a preset semi-structured medical record;
constructing a vector expression model of the medical record document;
cutting off the medical record document vector according to the TF-IDF value and obtaining a preset value matrix;
performing dimension reduction operation on a preset value matrix to obtain a low-dimension vector;
clustering the low-dimensional vectors and generating corresponding index information;
and retrieving the medical record document through the index information.
In a specific embodiment, after the retrieving the medical record document by the index information, the method further includes:
acquiring a short text document and converting the short text document into a vector;
if the short text document and the medical record document information are vectors, calculating a group corresponding to the short text document through the vector similarity and recalling the documents in the same group;
acquiring key document information from documents in the same group according to the similarity;
splicing original texts of the key document information and setting the original texts as contexts of short text documents;
synthesizing the key document information and the short text document text to update the short text document;
sending the updated short text document to the pre-training language model and obtaining a return document;
the return document is set to a reply.
In a specific embodiment, if the rejected text document submitted by the user is received, combining the returned document with the rejected text document and generating a secondary retrieved short text document;
sending the secondary search short text document to a pre-training language model and acquiring a secondary return document;
and setting the secondary return document as a secondary reply.
According to a second aspect of the present application, there is also provided a medical record retrieval system based on a pre-training language model, including:
the pre-training language model construction module is used for generating a pre-training language model based on the existing corpus information;
the pre-training language model adjusting module is used for adjusting the pre-training language model based on a preset medical record library;
the pre-language model fine tuning module is used for retraining the adjusted pre-training language model;
and the training model reply module is used for retrieving the medical record document according to the retrained pre-training language model and generating replies.
According to a third aspect of the present application there is also provided a computer device comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing the method of the first aspect.
According to a fourth aspect of the present application there is also provided a computer readable storage medium storing a computer program capable of being loaded by a processor and performing the method as in the first aspect.
In summary, the present application includes at least one of the following beneficial technical effects:
1. the semantic retrieval is realized based on the large language model, so that the natural language query interaction is realized, and the searching accuracy and the user experience are improved;
2. the recall rate of medical record searching is improved through a special electronic medical record document vectorization algorithm;
3. the multi-round dialogue memory effect and the natural language search result generation are realized by constructing the context and the prompt;
4. the deficiencies of the language generation model or against facts are addressed by combining vector search and context mechanisms.
Drawings
FIG. 1 is a flow chart of a medical record retrieval method based on a pre-trained language model in an embodiment of the application.
FIG. 2 is a schematic diagram of a medical record retrieval system based on a pre-trained language model in an embodiment of the application.
Reference numerals: 201. a pre-training language model construction module; 202. a pre-training language model adjustment module; 203. a pre-language model fine tuning module; 204. and a training model reply module.
Detailed Description
Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.
The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
The embodiment provides a medical record retrieval method based on a pre-training language model, which realizes medical semantic understanding capability based on fine adjustment based on the pre-training language model and medical data; the recall rate of medical record document retrieval can be improved based on a special electronic medical record vector model; and the retrieval accuracy is improved and the retrieval result in a natural language mode is realized based on the multi-round dialogue model.
As shown in fig. 1, the method for medical record retrieval based on the pre-training language model of the present embodiment specifically includes the following steps:
s1: based on the existing corpus information, a pre-trained language model is generated.
In the process of generating a pre-training language model, the existing corpus information needs to be preprocessed, wherein the preprocessing comprises the operations of word segmentation, stop word removal, word stem extraction and the like; word segmentation is to segment a text into meaningful vocabulary units; the size of the index can be reduced by removing stop words, and the retrieval speed is improved; the word stem extraction is to restore the vocabulary into its basic form, which is helpful to improve recall rate of retrieval.
S2: and adjusting the pre-training language model based on a preset medical record library.
Wherein the Pre-trained language model (Pre-trained Language Model) is a neural network model of special structure, generally used for text sequence-to-text sequence prediction (sequence to sequence); in the adjustment process, training of a language model is carried out through characters filled with texts, namely MASK masking is carried out on the texts, and words covered in a Chinese sentence are accurately predicted through a large number of Chinese corpus training models.
S3: retraining the adjusted pre-trained language model.
The language model generates a neural network model with high semantic understanding capability by learning large-scale Chinese corpus, and the main objective of the fine tuning algorithm is to enable the model to learn new data knowledge without changing the parameters of the original neural network, and the computational complexity required by fine tuning is reduced by a new model fine tuning method, so that the requirement of search application can be met.
S4: and retrieving the medical record document according to the retrained pre-training language model and generating a reply.
In one embodiment, considering that the construction of the pre-training model is an indispensable loop in the medical record retrieval method, a specific construction process may be performed as follows:
acquiring a source sequence and a target sequence through corpus information;
respectively segmenting a source sequence and a target sequence;
endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary;
obtaining a vector representation of the source sequence; acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector; calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss; calculating the gradient of the cross entropy total loss relative to each parameter through a back propagation algorithm; updating parameters through an optimization algorithm until the total cross entropy loss reaches a preset range; setting the optimized training language model as a pre-training language model.
It is worth mentioning that the derivation of the calculation formula of the predicted word and actual word cross entropy model loss function is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,for cross entropy, CE (+)>) For predicted and actual word cross entropy, V is the sum of the predicted and actual words, and ω is the selected predicted and actual words.
The calculation formula of the cross entropy total loss is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,for cross entropy, T is the total number of predicted words and actual words.
The pre-training model typically incorporates a transducer mechanism, i.e., rather than simply using a vector representation of the text, the weights are adjusted for the location and actual context of each word of the input text. In the same training process, predicting a target vector, sequentially selecting a word with highest sequence probability as a prediction result of each step, and generating a next word based on the current prediction result; the target sequence of samples is used only for the calculation of the loss function during training.
For example, "the combination of the space station and the space ship of the space ship No. six completes the meeting and docking", the input of the sample is the meeting and docking of the space ship No. six and the (MASK); output is a space station assembly, the source sequence is changed into [ space station assembly ] after the ship number six freight airship meets and is docked with [ MASK ]; these two sequences are input into the transform's Encoder and Decode, respectively. The Encoder generates a vector representation of the source sequence, and the Encoder generates a predicted vector of the target sequence and a probability distribution corresponding to each predicted vector; the probability distribution of the first word is predicted to be space station 0.6, earth 0.3, aviation center 0.1, the actual word is space station, and the corresponding cross entropy loss function is-log (0.6) =0.51.
In one embodiment, considering that the model learning new data knowledge requires a fine tuning operation on the pre-trained language model without changing the parameters of the original neural network, a specific fine tuning operation may be performed as:
different electronic medical record data are stored in the medical record library; acquiring a source sequence and a target sequence through electronic medical record data; respectively segmenting a source sequence and a target sequence; endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary; obtaining a vector representation of the source sequence; acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector; calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss; calculating the gradient corresponding to the total loss of the cross entropy relative to the preset parameters through a back propagation algorithm; updating preset parameters through an optimization algorithm until the total cross entropy loss reaches a preset range; setting the optimized training language model as a pre-training language model.
It should be noted that the common macro model fine tuning strategies are:
1. additional Pre-training (Additional Pre-training): the pre-trained language model is continued using more relevant additional data to enhance the understanding capabilities of the model on a particular task. Such as pre-training a language model using dialogue data to improve its dialogue generation capabilities.
2. Feature extraction (Feature Extraction): the pre-trained language model is fixed and used as a feature extractor. Dialogue data is input into the model to obtain corresponding feature vectors, and then the vectors are input into a downstream task model for training. The downstream task model learns dialog-related mappings, while the language model provides semantic features.
3. Subsequent trimming (Downstream Finetuning): limited fine-tuning is performed on the pre-trained language model, and data of the relevant tasks is used to further train some parameters of the language model. The general knowledge provided by the pre-training is reserved, and the task related knowledge is learned.
4. Incremental learning (Incremental Learning): when the task related data is less, incremental learning can be performed on the pre-trained language model, with only a small amount of data being used at a time to fine tune the model parameters. The model is gradually improved by accumulating task related knowledge through multiple incremental learning.
5. Joint Learning (Joint Learning): and combining the pre-trained language model and the downstream task model to perform end-to-end learning, and updating the language model parameters and the task model parameters together. The two models are allowed to interact to learn the dialogue understanding and generating capability together. These methods represent the idea of transferring learning by fine tuning on the basis of a pre-trained language model, using the dataset of the relevant task to improve the performance of the model on that task.
While this patent performs the fine tuning algorithm design based on additional pre-training and subsequent fine tuning strategies. The fine tuning algorithm of the patent realizes improvement by using the idea of expansion, on one hand, the number of parameter updating is reduced, and on the other hand, the influence range of the parameter is increased. The idea of dilation is used at the earliest for image domain advanced features, i.e. computing convolutions for pixels of adjacent regions, which enable a larger range of image feature extraction by setting the similarity point spacing. The method of the patent refers to the idea of expansion to control the updating range of large model parameters, and compared with the traditional fine tuning method, the method of the patent also controls the calculated amount while ensuring the updating range of the model.
In one embodiment, considering a case where kneading data or delivery of unreal data is prevented from occurring in the neural network during the prediction process, a limitation operation requiring an additional condition to the prediction process may be performed as:
acquiring user searching interaction information; generating corresponding restriction information based on the user search interaction information; each constraint message comprises two sentences of dialogue start and represents a dialogue scene; retrieving the medical record document according to the retrained pre-training language model and obtaining a retrieval result; judging whether the search result accords with the limit information; if not, canceling the generation of the search result. Although a text Sequence (Sequence 2 Sequence) is generated as such, the sample configuration varies as follows: inputs: please determine medical record documents meeting the search criteria based on the context. Context: medical record document 1 content, medical record document 2 content …, medical record document K content; the search conditions are [ xxx user actual inputs ] Outputs: the medical record document 1 accords with the search condition; because xxxx in medical record document 1 refers to yyyyy; the training samples output are artificial samples.
In one embodiment, given that existing electronic medical records typically employ semi-structured electronic medical records, specific fine-tuning operations for semi-structured electronic medical records may be performed as:
obtaining medical record document information, wherein the medical record document information is a preset semi-structured medical record; constructing a vector expression model of the medical record document; cutting off the medical record document vector according to the TF-IDF value and obtaining a preset value matrix; performing dimension reduction operation on a preset value matrix to obtain a low-dimension vector; clustering the low-dimensional vectors and generating corresponding index information; and retrieving the medical record document through the index information. The vector of the document consists of three parts, and the lengths are 500, 100 and 50 respectively;
w 1-w 500 are plain text word vectors, and if more than 500 single document words are exceeded, the words are truncated according to TF-IDF values;
x 1-x 100 are template word vectors, namely consciousness [ awake ] in the above diagram, wherein the consciousness is the template word, and the part represents key variables of medical records and is independently represented by vectors with the length of 100, so that the truncation strategy is the same as that of the previous figure;
y 1-y 50 are type vectors, and comprise departments and document types (such as admission records, discharge records and the like), and the cutting strategy is the same as that described above.
Since the vector of each word is a fixed 64-dimensional vector, the vector pattern of the document is a 650x64 matrix M, which can be converted to a 1-dimensional 41600x1 for ease of calculating vector similarity.
It should be noted that, after the medical record document is retrieved by the index information, the method further includes: acquiring a short text document and converting the short text document into a vector; if the short text document and the medical record document information are vectors, calculating a group corresponding to the short text document through the vector similarity and recalling the documents in the same group; acquiring key document information from documents in the same group according to the similarity; splicing original texts of the key document information and setting the original texts as contexts of short text documents; synthesizing the key document information and the short text document text to update the short text document; sending the updated short text document to the pre-training language model and obtaining a return document; the return document is set to a reply.
In one embodiment, considering that the user may be dissatisfied with the answer, the retrieval operation may be selected to be re-performed, and the re-performed retrieval operation may be performed as:
if the rejected short text document submitted by the user is received, combining the returned document and the rejected short text document to generate a secondary search short text document; sending the secondary search short text document to a pre-training language model and acquiring a secondary return document; and setting the secondary return document as a secondary reply.
Based on the method, the embodiment of the application also discloses a medical record retrieval system based on the pre-training language model.
The system comprises the following modules as shown in fig. 2:
a pre-training language model construction module 201, configured to generate a pre-training language model based on existing corpus information;
a pre-training language model adjustment module 202, configured to adjust the pre-training language model based on a preset medical record library;
a pre-language model fine tuning module 203, configured to retrain the adjusted pre-training language model;
a training model reply module 204 for retrieving medical record documents and generating replies according to the retrained pre-training language model.
In one embodiment, the pre-training language model construction module 201 is further configured to obtain a source sequence and a target sequence through corpus information; respectively segmenting a source sequence and a target sequence; endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary; obtaining a vector representation of the source sequence; acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector; calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss; calculating the gradient of the cross entropy total loss relative to each parameter through a back propagation algorithm; updating parameters through an optimization algorithm until the total cross entropy loss reaches a preset range; setting the optimized training language model as a pre-training language model.
In one embodiment, the pre-training language model adjustment module 202 is further configured to adjust the pre-training language model based on a preset medical record library, including: different electronic medical record data are stored in the medical record library; acquiring a source sequence and a target sequence through electronic medical record data; respectively segmenting a source sequence and a target sequence; endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary; obtaining a vector representation of the source sequence; acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector; calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss; calculating the gradient corresponding to the total loss of the cross entropy relative to the preset parameters through a back propagation algorithm; updating preset parameters through an optimization algorithm until the total cross entropy loss reaches a preset range; setting the optimized training language model as a pre-training language model.
In one embodiment, the pre-language model fine tuning module 203 is further configured to retrieve medical record documents according to the retrained pre-trained language model, including: acquiring user searching interaction information; generating corresponding restriction information based on the user search interaction information; retrieving the medical record document according to the retrained pre-training language model and obtaining a retrieval result; judging whether the search result accords with the limit information; if not, canceling the generation of the search result.
In one embodiment, the pre-language model fine tuning module 203 is further configured to retrieve medical record documents according to the retrained pre-trained language model, including: obtaining medical record document information, wherein the medical record document information is a preset semi-structured medical record; constructing a vector expression model of the medical record document; cutting off the medical record document vector according to the TF-IDF value and obtaining a preset value matrix; performing dimension reduction operation on a preset value matrix to obtain a low-dimension vector; clustering the low-dimensional vectors and generating corresponding index information; and retrieving the medical record document through the index information.
In one embodiment, the pre-language model fine tuning module 203 is further configured to, after the retrieving the medical record document by the index information, further include: acquiring a short text document and converting the short text document into a vector; if the short text document and the medical record document information are vectors, calculating a group corresponding to the short text document through the vector similarity and recalling the documents in the same group; acquiring key document information from documents in the same group according to the similarity; splicing original texts of the key document information and setting the original texts as contexts of short text documents; synthesizing the key document information and the short text document text to update the short text document; sending the updated short text document to the pre-training language model and obtaining a return document; the return document is set to a reply.
In one embodiment, the pre-language model fine tuning module 203 is further configured to combine the returned document and the rejected short text document and generate a secondary retrieved short text document if the rejected short text document submitted by the user is received; sending the secondary search short text document to a pre-training language model and acquiring a secondary return document; and setting the secondary return document as a secondary reply.
The embodiment of the application also discloses computer equipment.
In particular, the computer device includes a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that performs the medical record retrieval method based on the pre-trained language model described above.
The present embodiment provides a computer readable storage medium storing executable instructions that, when executed by a processor, perform a method of medical record retrieval based on a pre-trained language model as in the method embodiment described above.
The embodiments described above mainly focus on differences from other embodiments, but it should be clear to a person skilled in the art that the embodiments described above may be used alone or in combination with each other as desired.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are referred to each other, and each embodiment is mainly described as different from other embodiments, but it should be apparent to those skilled in the art that the above embodiments may be used alone or in combination with each other as required. In addition, for the device embodiment, since it corresponds to the method embodiment, description is relatively simple, and reference should be made to the description of the corresponding part of the method embodiment for relevant points. The system embodiments described above are merely illustrative, in that the modules illustrated as separate components may or may not be physically separate.
The present application may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present application.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present application may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.
The foregoing description of embodiments of the application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the application is defined by the appended claims.

Claims (6)

1. A medical record retrieval method based on a pre-trained language model, the method comprising:
based on the existing corpus information;
acquiring a source sequence and a target sequence through corpus information;
respectively segmenting a source sequence and a target sequence;
endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary;
obtaining a vector representation of the source sequence;
acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector;
calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss;
calculating the gradient of the cross entropy total loss relative to each parameter through a back propagation algorithm;
updating parameters through an optimization algorithm until the total cross entropy loss reaches a preset range;
setting the optimized training language model as a pre-training language model;
adjusting the pre-training language model based on a preset medical record library;
retraining the adjusted pre-trained language model;
acquiring user searching interaction information;
generating corresponding restriction information based on the user search interaction information;
obtaining medical record document information, wherein the medical record document information is a preset semi-structured medical record;
constructing a vector expression model of the medical record document;
cutting off the medical record document vector according to the TF-IDF value and obtaining a preset value matrix;
performing dimension reduction operation on a preset value matrix to obtain a low-dimension vector;
clustering the low-dimensional vectors and generating corresponding index information;
retrieving the medical record document through the index information;
acquiring a short text document and converting the short text document into a vector;
if the short text document and the medical record document information are vectors, calculating a group corresponding to the short text document through the vector similarity and recalling the documents in the same group;
acquiring key document information from documents in the same group according to the similarity;
splicing original texts of the key document information and setting the original texts as contexts of short text documents;
synthesizing the key document information and the short text document text to update the short text document;
sending the updated short text document to the pre-training language model and obtaining a return document;
setting the returned document as a reply;
judging whether the search result accords with the limit information;
if not, canceling the generation of the search result.
2. The medical record retrieval method based on a pre-training language model according to claim 1, wherein the adjusting the pre-training language model based on the preset medical record library comprises:
different electronic medical record data are stored in the medical record library;
acquiring a source sequence and a target sequence through electronic medical record data;
respectively segmenting a source sequence and a target sequence;
endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary;
obtaining a vector representation of the source sequence;
acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector;
calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss;
calculating the gradient corresponding to the total loss of the cross entropy relative to the preset parameters through a back propagation algorithm;
updating preset parameters through an optimization algorithm until the total cross entropy loss reaches a preset range;
setting the optimized training language model as a pre-training language model.
3. The pre-trained language model based medical record retrieval method according to claim 2, wherein the method further comprises:
if the rejected short text document submitted by the user is received, combining the returned document and the rejected short text document to generate a secondary search short text document;
sending the secondary search short text document to a pre-training language model and acquiring a secondary return document;
and setting the secondary return document as a secondary reply.
4. A medical record retrieval system based on a pre-trained language model, the system comprising:
a pre-training language model construction module (201) for being based on existing corpus information;
acquiring a source sequence and a target sequence through corpus information;
respectively segmenting a source sequence and a target sequence;
endowing the segmented source sequence and target sequence with corresponding index identity information and storing the index identity information into a preset vocabulary;
obtaining a vector representation of the source sequence;
acquiring a predicted vector of a target sequence and probability distribution corresponding to each predicted vector;
calculating a cross entropy loss function of each word segmentation and summing to obtain a cross entropy total loss;
calculating the gradient of the cross entropy total loss relative to each parameter through a back propagation algorithm;
updating parameters through an optimization algorithm until the total cross entropy loss reaches a preset range;
setting the optimized training language model as a pre-training language model
The pre-training language model adjusting module (202) is used for adjusting the pre-training language model based on a preset medical record library;
a pre-language model fine tuning module (203) for retraining the adjusted pre-trained language model;
a training model reply module (204) for obtaining user search interaction information;
generating corresponding restriction information based on the user search interaction information;
obtaining medical record document information, wherein the medical record document information is a preset semi-structured medical record;
constructing a vector expression model of the medical record document;
cutting off the medical record document vector according to the TF-IDF value and obtaining a preset value matrix;
performing dimension reduction operation on a preset value matrix to obtain a low-dimension vector;
clustering the low-dimensional vectors and generating corresponding index information;
retrieving the medical record document through the index information;
acquiring a short text document and converting the short text document into a vector;
if the short text document and the medical record document information are vectors, calculating a group corresponding to the short text document through the vector similarity and recalling the documents in the same group;
acquiring key document information from documents in the same group according to the similarity;
splicing original texts of the key document information and setting the original texts as contexts of short text documents;
synthesizing the key document information and the short text document text to update the short text document;
sending the updated short text document to the pre-training language model and obtaining a return document;
setting the returned document as a reply;
judging whether the search result accords with the limit information;
if not, canceling the generation of the search result.
5. A computer device comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing the method according to any of claims 1 to 3.
6. A computer readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which performs the method according to any of claims 1 to 3.
CN202310905527.8A 2023-07-24 2023-07-24 Medical record retrieval method and system based on pre-training language model Active CN116628171B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310905527.8A CN116628171B (en) 2023-07-24 2023-07-24 Medical record retrieval method and system based on pre-training language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310905527.8A CN116628171B (en) 2023-07-24 2023-07-24 Medical record retrieval method and system based on pre-training language model

Publications (2)

Publication Number Publication Date
CN116628171A CN116628171A (en) 2023-08-22
CN116628171B true CN116628171B (en) 2023-10-20

Family

ID=87592425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310905527.8A Active CN116628171B (en) 2023-07-24 2023-07-24 Medical record retrieval method and system based on pre-training language model

Country Status (1)

Country Link
CN (1) CN116628171B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076650B (en) * 2023-10-13 2024-02-23 之江实验室 Intelligent dialogue method, device, medium and equipment based on large language model
CN117494693B (en) * 2023-12-25 2024-03-15 广东省科技基础条件平台中心 Evaluation document generation method, device and equipment

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330127A (en) * 2017-07-21 2017-11-07 湘潭大学 A kind of Similar Text detection method retrieved based on textual image
CN110196894A (en) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 The training method and prediction technique of language model
CN111613339A (en) * 2020-05-15 2020-09-01 山东大学 Similar medical record searching method and system based on deep learning
CN112256860A (en) * 2020-11-25 2021-01-22 携程计算机技术(上海)有限公司 Semantic retrieval method, system, equipment and storage medium for customer service conversation content
CN112559686A (en) * 2020-12-11 2021-03-26 北京百度网讯科技有限公司 Information retrieval method and device and electronic equipment
CN113220890A (en) * 2021-06-10 2021-08-06 长春工业大学 Deep learning method combining news headlines and news long text contents based on pre-training
US11194972B1 (en) * 2021-02-19 2021-12-07 Institute Of Automation, Chinese Academy Of Sciences Semantic sentiment analysis method fusing in-depth features and time sequence models
CN113821622A (en) * 2021-09-29 2021-12-21 平安银行股份有限公司 Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN114020874A (en) * 2021-11-11 2022-02-08 万里云医疗信息科技(北京)有限公司 Medical record retrieval system, method, equipment and computer readable storage medium
CN114420232A (en) * 2022-01-17 2022-04-29 深圳万海思数字医疗有限公司 Method and system for generating health education data based on electronic medical record data
CN114780678A (en) * 2022-04-02 2022-07-22 中南民族大学 Text retrieval method, device, equipment and storage medium
EP4116861A2 (en) * 2021-11-05 2023-01-11 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for pre-training semantic representation model and electronic device
CN116227603A (en) * 2023-05-10 2023-06-06 山东财经大学 Event reasoning task processing method, device and medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330127A (en) * 2017-07-21 2017-11-07 湘潭大学 A kind of Similar Text detection method retrieved based on textual image
CN110196894A (en) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 The training method and prediction technique of language model
CN111613339A (en) * 2020-05-15 2020-09-01 山东大学 Similar medical record searching method and system based on deep learning
CN112256860A (en) * 2020-11-25 2021-01-22 携程计算机技术(上海)有限公司 Semantic retrieval method, system, equipment and storage medium for customer service conversation content
CN112559686A (en) * 2020-12-11 2021-03-26 北京百度网讯科技有限公司 Information retrieval method and device and electronic equipment
US11194972B1 (en) * 2021-02-19 2021-12-07 Institute Of Automation, Chinese Academy Of Sciences Semantic sentiment analysis method fusing in-depth features and time sequence models
CN113220890A (en) * 2021-06-10 2021-08-06 长春工业大学 Deep learning method combining news headlines and news long text contents based on pre-training
CN113821622A (en) * 2021-09-29 2021-12-21 平安银行股份有限公司 Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
EP4116861A2 (en) * 2021-11-05 2023-01-11 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for pre-training semantic representation model and electronic device
CN114020874A (en) * 2021-11-11 2022-02-08 万里云医疗信息科技(北京)有限公司 Medical record retrieval system, method, equipment and computer readable storage medium
CN114420232A (en) * 2022-01-17 2022-04-29 深圳万海思数字医疗有限公司 Method and system for generating health education data based on electronic medical record data
CN114780678A (en) * 2022-04-02 2022-07-22 中南民族大学 Text retrieval method, device, equipment and storage medium
CN116227603A (en) * 2023-05-10 2023-06-06 山东财经大学 Event reasoning task processing method, device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程煜华 ; 赖茂生 ; .基于D-S证据理论的信息检索模型研究.图书情报工作.2017,(第21期),全文. *

Also Published As

Publication number Publication date
CN116628171A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN112487182B (en) Training method of text processing model, text processing method and device
US11210306B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11501182B2 (en) Method and apparatus for generating model
US20230100376A1 (en) Text sentence processing method and apparatus, computer device, and storage medium
CN107783960B (en) Method, device and equipment for extracting information
US11875787B2 (en) Synthetic data generation for training of natural language understanding models
US11803758B2 (en) Adversarial pretraining of machine learning models
US20240046043A1 (en) Multi-turn Dialogue Response Generation with Template Generation
US11797822B2 (en) Neural network having input and hidden layers of equal units
CN116628171B (en) Medical record retrieval method and system based on pre-training language model
WO2018145098A1 (en) Systems and methods for automatic semantic token tagging
CN107832306A (en) A kind of similar entities method for digging based on Doc2vec
US10963647B2 (en) Predicting probability of occurrence of a string using sequence of vectors
EP3563302A1 (en) Processing sequential data using recurrent neural networks
CN115495555A (en) Document retrieval method and system based on deep learning
Landthaler et al. Extending Thesauri Using Word Embeddings and the Intersection Method.
CA3155096A1 (en) Augmenting attention-based neural networks to selectively attend to past inputs
CN110781666A (en) Natural language processing text modeling based on generative countermeasure networks
Wang et al. Data augmentation for internet of things dialog system
CN113869005A (en) Pre-training model method and system based on sentence similarity
CN116662502A (en) Method, equipment and storage medium for generating financial question-answer text based on retrieval enhancement
US20230281400A1 (en) Systems and Methods for Pretraining Image Processing Models
CN114692610A (en) Keyword determination method and device
CN112949313A (en) Information processing model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant