CN116401381B - Method and device for accelerating extraction of medical relations - Google Patents

Method and device for accelerating extraction of medical relations Download PDF

Info

Publication number
CN116401381B
CN116401381B CN202310670289.7A CN202310670289A CN116401381B CN 116401381 B CN116401381 B CN 116401381B CN 202310670289 A CN202310670289 A CN 202310670289A CN 116401381 B CN116401381 B CN 116401381B
Authority
CN
China
Prior art keywords
entity
text
predicted
extraction
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310670289.7A
Other languages
Chinese (zh)
Other versions
CN116401381A (en
Inventor
宋佳祥
白琨太
刘硕
杨雅婷
许娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Health China Technologies Co Ltd
Original Assignee
Digital Health China Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Health China Technologies Co Ltd filed Critical Digital Health China Technologies Co Ltd
Priority to CN202310670289.7A priority Critical patent/CN116401381B/en
Publication of CN116401381A publication Critical patent/CN116401381A/en
Application granted granted Critical
Publication of CN116401381B publication Critical patent/CN116401381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for accelerating medical relation extraction, which are used for carrying out length and quantity processing on texts to be predicted by a sequencing and merging rule mechanism before entity prediction extraction, adjusting the length of the texts to be predicted and simplifying prediction batches, reducing the time spent by a multi-head self-attention mechanism layer when the texts pass through a bert model, and improving the prediction efficiency and relation extraction efficiency; before the extraction of the medical relation, the entity for constructing the entity pair is judged, the characteristic construction is not carried out on the same entity type, and the analysis is carried out on different entity types, so that the entity pair of the same type and the entity pair which should not be the head entity are removed, the constructed entity pair is subjected to light-weight treatment, and the prediction efficiency and the relation extraction efficiency are improved.

Description

Method and device for accelerating extraction of medical relations
Technical Field
The invention belongs to the technical field of medical language processing, and particularly relates to a method and a device for accelerating medical relation extraction.
Background
Most medical records of hospitals in China are recorded in natural language nowadays, and the unstructured medical records cannot be directly used by machines and are required to be converted into structured information through natural language processing technology for machine processing. Along with the development of medical informatization, key information is extracted from massive electronic medical records accurately and rapidly, and a structured model conforming to medical specifications is constructed, so that the method becomes a key step of secondary data use. The medical record structuring is mainly based on information extraction technology and relates to entity extraction, relation extraction, entity standardization and the like. The relation extraction is a key step of medical record structuring, the current common relation extraction method is divided into pipeline and joint methods, the joint is joint extraction, and the entity and the relation are decoded jointly, so that the relation can be extracted in one step; the pipeline is divided into two steps, entity identification is carried out first, and relation extraction is carried out on the basis of the identified entity;
however, the existing pipeline relation extraction has the following problems to be optimized:
in relation extraction, there are two factors that affect the speed of extraction: the number of the predicted entities and the length of the text to be predicted;
1. in the existing pipeline relation extraction, the more the number of predicted entities is, the longer the time used for relation prediction is. Because two pairs of entity pairs are required to be constructed during relation prediction, the model judges whether the constructed entity pairs have a relation or not. Predicted entities cannot be optimized, but constructed entity pairs can be optimized, but the constructed entity pairs are not optimized at present, so that the more the number of entity predictions in the existing pipeline relation extraction flow is, the more the entity pairs combined in pairs are, the more the prediction time is prolonged, and the prediction efficiency and the relation extraction efficiency are reduced;
2. in the existing pipeline relation extraction, the length and the number of texts to be predicted are not processed through a sequencing and merging rule mechanism, so that the length of the texts to be predicted is too long, the number of predicted batches is too large, and when the predicted batches pass through a bert model, the longer the multi-head self-attention mechanism layer takes, so that the predicted time is prolonged, and the prediction efficiency and the relation extraction efficiency are reduced.
Disclosure of Invention
In view of the foregoing deficiencies of the prior art, the present application provides a method and apparatus for expediting medical relationship extraction.
In a first aspect, the present application proposes a method for accelerating the extraction of medical relationships, comprising the steps of:
acquiring original diagnosis case data from a database of a hospital, and extracting an original text corpus from the original diagnosis case data through a regular expression;
sequencing and counting the original text corpus according to text lengths, constructing a text length quantity relation table, merging and supplementing texts of the original text corpus according to the text length quantity relation table, and integrating processing results into texts to be predicted;
inputting the text to be predicted into a preset entity extraction model to carry out entity extraction, so as to obtain an entity prediction result;
performing entity-to-light weight processing on the entity prediction result through an entity analysis mechanism to obtain an entity prediction optimization result;
inputting the entity prediction optimization result into a preset relation extraction model to perform relation extraction, and obtaining a final medical relation extraction result.
In some embodiments, the sorting and counting the original text corpus according to the text length, constructing a text length number relation table, and performing merging and filling processing on the text of the original text corpus according to the text length number relation table, so as to integrate the processing result into the text to be predicted, including:
sorting the original text corpus according to text lengths, counting according to the text lengths, and constructing a text length quantity relation table;
setting a merging threshold according to the longest text length and the shortest text length in the text length quantity relation table, and carrying out batch merging processing on the original text corpus based on the merging threshold and a preset text prediction batch maximum number;
and (3) carrying out filling processing on all texts after batch merging processing, and integrating the texts after filling processing results into texts to be predicted.
In some embodiments, the inputting the text to be predicted into a preset entity extraction model for entity extraction to obtain an entity prediction result, and the processing procedure of the entity extraction model includes the following steps:
converting the text to be predicted into a first digital representation according to a bert vocabulary;
constructing characteristics to be predicted: constructing a starting subscript, an ending subscript and a span of a range to be predicted by using a permutation and combination mode;
converting the first digital representation and the feature to be predicted into a first data tensor;
and inputting the first data tensor into a bert model to obtain an entity prediction result of each range to be predicted.
In some embodiments, the performing, by the entity analysis mechanism, entity-to-lightweight processing on the entity prediction result to obtain an entity prediction optimization result includes:
setting an entity analysis mechanism, the entity analysis mechanism comprising: judging entity types of the two entities, if the entity types are the same, not performing feature construction, and if the entity types are different, then judging whether the two entities can form an entity pair;
and removing entity pairs of the same type and entity pairs which cannot form a relation in the entity prediction result according to the entity pair analysis mechanism to obtain an entity prediction optimization result.
In some embodiments, the inputting the entity prediction optimization result into a preset relationship extraction model for relationship extraction to obtain a final medical relationship extraction result, and the processing procedure of the relationship extraction model includes the following steps:
constructing an entity pair according to the entity prediction optimization result, wherein the construction process comprises a starting position index, an ending position index and an entity type of each entity to form a sample to be predicted;
converting the sample to be predicted into a second digital representation according to a bert vocabulary;
converting the second digital representation into a second data tensor;
and inputting the second data tensor into a bert model to obtain a relation extraction result.
In a second aspect, the present application proposes a device for accelerating medical relationship extraction, including an original text acquisition module, a text data processing module, an entity extraction module, an entity optimization module, and a relationship extraction module;
the original text acquisition module is used for acquiring original diagnosis case data from a database of a hospital and extracting original text corpus from the original diagnosis case data through a regular expression;
the text data processing module is used for sequencing and counting the original text corpus according to text lengths, constructing a text length quantity relation table, combining and supplementing texts of the original text corpus according to the text length quantity relation table, and integrating processing results into texts to be predicted;
the entity extraction module is used for inputting the text to be predicted into a preset entity extraction model to carry out entity extraction so as to obtain an entity prediction result;
the entity optimization module is used for carrying out entity pair light weight processing on the entity prediction result through an entity analysis mechanism to obtain an entity prediction optimization result;
the relation extraction module is used for inputting the entity prediction optimization result into a preset relation extraction model to perform relation extraction, and obtaining a final medical relation extraction result.
In some embodiments, the text data processing module comprises a relation table construction unit, a batch merging processing unit and a text integration unit to be predicted;
the relation table construction unit is used for sequencing the original text corpus according to the text length, counting according to the text length and constructing a text length quantity relation table;
the batch merging processing unit is used for setting a merging threshold according to the longest text length and the shortest text length in the text length quantity relation table, and carrying out batch merging processing on the original text corpus based on the merging threshold and a preset text prediction batch maximum number;
the text to be predicted integrating unit is used for carrying out filling processing on all texts after batch merging processing, and integrating the texts after filling processing results into the text to be predicted.
In some embodiments, the entity extraction module includes a first digital representation conversion unit, a feature to be predicted construction unit, a first data tensor conversion unit, and an entity prediction unit;
the first digital representation conversion unit is used for converting the text to be predicted into a first digital representation according to a bert vocabulary;
the feature to be predicted constructing unit is configured to construct a feature to be predicted: constructing a starting subscript, an ending subscript and a span of a range to be predicted by using a permutation and combination mode;
the first data tensor conversion unit is used for converting the first digital representation and the feature to be predicted into a first data tensor;
the entity prediction unit is used for inputting the first data tensor into a bert model to obtain an entity prediction result of each range to be predicted.
In some embodiments, the entity optimization module includes an entity pair analysis mechanism setting unit and an entity prediction optimization unit;
the entity analysis mechanism setting unit is configured to set an entity analysis mechanism, where the entity analysis mechanism includes: judging entity types of the two entities, if the entity types are the same, not performing feature construction, and if the entity types are different, then judging whether the two entities can form an entity pair;
and the entity prediction optimization unit is used for removing entity pairs of the same type and entity pairs which cannot form a relation in the entity prediction result according to the entity pair analysis mechanism to obtain the entity prediction optimization result.
In some embodiments, the relationship extraction module includes a sample to be predicted construction unit, a second digital representation conversion unit, a second data tensor conversion unit, and a relationship extraction unit;
the sample to be predicted constructing unit is used for constructing an entity pair according to the entity prediction optimization result, wherein the constructing process comprises a starting position index, an ending position index and an entity type of each entity to form a sample to be predicted;
the second digital representation conversion unit is used for converting the sample to be predicted into a second digital representation according to a bert vocabulary;
the second data tensor conversion unit is used for converting the second digital representation into a second data tensor;
the relation extraction unit is used for inputting the second data tensor into the bert model to obtain a relation extraction result.
In a third aspect, the present application proposes a computer device comprising:
and a processor for implementing the steps of any of the methods described above when executing the computer program stored in the memory.
In a fourth aspect, the present application proposes a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of any of the methods described above.
The invention has the beneficial effects that:
before entity prediction extraction, the length and the number of texts to be predicted are processed through a sequencing and merging rule mechanism, the length of the texts to be predicted is adjusted, prediction batches are simplified, the time spent by a multi-head self-attention mechanism layer is reduced when the text passes through a bert model, and the prediction efficiency and the relation extraction efficiency are improved;
before the extraction of the medical relation, the entity for constructing the entity pair is judged, the characteristic construction is not carried out on the same entity type, and the analysis is carried out on different entity types, so that the entity pair of the same type and the entity pair which should not be the head entity are removed, the constructed entity pair is subjected to light-weight treatment, and the prediction efficiency and the relation extraction efficiency are improved.
Drawings
Fig. 1 is a general flow chart of the present invention.
Fig. 2 is a schematic block diagram of the apparatus of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In a first aspect, the present application proposes a method for accelerating extraction of medical relationships, as shown in fig. 1, including the following steps:
s100: acquiring original diagnosis case data from a database of a hospital, and extracting an original text corpus from the original diagnosis case data through a regular expression;
the method comprises the steps of cleaning the case first page text in a hospital database through data to obtain texts of different types such as diagnosis, inspection and verification, and summarizing the texts of the same type to form an original text corpus.
S200: sequencing and counting the original text corpus according to text lengths, constructing a text length quantity relation table, merging and supplementing texts of the original text corpus according to the text length quantity relation table, and integrating processing results into texts to be predicted;
in some embodiments, the sorting and counting the original text corpus according to the text length, constructing a text length number relation table, and performing merging and filling processing on the text of the original text corpus according to the text length number relation table, so as to integrate the processing result into the text to be predicted, including:
sorting the original text corpus according to text lengths, counting according to the text lengths, and constructing a text length quantity relation table;
in general, no single prediction is made for the original text corpus, all in batch. For example: 1000 texts to be predicted exist, each batch of 100 texts is predicted, 10 times of prediction are needed, namely the longest text in the 100 texts is found out, and the lengths of other texts are uniformly complemented according to the length;
for example: assuming that the longest text length in the batch of text is 300, but the other text lengths are all within 50, all of the lengths need to be padded to 300.
Therefore, before prediction, 1000 pieces of text are ordered according to length, and counted according to length, and the ordered execution codes are as follows:
text_data=sorted(obj.get("ann_info"),
key=lambda x: len(x.get("text")))
the purpose of the sorting is to enable texts with not too large length differences to be concentrated in one block, so that short texts are not too much complemented in the process of complement, the prediction time consumption is reduced, and the sorting result is shown in the following table 1:
TABLE 1
Typically, the 1000 data length distributions are relatively sparse, which can be illustrated by way of example.
Setting a merging threshold according to the longest text length and the shortest text length in the text length quantity relation table, and carrying out batch merging processing on the original text corpus based on the merging threshold and a preset text prediction batch maximum number;
the final batch number of the model is determined according to the set maximum batch number and the combination threshold value of the difference between the longest text length and the shortest text length, the combination threshold value is generally set to be about 30, and the batch number is generally set to be about 32. The merge threshold is a primary factor and the number of batches is a secondary factor, such as: if there are 100 text numbers below the merge threshold 30, then the number of fed models per batch is 32, requiring 100/32=4 times; if there are 30 text pieces less than the merge threshold 20, the model is fed directly at one time.
From the above table, there are 234 texts with length of 12, and the predicted amount of each batch is 100, so the longest text length under the batch can be set to 12+6 (6 is [ CLS ], [ SEP ], [ unused3], [ unused4], [ unused23], [ unused24], obtain the longest text length in the batch: max_seq_length=max ([ len (i) for data in batch _data for i in data_get ] ("sendees") ]) +6), and 3 batches can be predicted to have length of 12;
similarly, text of lengths 19 and 22 is as described above.
When the number of texts with a certain length is far smaller than the number of batches (a certain ratio can be set), the combination prediction can be performed with the text with the length or above, and the combination rule is as follows:
longest text length-shortest text length < = threshold, as shown in the table above: the threshold is set at 30 and lengths 43, 56, 60 can be predicted together and length 103 alone.
Further, the execution code of the merge rule is as follows:
from collections import defaultdict
batch_size = 32
diff_length = 30
length interval statistics
len_count = defaultdict(list)
# obtain text
texts = [i.get("text") for i in obj.get("ann_info")]
# obtain longest text length
max_len = max(sorted([len(i) for i in texts]))
According to the length of the longest text and the length difference threshold value, # different length intervals are obtained
intervals = [[diff_length * i, diff_length * (i + 1)] for i in range(int(max_len / diff_length) + 1)]
Traversing all texts, storing the texts in the corresponding length interval
for text in texts:
for interval in intervals:
if interval[0] < len(text) < interval[1]:
len_count[str(interval[0]) + "~" + str(interval[1])].append(text)
Batch text prediction
for interval in len_count:
if len(len_count[interval]) > batch_size:
for i in range(0, len(len_count[interval]), batch_size):
result = predict(len_count[interval][i:i + batch_size]) else:
result = predict(len_count[interval]);
And (3) carrying out filling processing on all texts after batch merging processing, and integrating the texts after filling processing results into texts to be predicted.
The text length fed into the model is first subjected to a fill-in operation. The shortest text is subjected to a 0-filling operation, and is filled to the length +6 of the longest text (6 bits are also needed to be filled for the longest text), and after the filling operation, the model is fed to predict, wherein taking 'fever cough for 3 days and pharyngalgia for one week' as an example, if the length of the longest text in the same batch as the text is 300, the 'fever cough for 3 days and pharyngalgia for one week' is filled to be 300 in length, and the filled part is represented by 0, so that each token is converted into a vector representation of 512 dimensions in the model prediction process, namely 'fever cough for 3 days and pharyngalgia for one week' is converted from [11, 512] to [300,512] through the filling operation.
S300: inputting the text to be predicted into a preset entity extraction model to carry out entity extraction, so as to obtain an entity prediction result;
in some embodiments, the inputting the text to be predicted into a preset entity extraction model for entity extraction to obtain an entity prediction result, and the processing procedure of the entity extraction model includes the following steps:
converting the text to be predicted into a first digital representation according to a bert vocabulary;
constructing characteristics to be predicted: constructing a starting subscript, an ending subscript and a span of a range to be predicted by using a permutation and combination mode;
converting the first digital representation and the feature to be predicted into a first data tensor;
and inputting the first data tensor into a bert model to obtain an entity prediction result of each range to be predicted.
Wherein, entity extraction model: the currently mainstream algorithm model Bert is used, and the model Bert is composed of an encoder Layer of 12 convertors (self-Attention mechanism neural network), and each encoder contains Multi-Head Attention, layer Norm, feed Forward and addition. (each encoder contains multi-headed self-attention, layer normalization, feed-forward full-join and residual join)) process flows are as follows:
step 1: according to the bert word list, each word corresponds to an id, and the text to be predicted is converted into a digital representation;
step 2: constructing characteristics to be predicted; using permutation and combination (fever, febrile cough … … fever, febrile cough … …), constructing a starting subscript, an ending subscript and a span of the range to be predicted, such as: heating- - > (0, 1, 2) (0 indicates that the start index is 0,1 indicates that the end index is 1,2 indicates that the length of heating is 2);
step 3: converting the results of steps 1 and 2 into tensor (data tensor);
step 4: and (3) sending the result of the step (3) into a bert model to obtain an entity prediction result of each range to be predicted.
S400: performing entity-to-light weight processing on the entity prediction result through an entity analysis mechanism to obtain an entity prediction optimization result;
in some embodiments, the performing, by the entity analysis mechanism, entity-to-lightweight processing on the entity prediction result to obtain an entity prediction optimization result includes:
setting an entity analysis mechanism, the entity analysis mechanism comprising: judging entity types of the two entities, if the entity types are the same, not performing feature construction, and if the entity types are different, then judging whether the two entities can form an entity pair;
and removing entity pairs of the same type and entity pairs which cannot form a relation in the entity prediction result according to the entity pair analysis mechanism to obtain an entity prediction optimization result.
The more the number of the predicted entities in the entity prediction result is, the longer the time used for relation prediction is. Because two pairs of entity pairs are required to be constructed during relation prediction, the model judges whether the constructed entity pairs have a relation or not. Predicted entities cannot be optimized, but constructed entity pairs can be optimized, taking the above flow as an example: the entity type has two types of symptoms and duration, and the relation type has one type of duration. The predicted number of entities is 5, and a two-by-two construction will result in 5×5-1=20 entity pairs, but fever and cough cannot be related, and 3 days and a week cannot be related, because fever and cough are the same type of entity, and the entity types are all symptoms. The type of relationship needed is duration, i.e. one of the entities must be a symptom and the other one is a duration, so that it can be determined on the basis of this whether the two entities can form a relationship. Neither symptom nor symptom combination can be related, neither duration nor duration can be related, and in addition, duration cannot be the first entity, namely: the entity pair of [3 days, fever ] is not related, and in the condition of the relation, the entity A must be symptom, the entity B must be long, so that the entity AB can not form the relation, and if the position is changed, the entity B cannot form the relation. The method is obtained through training according to manually marked corpus in the model training process.
By removing the two combinations, the specific implementation steps are as follows:
1. judging the entity types of the two entities, and if the entity types are the same, not constructing the characteristics;
2. if the two entity types are different, then a determination is made as to whether the A entity is a symptom, whether the B entity is a duration),
# load entity pairs of possible composition relationships from model paths
tot_pos_ner_pairs=get_params(os.path.join(obj.get("entity_model_dir"), "positive_ner_pairs.json"))
# convert each pair of entity pair types from a list into a plurality of groups
tot_pos_ner_pairs = [tuple(i) for i in tot_pos_ner_pairs]
Judging when constructing the # feature, if the two entity pairs are in the tot_pos_ner_pair, constructing the feature if (sample [ 'sub_type' ], sample [ 'obj_type' ]) in tot_pos_ner_pairs:
sent_samples.append(sample)
the code is executed, the entity pairs of the same type and the entity pairs which should not be head entities are removed, and the quantity of the entity pairs which need to be predicted finally is 6 groups (heating, 3 days ], [ cough, 3 days ], [ heating, one week ], [ cough, one week ], [ pharyngalgia, 3 days ], [ pharyngalgia, one week ]), compared with the previous entity pairs which are 20 to be predicted, the quantity of the entity pairs is reduced by 2/3, and through the two methods, the prediction speed of a model can be greatly accelerated in actual production, and the prediction efficiency of the model is improved.
S500: inputting the entity prediction optimization result into a preset relation extraction model to perform relation extraction, and obtaining a final medical relation extraction result.
In some embodiments, the inputting the entity prediction optimization result into a preset relationship extraction model for relationship extraction to obtain a final medical relationship extraction result, and the processing procedure of the relationship extraction model includes the following steps:
constructing an entity pair according to the entity prediction optimization result, wherein the construction process comprises a starting position index, an ending position index and an entity type of each entity to form a sample to be predicted;
converting the sample to be predicted into a second digital representation according to a bert vocabulary;
converting the second digital representation into a second data tensor;
and inputting the second data tensor into a bert model to obtain a relation extraction result.
Wherein, the relation extraction model: the same entity extraction model is only changed when the characteristics are constructed, and the processing flow is as follows:
step 1: and constructing entity pairs (the number of the constructed entity pairs is n (n-1) if the number of the predicted entities is n, and the beginning position subscript, the ending position subscript and the entity type of each entity are covered in the construction process). Each entity pair is a sample to be predicted;
step 2: converting step 1 to a digital representation;
as shown in table 2 below:
TABLE 2
Wherein [ CLS ] and [ SEP ] represent the start and end of sentence symbols, respectively, [ unused3] represents the start of the heat generating entity, [ unused4] represents the end of the heat generating entity, [ unused23] represents the start of the cough entity, and [ unused24] represents the end of the cough entity.
Step 3: converting the result of the step 2 into tensor (data tensor);
step 4: and (3) sending the result of the step (3) into a bert model to obtain a predicted result.
In a second aspect, the present application proposes a device for accelerating extraction of medical relationships, as shown in fig. 2, including an original text acquisition module, a text data processing module, an entity extraction module, an entity optimization module, and a relationship extraction module;
the original text acquisition module is used for acquiring original diagnosis case data from a database of a hospital and extracting original text corpus from the original diagnosis case data through a regular expression;
the text data processing module is used for sequencing and counting the original text corpus according to text lengths, constructing a text length quantity relation table, combining and supplementing texts of the original text corpus according to the text length quantity relation table, and integrating processing results into texts to be predicted;
the entity extraction module is used for inputting the text to be predicted into a preset entity extraction model to carry out entity extraction so as to obtain an entity prediction result;
the entity optimization module is used for carrying out entity pair light weight processing on the entity prediction result through an entity analysis mechanism to obtain an entity prediction optimization result;
the relation extraction module is used for inputting the entity prediction optimization result into a preset relation extraction model to perform relation extraction, and obtaining a final medical relation extraction result.
In some embodiments, the text data processing module comprises a relation table construction unit, a batch merging processing unit and a text integration unit to be predicted;
the relation table construction unit is used for sequencing the original text corpus according to the text length, counting according to the text length and constructing a text length quantity relation table;
the batch merging processing unit is used for setting a merging threshold according to the longest text length and the shortest text length in the text length quantity relation table, and carrying out batch merging processing on the original text corpus based on the merging threshold and a preset text prediction batch maximum number;
the text to be predicted integrating unit is used for carrying out filling processing on all texts after batch merging processing, and integrating the texts after filling processing results into the text to be predicted.
In some embodiments, the entity extraction module includes a first digital representation conversion unit, a feature to be predicted construction unit, a first data tensor conversion unit, and an entity prediction unit;
the first digital representation conversion unit is used for converting the text to be predicted into a first digital representation according to a bert vocabulary;
the feature to be predicted constructing unit is configured to construct a feature to be predicted: constructing a starting subscript, an ending subscript and a span of a range to be predicted by using a permutation and combination mode;
the first data tensor conversion unit is used for converting the first digital representation and the feature to be predicted into a first data tensor;
the entity prediction unit is used for inputting the first data tensor into a bert model to obtain an entity prediction result of each range to be predicted.
In some embodiments, the entity optimization module includes an entity pair analysis mechanism setting unit and an entity prediction optimization unit;
the entity analysis mechanism setting unit is configured to set an entity analysis mechanism, where the entity analysis mechanism includes: judging entity types of the two entities, if the entity types are the same, not performing feature construction, and if the entity types are different, then judging whether the two entities can form an entity pair;
and the entity prediction optimization unit is used for removing entity pairs of the same type and entity pairs which cannot form a relation in the entity prediction result according to the entity pair analysis mechanism to obtain the entity prediction optimization result.
In some embodiments, the relationship extraction module includes a sample to be predicted construction unit, a second digital representation conversion unit, a second data tensor conversion unit, and a relationship extraction unit;
the sample to be predicted constructing unit is used for constructing an entity pair according to the entity prediction optimization result, wherein the constructing process comprises a starting position index, an ending position index and an entity type of each entity to form a sample to be predicted;
the second digital representation conversion unit is used for converting the sample to be predicted into a second digital representation according to a bert vocabulary;
the second data tensor conversion unit is used for converting the second digital representation into a second data tensor;
the relation extraction unit is used for inputting the second data tensor into the bert model to obtain a relation extraction result.
In a third aspect, the present application proposes a computer device comprising:
and a processor for implementing the steps of any of the methods described above when executing the computer program stored in the memory.
In a fourth aspect, the present application proposes a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of any of the methods described above, a computer program being divisible into one or more modules/units, one or more modules/units being stored in a memory and being executed by the processor to complete the present invention. One or more of the modules/units may be a series of computer program instruction segments capable of performing particular functions to describe the execution of the computer program in a computer device.
The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. Computer devices may include, but are not limited to, processors and memory. Those skilled in the art will appreciate that a computer device may include more or fewer components, or may combine certain components, or different components, e.g., a computer device may also include input and output devices, network access devices, buses, etc.
The processor may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The memory may also be an external storage device of the computer device, for example, a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store computer programs and other programs and data required by the computer device. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements made by those skilled in the art without departing from the present technical solution shall be considered as falling within the scope of the claims.

Claims (6)

1. A method for expediting extraction of medical relationships, comprising: the method comprises the following steps:
acquiring original diagnosis case data from a database of a hospital, and extracting an original text corpus from the original diagnosis case data through a regular expression;
sequencing and counting the original text corpus according to text length, constructing a text length quantity relation table, merging and supplementing the texts of the original text corpus according to the text length quantity relation table, and integrating the processing result into a text to be predicted, wherein the method comprises the following steps of: sorting the original text corpus according to text lengths, counting according to the text lengths, and constructing a text length quantity relation table; setting a merging threshold according to the longest text length and the shortest text length in the text length quantity relation table, and carrying out batch merging processing on the original text corpus based on the merging threshold and a preset text prediction batch maximum number; performing filling processing on all texts after batch merging processing, and integrating the texts after filling processing results into a text to be predicted;
inputting the text to be predicted into a preset entity extraction model to carry out entity extraction, so as to obtain an entity prediction result;
performing entity-to-light weight processing on the entity prediction result through an entity analysis mechanism to obtain an entity prediction optimization result, wherein the entity prediction optimization result comprises the following steps of: setting an entity analysis mechanism, the entity analysis mechanism comprising: judging entity types of the two entities, if the entity types are the same, not performing feature construction, and if the entity types are different, then judging whether the two entities can form an entity pair; removing entity pairs of the same type and entity pairs which cannot form a relation in the entity prediction result according to the entity pair analysis mechanism to obtain an entity prediction optimization result;
inputting the entity prediction optimization result into a preset relation extraction model to perform relation extraction, and obtaining a final medical relation extraction result.
2. The method according to claim 1, characterized in that: inputting the text to be predicted into a preset entity extraction model for entity extraction to obtain an entity prediction result, wherein the processing procedure of the entity extraction model comprises the following steps:
converting the text to be predicted into a first digital representation according to a bert vocabulary;
constructing characteristics to be predicted: constructing a starting subscript, an ending subscript and a span of a range to be predicted by using a permutation and combination mode;
converting the first digital representation and the feature to be predicted into a first data tensor;
and inputting the first data tensor into a bert model to obtain an entity prediction result of each range to be predicted.
3. The method according to claim 2, characterized in that: inputting the entity prediction optimization result into a preset relation extraction model for relation extraction to obtain a final medical relation extraction result, wherein the processing process of the relation extraction model comprises the following steps of:
constructing an entity pair according to the entity prediction optimization result, wherein the construction process comprises a starting position index, an ending position index and an entity type of each entity to form a sample to be predicted;
converting the sample to be predicted into a second digital representation according to a bert vocabulary;
converting the second digital representation into a second data tensor;
and inputting the second data tensor into a bert model to obtain a relation extraction result.
4. A device for expediting the extraction of medical relationships, comprising: the system comprises an original text acquisition module, a text data processing module, an entity extraction module, an entity optimization module and a relation extraction module;
the original text acquisition module is used for acquiring original diagnosis case data from a database of a hospital and extracting original text corpus from the original diagnosis case data through a regular expression;
the text data processing module is used for sequencing and counting the original text corpus according to text lengths, constructing a text length quantity relation table, combining and supplementing texts of the original text corpus according to the text length quantity relation table, and integrating processing results into texts to be predicted; the text data processing module comprises a relation table construction unit, a batch merging processing unit and a text integration unit to be predicted; the relation table construction unit is used for sequencing the original text corpus according to the text length, counting according to the text length and constructing a text length quantity relation table; the batch merging processing unit is used for setting a merging threshold according to the longest text length and the shortest text length in the text length quantity relation table, and carrying out batch merging processing on the original text corpus based on the merging threshold and a preset text prediction batch maximum number; the text to be predicted integrating unit is used for carrying out filling processing on all texts after batch merging processing, and integrating the texts after filling processing results into the text to be predicted;
the entity extraction module is used for inputting the text to be predicted into a preset entity extraction model to carry out entity extraction so as to obtain an entity prediction result;
the entity optimization module is used for carrying out entity pair light weight processing on the entity prediction result through an entity analysis mechanism to obtain an entity prediction optimization result; the entity optimization module comprises an entity pair analysis mechanism setting unit and an entity prediction optimization unit; the entity analysis mechanism setting unit is configured to set an entity analysis mechanism, where the entity analysis mechanism includes: judging entity types of the two entities, if the entity types are the same, not performing feature construction, and if the entity types are different, then judging whether the two entities can form an entity pair; the entity prediction optimization unit is used for removing entity pairs of the same type and entity pairs which cannot form a relation in the entity prediction result according to the entity pair analysis mechanism to obtain an entity prediction optimization result;
the relation extraction module is used for inputting the entity prediction optimization result into a preset relation extraction model to perform relation extraction, and obtaining a final medical relation extraction result.
5. The apparatus according to claim 4, wherein: the entity extraction module comprises a first digital representation conversion unit, a feature to be predicted construction unit, a first data tensor conversion unit and an entity prediction unit;
the first digital representation conversion unit is used for converting the text to be predicted into a first digital representation according to a bert vocabulary;
the feature to be predicted constructing unit is configured to construct a feature to be predicted: constructing a starting subscript, an ending subscript and a span of a range to be predicted by using a permutation and combination mode;
the first data tensor conversion unit is used for converting the first digital representation and the feature to be predicted into a first data tensor;
the entity prediction unit is used for inputting the first data tensor into a bert model to obtain an entity prediction result of each range to be predicted.
6. The apparatus according to claim 5, wherein: the relation extraction module comprises a sample construction unit to be predicted, a second digital representation conversion unit, a second data tensor conversion unit and a relation extraction unit;
the sample to be predicted constructing unit is used for constructing an entity pair according to the entity prediction optimization result, wherein the constructing process comprises a starting position index, an ending position index and an entity type of each entity to form a sample to be predicted;
the second digital representation conversion unit is used for converting the sample to be predicted into a second digital representation according to a bert vocabulary;
the second data tensor conversion unit is used for converting the second digital representation into a second data tensor;
the relation extraction unit is used for inputting the second data tensor into the bert model to obtain a relation extraction result.
CN202310670289.7A 2023-06-07 2023-06-07 Method and device for accelerating extraction of medical relations Active CN116401381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310670289.7A CN116401381B (en) 2023-06-07 2023-06-07 Method and device for accelerating extraction of medical relations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310670289.7A CN116401381B (en) 2023-06-07 2023-06-07 Method and device for accelerating extraction of medical relations

Publications (2)

Publication Number Publication Date
CN116401381A CN116401381A (en) 2023-07-07
CN116401381B true CN116401381B (en) 2023-08-04

Family

ID=87018369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310670289.7A Active CN116401381B (en) 2023-06-07 2023-06-07 Method and device for accelerating extraction of medical relations

Country Status (1)

Country Link
CN (1) CN116401381B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117766137B (en) * 2024-02-22 2024-05-28 广东省人民医院 Medical diagnosis result determining method and device based on reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020431A (en) * 2019-03-06 2019-07-16 平安科技(深圳)有限公司 Feature extracting method, device, computer equipment and the storage medium of text information
CN113408296A (en) * 2021-06-24 2021-09-17 东软集团股份有限公司 Text information extraction method, device and equipment
CN114996472A (en) * 2022-05-26 2022-09-02 神州医疗科技股份有限公司 Sample optimization method and system based on relation extraction model
CN115906838A (en) * 2021-08-20 2023-04-04 广东博智林机器人有限公司 Text extraction method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015077942A1 (en) * 2013-11-27 2015-06-04 Hewlett-Packard Development Company, L.P. Relationship extraction
CN107403067A (en) * 2017-07-31 2017-11-28 京东方科技集团股份有限公司 Intelligence based on medical knowledge base point examines server, terminal and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020431A (en) * 2019-03-06 2019-07-16 平安科技(深圳)有限公司 Feature extracting method, device, computer equipment and the storage medium of text information
CN113408296A (en) * 2021-06-24 2021-09-17 东软集团股份有限公司 Text information extraction method, device and equipment
CN115906838A (en) * 2021-08-20 2023-04-04 广东博智林机器人有限公司 Text extraction method and device, electronic equipment and storage medium
CN114996472A (en) * 2022-05-26 2022-09-02 神州医疗科技股份有限公司 Sample optimization method and system based on relation extraction model

Also Published As

Publication number Publication date
CN116401381A (en) 2023-07-07

Similar Documents

Publication Publication Date Title
Sariyar et al. The RecordLinkage package: detecting errors in data.
CN110287961A (en) Chinese word cutting method, electronic device and readable storage medium storing program for executing
CN116401381B (en) Method and device for accelerating extraction of medical relations
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN107704539B (en) Method and device for large-scale text information batch structuring
Wang et al. Zero-shot information extraction as a unified text-to-triple translation
CN111177375B (en) Electronic document classification method and device
CN112783921A (en) Database operation method and device
CN105654129A (en) Optical character sequence recognition method
CN111680494A (en) Similar text generation method and device
CN112507663A (en) Text-based judgment question generation method and device, electronic equipment and storage medium
CN112948429A (en) Data reporting method, device and equipment
CN111178701A (en) Risk control method and device based on feature derivation technology and electronic equipment
CN116842021A (en) Data dictionary standardization method, equipment and medium based on AI generation technology
CN117592470A (en) Low-cost gazette data extraction method driven by large language model
CN116564539A (en) Medical similar case recommending method and system based on information extraction and entity normalization
CN114783446B (en) Voice recognition method and system based on contrast predictive coding
CN111859915B (en) English text category identification method and system based on word frequency significance level
Ivaschenko et al. Semantic analysis implementation in engineering enterprise content management systems
CN113658652A (en) Binary relation extraction method based on electronic medical record data text
CN114239582A (en) Electronic medical record detail extraction method and system based on semantic information
US20210081424A1 (en) Joiner for distributed databases
CN116894436B (en) Data enhancement method and system based on medical named entity recognition
CN112529302A (en) Method and system for predicting success rate of patent application authorization and electronic equipment
CN109815270B (en) Relation calculation method and device, computer storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant