CN111553161B - Entity and relation labeling system for medical texts - Google Patents

Entity and relation labeling system for medical texts Download PDF

Info

Publication number
CN111553161B
CN111553161B CN202010347165.1A CN202010347165A CN111553161B CN 111553161 B CN111553161 B CN 111553161B CN 202010347165 A CN202010347165 A CN 202010347165A CN 111553161 B CN111553161 B CN 111553161B
Authority
CN
China
Prior art keywords
labeling
entity
user
marking
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010347165.1A
Other languages
Chinese (zh)
Other versions
CN111553161A (en
Inventor
张坤丽
赵旭
谢琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN202010347165.1A priority Critical patent/CN111553161B/en
Publication of CN111553161A publication Critical patent/CN111553161A/en
Application granted granted Critical
Publication of CN111553161B publication Critical patent/CN111553161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a medical text-oriented entity and relationship labeling platform, which integrates various preprocessing algorithms to preprocess texts to be labeled. And simultaneously, carrying out named entity identification and pre-labeling the text to be labeled by a rule-based and deep learning-based method. The platform can carry out progress control to the marking task, shows many rounds of marking progress in real time. And storing the entity, the relation offset and the original text of the marked file in a JSON format, and generating a marking comparison report to control the quality of the task. The labeling platform integrates various algorithms, has the functions of progress control and quality control, effectively improves the manual labeling efficiency, and ensures the quality of the constructed corpus.

Description

Entity and relation labeling system for medical texts
Technical Field
The invention belongs to the technical field of text labeling, and particularly relates to an entity and relationship labeling system for medical texts.
Background
The growing medical text data brings huge opportunities and challenges to the development of the whole industry, most of the medical text data belong to semi-structured or unstructured data, a series of scientific research applications can be carried out only by converting the semi-structured or unstructured data into structured data which can be processed by a computer, and the marking of text information is the basis for carrying out the structured processing on the text information. The idiomatic materials obtained by text labeling are very important resources and are the basis of related researches such as named entity identification, automatic relation extraction and the like. The currently labeled high-quality corpora are still quite lacking, and the corpora which can be used for research are particularly dignifiable numbers. The shortage of text labeling resources is in sharp contrast with the current massive text information, and the shortage of resources is not beneficial to the deep research on language resources. The text labeling task is extremely heavy and tedious work, and the traditional manual labeling is time-consuming, labor-consuming and high in cost, so that numerous researchers are disadvantaged, and the resource construction progress is slow.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects of time and labor waste, complex operation, difficulty in controlling quality and progress and the like in the labeling process in the prior art, an extensible semi-automatic labeling platform which integrates various automatic identification and extraction algorithms and comprises an entity relationship attribute labeling function and a labeling data analysis function is constructed.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: the invention discloses an entity and relation labeling system for medical texts, and the system architecture of the system is shown in figure 1 and comprises five modules, namely a task center, an algorithm factory, a database, user configuration and a WEB interface.
The task center module is used for creating a labeling task and task distribution and managing the task and the user authority;
the algorithm factory module is used for internally setting a preprocessing and named entity recognition algorithm to assist a user in marking;
the database module is used for storing the annotation file, the user information and the task information;
the user configuration module is used for the user to pre-configure the entities of the system and the relationship items between the entities;
and the WEB interface module provides a visual operation interface, so that a user can finish marking the text and displaying and analyzing data conveniently. The following sub-modules are introduced in an expanded way:
1) Task center module
The task center module is a core module of the system, and the responsibility of the task center module comprises two aspects of task management and authority management. And the task management is responsible for data uploading, task creation, task allocation and progress control work. Rights management includes user grouping and rights control. Users in the system are divided into two types, namely an administrator and a common user, the administrator user can perform tasks such as creation and distribution and user grouping, and the common user can only perform data uploading and data labeling.
Before the labeling task of the text data is carried out, the file to be labeled needs to be uploaded, the file format supported by the system comprises three types including TXT, markdown and JSON, and a plurality of files can be selected to be uploaded at the same time. After the upload job is completed, the administrator user can perform the creation of annotation tasks in the system. During creation, the labeling task can be distributed to a corresponding user to complete labeling work, meanwhile, a pre-processing algorithm built in the system can be selected for pre-processing, and details of the pre-processing algorithm are mainly described in the next section. After the administrator completes the task allocation, the common user can see the task to be labeled in the system interface of the common user, so that the labeling can be started.
The labeling process of the system can be divided into three rounds, including a first round of labeling, a second round of labeling and a third round of labeling;
the specific labeling task is to label entities and relationships existing in a text, such as a medical text labeling task, which aims to label entities such as diseases, symptoms and drugs existing in the text, and certain relationships existing between the entities (for example, a relationship between a disease and a symptom and a disease symptom, and a relationship between a disease and a drug therapy).
The specific labeling process is that the user 1 firstly performs a first round of labeling work on the original text, and generates a first round of labeling file after the first round of labeling work is completed. And after the user 1 finishes the first round of marking, the user 2 carries out second round of marking on the basis of the first round of marking files, corrects the problems of label missing or label error and the like in the marking process of the user 1, and generates second round of marking files after the second round of marking is finished. And finally, the user 1 carries out third round of labeling work on the basis of the second round of labeling files, completes the labeling result again, and takes the completed labeling result as a final labeling result, so far, the labeling task is declared to be completed.
The progress in the labeling process is counted by the number of the documents, each labeling task usually includes a plurality of text documents to be labeled, for example, the progress of the second round of labeling documents depends on the number of the second round of labeling documents divided by the number of the first round of labeling documents, the second round of labeling is performed on the basis of the first round of labeling, and a corresponding second round of labeling documents is generated only after the first round of labeling is completed. Similarly, the progress of the first round of annotation is the number of the first round of annotation files divided by the number of the source files, and the progress of the third round of annotation files is the number of the third round of annotation files divided by the number of the second round of annotation files.
2) Algorithm factory module
Preprocessing algorithm
The pre-processing algorithm is used for dividing the data to be marked and completing the entity name, and whether the data to be marked is started or not can be selected by an administrator user when a task is created.
The segmentation of the data set to be annotated can be divided into two different segmentation modes, namely sentence-by-sentence segmentation and chapter-by-chapter segmentation, and the administrator user can select which segmentation mode to adopt when creating the task. If the sentence-by-sentence division mode is selected, the contents of one sentence are displayed each time during the labeling, and the labeling is carried out sentence by sentence. If the division is performed according to chapters, the whole content of one file is displayed each time during the annotation, and the annotation is performed according to the sequence of the files.
The entity name completion is directed at a labeling task with a longer text, and the specific operation flow is as follows:
firstly, inputting an original text sequence;
secondly, segmenting the original text according to the sentence numbers;
and thirdly, performing completion operation according to the name of the entity to be completed input on the system interface by the administrator user, adding the entity in front of each sentence, and separating the entity from the sentence by using the @ symbol.
Fourthly, the sentences after completion are reassembled into articles, the algorithm is finished, and the marking work is carried out next step.
For example, if a document to be labeled mainly labels entities and relationships related to chronic atrial fibrillation, the entity related to chronic atrial fibrillation appears only in the head of the document and is replaced by the disease later. If 'the disease will lead to xxxx symptoms' later, it is necessary to note the relationship between chronic atrial fibrillation and the symptoms, which cannot be done because the entity of chronic atrial fibrillation did not appear in the sentence. Therefore, the entity name needs to be supplemented, and the entity name is added to the sentence head, for example, ' chronic atrial fibrillation ' causes xxxx symptoms ', so that the relation labeling work of the entity can be smoothly completed.
Named Entity Recognition (NER) algorithm
The built-in NER algorithm in the system is used for automatically identifying related named entities such as diseases, medicines, symptoms and the like appearing in the system, and pre-labeling the text so as to reduce the workload of manual labeling and improve the labeling efficiency. The part is carried out in the labeling process of a common user, and the user selects the part in the labeling process to accelerate the labeling process.
The system comprises two types of algorithms based on rules and deep learning, wherein the rules-based algorithm adopts dictionaries constructed by related experts in the medical field, the form of the dictionary is that each behavior is an entity, and the entities are separated by line feed. Dictionaries built such as medical text labels include disease dictionaries, symptom dictionaries, drug dictionaries, etc. (e.g., one name of a disease per action in a disease dictionary, other dictionaries are similar).
The rule-based algorithm identification process is as follows:
the first step is as follows: inputting a text sequence to be processed;
the second step: selecting a dictionary to be adopted;
the third step: sorting the entities in the selected dictionary in descending order of length to preferentially match longer entities;
the fourth step: obtaining a corresponding entity matched in the text sequence according to the entity set in the well-ordered dictionary;
the fifth step: and returning the matched entity set, and ending the algorithm.
A word-based BilSTM-CRF model is adopted based on a deep learning algorithm, training data of the model is a part of text data labeled in advance, a BIO labeling system is adopted during labeling, for example, the text 'cold can cause headache', the text is labeled as 'feel B-DIS common I-DIS can cause O head B-SYM pain I-SYM', wherein DIS represents a disease, SYM represents a symptom and is a label defined in advance for human, B represents a starting part of an entity, I represents the rest of the entity, and O represents the rest of the entity except the entity in the text.
And inputting the labeled data set into a neural network model for training to obtain the trained neural network model.
After the model training is completed, the method can be used for carrying out named entity recognition on the unlabeled text, and recognizing entities such as diseases, symptoms and the like, wherein the specific recognition process is as follows:
firstly, inputting a text sequence to be processed;
secondly, loading the trained deep learning model;
thirdly, inputting the text into the model;
fourthly, acquiring entities in the results returned by the model; (ii) a
And fifthly, returning the identified entity set, and ending the algorithm.
The process of identifying named entities combining two algorithms is shown in fig. 3, where a text sequence to be labeled (a preprocessed file) is identified by the two types of algorithms in the above two ways, and then the results are weighted and fused, or the named entities are identified by the two separate algorithms.
3) Database module
The database module is responsible for storing marking data and system tables in the system, the marking data comprise original files to be marked uploaded by a user, first round marking files, second round marking files and third round marking files (JSON format) generated in a marking process, and compressed files generated during data exporting, the files of all parts are stored separately, changes of all parts cannot affect other parts, and file safety is guaranteed.
The database also stores a user table, a permission table and a task information table in the system, contains information such as task creation time, allocation users and the like, and is a basis for supporting the operation of the whole system.
4) User configuration module
The module is mainly used for carrying out configuration operation on entity items and relationship items in the system. Before text labeling, corresponding entity items and relationship items need to be added, for example, medical text labeling needs to be performed, entity items such as diseases, symptoms and medicines, and relationship items such as clinical symptoms (relationship between diseases and symptom entities), medication (relationship between diseases and medication entities), complications (relationship between diseases and another disease entity) need to be added, and the entity items and the relationship items required by different task types are different and need to be defined by a user according to the medicine needs of the user.
5) Web interface module
The module provides a visual operation interface for a user, and comprises a labeling function, a data analysis function and other auxiliary labeling functions, and detailed introduction of each function is described below.
Labeling function module
The entity and relation labeling platform for the medical text, which is constructed by the method, has the functions of entity labeling, relation labeling and attribute labeling.
For entities appearing in the text, a user clicks an entity button to select an entity label and then selects a corresponding character to finish entity labeling. After the entity labeling is finished, a user can select whether to perform the relation labeling or not, and can switch back and forth between the entity labeling mode and the relation labeling mode at any time.
The relationship defined herein by default is in the form of a quadruple of (entity 1, entity 2, relationship name, relationship attributes), such as < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where chronic diverticulitis is entity 1, bloating is entity 2, post-treatment symptom is the relationship between the two entities, and colorectal anastomosis is an attribute of a relationship, signifying that symptom occurred after colorectal anastomosis. The relation label is similar to the entity label, and the user needs to select the corresponding relation name first, then click the entity corresponding to the entity 1, and then click the entity corresponding to the entity 2 to complete the label.
Attributes are modifications, explanations, conditional restrictions, etc. of relationships, for which a modification limitation is effected, as mentioned above for colorectal anastomosis as an attribute of post-treatment symptoms. And the marking of the attribute is selected by the user to be started or not, and the corresponding value of the attribute is set to be null when the marking of the attribute is closed. If the user selects to start the attribute marking, in the process of relation marking, the entity 1, the entity 2 and the relation can pop up whether to perform attribute marking dialog boxes after the marking is completed, the user selects characters corresponding to the attributes to complete the marking of the attributes after the selection is yes, and otherwise the attributes are set to be null.
Data analysis function module
In the process of processing data, not only the annotation of the data is important, but also the analysis of the annotated data is an essential task. In combination with the realization of the above part of the marking function, the platform provides the analysis function of the marking data and the generation of the marking comparison report, so that the user can conveniently grasp the marking quality while marking.
The platform provides three different data analysis modes, namely a list mode, a knowledge graph mode and a chart mode. And the list analysis mode lists all the relations and entities in the marked file one by one in a list form, and the result of the list analysis mode provides export of the Excel format file.
The knowledge graph format is then displayed in the form of a graph, such as a tetrad < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where the entity 1 'chronic diverticulitis' will be the central node, the entity 2 'bloating' will be the leaf nodes thereof, the relationship is a line connecting the two nodes, and the attribute is another leaf node of the entity 1.
The chart form shows the relationship and the entity composition in a pie chart and a bar chart, for example, in the medical text labeling, the entities can be classified into diseases, symptoms, medicines and the like, and the proportion of various types of entities or relationships can be viewed in the chart.
In addition to the above three analysis modes, the platform can also generate a detailed comparison report of two annotation files (the three annotation File is File1, and the two annotation File is File 2), and the report is divided into two parts, namely overall analysis and detailed content comparison.
The first part shows the P (accuracy), R (recall) and F values of each type of entity in the annotation File in a table form, the specific calculation is that a three-standard File (File 1) is used as a gold standard, and the calculation formulas are shown in (2) - (4).
Figure RE-GDA0002532560740000061
Figure RE-GDA0002532560740000062
Figure RE-GDA0002532560740000063
The second part lists the concrete contents in the marking file in detail, and highlights the entity in different colors at the same time, so as to compare the difference between the three-mark file and the two-mark file. If the two documents exist simultaneously, the color is green, and the color is blue when the two documents exist only in document 1, and the color is red when the two documents exist only in document 2.
Auxiliary function module
The system provides two auxiliary functions, one-click multiple-neutralization and data export. If the function of one click for multiple media is started, in the process of marking, a user only needs to mark one entity, and other same entities automatically complete marking in a character string matching mode, so that the marking efficiency can be improved.
In the data export function, the user can search the relationship (including by relationship name, by file name, etc.), and the search result is displayed in the form of a list, and the content of each row includes the content of file name, entity 1, relationship, entity 2, etc. And after the retrieval is finished, export operation can be carried out, and the exported relation is compressed in the form of an Excel file and then provided for a user to download.
The system provided by the invention has the following advantages:
(1) And a visual operation interface is provided, the labeling of entities, relations and attributes is supported, and the user can finish the labeling only by simple clicking operation.
(2) A series of dictionaries containing a large amount of entity information, such as a plurality of dictionaries of diseases, symptoms, medicines, operations, examinations and the like, are built in, and auxiliary labeling is carried out in a rule-based mode.
(3) And fusing various deep learning models, such as a BilSTM-CRF model and the like, and performing pre-labeling.
(4) And all the labeling tasks in the platform adopt multi-round labeling processes including a first label, a second label and a third label, and the task progress is displayed in real time on a task management interface.
(5) In order to ensure the quality of the labeling, different personnel label the labels in the process of multiple rounds of labeling, and a system automatically generates a label comparison report after the labeling is finished, so as to control the labeling quality.
(6) The platform has good customizability, can be suitable for the labeling task of medical texts, and can also be applied to other types of labeling tasks after simple configuration.
(7) The web framework based on python is developed, is easy to deploy and has good portability.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the relation labeling platform provided by the invention adopts a dictionary based on rules and a model based on deep learning to perform auxiliary labeling, so that the defects that manual labeling is time-consuming and labor-consuming in the traditional labeling work are effectively overcome, and the labeling efficiency can be effectively improved. Meanwhile, the perfect progress control and quality control functions greatly improve the labeling quality and have great practical significance for establishing a high-quality corpus.
Drawings
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of data annotation;
FIG. 3 is a diagram of a named entity identification process.
Detailed Description
The invention provides an entity and relation labeling system facing to medical texts, and a system architecture of the system is shown in figure 1 and comprises five modules including a task center, an algorithm factory, a database, user configuration and a WEB interface.
The task center module is used for creating a labeling task and task distribution and managing the task and the user authority;
the algorithm factory module is used for internally setting a preprocessing and named entity recognition algorithm to assist a user in marking;
the database module is used for storing the annotation file, the user information and the task information;
the user configuration module is used for the user to pre-configure the entities of the system and the relationship items among the entities;
and the WEB interface module provides a visual operation interface, so that a user can conveniently finish the labeling of the text and the data display and analysis. The following sub-modules are introduced in an expanded way:
1) Task center module
The task center module is a core module of the system, and responsibilities of the task center module comprise two aspects of task management and authority management. And the task management is responsible for data uploading, task creation, task distribution and progress control work. Rights management includes user grouping and rights control. The users in the system are divided into two types of administrators and ordinary users, the administrator users can create and distribute tasks, group users and the like, and the ordinary users can only upload data and label data.
Before the labeling task of the text data is carried out, the file to be labeled needs to be uploaded, the file format supported by the system comprises three types including TXT, markdown and JSON, and a plurality of files can be selected to be uploaded at the same time. After the upload job is completed, the administrator user can perform the creation of annotation tasks in the system. During creation, the labeling task can be distributed to a corresponding user to complete labeling work, meanwhile, a pre-processing algorithm built in the system can be selected for pre-processing, and details of the pre-processing algorithm are mainly described in the next section. After the administrator completes the task allocation, the common user can see the task to be labeled in the system interface of the common user, so that the labeling can be started.
The labeling process of the system can be divided into three rounds, including a first round of labeling, a second round of labeling and a third round of labeling;
the specific labeling task is to label entities and relationships existing in the text, such as a medical text labeling task, which aims to label entities such as diseases, symptoms, drugs and the like existing in the text, and certain relationships existing between the entities (for example, a relationship between diseases and symptoms exists, and a relationship between diseases and drugs exists for drug treatment).
The specific labeling process is that the user 1 firstly performs a first round of labeling work on the original text, and generates a first round of labeling file after the first round of labeling work is completed. And after the user 1 finishes the first round of marking, the user 2 carries out second round of marking on the basis of the first round of marking files, corrects the problems of label missing or label error and the like in the marking process of the user 1, and generates second round of marking files after the second round of marking is finished. And finally, the user 1 carries out third round of labeling work on the basis of the second round of labeling files, completes the labeling result again, and takes the completed labeling result as a final labeling result, so far, the labeling task is declared to be completed.
The progress in the labeling process is counted by the number of the documents, each labeling task usually includes a plurality of text documents to be labeled, for example, the progress of the second round of labeling documents depends on the number of the second round of labeling documents divided by the number of the first round of labeling documents, the second round of labeling is performed on the basis of the first round of labeling, and a corresponding second round of labeling documents is generated only after the first round of labeling is completed. Similarly, the progress of the first round of annotation is the number of the first round of annotation files divided by the number of the source files, and the progress of the third round of annotation files is the number of the third round of annotation files divided by the number of the second round of annotation files.
2) Algorithm factory module
Preprocessing algorithm
The pre-processing algorithm is used for segmenting the data to be annotated and completing the entity name, and whether the data to be annotated and the entity name are enabled or not can be selected by an administrator user when a task is created.
The division of the data set to be marked can be divided into two different division modes, namely sentence-by-sentence and chapter-by-chapter, and the administrator user can select which division mode to adopt when creating the task. If the sentence-by-sentence division mode is selected, the contents of one sentence are displayed each time during the labeling, and the labeling is carried out sentence by sentence. If the division is performed according to chapters, the whole content of one file is displayed each time during the annotation, and the annotation is performed according to the sequence of the files.
The entity name completion is directed at a labeling task with a longer text, and the specific operation flow is as follows:
firstly, inputting an original text sequence;
secondly, segmenting the original text according to the sentence numbers;
and thirdly, performing completion operation according to the name of the entity to be completed input on the system interface by the administrator user, adding the entity in front of each sentence, and separating the entity from the sentence by using the @ symbol.
Fourthly, the completed sentences are reassembled into articles, the algorithm is finished, and the marking work is carried out next step.
For example, if a document to be labeled mainly labels entities and relationships related to chronic atrial fibrillation, the entity related to chronic atrial fibrillation appears only in the head of the document and is replaced by the disease later. If the disease later appears to result in xxxx symptoms, it is necessary to note the relationship between chronic atrial fibrillation and the symptoms, which cannot be done because the entity of chronic atrial fibrillation did not appear in the sentence. Therefore, the entity name needs to be supplemented, and the entity name is added to the sentence head, for example, ' chronic atrial fibrillation ' causes xxxx symptoms ', so that the relation labeling work of the entity can be smoothly completed.
Named Entity Recognition (NER) algorithm
The NER algorithm built in the system is used for automatically identifying related named entities such as diseases, medicines, symptoms and the like appearing in the system, and the text is pre-labeled, so that the workload of manual labeling is reduced, and the labeling efficiency is improved. The part is carried out in the marking process of a common user, and the user selects and uses the part in the marking process, so that the marking process is accelerated.
The system comprises two types of algorithms based on rules and deep learning, wherein the rules-based algorithm adopts dictionaries constructed by related experts in the medical field, the form of the dictionary is that each behavior is an entity, and the entities are separated by line feed. Dictionaries built such as medical text labeling include disease dictionaries, symptom dictionaries, drug dictionaries, etc. (e.g., one name of a disease per action in a disease dictionary, other dictionaries are the same).
The rule-based algorithm identification process is as follows:
the first step is as follows: inputting a text sequence to be processed;
the second step is that: selecting a dictionary to be adopted;
the third step: sorting the entities in the selected dictionary in descending order of length to preferentially match longer entities;
the fourth step: obtaining a corresponding entity matched in the text sequence according to the entity set in the sorted dictionary;
the fifth step: and returning the matched entity set, and finishing the algorithm.
A word-based BilSTM-CRF model is adopted based on a deep learning algorithm, training data of the model is a part of text data labeled in advance, a BIO labeling system is adopted during labeling, for example, the text 'cold can cause headache', the text is labeled as 'feel B-DIS common I-DIS can cause O head B-SYM pain I-SYM', wherein DIS represents a disease, SYM represents a symptom and is a label defined in advance for human, B represents a starting part of an entity, I represents the rest of the entity, and O represents the rest of the entity except the entity in the text.
And inputting the labeled data set into a neural network model for training to obtain the trained neural network model.
After the model training is completed, the method can be used for carrying out named entity recognition on the unlabeled text to recognize entities such as diseases, symptoms and the like, and the specific recognition process is as follows:
firstly, inputting a text sequence to be processed;
secondly, loading the trained deep learning model;
thirdly, inputting the text into the model;
fourthly, acquiring an entity in a result returned by the model; (ii) a
And fifthly, returning the identified entity set, and ending the algorithm.
The named entity recognition process combining the two algorithms is shown in fig. 3, and the text sequence to be labeled (the preprocessed file) is recognized by the two types of algorithms respectively according to the two modes, and then the results are weighted and fused. Or the named entity recognition can be carried out by two independent algorithms.
3) Database module
The database module is responsible for storing annotation data and a system table in the system, the annotation data comprises an original file to be annotated uploaded by a user, a first round of annotation, a second round of annotation and a third round of annotation file (JSON format) generated in an annotation process, and a compressed file generated during data export, the files of all parts are stored separately, and the change of each part does not affect other parts, so that the file safety is ensured.
The database also stores a user table, a permission table and a task information table in the system, contains information such as task creation time, allocation users and the like, and is a basis for supporting the operation of the whole system.
4) User configuration module
The module is mainly used for carrying out configuration operation on entity items and relationship items in the system. Before text labeling, corresponding entity items and relationship items need to be added, for example, medical text labeling needs to be performed, entity items such as diseases, symptoms and medicines, and relationship items such as clinical symptoms (relationship between diseases and symptom entities), medication (relationship between diseases and medication entities), complications (relationship between diseases and another disease entity) need to be added, and the entity items and the relationship items required by different task types are different and need to be defined by a user according to the medicine needs of the user.
6) Web interface module
The module provides a visual operation interface for a user, and comprises a labeling function, a data analysis function and other auxiliary labeling functions, and detailed introduction of each function is described below.
Labeling function module
The entity and relation labeling platform for the medical text, which is constructed by the method, has the functions of entity labeling, relation labeling and attribute labeling.
For entities appearing in the text, a user clicks an entity button to select an entity label and then selects a corresponding character to finish entity labeling. After the entity labeling is completed, a user can select whether to perform relation labeling or not, and can switch back and forth between two labeling modes of the entity labeling and the relation labeling at any time.
The relationship defined herein by default is in the form of a quadruple of (entity 1, entity 2, relationship name, relationship attributes), such as < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where chronic diverticulitis is entity 1, bloating is entity 2, post-treatment symptom is a relationship between the two entities, and colorectal anastomosis is an attribute of a relationship, signifying that symptom occurred after colorectal anastomosis. The relation label is similar to the entity label, and the user needs to select the corresponding relation name first, then click the entity corresponding to the entity 1, and then click the entity corresponding to the entity 2 to complete the label.
Attributes are modifications, explanations, conditional restrictions, etc. of relationships, for which a modification limitation is effected, as mentioned above for colorectal anastomosis as an attribute of post-treatment symptoms. And the marking of the attribute is selected by the user to be started or not, and the corresponding value of the attribute is set to be null when the marking of the attribute is closed. If the user selects to start the attribute labeling, in the process of the relationship labeling, a dialog box for labeling the attribute can be popped up after the entity 1, the entity 2 and the relationship are labeled, the user selects characters corresponding to the attribute to label the attribute after selecting the dialog box, and otherwise, the attribute is set to be null.
Data analysis function module
In the process of processing data, not only the annotation of the data is important, but also the analysis of the annotated data is an essential task. In combination with the realization of the above part of the marking function, the platform provides the analysis function of the marking data and the generation of the marking comparison report, so that the user can conveniently grasp the marking quality while marking.
The platform provides three different data analysis modes, namely a list mode, a knowledge graph mode and a chart mode. And the list analysis mode is used for listing all the relations and entities in the marked file one by one in a list form, and the result of the list analysis mode provides export of the Excel format file.
The knowledge graph format is then displayed in the form of a graph, such as a tetrad < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where the entity 1 'chronic diverticulitis' will be the central node, the entity 2 'bloating' will be the leaf nodes thereof, the relationship is a line connecting the two nodes, and the attribute is another leaf node of the entity 1.
The chart form shows the relationship and the entity composition in a pie chart and a bar chart, for example, in the medical text labeling, the entities can be classified into diseases, symptoms, medicines and the like, and the proportion of various types of entities or relationships can be viewed in the chart.
In addition to the above three analysis modes, the platform can also generate a detailed comparison report of two markup files (a three-markup File is File1, and a two-markup File is File 2), and the report is divided into two parts, namely overall analysis and detailed content comparison.
The first part shows the P (accuracy), R (recall) and F values of each type of entity in the annotation File in a table form, the specific calculation is that a three-standard File (File 1) is used as a gold standard, and the calculation formulas are shown in (2) - (4).
Figure RE-GDA0002532560740000111
Figure RE-GDA0002532560740000112
Figure RE-GDA0002532560740000113
The second section lists the specific contents in the annotation file in detail, and highlights the entity in different colors at the same time in order to compare the differences between the three-symbol and two-symbol files. If the two documents exist at the same time, the color is green, and the color is blue only in document 1 and red only in document 2.
Auxiliary function module
The system provides two auxiliary functions, one-click multiple-neutralization and data export. If the function of one click for multiple media is started, in the process of marking, a user only needs to mark one entity, and other same entities automatically complete marking in a character string matching mode, so that the marking efficiency can be improved.
In the data export function, the user can search the relationship (including by relationship name, by file name, etc.), and the search result is displayed in the form of a list, and the content of each row includes the content of file name, entity 1, relationship, entity 2, etc. And after the retrieval is finished, exporting operation can be carried out, and the exported relation is compressed in the form of an Excel file and then is provided for a user to download. The system provided by the invention has the following advantages:
(1) And a visual operation interface is provided, the labeling of entities, relations and attributes is supported, and the user can finish the labeling only by simple clicking operation.
(2) A series of dictionaries containing a large amount of entity information, such as a plurality of dictionaries of diseases, symptoms, medicines, operations, examinations and the like, are built in, and auxiliary labeling is carried out in a rule-based mode.
(3) And fusing various deep learning models, such as a BilSTM-CRF model and the like, and performing pre-labeling.
(4) And all the labeling tasks in the platform adopt multi-round labeling processes including a first label, a second label and a third label, and the task progress is displayed in real time on a task management interface.
(5) In order to ensure the quality of the labeling, different personnel label the labels in the process of multiple rounds of labeling, and a system automatically generates a label comparison report after the labeling is finished, so as to control the labeling quality.
(6) The platform has good customizability, not only can be suitable for the labeling task of medical texts, but also can be applied to other types of labeling tasks after simple configuration.
(7) The web framework based on python is developed, is easy to deploy and has good portability.
The general labeling process is as follows:
the first step is as follows: uploading a text to be marked by a common user, creating a corresponding task by an administrator, selecting a corresponding preprocessing algorithm for processing, and finally distributing the task to a specific user;
the second step: a common user enters the task, and selects a required dictionary (a rule-based method) and a deep learning model for pre-labeling according to the specific requirements of the labeling task;
the third step: respectively completing a first-label, a second-label, a third-label and other multi-round labeling processes of a task to be labeled by different labeling personnel, wherein the first label and the third label are completed by the same personnel, the labeling progress is displayed by 100% to indicate that the task is completed, and a document with the labeling completion is stored in a JSON format and contains information such as offset of entities, relations and the like in a text;
the fourth step: and for the task with the completion degree of 100%, generating a labeling comparison report. Calculating the accuracy, recall rate and F value of the entity categories of the plurality of labeled files by taking the three-labeled files as gold standards;
the fifth step: and reworking the unqualified labeling task to a specific person, performing the next labeling task after the unqualified labeling task is qualified, and continuously repeating the processes from the first step to the fifth step until all the documents are labeled.

Claims (4)

1. A medical text-oriented entity and relationship labeling system is characterized by comprising the following modules:
the task center module is used for uploading the annotation data, creating the annotation tasks and task distribution, and managing the tasks and the user authority; the task center module has the following functions: data uploading, task creation, task allocation and progress control; the authority management comprises user grouping and authority control, users in the system are divided into two types, namely an administrator and a common user, the administrator user can create and distribute tasks and group the users, and the common user can only upload data and label the data;
the algorithm factory module is internally provided with a preprocessing segmentation method for segmenting the text and combines a named entity recognition algorithm to assist the user in performing pre-labeling;
the database module is used for storing the annotation file, the user information and the task information;
the user configuration module is used for the user to pre-configure the entities of the system and the relationship items among the entities;
the WEB interface module provides a visual operation interface, so that a user can conveniently finish the labeling of texts and the data display and analysis;
the WEB interface module also comprises a data analysis function module and an auxiliary function module, and the functions of the WEB interface module are as follows:
the data analysis function module is used for displaying the analysis result of the marked data in a list form, a knowledge graph form and a chart form;
the auxiliary function module provides two auxiliary functions, namely a one-click-multiple-center and data derivation function, if the one-click-multiple-center function is started, a user only needs to mark one entity in the marking process, and other same entities automatically complete marking in a character string matching mode;
in the data export function, a user displays the search result of the relation in a list form, and export operation is performed after the search is finished;
the label is divided into three wheels, including a first wheel label, a second wheel label and a third wheel label, and the specific labeling process is as follows: the method comprises the following steps that a first user firstly carries out a first round of labeling work on an original text, and then generates a first round of labeling files; after the first user finishes the first round of marking, the second user carries out second round of marking on the basis of the first round of marking files, corrects the problem of label missing or label error of the first user in the marking process, and generates second round of marking files after the second user finishes the first round of marking; finally, the first user carries out third round of labeling work on the basis of the second round of labeling files, and completes the labeling result to be used as a final labeling result;
the functions of the algorithm factory module are as follows:
(1) A built-in preprocessing algorithm is used for dividing data to be marked and completing entity names, and whether the data to be marked are started or not is selected by an administrator user when a task is created;
(2) The segmentation of the data set to be labeled can be divided into two different segmentation modes, namely sentence-by-sentence segmentation and chapter-by-chapter segmentation, an administrator user selects which segmentation mode to adopt when creating a task, and if the sentence-by-sentence segmentation mode is selected, the content of a sentence is displayed each time during labeling, and the labeling is carried out sentence-by-sentence; if the division is carried out according to sections, all contents of one file are displayed each time during the marking, and the marking is carried out according to the sequence of the files;
the concrete process of entity name completion is as follows:
firstly, inputting an original text sequence;
secondly, segmenting the original text according to the sentence numbers;
inputting the name of an entity to be complemented to complement, adding the entity in front of each sentence, and separating the entity from the sentence by using a preset symbol;
fourthly, the completed sentences are reassembled into articles;
the data analysis function module also has the following functions:
(1) Calculating the accuracy P, the recall ratio R and the F value of each type entity in the annotation file, wherein the calculation formula is as follows:
Figure FDA0003874417610000021
Figure FDA0003874417610000022
Figure FDA0003874417610000023
wherein, file1 is a third round of labeled files, and File2 is a second round of labeled files;
(2) And comparing the difference between the third round of annotation files and the second round of annotation files, and highlighting the entity in different colors.
2. The system of claim 1, wherein the labeling task is assigned to a user to complete labeling or to select pre-processing using a pre-processing algorithm built in the system.
3. The system for labeling entities and relations oriented to medical texts as claimed in claim 1, wherein the named entity recognition algorithm comprises rule-based algorithm recognition and/or neural network model-based algorithm recognition, and the specific flow of the rule-based algorithm recognition is as follows:
the first step is as follows: inputting a text sequence to be processed;
the second step is that: selecting a dictionary to be adopted;
the third step: sorting the entities in the selected dictionary in descending order of length to preferentially match longer entities;
the fourth step: obtaining a corresponding entity matched in the text sequence according to the entity set in the well-ordered dictionary;
the fifth step: returning the matched entity set;
the algorithm identification based on the neural network model comprises the following specific processes:
firstly, carrying out entity labeling on a text data set, and inputting a labeled data set text into a neural network model for training to obtain the neural network model meeting requirements;
secondly, inputting a text sequence to be processed;
thirdly, inputting the text into the neural network model;
fourthly, acquiring an entity in a result returned by the model;
and fifthly, returning the identified entity set.
4. The system for labeling entities and relationships oriented to medical texts as claimed in claim 1, wherein the WEB interface module comprises a labeling function module, and the function of the module is as follows: for entities appearing in the text, a user clicks an entity button to select an entity label and then selects a corresponding character to finish entity marking, and after the entity marking is finished, the user selects whether to perform relation marking or not and can switch back and forth between two marking modes of entity marking and relation marking at any time.
CN202010347165.1A 2020-04-28 2020-04-28 Entity and relation labeling system for medical texts Active CN111553161B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010347165.1A CN111553161B (en) 2020-04-28 2020-04-28 Entity and relation labeling system for medical texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010347165.1A CN111553161B (en) 2020-04-28 2020-04-28 Entity and relation labeling system for medical texts

Publications (2)

Publication Number Publication Date
CN111553161A CN111553161A (en) 2020-08-18
CN111553161B true CN111553161B (en) 2022-11-18

Family

ID=72004090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010347165.1A Active CN111553161B (en) 2020-04-28 2020-04-28 Entity and relation labeling system for medical texts

Country Status (1)

Country Link
CN (1) CN111553161B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035675A (en) * 2020-08-31 2020-12-04 康键信息技术(深圳)有限公司 Medical text labeling method, device, equipment and storage medium
CN112182248B (en) * 2020-10-19 2024-08-20 深圳供电局有限公司 Statistical method for key policy of electricity price
CN112270180A (en) * 2020-11-03 2021-01-26 北京阳光云视科技有限公司 BIO automatic labeling system and method for entity recognition training data
CN113392633B (en) * 2021-08-05 2021-12-24 中国医学科学院阜外医院 Medical named entity identification method, device and storage medium
CN113553840A (en) * 2021-08-12 2021-10-26 卫宁健康科技集团股份有限公司 Text information processing method, device, equipment and storage medium
CN114169336A (en) * 2021-12-13 2022-03-11 郑州大学 User-defined multi-mode distributed semi-automatic labeling system
CN117034864B (en) * 2023-09-07 2024-05-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710933A (en) * 2018-12-25 2019-05-03 广州天鹏计算机科技有限公司 Acquisition methods, device, computer equipment and the storage medium of training corpus
CN110096480A (en) * 2019-03-28 2019-08-06 厦门快商通信息咨询有限公司 A kind of text marking system, method and storage medium
CN110674295A (en) * 2019-09-11 2020-01-10 成都数之联科技有限公司 Data labeling system based on deep learning
CN110826101A (en) * 2019-11-05 2020-02-21 安徽数据堂科技有限公司 Privatization deployment data processing method for enterprise

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710933A (en) * 2018-12-25 2019-05-03 广州天鹏计算机科技有限公司 Acquisition methods, device, computer equipment and the storage medium of training corpus
CN110096480A (en) * 2019-03-28 2019-08-06 厦门快商通信息咨询有限公司 A kind of text marking system, method and storage medium
CN110674295A (en) * 2019-09-11 2020-01-10 成都数之联科技有限公司 Data labeling system based on deep learning
CN110826101A (en) * 2019-11-05 2020-02-21 安徽数据堂科技有限公司 Privatization deployment data processing method for enterprise

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools;M.Erdmann 等;《Computer and Information Science》;20011231;第6卷;全文 *

Also Published As

Publication number Publication date
CN111553161A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
CN111553161B (en) Entity and relation labeling system for medical texts
Cui et al. Text-to-viz: Automatic generation of infographics from proportion-related natural language statements
US8065336B2 (en) Data semanticizer
Guerra-Gomez et al. Visualizing change over time using dynamic hierarchies: TreeVersity2 and the StemView
CN110415571A (en) A kind of intelligent Auto-generating Test Paper, the method for examination and system
CN112035675A (en) Medical text labeling method, device, equipment and storage medium
CN104199871A (en) High-speed test question inputting method for intelligent teaching
CN111651614A (en) Method and system for constructing medicated diet knowledge graph, electronic equipment and storage medium
Shi et al. Supporting expressive and faithful pictorial visualization design with visual style transfer
CN115293161A (en) Reasonable medicine taking system and method based on natural language processing and medicine knowledge graph
CN116805013A (en) Traditional Chinese medicine video retrieval model based on knowledge graph
WO2024109097A1 (en) Knowledge map creation method and apparatus for patent text, and storage medium and device
CN112685513A (en) Al-Si alloy material entity relation extraction method based on text mining
Huang et al. OTO: ontology term organizer
TWM543395U (en) Translation assistance system
CN111339214B (en) Automatic knowledge base construction method and system
CN116521858B (en) Context semantic sequence comparison method based on dynamic clustering and visualization
Kumar et al. Comparison of various ml and dl models for emotion recognition using twitter
CN115017271B (en) Method and system for intelligently generating RPA flow component block
CN114169336A (en) User-defined multi-mode distributed semi-automatic labeling system
Moser et al. Use of claim graphing and argumentation schemes in biomedical literature: A manual approach to analysis
CN115602277A (en) Medical electronic medical record labeling method, device, system and storage medium
CN114861646A (en) Data annotation platform for vertical field of medical science
Rahman et al. Sentiment analysis on adventure movie scripts
Bilenko The narrative explorer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant