CN111553161B

CN111553161B - Entity and relation labeling system for medical texts

Info

Publication number: CN111553161B
Application number: CN202010347165.1A
Authority: CN
Inventors: 张坤丽; 赵旭; 谢琦
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2022-11-18
Anticipated expiration: 2040-04-28
Also published as: CN111553161A

Abstract

The invention discloses a medical text-oriented entity and relationship labeling platform, which integrates various preprocessing algorithms to preprocess texts to be labeled. And simultaneously, carrying out named entity identification and pre-labeling the text to be labeled by a rule-based and deep learning-based method. The platform can carry out progress control to the marking task, shows many rounds of marking progress in real time. And storing the entity, the relation offset and the original text of the marked file in a JSON format, and generating a marking comparison report to control the quality of the task. The labeling platform integrates various algorithms, has the functions of progress control and quality control, effectively improves the manual labeling efficiency, and ensures the quality of the constructed corpus.

Description

Entity and relation labeling system for medical texts

Technical Field

The invention belongs to the technical field of text labeling, and particularly relates to an entity and relationship labeling system for medical texts.

Background

The growing medical text data brings huge opportunities and challenges to the development of the whole industry, most of the medical text data belong to semi-structured or unstructured data, a series of scientific research applications can be carried out only by converting the semi-structured or unstructured data into structured data which can be processed by a computer, and the marking of text information is the basis for carrying out the structured processing on the text information. The idiomatic materials obtained by text labeling are very important resources and are the basis of related researches such as named entity identification, automatic relation extraction and the like. The currently labeled high-quality corpora are still quite lacking, and the corpora which can be used for research are particularly dignifiable numbers. The shortage of text labeling resources is in sharp contrast with the current massive text information, and the shortage of resources is not beneficial to the deep research on language resources. The text labeling task is extremely heavy and tedious work, and the traditional manual labeling is time-consuming, labor-consuming and high in cost, so that numerous researchers are disadvantaged, and the resource construction progress is slow.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of time and labor waste, complex operation, difficulty in controlling quality and progress and the like in the labeling process in the prior art, an extensible semi-automatic labeling platform which integrates various automatic identification and extraction algorithms and comprises an entity relationship attribute labeling function and a labeling data analysis function is constructed.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: the invention discloses an entity and relation labeling system for medical texts, and the system architecture of the system is shown in figure 1 and comprises five modules, namely a task center, an algorithm factory, a database, user configuration and a WEB interface.

The task center module is used for creating a labeling task and task distribution and managing the task and the user authority;

the algorithm factory module is used for internally setting a preprocessing and named entity recognition algorithm to assist a user in marking;

the database module is used for storing the annotation file, the user information and the task information;

the user configuration module is used for the user to pre-configure the entities of the system and the relationship items between the entities;

and the WEB interface module provides a visual operation interface, so that a user can finish marking the text and displaying and analyzing data conveniently. The following sub-modules are introduced in an expanded way:

1) Task center module

The task center module is a core module of the system, and the responsibility of the task center module comprises two aspects of task management and authority management. And the task management is responsible for data uploading, task creation, task allocation and progress control work. Rights management includes user grouping and rights control. Users in the system are divided into two types, namely an administrator and a common user, the administrator user can perform tasks such as creation and distribution and user grouping, and the common user can only perform data uploading and data labeling.

Before the labeling task of the text data is carried out, the file to be labeled needs to be uploaded, the file format supported by the system comprises three types including TXT, markdown and JSON, and a plurality of files can be selected to be uploaded at the same time. After the upload job is completed, the administrator user can perform the creation of annotation tasks in the system. During creation, the labeling task can be distributed to a corresponding user to complete labeling work, meanwhile, a pre-processing algorithm built in the system can be selected for pre-processing, and details of the pre-processing algorithm are mainly described in the next section. After the administrator completes the task allocation, the common user can see the task to be labeled in the system interface of the common user, so that the labeling can be started.

The labeling process of the system can be divided into three rounds, including a first round of labeling, a second round of labeling and a third round of labeling;

the specific labeling task is to label entities and relationships existing in a text, such as a medical text labeling task, which aims to label entities such as diseases, symptoms and drugs existing in the text, and certain relationships existing between the entities (for example, a relationship between a disease and a symptom and a disease symptom, and a relationship between a disease and a drug therapy).

The specific labeling process is that the user 1 firstly performs a first round of labeling work on the original text, and generates a first round of labeling file after the first round of labeling work is completed. And after the user 1 finishes the first round of marking, the user 2 carries out second round of marking on the basis of the first round of marking files, corrects the problems of label missing or label error and the like in the marking process of the user 1, and generates second round of marking files after the second round of marking is finished. And finally, the user 1 carries out third round of labeling work on the basis of the second round of labeling files, completes the labeling result again, and takes the completed labeling result as a final labeling result, so far, the labeling task is declared to be completed.

The progress in the labeling process is counted by the number of the documents, each labeling task usually includes a plurality of text documents to be labeled, for example, the progress of the second round of labeling documents depends on the number of the second round of labeling documents divided by the number of the first round of labeling documents, the second round of labeling is performed on the basis of the first round of labeling, and a corresponding second round of labeling documents is generated only after the first round of labeling is completed. Similarly, the progress of the first round of annotation is the number of the first round of annotation files divided by the number of the source files, and the progress of the third round of annotation files is the number of the third round of annotation files divided by the number of the second round of annotation files.

2) Algorithm factory module

Preprocessing algorithm

The pre-processing algorithm is used for dividing the data to be marked and completing the entity name, and whether the data to be marked is started or not can be selected by an administrator user when a task is created.

The segmentation of the data set to be annotated can be divided into two different segmentation modes, namely sentence-by-sentence segmentation and chapter-by-chapter segmentation, and the administrator user can select which segmentation mode to adopt when creating the task. If the sentence-by-sentence division mode is selected, the contents of one sentence are displayed each time during the labeling, and the labeling is carried out sentence by sentence. If the division is performed according to chapters, the whole content of one file is displayed each time during the annotation, and the annotation is performed according to the sequence of the files.

The entity name completion is directed at a labeling task with a longer text, and the specific operation flow is as follows:

firstly, inputting an original text sequence;

secondly, segmenting the original text according to the sentence numbers;

and thirdly, performing completion operation according to the name of the entity to be completed input on the system interface by the administrator user, adding the entity in front of each sentence, and separating the entity from the sentence by using the @ symbol.

Fourthly, the sentences after completion are reassembled into articles, the algorithm is finished, and the marking work is carried out next step.

For example, if a document to be labeled mainly labels entities and relationships related to chronic atrial fibrillation, the entity related to chronic atrial fibrillation appears only in the head of the document and is replaced by the disease later. If 'the disease will lead to xxxx symptoms' later, it is necessary to note the relationship between chronic atrial fibrillation and the symptoms, which cannot be done because the entity of chronic atrial fibrillation did not appear in the sentence. Therefore, the entity name needs to be supplemented, and the entity name is added to the sentence head, for example, ' chronic atrial fibrillation ' causes xxxx symptoms ', so that the relation labeling work of the entity can be smoothly completed.

Named Entity Recognition (NER) algorithm

The built-in NER algorithm in the system is used for automatically identifying related named entities such as diseases, medicines, symptoms and the like appearing in the system, and pre-labeling the text so as to reduce the workload of manual labeling and improve the labeling efficiency. The part is carried out in the labeling process of a common user, and the user selects the part in the labeling process to accelerate the labeling process.

The system comprises two types of algorithms based on rules and deep learning, wherein the rules-based algorithm adopts dictionaries constructed by related experts in the medical field, the form of the dictionary is that each behavior is an entity, and the entities are separated by line feed. Dictionaries built such as medical text labels include disease dictionaries, symptom dictionaries, drug dictionaries, etc. (e.g., one name of a disease per action in a disease dictionary, other dictionaries are similar).

The rule-based algorithm identification process is as follows:

the first step is as follows: inputting a text sequence to be processed;

the second step: selecting a dictionary to be adopted;

the third step: sorting the entities in the selected dictionary in descending order of length to preferentially match longer entities;

the fourth step: obtaining a corresponding entity matched in the text sequence according to the entity set in the well-ordered dictionary;

the fifth step: and returning the matched entity set, and ending the algorithm.

A word-based BilSTM-CRF model is adopted based on a deep learning algorithm, training data of the model is a part of text data labeled in advance, a BIO labeling system is adopted during labeling, for example, the text 'cold can cause headache', the text is labeled as 'feel B-DIS common I-DIS can cause O head B-SYM pain I-SYM', wherein DIS represents a disease, SYM represents a symptom and is a label defined in advance for human, B represents a starting part of an entity, I represents the rest of the entity, and O represents the rest of the entity except the entity in the text.

And inputting the labeled data set into a neural network model for training to obtain the trained neural network model.

After the model training is completed, the method can be used for carrying out named entity recognition on the unlabeled text, and recognizing entities such as diseases, symptoms and the like, wherein the specific recognition process is as follows:

firstly, inputting a text sequence to be processed;

secondly, loading the trained deep learning model;

thirdly, inputting the text into the model;

fourthly, acquiring entities in the results returned by the model; (ii) a

And fifthly, returning the identified entity set, and ending the algorithm.

The process of identifying named entities combining two algorithms is shown in fig. 3, where a text sequence to be labeled (a preprocessed file) is identified by the two types of algorithms in the above two ways, and then the results are weighted and fused, or the named entities are identified by the two separate algorithms.

3) Database module

The database module is responsible for storing marking data and system tables in the system, the marking data comprise original files to be marked uploaded by a user, first round marking files, second round marking files and third round marking files (JSON format) generated in a marking process, and compressed files generated during data exporting, the files of all parts are stored separately, changes of all parts cannot affect other parts, and file safety is guaranteed.

The database also stores a user table, a permission table and a task information table in the system, contains information such as task creation time, allocation users and the like, and is a basis for supporting the operation of the whole system.

4) User configuration module

The module is mainly used for carrying out configuration operation on entity items and relationship items in the system. Before text labeling, corresponding entity items and relationship items need to be added, for example, medical text labeling needs to be performed, entity items such as diseases, symptoms and medicines, and relationship items such as clinical symptoms (relationship between diseases and symptom entities), medication (relationship between diseases and medication entities), complications (relationship between diseases and another disease entity) need to be added, and the entity items and the relationship items required by different task types are different and need to be defined by a user according to the medicine needs of the user.

5) Web interface module

The module provides a visual operation interface for a user, and comprises a labeling function, a data analysis function and other auxiliary labeling functions, and detailed introduction of each function is described below.

Labeling function module

The entity and relation labeling platform for the medical text, which is constructed by the method, has the functions of entity labeling, relation labeling and attribute labeling.

For entities appearing in the text, a user clicks an entity button to select an entity label and then selects a corresponding character to finish entity labeling. After the entity labeling is finished, a user can select whether to perform the relation labeling or not, and can switch back and forth between the entity labeling mode and the relation labeling mode at any time.

The relationship defined herein by default is in the form of a quadruple of (entity 1, entity 2, relationship name, relationship attributes), such as < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where chronic diverticulitis is entity 1, bloating is entity 2, post-treatment symptom is the relationship between the two entities, and colorectal anastomosis is an attribute of a relationship, signifying that symptom occurred after colorectal anastomosis. The relation label is similar to the entity label, and the user needs to select the corresponding relation name first, then click the entity corresponding to the entity 1, and then click the entity corresponding to the entity 2 to complete the label.

Attributes are modifications, explanations, conditional restrictions, etc. of relationships, for which a modification limitation is effected, as mentioned above for colorectal anastomosis as an attribute of post-treatment symptoms. And the marking of the attribute is selected by the user to be started or not, and the corresponding value of the attribute is set to be null when the marking of the attribute is closed. If the user selects to start the attribute marking, in the process of relation marking, the entity 1, the entity 2 and the relation can pop up whether to perform attribute marking dialog boxes after the marking is completed, the user selects characters corresponding to the attributes to complete the marking of the attributes after the selection is yes, and otherwise the attributes are set to be null.

Data analysis function module

In the process of processing data, not only the annotation of the data is important, but also the analysis of the annotated data is an essential task. In combination with the realization of the above part of the marking function, the platform provides the analysis function of the marking data and the generation of the marking comparison report, so that the user can conveniently grasp the marking quality while marking.

The platform provides three different data analysis modes, namely a list mode, a knowledge graph mode and a chart mode. And the list analysis mode lists all the relations and entities in the marked file one by one in a list form, and the result of the list analysis mode provides export of the Excel format file.

The knowledge graph format is then displayed in the form of a graph, such as a tetrad < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where the entity 1 'chronic diverticulitis' will be the central node, the entity 2 'bloating' will be the leaf nodes thereof, the relationship is a line connecting the two nodes, and the attribute is another leaf node of the entity 1.

The chart form shows the relationship and the entity composition in a pie chart and a bar chart, for example, in the medical text labeling, the entities can be classified into diseases, symptoms, medicines and the like, and the proportion of various types of entities or relationships can be viewed in the chart.

In addition to the above three analysis modes, the platform can also generate a detailed comparison report of two annotation files (the three annotation File is File1, and the two annotation File is File 2), and the report is divided into two parts, namely overall analysis and detailed content comparison.

The first part shows the P (accuracy), R (recall) and F values of each type of entity in the annotation File in a table form, the specific calculation is that a three-standard File (File 1) is used as a gold standard, and the calculation formulas are shown in (2) - (4).

The second part lists the concrete contents in the marking file in detail, and highlights the entity in different colors at the same time, so as to compare the difference between the three-mark file and the two-mark file. If the two documents exist simultaneously, the color is green, and the color is blue when the two documents exist only in document 1, and the color is red when the two documents exist only in document 2.

Auxiliary function module

The system provides two auxiliary functions, one-click multiple-neutralization and data export. If the function of one click for multiple media is started, in the process of marking, a user only needs to mark one entity, and other same entities automatically complete marking in a character string matching mode, so that the marking efficiency can be improved.

In the data export function, the user can search the relationship (including by relationship name, by file name, etc.), and the search result is displayed in the form of a list, and the content of each row includes the content of file name, entity 1, relationship, entity 2, etc. And after the retrieval is finished, export operation can be carried out, and the exported relation is compressed in the form of an Excel file and then provided for a user to download.

The system provided by the invention has the following advantages:

(1) And a visual operation interface is provided, the labeling of entities, relations and attributes is supported, and the user can finish the labeling only by simple clicking operation.

(2) A series of dictionaries containing a large amount of entity information, such as a plurality of dictionaries of diseases, symptoms, medicines, operations, examinations and the like, are built in, and auxiliary labeling is carried out in a rule-based mode.

(3) And fusing various deep learning models, such as a BilSTM-CRF model and the like, and performing pre-labeling.

(4) And all the labeling tasks in the platform adopt multi-round labeling processes including a first label, a second label and a third label, and the task progress is displayed in real time on a task management interface.

(5) In order to ensure the quality of the labeling, different personnel label the labels in the process of multiple rounds of labeling, and a system automatically generates a label comparison report after the labeling is finished, so as to control the labeling quality.

(6) The platform has good customizability, can be suitable for the labeling task of medical texts, and can also be applied to other types of labeling tasks after simple configuration.

(7) The web framework based on python is developed, is easy to deploy and has good portability.

Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

the relation labeling platform provided by the invention adopts a dictionary based on rules and a model based on deep learning to perform auxiliary labeling, so that the defects that manual labeling is time-consuming and labor-consuming in the traditional labeling work are effectively overcome, and the labeling efficiency can be effectively improved. Meanwhile, the perfect progress control and quality control functions greatly improve the labeling quality and have great practical significance for establishing a high-quality corpus.

Drawings

FIG. 1 is a system architecture diagram of the present invention;

FIG. 2 is a flow chart of data annotation;

FIG. 3 is a diagram of a named entity identification process.

Detailed Description

The invention provides an entity and relation labeling system facing to medical texts, and a system architecture of the system is shown in figure 1 and comprises five modules including a task center, an algorithm factory, a database, user configuration and a WEB interface.

the user configuration module is used for the user to pre-configure the entities of the system and the relationship items among the entities;

and the WEB interface module provides a visual operation interface, so that a user can conveniently finish the labeling of the text and the data display and analysis. The following sub-modules are introduced in an expanded way:

1) Task center module

The task center module is a core module of the system, and responsibilities of the task center module comprise two aspects of task management and authority management. And the task management is responsible for data uploading, task creation, task distribution and progress control work. Rights management includes user grouping and rights control. The users in the system are divided into two types of administrators and ordinary users, the administrator users can create and distribute tasks, group users and the like, and the ordinary users can only upload data and label data.

the specific labeling task is to label entities and relationships existing in the text, such as a medical text labeling task, which aims to label entities such as diseases, symptoms, drugs and the like existing in the text, and certain relationships existing between the entities (for example, a relationship between diseases and symptoms exists, and a relationship between diseases and drugs exists for drug treatment).

2) Algorithm factory module

Preprocessing algorithm

The pre-processing algorithm is used for segmenting the data to be annotated and completing the entity name, and whether the data to be annotated and the entity name are enabled or not can be selected by an administrator user when a task is created.

The division of the data set to be marked can be divided into two different division modes, namely sentence-by-sentence and chapter-by-chapter, and the administrator user can select which division mode to adopt when creating the task. If the sentence-by-sentence division mode is selected, the contents of one sentence are displayed each time during the labeling, and the labeling is carried out sentence by sentence. If the division is performed according to chapters, the whole content of one file is displayed each time during the annotation, and the annotation is performed according to the sequence of the files.

firstly, inputting an original text sequence;

secondly, segmenting the original text according to the sentence numbers;

Fourthly, the completed sentences are reassembled into articles, the algorithm is finished, and the marking work is carried out next step.

For example, if a document to be labeled mainly labels entities and relationships related to chronic atrial fibrillation, the entity related to chronic atrial fibrillation appears only in the head of the document and is replaced by the disease later. If the disease later appears to result in xxxx symptoms, it is necessary to note the relationship between chronic atrial fibrillation and the symptoms, which cannot be done because the entity of chronic atrial fibrillation did not appear in the sentence. Therefore, the entity name needs to be supplemented, and the entity name is added to the sentence head, for example, ' chronic atrial fibrillation ' causes xxxx symptoms ', so that the relation labeling work of the entity can be smoothly completed.

Named Entity Recognition (NER) algorithm

The NER algorithm built in the system is used for automatically identifying related named entities such as diseases, medicines, symptoms and the like appearing in the system, and the text is pre-labeled, so that the workload of manual labeling is reduced, and the labeling efficiency is improved. The part is carried out in the marking process of a common user, and the user selects and uses the part in the marking process, so that the marking process is accelerated.

The system comprises two types of algorithms based on rules and deep learning, wherein the rules-based algorithm adopts dictionaries constructed by related experts in the medical field, the form of the dictionary is that each behavior is an entity, and the entities are separated by line feed. Dictionaries built such as medical text labeling include disease dictionaries, symptom dictionaries, drug dictionaries, etc. (e.g., one name of a disease per action in a disease dictionary, other dictionaries are the same).

The rule-based algorithm identification process is as follows:

the first step is as follows: inputting a text sequence to be processed;

the second step is that: selecting a dictionary to be adopted;

the fourth step: obtaining a corresponding entity matched in the text sequence according to the entity set in the sorted dictionary;

the fifth step: and returning the matched entity set, and finishing the algorithm.

After the model training is completed, the method can be used for carrying out named entity recognition on the unlabeled text to recognize entities such as diseases, symptoms and the like, and the specific recognition process is as follows:

firstly, inputting a text sequence to be processed;

secondly, loading the trained deep learning model;

thirdly, inputting the text into the model;

fourthly, acquiring an entity in a result returned by the model; (ii) a

And fifthly, returning the identified entity set, and ending the algorithm.

The named entity recognition process combining the two algorithms is shown in fig. 3, and the text sequence to be labeled (the preprocessed file) is recognized by the two types of algorithms respectively according to the two modes, and then the results are weighted and fused. Or the named entity recognition can be carried out by two independent algorithms.

3) Database module

The database module is responsible for storing annotation data and a system table in the system, the annotation data comprises an original file to be annotated uploaded by a user, a first round of annotation, a second round of annotation and a third round of annotation file (JSON format) generated in an annotation process, and a compressed file generated during data export, the files of all parts are stored separately, and the change of each part does not affect other parts, so that the file safety is ensured.

4) User configuration module

6) Web interface module

Labeling function module

For entities appearing in the text, a user clicks an entity button to select an entity label and then selects a corresponding character to finish entity labeling. After the entity labeling is completed, a user can select whether to perform relation labeling or not, and can switch back and forth between two labeling modes of the entity labeling and the relation labeling at any time.

The relationship defined herein by default is in the form of a quadruple of (entity 1, entity 2, relationship name, relationship attributes), such as < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where chronic diverticulitis is entity 1, bloating is entity 2, post-treatment symptom is a relationship between the two entities, and colorectal anastomosis is an attribute of a relationship, signifying that symptom occurred after colorectal anastomosis. The relation label is similar to the entity label, and the user needs to select the corresponding relation name first, then click the entity corresponding to the entity 1, and then click the entity corresponding to the entity 2 to complete the label.

Attributes are modifications, explanations, conditional restrictions, etc. of relationships, for which a modification limitation is effected, as mentioned above for colorectal anastomosis as an attribute of post-treatment symptoms. And the marking of the attribute is selected by the user to be started or not, and the corresponding value of the attribute is set to be null when the marking of the attribute is closed. If the user selects to start the attribute labeling, in the process of the relationship labeling, a dialog box for labeling the attribute can be popped up after the entity 1, the entity 2 and the relationship are labeled, the user selects characters corresponding to the attribute to label the attribute after selecting the dialog box, and otherwise, the attribute is set to be null.

Data analysis function module

The platform provides three different data analysis modes, namely a list mode, a knowledge graph mode and a chart mode. And the list analysis mode is used for listing all the relations and entities in the marked file one by one in a list form, and the result of the list analysis mode provides export of the Excel format file.

In addition to the above three analysis modes, the platform can also generate a detailed comparison report of two markup files (a three-markup File is File1, and a two-markup File is File 2), and the report is divided into two parts, namely overall analysis and detailed content comparison.

The second section lists the specific contents in the annotation file in detail, and highlights the entity in different colors at the same time in order to compare the differences between the three-symbol and two-symbol files. If the two documents exist at the same time, the color is green, and the color is blue only in document 1 and red only in document 2.

Auxiliary function module

In the data export function, the user can search the relationship (including by relationship name, by file name, etc.), and the search result is displayed in the form of a list, and the content of each row includes the content of file name, entity 1, relationship, entity 2, etc. And after the retrieval is finished, exporting operation can be carried out, and the exported relation is compressed in the form of an Excel file and then is provided for a user to download. The system provided by the invention has the following advantages:

(6) The platform has good customizability, not only can be suitable for the labeling task of medical texts, but also can be applied to other types of labeling tasks after simple configuration.

The general labeling process is as follows:

the first step is as follows: uploading a text to be marked by a common user, creating a corresponding task by an administrator, selecting a corresponding preprocessing algorithm for processing, and finally distributing the task to a specific user;

the second step: a common user enters the task, and selects a required dictionary (a rule-based method) and a deep learning model for pre-labeling according to the specific requirements of the labeling task;

the third step: respectively completing a first-label, a second-label, a third-label and other multi-round labeling processes of a task to be labeled by different labeling personnel, wherein the first label and the third label are completed by the same personnel, the labeling progress is displayed by 100% to indicate that the task is completed, and a document with the labeling completion is stored in a JSON format and contains information such as offset of entities, relations and the like in a text;

the fourth step: and for the task with the completion degree of 100%, generating a labeling comparison report. Calculating the accuracy, recall rate and F value of the entity categories of the plurality of labeled files by taking the three-labeled files as gold standards;

the fifth step: and reworking the unqualified labeling task to a specific person, performing the next labeling task after the unqualified labeling task is qualified, and continuously repeating the processes from the first step to the fifth step until all the documents are labeled.

Claims

1. A medical text-oriented entity and relationship labeling system is characterized by comprising the following modules:

the task center module is used for uploading the annotation data, creating the annotation tasks and task distribution, and managing the tasks and the user authority; the task center module has the following functions: data uploading, task creation, task allocation and progress control; the authority management comprises user grouping and authority control, users in the system are divided into two types, namely an administrator and a common user, the administrator user can create and distribute tasks and group the users, and the common user can only upload data and label the data;

the algorithm factory module is internally provided with a preprocessing segmentation method for segmenting the text and combines a named entity recognition algorithm to assist the user in performing pre-labeling;

the WEB interface module provides a visual operation interface, so that a user can conveniently finish the labeling of texts and the data display and analysis;

the WEB interface module also comprises a data analysis function module and an auxiliary function module, and the functions of the WEB interface module are as follows:

the data analysis function module is used for displaying the analysis result of the marked data in a list form, a knowledge graph form and a chart form;

the auxiliary function module provides two auxiliary functions, namely a one-click-multiple-center and data derivation function, if the one-click-multiple-center function is started, a user only needs to mark one entity in the marking process, and other same entities automatically complete marking in a character string matching mode;

in the data export function, a user displays the search result of the relation in a list form, and export operation is performed after the search is finished;

the label is divided into three wheels, including a first wheel label, a second wheel label and a third wheel label, and the specific labeling process is as follows: the method comprises the following steps that a first user firstly carries out a first round of labeling work on an original text, and then generates a first round of labeling files; after the first user finishes the first round of marking, the second user carries out second round of marking on the basis of the first round of marking files, corrects the problem of label missing or label error of the first user in the marking process, and generates second round of marking files after the second user finishes the first round of marking; finally, the first user carries out third round of labeling work on the basis of the second round of labeling files, and completes the labeling result to be used as a final labeling result;

the functions of the algorithm factory module are as follows:

(1) A built-in preprocessing algorithm is used for dividing data to be marked and completing entity names, and whether the data to be marked are started or not is selected by an administrator user when a task is created;

(2) The segmentation of the data set to be labeled can be divided into two different segmentation modes, namely sentence-by-sentence segmentation and chapter-by-chapter segmentation, an administrator user selects which segmentation mode to adopt when creating a task, and if the sentence-by-sentence segmentation mode is selected, the content of a sentence is displayed each time during labeling, and the labeling is carried out sentence-by-sentence; if the division is carried out according to sections, all contents of one file are displayed each time during the marking, and the marking is carried out according to the sequence of the files;

the concrete process of entity name completion is as follows:

firstly, inputting an original text sequence;

secondly, segmenting the original text according to the sentence numbers;

inputting the name of an entity to be complemented to complement, adding the entity in front of each sentence, and separating the entity from the sentence by using a preset symbol;

fourthly, the completed sentences are reassembled into articles;

the data analysis function module also has the following functions:

(1) Calculating the accuracy P, the recall ratio R and the F value of each type entity in the annotation file, wherein the calculation formula is as follows:

wherein, file1 is a third round of labeled files, and File2 is a second round of labeled files;

(2) And comparing the difference between the third round of annotation files and the second round of annotation files, and highlighting the entity in different colors.

2. The system of claim 1, wherein the labeling task is assigned to a user to complete labeling or to select pre-processing using a pre-processing algorithm built in the system.

3. The system for labeling entities and relations oriented to medical texts as claimed in claim 1, wherein the named entity recognition algorithm comprises rule-based algorithm recognition and/or neural network model-based algorithm recognition, and the specific flow of the rule-based algorithm recognition is as follows:

the first step is as follows: inputting a text sequence to be processed;

the second step is that: selecting a dictionary to be adopted;

the fifth step: returning the matched entity set;

the algorithm identification based on the neural network model comprises the following specific processes:

firstly, carrying out entity labeling on a text data set, and inputting a labeled data set text into a neural network model for training to obtain the neural network model meeting requirements;

secondly, inputting a text sequence to be processed;

thirdly, inputting the text into the neural network model;

fourthly, acquiring an entity in a result returned by the model;

and fifthly, returning the identified entity set.

4. The system for labeling entities and relationships oriented to medical texts as claimed in claim 1, wherein the WEB interface module comprises a labeling function module, and the function of the module is as follows: for entities appearing in the text, a user clicks an entity button to select an entity label and then selects a corresponding character to finish entity marking, and after the entity marking is finished, the user selects whether to perform relation marking or not and can switch back and forth between two marking modes of entity marking and relation marking at any time.