CN114169336A

CN114169336A - User-defined multi-mode distributed semi-automatic labeling system

Info

Publication number: CN114169336A
Application number: CN202111517605.4A
Authority: CN
Inventors: 张坤丽; 胡斌; 昝红英; 代东明; 桂明宇; 宋玉; 赵旭
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-11

Abstract

The invention discloses a user-defined multi-mode distributed semi-automatic labeling system, which comprises a management module, a user-defined module, a persistent storage module, a WEB interaction module and an algorithm factory module; the management module is used for managing the labeling task and the system authority; the user self-defining module is used for providing an interface of a self-defining function; the persistent storage module is used for storing system data; the WEB interaction module is used for providing a visual interface and completing a labeling task; and the algorithm factory module is used for providing algorithm support for the WEB interaction module. In the system, all the labeling tasks adopt a plurality of rounds of labeling processes, in order to ensure the labeling quality, different personnel can label the labeling tasks in the plurality of rounds of labeling processes respectively, and a labeling comparison report can be generated by the system after the labeling is finished, so that the labeling quality can be better controlled, and meanwhile, the task progress can be displayed in real time in a task management interface, thereby facilitating the progress control of a user.

Description

User-defined multi-mode distributed semi-automatic labeling system

Technical Field

The invention relates to the technical field of data annotation, in particular to a user-defined multi-mode distributed semi-automatic annotation system.

Background

With the development of internet technology in recent years, artificial intelligence has been deeply involved in the aspects of education, transportation, medical treatment and the like in our lives. The deep learning is the most important component of the existing artificial intelligence. Deep learning requires a large amount of structured data that has been pre-labeled to support model optimization. Although data is now growing explosively, most of the data belongs to semi-structured or unstructured data, and a series of scientific research applications can be performed only by converting the semi-structured or unstructured data into structured data which can be processed by a computer. The acquaintance materials obtained through text labeling are the basis for conducting related researches such as named entity recognition and automatic relation extraction. However, the existing labeled high-quality corpora are quite lacking, and the corpora which can be used for research are more exponential, so that the development of the research is greatly limited. The labeling task is extremely heavy and tedious work, the traditional manual labeling is time-consuming and labor-consuming, the cost is huge, the quality is difficult to control, and numerous researchers are forbidden, so that the resource construction progress is slow.

Disclosure of Invention

The invention aims to provide a user-defined multi-mode distributed semi-automatic labeling system, which aims to solve the problems of time and labor waste, complex operation and difficulty in controlling quality and progress in the labeling process of the prior art in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a user-defined multi-mode distributed semi-automatic labeling system comprises a management module, a user-defined module, a persistent storage module, a WEB interaction module and an algorithm factory module;

the management module is used for managing the labeling task and the system authority;

the user self-defining module is used for providing an interface of a self-defining function;

the persistent storage module is used for storing system data;

the WEB interaction module is used for providing a visual interface and completing a labeling task;

and the algorithm factory module is used for providing algorithm support for the WEB interaction module.

Furthermore, the management module comprises a task management unit and a user management unit;

the task management unit is used for managing the labeling task according to the labeling task type;

the user management unit is used for setting the authority management of the system.

Further, the task management unit is configured to manage the annotation task according to the annotation task type, and specifically includes:

uploading the tasks to be marked to a system in different file formats according to the marking task types;

according to the task to be labeled, setting a task name, a labeling type, used source data, a belonging group and a preprocessing algorithm corresponding to the task to be labeled;

and performing task allocation on the tasks to be labeled and labeling.

Furthermore, the user-defined module is used for providing an interface with a user-defined function, and specifically includes:

self-defining corresponding entities to be annotated and relationship items between the entities;

and customizing a dictionary or a deep learning model required by the named entity recognition algorithm based on rules or deep learning according to the task requirement to be marked.

Furthermore, the WEB interaction module comprises a labeling unit, a data analysis unit, a management unit and an auxiliary labeling unit;

the marking unit is used for marking the task to be marked according to the type of the marked task;

the data analysis unit is used for analyzing the marked data and acquiring a marked comparison report according to an analysis result;

the management unit is used for managing the dictionary file and the trained model file;

the auxiliary labeling unit is used for providing a first round of label files with one click of multiple hits, data export and direct uploading.

Further, the annotation comparison report includes the accuracy, recall rate and F value of each entity item in the annotation file, which are as follows:

wherein: p is the accuracy of each entity item, R is the recall of each entity item, F is the F value of each entity item, File₁For the subsequent round, File₂And marking the file for the previous round.

Furthermore, in the process of one-click multiple-click, after any entity is labeled, the labeling system can search all other same entities which are not labeled through a character string matching mode.

Furthermore, the algorithm factory module processes the labeling task through a data preprocessing algorithm, a named entity recognition algorithm and a cross-modal data matching algorithm;

the data preprocessing algorithm is used for performing annotation data segmentation and entity name completion on the source file;

the named entity recognition algorithm is used for automatically recognizing corresponding named entities appearing in the text;

the cross-modality data matching algorithm is used to determine a set of candidate concepts from other modality screening for a source modality.

Furthermore, the annotation data is segmented and used for the oversize text type data file, the entity name completion is used for the annotation task with longer text, and the entity name completion process specifically comprises the following steps:

completing the original text of the entity to be processed, and inputting the original text into a preprocessing algorithm;

segmenting the original text according to periods;

inputting an entity name to be completed through the WEB interaction module, adding the entity name to be completed before each sentence of the original text after segmentation, and connecting the entity name to be completed and each sentence of the original text after segmentation through a preset symbol;

and recombining the complemented sentences.

Furthermore, the algorithm factory module further comprises a rule-based algorithm and a deep learning-based algorithm, the rule-based algorithm adopts a dictionary which is self-defined and constructed by a user, the deep learning-based algorithm adopts a word-based BilSTM-CRF model, and the deep learning-based algorithm identification process is specifically as follows:

inputting the task text to be marked into a system;

selecting a dictionary to be used according to the system prompt information;

arranging all contents in the dictionary in a descending order according to the entity length so as to achieve the aim of preferential matching of long entities;

and obtaining a corresponding entity matched in the text sequence according to the entity set in the sorted dictionary.

Compared with the prior art, the invention has the beneficial effects that:

(1) in the system, all labeling tasks adopt a plurality of labeling processes, in order to ensure the labeling quality, different personnel can label the tasks in the plurality of labeling processes respectively, and a labeling comparison report can be generated by the system after the labeling is finished, so that the labeling quality can be better controlled, and meanwhile, the task progress can be displayed in real time in a task management interface, thereby facilitating the progress control of a user;

(2) the system is internally provided with a dictionary which can be customized by a user according to a task of the user, and auxiliary marking is carried out in a rule-based mode, so that the number of the models needing manual marking is reduced, and the system can be trained according to marking data of the user by fusing various deep learning models and can also be uploaded by the user. The user can use the trained deep learning model to perform pre-labeling, so that the number of labels needing to be manually labeled is reduced;

(3) the system has good customizability, can be suitable for various existing labeling tasks, only needs to configure user-defined entities and relationships according to user requirements, has a user-friendly visual operation interface, supports various types of labeling tasks, and can finish labeling by simple clicking operation of a user.

Drawings

FIG. 1 is a system framework diagram of the present invention;

FIG. 2 is a flow chart of data annotation according to the present invention;

FIG. 3 is a diagram of the named entity identification process of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus, once an item is defined or illustrated in one figure, it will not need to be further discussed or illustrated in detail in the description of the following figure.

Referring to fig. 1 to fig. 3, fig. 1 is a system framework diagram of a user-defined multi-modal distributed semi-automatic annotation system according to the embodiment. The embodiment provides a user-defined multi-mode distributed semi-automatic labeling system which comprises a management module, a user-defined module, a persistent storage module, a WEB interaction module and an algorithm factory module. It is noted that the system mainly deals with the following 4 types of tasks, respectively: the system comprises a concept relation type labeling task, an event type labeling task, a case type labeling task and a multi-mode knowledge fusion type labeling task.

In this embodiment, the management module is configured to manage the annotation task and the system permission. The management module comprises a task management unit and a user management unit, wherein the task management unit is used for managing the labeling task according to the labeling task type. Specifically, the task management unit is a core unit of a user-defined multi-mode distributed semi-automatic labeling system, is responsible for managing tasks, has the main functions of labeled data uploading, labeled task creation, labeled task distribution and labeled progress control, and specifically comprises the following processing processes:

step S101: before the annotation task formally starts, the task to be annotated needs to be uploaded to a system in different file formats according to the annotation task type. It is worth noting that the user-defined multi-modal distributed semi-automatic annotation system supports a plurality of file formats such as TXT, JSON, Markdown, JPG, JPEG, and the like, but for different types of annotation tasks, the file formats supported by the system are different, and the targets of the different types of annotation tasks are different.

In particular, for conceptual relational annotation tasks, the main goal is to annotate user-defined entities present in the text and the corresponding relationships between the entities. In this embodiment, the labeling task of the medical text is taken as an example, and the purpose of the labeling task is to label entities such as diseases, symptoms, drugs and the like existing in the text, and some relationships existing between the entities (for example, a relationship between diseases and symptoms exists, and a relationship between diseases and drugs exists, and a relationship between drugs and medication exists). For the type of task, the file format supported by the system is common text type files such as TXT, Markdown, JSON and the like, and the JSON file is used for storing the final labeling result.

For event-type annotation tasks, the main goal is to annotate specific entity items under the entity classes present in the text. In the embodiment, a timescale news event marking task is taken as an example, and the purpose is to divide a text into five major entity classes of visiting, meeting, investigation and research, sending a telegram and sending a message and a foreign affair activity, and mark specific entity items such as a name of a person, a place, time, a name of a meeting and the like in the text. Because the task entity is complex, the system only supports the TXT format file for the task, and stores the final labeling result by the TXT file.

For the event type labeling task, the main aim is to label the events existing in the text and the corresponding relationship between the events. In this embodiment, the causal relationship labeling task is taken as an example, and the purpose is to label various events such as "container shortage", "new crown epidemic situation spreading" and the like existing in the text and the causal relationship among the events. For the type of task, the file format supported by the system is common text type files such as TXT, Markdown, JSON and the like, and the JSON file is used for storing the final labeling result.

For a multi-modal knowledge fusion type annotation task, the main goal is to integrate multiple modal information to obtain a consistent, common model output. In this embodiment, taking ImageNet, HowNet and CCD knowledge fusion labeling tasks as an example, the main content of ImageNet includes pictures and corresponding text descriptions, and HowNet is a common knowledge base that uses concepts represented by words of chinese and english as description objects, and discloses relationships between concepts and attributes of the concepts as basic content, and is a knowledge system with a mesh structure. The Chinese concept dictionary CCD is a Chinese-English bilingual semantic dictionary based on a WordNet frame and comprises concept nodes corresponding to Chinese and English. The chinese concept dictionary CCD defines concepts in terms of a set of synonyms, defining relationships between the concepts. The parts of speech include nouns, verbs, adjectives and adverbs, and the main semantic relations include synonymy relations, antisense relations, lower relations, overall part relations and the like. The objective of knowledge fusion in this embodiment is to establish a relationship between ImageNet and HowNet and a chinese concept dictionary CCD through concepts, so as to introduce picture information into the HowNet and the CCD, and simultaneously perform manual review on data of the ImageNet again to form fusion of multi-modal knowledge, that is, a user determines whether the matching items of the HowNet and the CCD matched according to the algorithm are consistent according to the picture description of the ImageNet. This type of task is complex, and the file type needs to be configured according to a specific task. Taking ImageNet, HowNet and CCD knowledge fusion labeling tasks as examples, the system supports various picture formats such as JPG, JPEG, PNG and the like, and stores the final labeling result by a JSON file.

Step S102: after the file to be annotated in step S101 is uploaded to the system, an administrator user needs to create an annotation task in the system according to the uploaded task to be annotated. The marking task creation refers to the task name, the marking type, the used source data, the belonged grouping and the preprocessing algorithm corresponding to the task to be marked.

It is noted that the preprocessing algorithm in the task of creating the annotation refers to whether preprocessing is performed using a preprocessing algorithm built in the system. In the present embodiment, the preprocessing algorithm is used only for text-type files, and cannot be used for files such as JPG, JSON, JPEG, and the like.

Step S103: after the annotation task in step S102 is created, the administrator user needs to assign the task to the members in the corresponding user group, and the members perform specific annotation on the task to be annotated. It should be noted that, in the present embodiment, the whole labeling process is divided into three rounds, which are respectively a round of labeling, a two-round of labeling and a three-round of labeling.

Specifically, the user 1 performs a first round of annotation on the source file, and the user 1 can select whether to use the algorithm in the algorithm factory module for auxiliary annotation, and after all annotations are completed, a first round of annotation file can be generated. And then, after the user 1 finishes the first round of marking, the user 2 carries out second round of marking on the basis of the first round of marking files, corrects the problems of label missing or label error and the like in the marking process of the user 1, and generates a second round of marking files after the second round of marking is finished. And finally, the user 1 carries out third round of labeling work on the basis of the second round of labeling files, the labeling result is checked and perfected again, and the labeling result can be used as a final labeling result after the completion.

In this embodiment, it should be noted that, since each annotation task usually includes a plurality of documents to be annotated, the progress statistics in the annotation process mainly depends on the difference between the two rounds of annotation documents, for example, the progress of the second round of annotation documents depends on the number of the second round of annotation documents divided by the number of the first round of annotation documents. And the second round of marking is carried out on the basis of the first round of marking, and a corresponding second-mark file is generated only after the first round of marking is finished. Similarly, the progress of the first round of annotation is the number of the first round of annotation files divided by the number of the source files, and the progress of the third round of annotation is the number of the third round of annotation files divided by the number of the second round of annotation files. The progress of each task can be clearly seen in the task list, and the user can conveniently manage the progress.

The user management unit is responsible for the authority management of the user-defined multi-mode distributed semi-automatic labeling system. In the present embodiment, the rights management of the system includes group management and rights management. Specifically, the group management is used to divide users into administrator rights and general user rights, and general users in the same group have the same rights and can label tasks under the group.

The authority management is used for creating and distributing marking tasks, grouping users, data exporting, data analyzing, data uploading and data marking. It is worth noting that the administrator user has all rights, and can perform all operations such as creation and distribution of labeling tasks, grouping of users, data export, data analysis and the like, and the common user can only perform data uploading and data labeling.

In this embodiment, the user-defined module is used to provide an interface of the user-defined function for the user. The method is mainly used for customizing corresponding entities to be labeled and relation items (entity classes and entity items) between the entities, or a dictionary required by a named entity recognition algorithm based on rules or a trained deep learning model required by the named entity recognition algorithm based on deep learning, which is required by a user according to the requirements of specific tasks of the user, so as to be used by the named entity recognition algorithm. It is worth noting that dictionaries required by different task types and different specific tasks of the same task type and a deep learning model after training are different, and can be defined by a user according to specific task requirements, and then are uploaded to a system for labeling tasks in a user-defined module.

Specifically, the user-defined module can perform configuration operation, dictionary configuration uploading and trained model uploading on entity items and relationship items of a concept-relationship type labeling task and a case-type labeling task in the system, and entity classes and specific entity items of an event-type labeling task.

It should be noted that, for the conceptual relational annotation task and the fact-based annotation task, before the annotation operation starts, the administrator user needs to add the corresponding entity item and relationship item to perform the annotation. For example, in the case of medical text labeling, entity items such as diseases, symptoms and medicines and relationship items such as clinical symptoms (relationship between diseases and symptom entities), medication (relationship between diseases and medication entities), complications (relationship between diseases and another disease entity) need to be added.

For the event type annotation task, before the annotation work is started, the corresponding entity class and the specific entity items under the entity class need to be added by the administrator user. For example, when the current news event is labeled, the corresponding visiting, meeting, investigation and research entity classes of the text and the corresponding entity items under each entity class need to be added.

In the specific use process, it is to be understood that the entity classes and the entity items (or the entities and the relationships) between different types of labeling tasks are different, and the entity classes and the entity items (or the entities and the relationships) corresponding to the specific labeling files of the same type of labeling tasks are also the same, and a user needs to define the tasks according to the needs of the user. For a specific multi-modal knowledge fusion type labeling task, the form of each modal file is only required to be given.

In this embodiment, the persistent storage module is used for storing system data, which is the basis for supporting the operation of the whole system. The system data comprises annotation task data, dictionary data, model data and system table data. Specifically, the annotation task data includes an original file of a task to be annotated uploaded by a user, a first round of annotation, a second round of annotation and a third round of annotation files (in JSON or TXT format) generated in a three-round annotation process performed in the task management unit, and a compressed file generated when data is exported, and an excel file generated by performing table display. The dictionary data dictionary is mainly a dictionary file (JSON format) defined by the user according to the user's actual task. The files of all parts are stored separately, and the change of each part does not affect other parts, so that the safety of the files can be ensured.

In this embodiment, the database used by the user-defined multi-modal distributed semi-automatic labeling system is a MySQL database, and a user table, an authority table, and a task information table in the system are stored in the database, where the user table mainly records personal information of a user, and includes: user name, password, mailbox, etc. The authority table (group table) mainly stores information of each group, including: group members and group leader, etc. The task information table mainly records detailed information of tasks, and comprises: the information of the creation time of the task, the distributed users, the corresponding source files, the total word number of the files in the task, the number of the files, the type of the task and the like.

In this embodiment, the WEB interaction module is configured to provide a visual operation interface for a user to help the user complete a corresponding annotation task, and provide data display and analysis for the user. The WEB interaction module comprises a labeling unit, a data analysis unit, a management unit and an auxiliary labeling unit.

Specifically, the labeling unit is used for labeling the task to be labeled according to the type of the labeling task. That is, the labeling process is different for different types of labeling tasks. In this embodiment, when labeling a conceptual relational labeling task, after the user defines the required entity tags and relations, the user can select corresponding characters after selecting the corresponding entity tags, so as to complete entity labeling. After the entity labeling is completed, when a user needs to label the relationship, the relationship between the corresponding entities can be labeled according to the types of the head entity and the tail entity required by the relationship after the relationship needing to be labeled is selected. Therefore, a user can conveniently switch between the entity labeling mode and the relationship labeling mode at any time according to needs.

The relationship defined by the conceptual relational annotation task is in the form of a quadruple of (entity 1, entity 2, relationship name, relationship attributes), such as < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where chronic diverticulitis is entity 1, bloating is entity 2, post-treatment symptom is a relationship between the two entities, and colorectal anastomosis is an attribute of a relationship that indicates the symptom that appears after colorectal anastomosis. The relation label is similar to the entity label, and the user needs to select the corresponding relation name first, then click the entity corresponding to the entity 1, and then click the entity corresponding to the entity 2 to complete the label.

It is noted that attributes are modifications, explanations, conditional limitations, and the like of relationships. Colorectal anastomosis, as mentioned above, is a property of post-treatment symptoms, which serves as a modification definition. And the marking of the attribute is selected by the user to be started or not, and the corresponding value of the attribute is set to be null when the marking of the attribute is closed. If the user selects to start the attribute marking, in the process of relation marking, the entity 1, the entity 2 and the relation can pop up whether to perform attribute marking dialog boxes after the marking is completed, the user selects characters corresponding to the attributes to complete the marking of the attributes after the selection is yes, and otherwise the attributes are set to be null.

For the event type annotation task, after the user defines the required entity class and the corresponding specific entity items under each entity class, the user can click the entity class button to select the corresponding entity class according to the class to which the file belongs, then the corresponding entity item is selected in the specific entity item under the corresponding entity class, and after the entity item is selected, the user selects the corresponding characters to finish the annotation. It is noted that the event type tagging task supports tagging of different entity items for the same segment of text.

For the case type labeling task, after the user self-defines the required entity labels and the relation, the user can select the corresponding entity labels and select the corresponding characters to finish the entity labeling. After the entity labeling is completed, if a user needs to label the relationship, the relationship between the corresponding entities can be labeled according to the types of the head entity and the tail entity required by the relationship to select the relationship to be labeled. Therefore, a user can switch between the entity labeling mode and the relationship labeling mode at any time according to needs. The relationship defined by the event-based annotation task is in the form of a triple (entity 1, entity 2, relationship name), such as < oil price drop, total financial income reduction, cause-effect relationship >, wherein oil price drop is entity 1, total financial income reduction is entity 2, and cause-effect relationship is a relationship between two entities. The relation label is similar to the entity label, and the user needs to select the corresponding relation name first, then click the entity corresponding to the entity 1, and then click the entity corresponding to the entity 2 to complete the label.

For a multi-modal knowledge fusion type annotation task, it can vary greatly from data set to data set. In this embodiment, a specific description will be given by taking the tasks of fusion and labeling of ImageNet, HowNet and CCD knowledge as an example. After the user uploads the corresponding data set, the system matches the concept description of ImageNet with the possibly same concept in HowNet and CCD as a candidate concept set. And then the user can firstly check the data of the source data set, namely ImageNet, and check out the correct data. And then selecting candidate concepts in the lowered HowNet candidate region and the CCD candidate region, and checking out correct data in the candidate regions to finish labeling. The final labeling result is stored in a JSON format file, wherein incorrect contents (namely contents needing to be deleted) in ImageNet data, concepts matched by HowNet and concepts matched by CCD are stored.

Specifically, the data analysis unit is used for analyzing the annotation data. In the process of processing data, not only the annotation of the data is important, but also the analysis of the annotated data is an essential work content. The method and the device can realize that the labeling unit can label the tasks to be labeled, can provide an analysis function of labeling data and generation of a labeling comparison report for each type of labeling tasks, and are convenient for a user to grasp the labeling quality while labeling.

In this embodiment, for the conceptual relational annotation task, the data analysis unit provides three different data analysis methods, which are respectively: list form, knowledge graph form, and graph form. The list analysis mode lists all the relationships and entities in the file after the labeling of a certain turn of the whole task in a list form one by one, and the relationships and the entities comprise the source file name and the source task name of each relationship, the relationship name, the subrelationship name, the entity 1, the entity 2, the relationship attribute and other details of the relationship. And the result of the list analysis mode is displayed in a form of an Excel format file.

The knowledge graph format is shown in a graph format, such as a four-tuple < chronic diverticulitis-bloating-post-treatment symptom-colorectal anastomosis >, where entity 1 'chronic diverticulitis' will be the central node, entity 2 'bloating' will be the leaf nodes thereof, the relationship is the connecting line between the two nodes, and the attribute is the other leaf node of entity 1.

The chart form shows the relationship and the entity composition in a visual manner in the form of a pie chart and a bar chart, for example, in the medical text labeling, the entities can be divided into diseases, symptoms, medicines and the like, and the proportion and the corresponding number of various types of entities or relationships can be checked in the chart.

In addition to the above three analysis methods, the data analysis unit may also generate detailed comparison reports of two annotation files (the annotation File in the next round is File1, and the annotation File in the previous round is File2), for example, if three annotation files and two annotation files are compared, the annotation File in the three rounds is File1, the annotation File in the two rounds is File2, and the report is divided into two parts, namely, overall analysis and detailed content comparison.

The first part shows the accuracy P, recall ratio R and F values of each type of entity in the annotation File in a table form, the specific calculation is that File1 is used as a gold standard, and the calculation formula is as follows:

The second part lists the specific contents in the annotation File in detail, and highlights the entity in different colors, so as to compare the difference between the previous round of annotation File2 and the next round of annotation File1, wherein in the multiple round of annotation process, the next round of annotation File1 and the previous round of annotation File2 are annotated by different annotators.

For event-type annotation tasks, the data analysis unit provides error checking and chart analysis. Since the event type annotation task classifies the file first, all entity items of the file should belong to the entity class. The data analysis unit supports basic check on the errors, checks whether the marked specific entity items correspond to the entity class corresponding to the current task or not, and supports generation of log files for downloading by users. It should be noted that the log file also contains the basic text of the task, and the number of labels corresponding to each specific entity item under the entity class to which the task belongs.

For example, in the text labeling of the political news, the corresponding entity items under the entity class of the character resume can be divided into names, nationalities, native places and the like, and the proportion and the corresponding number of various entity items can be checked in the chart.

The first part shows the accuracy P, recall ratio R and F values of each entity item in the annotation File in a table form, the specific calculation is that File1 is used as a gold standard, and the calculation formula is as follows:

The second part lists the specific contents in the annotation File in detail, and highlights the entity items in different colors, so as to compare the difference between the previous round of annotation File2 and the next round of annotation File1, wherein in the multiple round of annotation process, the next round of annotation File1 and the previous round of annotation File2 are annotated by different annotators.

For the case-based annotation task, the data analysis unit provides list-form data analysis. The list analysis mode is to list all the relationships and entities in the file after the labeling of a certain turn of the whole task is completed in a list form, including the source file name and the source task name of each relationship, and the details of the relationship name, the subrelationship name, the entity 1, the entity 2, the relationship attribute and the like of the relationship. And exporting the Excel format file by the result of the list analysis mode.

Besides, the data analysis unit may also generate detailed comparison reports of two annotation files (the annotation File in the next round is File1, and the annotation File in the previous round is File2), for example, if the annotation files in the three rounds are compared with the annotation File in the two rounds, the annotation File in the three rounds is File1, the annotation File in the two rounds is File2, and the reports are divided into two parts, namely, overall analysis and detailed content comparison.

For the multi-modal knowledge fusion type labeling task, the data analysis unit supports consistency analysis on the task. Specifically, the consistency analysis is to analyze differences of annotation contents of two adjacent rounds of the annotation task, that is, differences of the annotation file of the first round from the source file, differences of the annotation file of the second round from the annotation file of the first round, and differences of the annotation file of the third round from the annotation file of the second round, and to display the differences to the user through the visual interface. And the change of the amount of the marked content in each round of marking process is also shown in the form of a column diagram.

In this embodiment, the management unit is configured to manage the dictionary file and the trained model file. Specifically, when the managing unit manages the dictionary file, it can manage the existing dictionary according to the managing unit to prevent the dictionary file in the system from being confused. Similarly, when the management unit manages the trained model file, the management unit can manage the existing trained model file according to the management unit so as to prevent the trained model file from being confused. It should be noted that, in this embodiment, in the process of managing the dictionary file and the trained model file by the management unit, the user may upload, delete, rename, and the like the dictionary file and the trained model file through the management unit.

In this embodiment, the auxiliary annotation unit is configured to provide functions of multi-hit and multi-hit, data export, and direct uploading of the annotation file in the first round. The specific one-click multi-click method is as follows: in the process of labeling, as long as a user labels one entity, the system can automatically retrieve all other same entities which are not labeled in a character string matching mode, and label the entities as entity types (entity item types) which are the same as the entity types labeled by the user, so that the labeling efficiency can be improved.

When data is exported, a user can export a first round of annotation files, a second round of annotation files and a third round of annotation files according to the task name. For the conceptual annotation task and the affair annotation task, all users can search the relationship (including the modes of name according to the relationship, name according to file and the like), the search result is displayed in a list form, and the content of each line comprises the content of the name of the file, the entity 1, the relationship, the entity 2 and the like. After the retrieval is finished, the export operation can be carried out, and the exported relation is compressed in the form of an Excel file and then is provided for a user to download. For the event type annotation task and the multi-mode knowledge fusion type annotation task, the corresponding annotation file can be directly exported.

It is worth noting that in the specific use process, the deep learning model defined by the user is inconvenient to disclose, so that the system supports the user to pre-label data on the basis of not uploading the deep learning model. The user can locally process the data by using the trained deep learning model, and then the processed data is directly uploaded to the first round of labeling file for subsequent labeling.

In this embodiment, the algorithm factory module is configured to provide algorithm support for the WEB interaction module, and is configured to assist a user in performing annotation, and the main functions include a data preprocessing algorithm and a named entity recognition algorithm for three annotation tasks, namely, a conceptual relational annotation task, an event annotation task, and a case annotation task, and a cross-modal data matching algorithm for a multimodal knowledge fusion annotation task. Specifically, the specific processing procedures of the data preprocessing algorithm, the named entity recognition algorithm and the cross-modal data matching algorithm are as follows.

The data preprocessing algorithm is used for processing the source file and mainly comprises the steps of performing labeling data segmentation and entity name completion on the source file. Specifically, the annotation data segmentation is mainly used for the oversize text-type data file, and can be segmented in two ways, namely, periods in the text or chapters in the text, and the specific segmentation way is not fixed and can be specifically selected by a user according to the needs of the user. When the division mode is selected to divide according to the sentence numbers in the text, the content of one sentence is displayed each time during the marking, and the content of the next sentence is displayed after the marking of the sentence is finished, thereby facilitating the user to mark sentence by sentence. When the segmentation mode is selected to be segmented according to chapters in the text, all contents of one file are displayed each time during annotation, and the annotation is carried out according to the sequence of the files.

In this embodiment, the entity name completion is for a labeling task with a long text, and the specific flow is as follows:

step S201: and inputting the entities to be complemented into the original text input system.

Step S202: the user segments the original text according to the self requirement, and segments the original text according to the sentence number.

Step S203: and performing completion operation according to the name of the entity to be completed, which is input on the system interface by the administrator user, and adding the entity in front of each sentence, wherein the @ symbol is used as a separation between the entity and the sentence.

Step S204: and re-assembling the completed sentences into an article, and carrying out the next labeling work.

In the embodiment, a document to be labeled mainly labels entities and relations related to chronic atrial fibrillation, and the entity of chronic atrial fibrillation appears only in the head of the document, which is described in detail later by taking the example of replacing the entity with the entity of chronic atrial fibrillation. If the disease later appears to result in xxxx symptoms, the relationship between chronic atrial fibrillation and the symptoms needs to be labeled, but the entity of chronic atrial fibrillation is not present in the sentence, so that labeling cannot be performed. Therefore, the entity name needs to be supplemented, and the entity name is added to the beginning of the sentence, such as 'chronic atrial fibrillation', 'the disease causes xxxx symptoms', and the relation labeling work of the entity can be successfully completed.

Where a cross-modality data matching algorithm is used to determine a set of candidate concepts for the source modality from the screening of other modalities, the above-described preprocessing algorithm for text is not applicable. The algorithm needs to be slightly different according to each modal shape. The following description takes the example of the tasks of fusion and annotation of ImageNet, HowNet and CCD knowledge, and aims to match the concepts of HowNet and CCD, which may be the same, according to the concept description of ImageNet. Since the conceptual description in ImageNet and CCD are both likely to be multiple words, there are both full and partial matches. The concept in ImageNet is completely consistent with that in HowNet or CCD, and the partial match refers to that the divided substring of the concept in ImageNet is matched with that in HowNet or CCD. And (4) taking the completely matched concepts and the partially matched concepts as the contents in the candidate concept set, sequencing the candidate concept set according to the matching degree, and storing the final result in a JSON file. The cross-modal data matching algorithm is specific to a specific task and only needs to be finely adjusted according to the data form of each modality.

Wherein the named entity recognition algorithm is used for automatically recognizing the corresponding named entity appearing in the text. Since the source data of the multi-modal knowledge fusion type annotation task is structured or semi-structured data, the algorithm is suitable for all tasks except the multi-modal knowledge fusion type annotation task. In the system, after the named entity recognition algorithm recognizes the corresponding named entity, the named entity recognition algorithm can pre-label the text according to the recognition result, so that the workload of manual labeling is reduced, and the labeling efficiency is improved. Since different types of labeling tasks and different specific tasks of the same type of labeling task have different user-defined entities and relationships, the named entity recognition algorithm needs to be customized by a user according to the specific labeling tasks. In the process of marking by a common user, the user can select whether to use a named entity recognition algorithm to accelerate the marking process.

The system comprises algorithms based on rules and algorithms based on deep learning, wherein the algorithms based on the rules adopt dictionaries which are self-defined by users and can be constructed by experts in related fields or constructed according to the same type of marking tasks in the past, and the users need to upload the dictionaries to the system in advance. The dictionary is a JSON file and consists of character strings of entities and entity types corresponding to the entities, for example, the dictionary content constructed by medical text annotation is { "en _ name": diabetes "," label _ name ": disease", etc. The user may customize a corresponding dictionary for each particular annotation task, or may use multiple dictionaries for an annotation task.

The algorithm identification process based on the rule is specifically as follows:

step S301: and inputting the whole text of the task to be marked into the system.

Step S302: and according to the prompt information of the system, the user selects the dictionary to be used by himself.

Step S303: and arranging all contents in the dictionary according to the descending order of the entity length so as to achieve the aim of the prior matching of the long entities.

Step S304: and obtaining a corresponding entity matched in the text sequence according to the entity set in the sorted dictionary.

Step S305: and automatically labeling all matched entities according to the entity types corresponding to the entities, wherein the labeling results are different according to the specific task type formats.

The deep learning algorithm in this embodiment uses a word-based BilSTM-CRF model. Before using the model, a user needs to provide part of pre-labeled text data as training data, and a BIO labeling system is adopted during labeling. For example, the text 'cold can cause headache', labeled 'B-DIS cold I-DIS cold O causes O head B-SYM pain I-SYM', where DIS represents disease, SYM represents symptom, manually defined label, B represents beginning of entity, I represents remaining part of entity, O represents remaining part of text except entity.

And after the user prepares training data, inputting a data set of the training data into the BilSTM-CRF neural network model for training, thereby obtaining the trained BilSTM neural network model. Notably, the present system also supports user-defined deep learning algorithms. The user only needs to provide the deep training model after training and use the interface to use in the marking.

In this embodiment, the labeling process of the user-defined multi-modal distributed semi-automatic labeling system specifically includes the following steps:

the first step is as follows: the administrator user uses the user management module to create a corresponding user group for the task, and adds the users participating in the task into the group. Meanwhile, the persistent storage module updates a corresponding data table in the database to persistently store the packet information.

The second step is that: and uploading a source file to be marked to a task management unit by a common user or an administrator user, then creating a corresponding task by the administrator user, establishing a connection with the group, and selecting whether to use a corresponding preprocessing algorithm. Meanwhile, the persistent storage module can perform persistent storage on the file and update a corresponding data table in the database.

The third step: the administrator user selects whether to upload the dictionary required by the rule-based named entity recognition algorithm or the trained model required by the deep learning-based named entity recognition algorithm according to task needs. If necessary, the administrator user can upload the corresponding dictionary or model in the user-defined module, and the persistent storage module can perform persistent storage on the dictionary or model and update the corresponding data table in the database.

The fourth step: the annotating personnel enters the task, and can select a needed dictionary (rule-based method) and a needed deep learning model for the annotating task of the text file according to the specific needs of the annotating task to perform pre-annotation, and the multi-mode knowledge fusion annotation source data is structured data which needs manual proofreading, so that the pre-annotation is not needed.

The fifth step: and respectively completing a first-label, a second-label, a third-label and other multi-round labeling processes of the task to be labeled by different labeling personnel by using a labeling function provided by the WEB interaction module, wherein the first label and the third label are completed by the same personnel, and the display of the labeling progress by 100% indicates that the task is completed. And the persistent storage module can perform persistent storage on the labeling result by using the JSON file.

And a sixth step: for tasks with the completion degree of 100%, according to different types of labeled tasks, a corresponding quality inspection method provided by a WEB interaction module can be used for quality control.

The seventh step: and reworking the unqualified labeling task to a specific person, performing the next labeling task after the unqualified labeling task is qualified, and continuously repeating the processes from the first step to the fifth step until all the documents are labeled.

Eighth step: and the administrator user derives the labeled data to be used for constructing downstream tasks such as a corresponding corpus and the like.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A user-defined multi-mode distributed semi-automatic labeling system is characterized by comprising a management module, a user-defined module, a persistent storage module, a WEB interaction module and an algorithm factory module;

the persistent storage module is used for storing system data;

2. The user-defined multi-modal distributed semi-automatic tagging system of claim 1, wherein the management module comprises a task management unit and a user management unit;

3. The user-defined multi-modal distributed semi-automatic labeling system of claim 2, wherein the task management unit is configured to manage the labeling task according to a labeling task type, specifically as follows:

and performing task allocation on the tasks to be labeled and labeling.

4. The user-defined multi-modal distributed semi-automatic tagging system of claim 1, wherein the user-defined module is configured to provide a user-defined interface, and specifically comprises:

5. The user-defined multi-modal distributed semi-automatic labeling system of claim 1, wherein the WEB interaction module comprises a labeling unit, a data analysis unit, a management unit and an auxiliary labeling unit;

6. The user-defined multi-modal distributed semi-automatic annotation system according to claim 5, wherein the annotation comparison report comprises the accuracy, recall and F-value of each entity item in the annotation file, and specifically comprises the following steps:

7. The user-defined multi-modal distributed semi-automatic labeling system of claim 5, wherein after any entity is labeled in the process of one click with more than one click, the labeling system can retrieve all other same entities which are not labeled in a character string matching manner.

8. The user-defined multi-modal distributed semi-automatic tagging system of claim 1, wherein the algorithm factory module processes tagging tasks through a data pre-processing algorithm, a named entity recognition algorithm, and a cross-modal data matching algorithm;

9. The user-defined multi-modal distributed semi-automatic annotation system of claim 8, wherein the annotation data is segmented for very large text-based data files, the entity name completion is for longer text annotation tasks, and the entity name completion process is as follows:

segmenting the original text according to periods;

and recombining the complemented sentences.

10. The user-defined multi-modal distributed semi-automatic labeling system of claim 8, wherein the algorithm factory module further comprises a rule-based algorithm and a deep learning algorithm, the rule-based algorithm adopts a user-defined dictionary, the deep learning algorithm adopts a word-based BilSTM-CRF model, and the deep learning algorithm-based recognition process is specifically as follows:

inputting the task text to be marked into a system;

selecting a dictionary to be used according to the system prompt information;