CN114446431A - Method and device for selecting annotating personnel of professional data and electronic equipment - Google Patents

Method and device for selecting annotating personnel of professional data and electronic equipment Download PDF

Info

Publication number
CN114446431A
CN114446431A CN202210113924.7A CN202210113924A CN114446431A CN 114446431 A CN114446431 A CN 114446431A CN 202210113924 A CN202210113924 A CN 202210113924A CN 114446431 A CN114446431 A CN 114446431A
Authority
CN
China
Prior art keywords
personnel
entity type
labeling
text
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210113924.7A
Other languages
Chinese (zh)
Inventor
李姣
马鹤桐
王序文
徐晓巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN202210113924.7A priority Critical patent/CN114446431A/en
Publication of CN114446431A publication Critical patent/CN114446431A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06398Performance of employee with respect to a job function
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms

Abstract

The application discloses a method, a device and electronic equipment for selecting annotators of professional data, which are used for constructing user characteristics of each annotator based on evaluation of abilities of the annotators, constructing entity type characteristics of each entity type and/or text theme characteristics corresponding to each test text, and selecting at least one target annotator for labeling tasks according to the user characteristics of each annotator, the entity type characteristics of each entity type and/or the text theme characteristics of each test text and the task characteristics of the tasks to be annotated. Therefore, the method and the device have the advantages that the labeling personnel which are suitable for the tasks to be processed in a multi-dimensional and multi-aspect mode (such as user characteristics, entity type characteristics, text theme characteristics and the like) and correspond to the tasks to be labeled in the aspect of the specialty are provided, the accuracy and the credibility of the labeling results are improved, and the intelligent algorithm can be assisted to achieve a better recognition effect.

Description

Method and device for selecting marking personnel of professional data and electronic equipment
Technical Field
The application belongs to the technical field of artificial intelligence application, and particularly relates to a method and a device for selecting annotating personnel of professional data and electronic equipment.
Background
At present, the medical artificial intelligence technology has been widely applied to various fields of medicine, and the performance of the technology under the drive of medical big data can be rounded, so that the technology becomes an increasingly important auxiliary means for medical practitioners. The use of medical artificial intelligence techniques is based on a large amount of structured data, and a model capable of assisting prediction or classification is trained.
Medical data such as electronic medical records and the like are very important in medical resources, and complete information recording of a series of processes of patient treatment after symptom generation, doctor diagnosis, treatment scheme making, patient treatment operation implementation such as operation and the like, are the first hand data for knowing the disease causes, treatment conditions and prognosis conditions of patients, contain a large amount of important information, can be used for providing data basis for medical artificial intelligence technology, however, the electronic medical records are realized by description and recording of clinicians, are usually presented in an irregular text form, namely an unstructured text form, cannot be directly applied to data-driven model training, and need to complete extraction of fixed category contents such as data of diseases, operations, experiments and the like.
Accordingly, the mass data training in the medical artificial intelligence technology needs to use the medical term labeling as a premise, and accordingly, identifying different types of medical terms, such as diseases, operations and the like, in unstructured medical data (such as electronic medical records) becomes an important subject to be solved by the medical artificial intelligence technology. The applicant researches and discovers that medical data generally has the characteristics of high knowledge expertise, high complexity, wide data relating range and the like, and medical artificial intelligence technology requires that labeled medical terms have high accuracy and have high requirements on medical labeling personnel, so that the medical artificial intelligence technology adapts to the labeling personnel which are suitable for tasks to be labeled and correspond to the tasks to be labeled in the aspect of the specialty, and is very important for the field to help to improve the accuracy of medical labeling, reduce the number of auditing rounds and improve the labeling efficiency.
Disclosure of Invention
In view of this, the application provides a method and a device for selecting a annotating person of professional data and electronic equipment, which are used for adapting an appropriate annotating person corresponding to a task to be annotated in the aspect of the specialty so as to improve the accuracy and credibility of an annotation result and assist an intelligent algorithm to achieve a better identification effect.
The specific technical scheme is as follows:
a method for selecting annotators of professional data comprises the following steps:
evaluating the abilities of different labels in the label personnel set to obtain the ability evaluation results corresponding to the labels respectively; the capability evaluation result corresponding to the annotating personnel can be used for representing the annotation quality of the annotating personnel to each entity type in a plurality of entity types of professional data in the preset field and/or the annotation quality of each text in a plurality of test texts;
constructing user characteristics of the labeling personnel according to the capability evaluation result corresponding to the labeling personnel;
constructing entity type characteristics of each entity type in the multiple entity types; and/or constructing text theme characteristics of each test text in the plurality of test texts;
selecting at least one target marking person for task marking from the marking person set for the task to be marked according to the user characteristics of the marking persons, the entity type characteristics of each entity type and/or the text subject characteristics of each test text and the task characteristics of the task to be marked;
and distributing the tasks to be labeled to the selected target labeling personnel.
Optionally, the performing capability evaluation on different labels in the label staff set to obtain capability evaluation results corresponding to each label staff respectively includes:
setting a plurality of test texts of the professional data of the predetermined field, wherein each test text corresponds to one or more entity types of the predetermined field;
acquiring a labeling result obtained by each labeling person in the labeling person set performing entity labeling on the plurality of test texts;
and determining the quality index value of the labeling quality of each labeling person for each entity type and/or the quality index value of the labeling quality of each test text according to the labeling result of each labeling person to obtain the capability evaluation result corresponding to each labeling person.
Optionally, the constructing the user characteristics of the annotating personnel according to the ability evaluation result corresponding to the annotating personnel includes:
constructing user characteristics of the labeling personnel according to the quality index value of the labeling quality of each entity type by the labeling personnel and/or the quality index value of the labeling quality of each test text;
the user characteristics of the annotating personnel are sequentially arranged multi-dimensional characteristics, and each dimension characteristic comprises: and/or the corresponding relation between the text identifier of each test text and the quality index value generated by the labeling personnel in the corresponding test text.
Optionally, before constructing the user characteristics of the annotating personnel, the method further comprises:
and filtering out the corresponding labeling personnel with the corresponding quality index values not meeting the set index conditions from the labeling personnel set.
Optionally, the constructing an entity type feature of each entity type of the plurality of entity types includes:
constructing a feature representation of each entity type according to the entity type structure of each entity type in the multiple entity types to obtain the entity type feature of each entity type;
the constructing of the text topic feature of each test text in the plurality of test texts comprises:
and constructing text theme characteristics of each test text according to the theme distribution of each test text in the plurality of test texts.
Optionally, the entity type structure of the entity type includes: extracting entity objects and labeling entity labels from knowledge data corresponding to the entity types, and constructing a tree-type entity label system structure based on the object relationship of each entity object;
constructing a feature representation for each entity type according to the entity type structure of each entity type of the plurality of entity types, including:
and constructing vector representation of the entity type through a vector space model according to the entity label contained in each node in the entity type structure of the entity type, wherein the vector representation is used as the entity type characteristic of the entity type.
Optionally, the task features of the task to be annotated include: entity type characteristics and/or text subject characteristics of the task to be marked;
according to the user characteristics of each marking person, the entity type characteristics of each entity type and/or the text subject characteristics of each test text, and the task characteristics of the task to be marked, at least one target marking person for task marking is selected for the task to be marked from the marking person set, and the method comprises the following steps:
calculating the similarity between the entity type of the task to be marked and the entity type in the user characteristics of each marking person according to the entity type characteristics of the task to be marked and the entity type characteristics corresponding to the entity type in the user characteristics of each marking person; determining the corresponding labeling personnel with the entity type similarity and the entity type quality index value meeting the first screening condition as target labeling personnel;
or, calculating the text theme similarity of the test text in the user characteristics of the to-be-labeled task and each labeling person according to the text theme characteristics of the to-be-labeled task and the text theme characteristics corresponding to the test text in the user characteristics of each labeling person; and determining the labeling personnel of which the corresponding text theme similarity and the test text quality index value meet the second screening condition as target labeling personnel.
Optionally, after the task to be annotated is allocated to the selected target annotation person, the method further includes:
acquiring and auditing the labeling result of the target labeling personnel on the task to be labeled to obtain a quality index value of the labeling quality of the target labeling personnel on the task to be labeled;
and updating the user characteristics of the target annotating personnel based on the quality index values obtained by the examination, so as to select the annotating personnel of the subsequent tasks to be annotated based on the updated user characteristics.
A annotator selection device for professional data, comprising:
the capacity evaluating module is used for evaluating the capacity of different marking personnel in the marking personnel set to obtain the capacity evaluating result corresponding to each marking personnel; the capability evaluation result corresponding to the annotating personnel can be used for representing the annotation quality of the annotating personnel to each entity type in a plurality of entity types of professional data in the preset field and/or the annotation quality of each text in a plurality of test texts;
the first feature construction module is used for constructing the user features of each annotation person according to the ability evaluation result corresponding to each annotation person;
a second feature construction module for constructing an entity type feature for each of the plurality of entity types; and/or constructing text theme characteristics of each test text in the plurality of test texts;
the annotation personnel selection module is used for selecting at least one target annotation personnel for task annotation from the annotation personnel set for the task to be annotated according to the user characteristics of each annotation personnel, the entity type characteristics of each entity type and/or the text subject characteristics of each test text and the task characteristics of the task to be annotated;
and the task allocation module is used for allocating the tasks to be annotated to the selected target annotation personnel.
An electronic device, comprising:
a memory for storing at least one set of instructions;
and the processor is used for calling and executing the instruction set in the memory, and the annotator selection method of the professional data is realized by executing the instruction set.
According to the scheme, the annotator selecting method, the annotator selecting device and the electronic equipment for the professional data, which are provided by the application, construct the user characteristics of each annotator based on the ability evaluation of the annotator, construct the entity type characteristics of each entity type and/or the text theme characteristics corresponding to each test text, and select at least one target annotator for labeling the task to be labeled according to the user characteristics of each annotator, the entity type characteristics of each entity type and/or the text theme characteristics of each test text and the task characteristics of the task to be labeled.
Therefore, the application provides a labeling person who is suitable for the task to be processed in a multidimensional and multi-aspect (such as user characteristics, entity type characteristics, text theme characteristics and the like) and corresponds to the task to be labeled in the aspect of professionality, the suitable task can be pushed to the suitable labeling person, the labeling person is good at familiarity with the pushed content to be labeled, the task to be labeled can be completed quickly, efficiently and accurately, the accuracy and the credibility of a labeling result are improved, and an intelligent algorithm can be assisted to achieve a better recognition effect.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for selecting annotators of professional data provided by the present application;
FIG. 2 is a diagram of an implementation process for evaluating the ability of a annotating person according to the present application;
FIG. 3 is another flow chart of the annotator selection method for professional data provided by the present application;
FIG. 4 is a block diagram of the annotator selection device for professional data provided by the present application;
fig. 5 is a structural diagram of the electronic device provided in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application discloses a method and a device for selecting a marking person of professional data and electronic equipment. The method can be used for adapting suitable labeling personnel corresponding to the tasks to be labeled in the aspect of the specialty for the labeling requirements of professional data in the predetermined fields of medicine/medicine, biology, photoelectricity/light guide, space technology and the like. The embodiment of the present application will be mainly described by taking the label of medical data as an example.
The medical term labeling is a precondition for large-batch data training, and can be used as a training set for neural network algorithm training or a test set for medical semantic type recognition in the form of gold standard. In addition, the contents of learning and practice of different medical annotating personnel may be different, so that the semantic types or medical contents of the annotation also have different emphasis.
The methods disclosed in the embodiments of the present application are operational with numerous general purpose or special purpose computing device environments or configurations, such as: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
Referring to fig. 1, the method for selecting a annotator of professional data disclosed in the embodiment of the present application specifically includes the following processing steps:
101, evaluating the abilities of different labels in a label set to obtain the ability evaluation results corresponding to the labels respectively; the capability evaluation result corresponding to the annotating personnel can be used for representing the annotation quality of the annotating personnel to each entity type in a plurality of entity types of professional data in the preset field and/or the annotation quality of each text in a plurality of test texts.
And evaluating the capability of the annotating personnel, and is used for estimating and evaluating the annotating capability of the annotating personnel in professional data of the preset field, so as to portray the subdivision professional field and the adequacy degree of different annotating personnel in the preset field. The predetermined domain professional data may be, but is not limited to, data related to the above-mentioned fields of medicine/medicine, biology, optoelectronics/photoconduction, space technology, etc., such as an electronic medical record in the medical domain.
Referring to fig. 2, the process of capability evaluation for different annotators can be further implemented as follows:
step 201, setting test texts of multiple entity types of professional data in a predetermined field, wherein each entity type corresponds to multiple different test texts.
For example, according to the labeling requirement of the medical data, the present embodiment sets a test text including M entity types, where the M entity types include, but are not limited to, entity types of diseases, medicines, operations, experiments, and the like, and are used to represent each subdivided professional field of the medical field, the M entity types correspond to N test texts, each test text corresponds to one or more entity types of the M entity types, and M, N is an integer greater than 1.
And each test text is used as unstructured data to be marked and is used for evaluating the capability of a marking person. And for each test text, a corresponding marking standard used as an evaluation basis is formulated, for example, the result marked by an expert on each test text is regarded as a gold standard of the test text and is used as an evaluation basis for evaluating the capability of a marking person in the following process, and the like.
Step 202, obtaining a labeling result obtained by each labeling person in the labeling person set performing entity labeling on the test texts of the multiple entity types.
And distributing each test text to different marking personnel in a marking personnel set respectively, marking by each marking personnel, wherein each marking personnel needs to accumulate and mark N test texts, namely, each marking personnel corresponds to M entity types in total, and after marking is finished, collecting the marking result of each marking personnel.
And 203, determining the quality index value of the labeling quality of each labeling person for each entity type and/or the quality index value of the labeling quality of each test text according to the labeling result of each labeling person, and obtaining the capability evaluation result corresponding to each labeling person.
And then, further taking the marking standards respectively corresponding to the test texts as comparison bases, and analyzing and counting the related quality index values of the marking quality of each marking person. The method includes, but is not limited to, analyzing and counting quality index values of labeling quality of each labeling person for each entity type in N test texts, and/or quality index values of labeling quality of each test text, and correspondingly obtaining a capability evaluation result corresponding to each labeling person.
The quality index value of the labeling quality of each entity type by the labeling personnel can be, but is not limited to, index values such as a labeling F1 value, accuracy or recall ratio and the like of each entity type by the labeling personnel; similarly, the quality index value of the labeling quality of each test text by the labeling personnel can be, but is not limited to, index values such as a labeling F1 value, an accuracy rate or a recall rate of each test text by the labeling personnel for each entity type. The quality index value is larger, the marking quality of the corresponding entity type or the corresponding test text by the marking personnel is better, and the marking capability is higher.
The respective definitions and formulas of the F1 value, accuracy rate or recall rate can be referred to the related description of the prior art and will not be described in detail herein.
And 102, constructing user characteristics of the annotating personnel according to the ability evaluation result corresponding to the annotating personnel.
Optionally, before constructing the user characteristics of the annotating staff, firstly, the annotating staff with the corresponding quality index value not meeting the set index condition are filtered from the annotating staff in a centralized manner, specifically, the annotating staff with the F1 value/accuracy/recall ratio (or average F1 value/accuracy/recall ratio of each entity type/test text) smaller than the set threshold value are filtered, so as to primarily screen out the task allocation of each annotating staff with relatively high centralized annotating capability for the tasks to be annotated.
On the basis, further aiming at each label person which is not filtered, constructing a corresponding user characteristic for each label person according to the quality index value of the label quality of each entity type of the label person and/or the quality index value of the label quality of each test text.
The user characteristics of the annotating personnel are sequentially arranged multi-dimensional characteristics, and each dimension characteristic comprises: and the corresponding relation between the corresponding entity type and the quality index value generated by the labeling personnel in the corresponding entity type, and/or the corresponding relation between the text identification of each test text and the quality index value generated by the labeling personnel in the corresponding test text.
For ease of understanding, the following are exemplified: in this example, the annotator accuracy corresponding to each entity type is set to be the F1 value of the statistics corresponding to the previous competency assessment, and is set to be 100% for the value of F1 being zero (the annotator has a value of 0 for F1 of an entity type, which means that the annotator has not tested the text annotation of the entity type, and for this case, the accuracy is the first default of 100%, which is equivalent to the chance of one attempt by the annotator for this case), and on this basis, for each annotator not filtered, the user characteristics are represented as { (a, P)1),(B,P2),(C,P3) … … }. Wherein A, B, C … … represents different entity types, such as disease, medicine, operation, experiment, etc., P1、P2、P3… … represent the annotated F1 values of the corresponding entity types in the annotated people pair A, B, C … …, respectively. The sequence of A, B and C … … is arranged by the labeling personnel in descending order of the labeled F1 value of each entity type.
It will be readily appreciated that in the above example, the user characteristics of each annotating person comprise multidimensional characteristics sorted in descending order of F1 values, each dimension characteristic comprising: and the corresponding relation between the corresponding entity type and the quality index value (such as F1 value) generated by the annotating personnel in the corresponding entity type.
103, constructing entity type characteristics of each entity type in the multiple entity types; and/or constructing text subject characteristics of each test text in the plurality of test texts.
Besides the user characteristics, the embodiment of the application also constructs entity type characteristics of the entity type, text subject characteristics of the test text and other related characteristics.
The construction processes of the entity type characteristic and the text theme characteristic are respectively as follows:
11) entity type characteristics
In this embodiment, a feature representation of each entity type is constructed according to an entity type structure of each entity type in a plurality of entity types, so as to obtain an entity type feature of each entity type.
Wherein the entity type structure of each entity type comprises: the method comprises the steps of extracting entity objects and labeling entity labels from knowledge data corresponding to entity types, and constructing a tree-type entity label system structure based on object relations of the entity objects. The tree structure of the entity label system corresponding to the entity type reflects the object relationship of each entity object of the entity type, such as the inclusion/contained relationship, the parallel relationship, the derivative relationship, and the like.
Based on this, in practical application, the entity type structure of each entity type can be determined first. The determination of the entity type structure mainly comprises two modes, one mode is a manual determination mode, namely, according to actual requirements, relevant objects are extracted from existing medical data such as a medical comprehensive word list, a professional medical word list, a medical ontology and a medical knowledge graph, entity label labeling is carried out on the entity objects, an entity label system is built based on the object relation of each entity object or the label system is directly built by an expert, and therefore a proper entity type labeling task is recommended to a proper labeling person in the later period. And in the other mode, a large amount of text contents are trained in an unsupervised learning mode to generate a label system of the medical entity type from bottom to top, so that an entity type structure of the entity type is obtained.
Then, further according to an entity label correspondingly contained by each node in the entity type structure of the entity type, a vector representation of the entity type is constructed through a vector space model (Word2vec) and is used as an entity type feature of the entity type, such as two entity types V1,V2
12) Text topic features
The text topic feature of each test text is constructed according to the topic distribution of each test text in a plurality of test texts.
Specifically, a large number of texts are trained through an LDA (latent Dirichlet allocation) algorithm to learn the theme distribution characteristics of the large number of texts, the theme distribution of each test text is constructed on the basis, and a vector based on the theme is constructed for the test texts. Each theme of the test text is a feature of the vector, the feature value is a probability value of occurrence of the theme in the test text, and each theme is normalized to finally obtain the text theme feature of the test text.
And 104, selecting at least one target marking person for task marking from the marking person set for the task to be marked according to the user characteristics of each marking person, the entity type characteristics of each entity type and/or the text subject characteristics of each test text and the task characteristics of the task to be marked.
The method for selecting target annotation personnel for task annotation has various modes, and mainly comprises the following steps:
21) selecting based on similarity with entity types in user characteristics of annotators:
in the method, firstly, vectorization representation is carried out on a task to be marked (an entity type to be marked) to obtain entity type characteristics of the task to be marked, similarity between the entity type of the task to be marked and the entity type in each marking person user characteristic is calculated according to the entity type characteristics of the task to be marked and the entity type characteristics corresponding to the entity type in each marking person user characteristic, and then marking persons with corresponding entity type similarity and entity type quality index values meeting first screening conditions are determined as target marking persons.
Specifically, the similarity of the two entity types can be calculated based on the feature vector of the entity type of the task to be labeled and the feature vector of each entity type in the user features of the labeling personnel, and the calculation formula is designed as follows:
Figure BDA0003495656930000101
wherein, ai,biI-th characteristic components, w, of two entity types, respectivelyiIs the weight of the ith characteristic of the two entity types, i is more than or equal to 1 and less than or equal to k, and i is an integer. w is aiThe value of (c) can be set according to manual experience knowledge/understanding of the business, and can also be obtained by training with a machine learning algorithm.
Optionally, the corresponding entity type similarity may be specifically selected to be located at topk1And topk1The quality index value corresponding to each entity type in (1) is located at topk2The annotating staff as target annotating staff for performing task annotation on the task to be processed, that is, the annotating staff with high entity type similarity and high annotation quality index value (such as F1 value) of the annotating staff on the entity type are selected as the target annotating staff adapted to the task to be annotated for performing task allocation.
Based on topk1And topk2The screening conditions thus constituted may be the first screening conditions described above.
k1、k2All are positive integers which can be freely set according to actual requirements.
22) Selecting based on the similarity of the subjects of the test texts in the user characteristics of the annotating personnel:
in the method, firstly, a theme of a task to be marked (a text to be marked) is predicted through an LDA model to obtain text theme characteristics of the task to be marked, text theme similarity of a test text in the task to be marked and user characteristics of marking personnel is calculated according to the text theme characteristics of the task to be marked and the text theme characteristics corresponding to the test text in the user characteristics of the marking personnel, and then the marking personnel with the corresponding text theme similarity and the corresponding test text quality index value meeting a second screening condition are determined as target marking personnel.
Optionally, the corresponding text topic similarity may be specifically selected to be located at topk3And topk3The marking quality index value corresponding to each test text in (1) is located at topk4The annotating staff is used as target staff for performing task annotation on the tasks to be processed, that is, the annotating staff with high similarity to the text theme of the test text and high annotation quality index value (such as F1 value) of the test text by the annotating staff is selected as the target annotating staff matched with the staff to be annotated for performing task allocation.
Based on topk3And topk4The screening conditions thus constituted may be the second screening conditions described above.
K3、k4All are positive integers which can be freely set according to actual requirements.
It should be noted that, in the implementation of the present application, the method is not limited to adopt the above-mentioned mode 21) or 22) to select the target annotating personnel, and the two modes may be combined to select the adaptive target annotating personnel for the task to be annotated, specifically, for example, the target annotating personnel selected by the mode 21) and the target annotating personnel selected by the mode 22) find an intersection, and the annotating personnel in the intersection is used as the target annotating personnel selecting the adaptation for the task to be annotated.
And 105, distributing the tasks to be labeled to the selected target labeling personnel.
And finally, pushing the task to be labeled to the selected target labeling personnel for text labeling.
According to the embodiment of the application, the marking personnel are subjected to capability evaluation based on the set plurality of test texts of various different entity types, so that professional subdivision fields with higher marking accuracy and recall rate of different marking personnel are respectively identified and evaluated, the marking capabilities of different marking personnel in different professional subdivision fields are represented, and field marking tasks which are the same as or similar to the field which is good at the same are recommended/pushed according to the professional subdivision fields, so that the accuracy and the efficiency of the whole marking task are improved, and then powerful marking result guarantee is provided for an intelligent algorithm.
According to the scheme, the method of the embodiment of the application constructs the user characteristics of each annotating person based on the capability evaluation of the annotating persons, constructs the entity type characteristics of each entity type and/or the text topic characteristics corresponding to each test text, and selects at least one target annotating person for task annotation for the task to be annotated according to the user characteristics of each annotating person, the entity type characteristics of each entity type and/or the text topic characteristics of each test text and the task characteristics of the task to be annotated.
Therefore, the application provides a labeling person who is suitable for the task to be processed in a multidimensional and multi-aspect (such as user characteristics, entity type characteristics, text theme characteristics and the like) and corresponds to the task to be labeled in the aspect of professionality, the suitable task can be pushed to the suitable labeling person, the labeling person is good at familiarity with the pushed content to be labeled, the task to be labeled can be completed quickly, efficiently and accurately, the accuracy and the credibility of a labeling result are improved, and an intelligent algorithm can be assisted to achieve a better recognition effect.
In one embodiment, referring to the flow chart of the annotator selecting method for professional data provided in fig. 3, the annotator selecting method for professional data disclosed in the present application may further include, after step 105:
106, acquiring and auditing the labeling result of the target labeling personnel to-be-labeled task to obtain a quality index value of the labeling quality of the target labeling personnel to-be-labeled task;
and 107, updating the user characteristics of the target annotator based on the quality index value obtained by auditing, and selecting the annotator of the subsequent task to be annotated based on the updated user characteristics.
Optionally, in this embodiment, after the target annotator finishes annotating the content of the task to be annotated, the annotation result of the task to be annotated by the target annotator is obtained, and the annotation result is checked, for example, the audit is performed by the auditor or is automatically checked based on a related artificial intelligence method.
After the auditing is completed, the labeling quality index values of the target labeling personnel on the accuracy, the recall rate or the F1 value of the assigned tasks are further calculated, and other related data such as the labeling quality index values and the sequencing ordinal of the entity categories in the user characteristics of the target labeling personnel are updated. Subsequently, the next task recommendation is carried out according to the updated user characteristics, namely, the marking personnel which are matched with the next task to be marked in the aspect of specialty and correspond to the task to be marked are selected for the next task to be marked according to the updated user characteristics.
In the embodiment, by examining the labeling result of the label personnel on the assigned task, the labeling quality index value of the label personnel on the task is calculated and fed back to the user characteristics, and the user characteristics are updated, so that the user characteristics of the label personnel are gradually accurate and perfect, accurate portrayal of field types and adequacy degrees of different label personnel in specific fields for subdividing professional fields can be realized, further, the same or similar field labeling tasks as the adequacy fields are recommended/pushed to the different label personnel respectively, the accuracy and the efficiency of the whole labeling task are improved, and the powerful guarantee is provided.
Corresponding to the above method, an embodiment of the present application further discloses a device for selecting a annotator of professional data, as shown in fig. 4, the device includes:
the capability evaluating module 401 is configured to evaluate the capabilities of different labeling personnel in the labeling personnel set to obtain capability evaluation results corresponding to the labeling personnel respectively; the capability evaluation result corresponding to the annotating personnel can be used for representing the annotation quality of the annotating personnel on each entity type in a plurality of entity types of professional data in the preset field and/or the annotation quality of each text in a plurality of test texts;
a first feature construction module 402, configured to construct a user feature of each annotating person according to a capability evaluation result corresponding to each annotating person;
a second feature construction module 403, configured to construct an entity type feature of each entity type in the multiple entity types; and/or constructing text theme characteristics of each test text in the plurality of test texts;
a annotator selecting module 404, configured to select, according to a user characteristic of each annotator, an entity type characteristic of each entity type and/or a text topic characteristic of each test text, and a task characteristic of a task to be annotated, at least one target annotator for task annotation from the annotator set for the task to be annotated;
and the task allocation module 405 is configured to allocate the task to be annotated to the selected target annotation person.
In an embodiment, the capability evaluating module 401 is specifically configured to:
setting a plurality of test texts of the professional data of the predetermined field, wherein each test text corresponds to one or more entity types of the predetermined field;
acquiring a labeling result obtained by each labeling person in the labeling person set performing entity labeling on the plurality of test texts;
and determining the quality index value of the labeling quality of each labeling person for each entity type and/or the quality index value of the labeling quality of each test text according to the labeling result of each labeling person to obtain the capability evaluation result corresponding to each labeling person.
In an embodiment, the first feature constructing module 402 is specifically configured to:
constructing user characteristics of the labeling personnel according to the quality index value of the labeling quality of the labeling personnel for each entity type and/or the quality index value of the labeling quality of each test text under each entity type;
the user characteristics of the annotating personnel are sequentially arranged multi-dimensional characteristics, and each dimension characteristic comprises: and/or the corresponding relation between the text identifier of each test text and the quality index value generated by the labeling personnel in the corresponding test text.
In one embodiment, the apparatus further comprises:
and the personnel filtering module is used for filtering out the corresponding labeling personnel with the corresponding quality index values which do not meet the set index conditions from the labeling personnel in a centralized manner before the user characteristics of the labeling personnel are constructed.
In an embodiment, the second feature constructing module 403, when constructing the entity type feature of each entity type in the plurality of entity types, is specifically configured to: constructing a feature representation of each entity type according to the entity type structure of each entity type in the multiple entity types to obtain the entity type feature of each entity type;
the second feature constructing module 403, when constructing the text topic feature of each test text in the multiple test texts, is specifically configured to: and constructing text theme characteristics of each test text according to the theme distribution of each test text in the plurality of test texts.
In one embodiment, the entity type structure of the entity type includes: extracting entity objects and labeling entity labels from knowledge data corresponding to the entity types, and constructing a tree-type entity label system structure based on the object relationship of each entity object;
the second feature constructing module 403, when constructing the feature representation of each entity type according to the entity type structure of each entity type in the multiple entity types, is specifically configured to:
and constructing vector representation of the entity type through a vector space model according to the entity label contained in each node in the entity type structure of the entity type, wherein the vector representation is used as the entity type characteristic of the entity type.
In one embodiment, the task characteristics of the task to be annotated include: entity type characteristics and/or text subject characteristics of the task to be marked;
the annotator selection module 404 is specifically configured to:
calculating the similarity between the entity type of the task to be marked and the entity type in the user characteristics of each marking person according to the entity type characteristics of the task to be marked and the entity type characteristics corresponding to the entity type in the user characteristics of each marking person; determining a marking person of which the corresponding entity type similarity and the corresponding entity type quality index value meet the first screening condition as a target marking person;
or, calculating the text theme similarity of the test text in the user characteristics of the to-be-labeled task and each labeling person according to the text theme characteristics of the to-be-labeled task and the text theme characteristics corresponding to the test text in the user characteristics of each labeling person; and determining the labeling personnel of which the corresponding text theme similarity and the test text quality index value meet the second screening condition as target labeling personnel.
In one embodiment, the apparatus further comprises:
a user characteristic update module to:
after the task to be labeled is distributed to the selected target labeling personnel, acquiring and auditing the labeling result of the target labeling personnel on the task to be labeled to obtain the quality index value of the labeling quality of the target labeling personnel on the task to be labeled;
and updating the user characteristics of the target annotating personnel based on the quality index values obtained by the examination, so as to select the annotating personnel of the subsequent tasks to be annotated based on the updated user characteristics.
The annotator selection device for the professional data disclosed in the embodiment of the application corresponds to the annotator selection method for the professional data disclosed in the embodiment of the method above, so that the description is simple, and the relevant similarities are only required to be referred to the description of the embodiment of the method above, and are not detailed here.
Embodiments of the present application further disclose an electronic device, which may be, but is not limited to, any of numerous general purpose or special purpose computing device environments or configurations, such as: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor appliances, and the like.
The composition structure of the electronic device is shown in fig. 5, and includes:
a memory 501 for storing a set of computer instructions;
the set of computer instructions in the memory 501 may be implemented in the form of a computer program.
A processor 502 for implementing the annotator selection method for professional data as disclosed in the above method embodiments by executing a set of computer instructions.
The processor 502 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device.
Besides, the electronic device may further include a communication interface, a communication bus, and the like. The memory, the processor and the communication interface communicate with each other via a communication bus.
The communication interface is used for communication between the electronic device and other devices. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like, and may be divided into an address bus, a data bus, a control bus, and the like.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A method for selecting a marking person of professional data is characterized by comprising the following steps:
evaluating the abilities of different labels in the label personnel set to obtain the ability evaluation results corresponding to the labels respectively; the capability evaluation result corresponding to the annotating personnel can be used for representing the annotation quality of the annotating personnel to each entity type in a plurality of entity types of professional data in the preset field and/or the annotation quality of each text in a plurality of test texts;
constructing user characteristics of the labeling personnel according to the capability evaluation result corresponding to the labeling personnel;
constructing entity type characteristics of each entity type in the multiple entity types; and/or constructing text theme characteristics of each test text in the plurality of test texts;
selecting at least one target marking person for task marking from the marking person set for the task to be marked according to the user characteristics of the marking persons, the entity type characteristics of each entity type and/or the text subject characteristics of each test text and the task characteristics of the task to be marked;
and distributing the tasks to be labeled to the selected target labeling personnel.
2. The method according to claim 1, wherein the performing capability evaluation on different annotating personnel in the annotating personnel set to obtain capability evaluation results corresponding to the annotating personnel respectively comprises:
setting a plurality of test texts of the professional data of the predetermined field, wherein each test text corresponds to one or more entity types of the predetermined field;
acquiring a labeling result obtained by each labeling person in the labeling person set performing entity labeling on the plurality of test texts;
and determining the quality index value of the labeling quality of each labeling person for each entity type and/or the quality index value of the labeling quality of each test text according to the labeling result of each labeling person to obtain the capability evaluation result corresponding to each labeling person.
3. The method according to claim 2, wherein the step of constructing the user characteristics of the annotating personnel according to the ability evaluation results corresponding to the annotating personnel comprises the following steps:
constructing user characteristics of the labeling personnel according to the quality index value of the labeling quality of each entity type by the labeling personnel and/or the quality index value of the labeling quality of each test text;
the user characteristics of the annotating personnel are sequentially arranged multi-dimensional characteristics, and each dimension characteristic comprises: and/or the corresponding relation between the text identifier of each test text and the quality index value generated by the labeling personnel in the corresponding test text.
4. The method of claim 3, further comprising, prior to constructing the user characteristics of the annotating personnel:
and filtering out the corresponding labeling personnel with the corresponding quality index values not meeting the set index conditions from the labeling personnel set.
5. The method of claim 3, wherein said constructing entity type characteristics for each entity type of said plurality of entity types comprises:
constructing a feature representation of each entity type according to the entity type structure of each entity type in the multiple entity types to obtain the entity type feature of each entity type;
the constructing of the text topic feature of each test text in the plurality of test texts comprises:
and constructing text theme characteristics of each test text according to the theme distribution of each test text in the plurality of test texts.
6. The method of claim 5, wherein the entity type structure of the entity type comprises: extracting entity objects and labeling entity labels from knowledge data corresponding to the entity types, and constructing a tree-type entity label system structure based on the object relationship of each entity object;
constructing a feature representation of each entity type according to the entity type structure of each entity type in the plurality of entity types, including:
and constructing vector representation of the entity type through a vector space model according to the entity label contained in each node in the entity type structure of the entity type, wherein the vector representation is used as the entity type characteristic of the entity type.
7. The method of claim 5, wherein the task features of the task to be annotated comprise: entity type characteristics and/or text subject characteristics of the task to be marked;
the method for selecting at least one target marking person for task marking from the marking person set for the task to be marked according to the user characteristics of each marking person, the entity type characteristics of each entity type and/or the text subject characteristics of each test text and the task characteristics of the task to be marked comprises the following steps:
calculating the similarity between the entity type of the task to be marked and the entity type in the user characteristics of each marking person according to the entity type characteristics of the task to be marked and the entity type characteristics corresponding to the entity type in the user characteristics of each marking person; determining the corresponding labeling personnel with the entity type similarity and the entity type quality index value meeting the first screening condition as target labeling personnel;
or, calculating the text theme similarity of the test text in the user characteristics of the to-be-labeled task and each labeling person according to the text theme characteristics of the to-be-labeled task and the text theme characteristics corresponding to the test text in the user characteristics of each labeling person; and determining the labeling personnel of which the corresponding text theme similarity and the test text quality index value meet the second screening condition as target labeling personnel.
8. The method of claim 1, wherein after the task to be labeled is assigned to the selected target labeling person, the method further comprises:
acquiring and auditing the labeling result of the target labeling personnel on the task to be labeled to obtain a quality index value of the labeling quality of the target labeling personnel on the task to be labeled;
and updating the user characteristics of the target annotating personnel based on the quality index values obtained by the examination, so as to select the annotating personnel of the subsequent tasks to be annotated based on the updated user characteristics.
9. The utility model provides a personnel of annotating of professional data selects device which characterized in that includes:
the capacity evaluating module is used for evaluating the capacity of different marking personnel in the marking personnel set to obtain the capacity evaluating result corresponding to each marking personnel; the capability evaluation result corresponding to the annotating personnel can be used for representing the annotation quality of the annotating personnel to each entity type in a plurality of entity types of professional data in the preset field and/or the annotation quality of each text in a plurality of test texts;
the first feature construction module is used for constructing the user features of each annotation person according to the ability evaluation result corresponding to each annotation person;
a second feature construction module for constructing an entity type feature for each of the plurality of entity types; and/or constructing text theme characteristics of each test text in the plurality of test texts;
the annotation personnel selection module is used for selecting at least one target annotation personnel for task annotation from the annotation personnel set for the task to be annotated according to the user characteristics of each annotation personnel, the entity type characteristics of each entity type and/or the text subject characteristics of each test text and the task characteristics of the task to be annotated;
and the task allocation module is used for allocating the tasks to be annotated to the selected target annotation personnel.
10. An electronic device, comprising:
a memory for storing at least one set of instructions;
a processor for invoking and executing said set of instructions in said memory, whereby said set of instructions is executed, implementing a annotator selection method for professional data according to any of claims 1-7.
CN202210113924.7A 2022-01-30 2022-01-30 Method and device for selecting annotating personnel of professional data and electronic equipment Pending CN114446431A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210113924.7A CN114446431A (en) 2022-01-30 2022-01-30 Method and device for selecting annotating personnel of professional data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210113924.7A CN114446431A (en) 2022-01-30 2022-01-30 Method and device for selecting annotating personnel of professional data and electronic equipment

Publications (1)

Publication Number Publication Date
CN114446431A true CN114446431A (en) 2022-05-06

Family

ID=81371803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210113924.7A Pending CN114446431A (en) 2022-01-30 2022-01-30 Method and device for selecting annotating personnel of professional data and electronic equipment

Country Status (1)

Country Link
CN (1) CN114446431A (en)

Similar Documents

Publication Publication Date Title
Che et al. Interpretable deep models for ICU outcome prediction
Nie et al. Disease inference from health-related questions via sparse deep learning
Bashir et al. BagMOOV: A novel ensemble for heart disease prediction bootstrap aggregation with multi-objective optimized voting
CN109036577B (en) Diabetes complication analysis method and device
CN111326226B (en) Analysis processing and display method, device, equipment and storage medium of electronic medical record
CN109935337B (en) Medical record searching method and system based on similarity measurement
Balamurugan et al. Alzheimer’s disease diagnosis by using dimensionality reduction based on knn classifier
Kaswan et al. AI-based natural language processing for the generation of meaningful information electronic health record (EHR) data
Wanyan et al. Deep learning with heterogeneous graph embeddings for mortality prediction from electronic health records
CN115050442B (en) Disease category data reporting method and device based on mining clustering algorithm and storage medium
CN109065015B (en) Data acquisition method, device and equipment and readable storage medium
Gülkesen et al. Research subjects and research trends in medical informatics
CN112071431B (en) Clinical path automatic generation method and system based on deep learning and knowledge graph
CN113779954A (en) Similar text recommendation method and device and electronic equipment
CN114446431A (en) Method and device for selecting annotating personnel of professional data and electronic equipment
CN112309519B (en) Electronic medical record medication structured processing system based on multiple models
CN114582449A (en) Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model
Manuaba et al. The evaluation of supervised classifier models to develop a machine learning API for predicting cardiovascular disease risk
CN113688854A (en) Data processing method and device and computing equipment
JPWO2023037398A5 (en)
Ikponmwosa et al. Examining Data Mining Classification Techniques for Predicting Early Childhood Development in Nigeria
Kulakou Exploration of time-series models on time series data
Butcher Contract Information Extraction Using Machine Learning
Nasira et al. A study on prediction of cardiovascular victimization data processing techniques
CN117194604B (en) Intelligent medical patient inquiry corpus construction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination