CN111881294B

CN111881294B - Corpus labeling system, corpus labeling method and storage medium

Info

Publication number: CN111881294B
Application number: CN202010754175.7A
Authority: CN
Inventors: 杨永东
Original assignee: Benzhi Technology Shenzhen Co ltd
Current assignee: Benzhi Technology Shenzhen Co ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2023-10-24
Anticipated expiration: 2040-07-30
Also published as: CN111881294A

Abstract

The application discloses a corpus labeling system, a method and a storage medium, which are characterized in that corpus data to be labeled are input into a corresponding pre-labeling model according to the category of the corpus data to be labeled, the pre-labeling corpus data is calculated, noise risk points such as labeling errors and the like of the pre-labeling data are checked by utilizing a knowledge graph and a data dictionary, the noise risk points are manually labeled, finally, the labeling data are uploaded to a distributed file server, transaction rules are constructed in a block chain, and the transaction is carried out according to the pre-constructed transaction rules.

Description

Corpus labeling system, corpus labeling method and storage medium

Technical Field

The application relates to artificial intelligence, in particular to a corpus labeling system, a corpus labeling method and a storage medium.

Background

Corpus includes documents of natural language text and voice information, such as speech of XX, txt, recording of XX, wav, with the development of artificial intelligence, the corpus has more and more applications, and can be recognized and processed by a computer or other intelligent devices for use, however, the corpus with marked information is required to be marked on the corpus to be formed, so that the artificial intelligent model can be trained to complete automatic extraction and automatic processing of information.

The labeling of the corpus refers to adding additional codes representing various language features to corresponding language components so as to facilitate the recognition and processing of the corpus by a computer.

The existing corpus construction technology generally adopts manual corpus labeling or manual examination after all machine labeling, so that a great deal of manpower and material resources are consumed, and the labeled corpus is all owned by the buyer and cannot be shared.

Disclosure of Invention

The application mainly solves the technical problem of how to label the language more efficiently.

According to a first aspect, in one embodiment, there is provided a corpus labeling system, including: the system comprises a user terminal and a central processing server, wherein the user terminal is connected to the central processing server through a network, and the central processing server comprises a request receiving module, a corpus uploading module, an automatic labeling module, an automatic checking module and a labeling data uploading/downloading module;

the request receiving module is used for receiving a request from a user terminal and sending the request to the corpus uploading module;

the corpus uploading module is used for receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal;

the automatic labeling module is used for receiving the corpus data to be labeled from the corpus uploading interface, storing the corpus data to be labeled in a preset database, analyzing the content in the corpus data to be labeled, classifying the corpus data to be labeled according to the content to obtain the category of the corpus data to be labeled, selecting a pre-constructed pre-labeling model corresponding to the category according to the category of the corpus data to be labeled, and inputting the corpus data to be labeled into the corresponding pre-constructed pre-labeling model for calculation to obtain the pre-labeled corpus data;

the automatic checking module is used for retrieving a pre-constructed knowledge graph and a data dictionary, determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to the pre-constructed knowledge graph and the data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and sending the pre-labeled corpus data of the marked risk points to a marker terminal; the annotator terminal is used for manually detecting and correcting risk points in the pre-annotated corpus data to obtain annotated data;

the annotation data uploading/downloading module is used for receiving the annotation data from the annotator terminal, uploading the annotation data to the distributed file server, extracting relevant features in the annotation data, uploading the relevant features to the blockchain, and constructing transaction rules in the blockchain based on the relevant features, wherein the relevant features are used for identifying the annotation data, measuring the workload of the annotation data and representing the annotation quality of the annotation data.

Further, the central processing server also comprises a payment module for receiving payment/purchase requests from the user terminal, searching for pre-constructed transaction rules from the blockchain, determining payment fees according to the pre-constructed transaction rules based on the annotation data, returning the payment fees to the user terminal, and if the payment fees are received, calling the annotation data from the distributed file server and sending the annotation data to the user terminal.

Further, the task content of the corpus labeling request comprises at least one of segmentation of the corpus, part-of-speech labeling of the corpus, dependency relationship labeling of the corpus, entity labeling of the corpus, relationship labeling of the corpus, event labeling of the corpus, reading and understanding labeling of the corpus and question and answer labeling of the corpus.

Further, the pre-constructed pre-labeling model is a deep learning model.

Further, the central processing server also comprises a data conversion module, which is used for converting the corpus data to be annotated into a first preset data format and converting the annotation data into a second preset data format.

According to a second aspect, in one embodiment, a corpus labeling method is provided, applied to a central processing server, and includes:

receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal;

receiving corpus data to be annotated from a corpus uploading interface, storing the corpus data to be annotated in a preset database, analyzing the content in the corpus data to be annotated, classifying the corpus data to be annotated according to the content to obtain the category of the corpus data to be annotated, selecting a pre-constructed pre-annotation model corresponding to the category according to the category of the corpus data to be annotated, and inputting the corpus data to be annotated into the corresponding pre-constructed pre-annotation model for calculation to obtain the pre-annotated corpus data;

determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to a pre-constructed knowledge graph and a data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and transmitting the pre-labeled corpus data marking the risk points to a marker terminal; the annotator terminal is used for manually checking and correcting risk points in the pre-annotated corpus data to obtain annotated data;

and receiving the labeling data from the labeling operator terminal, uploading the labeling data to a distributed file server, extracting relevant features in the labeling data, uploading the relevant features to a blockchain, and constructing a transaction rule in the blockchain based on the relevant features, wherein the relevant features are used for identifying the labeling data, measuring the workload of the labeling data and representing the labeling quality of the labeling data.

Further, the method further comprises the following steps:

receiving a payment/purchase request from the user terminal, searching a pre-constructed transaction rule from the blockchain, determining payment cost according to the pre-constructed transaction rule based on the marking data, returning the payment cost to the user terminal, and if the payment cost is received, calling the marking data from the distributed file server and sending the marking data to the user terminal.

Further, before analyzing the content in the corpus data to be annotated, after receiving the corpus data to be annotated from the corpus uploading interface, the method further comprises the following steps:

converting the corpus data to be annotated into a first preset data format, and converting the annotation data into a second preset data format.

According to a third aspect, an embodiment provides a computer readable storage medium including a program executable by a processor to implement the method described in the above embodiments.

According to the corpus labeling system, the corpus labeling method and the storage medium, the content in the corpus data to be labeled uploaded by the user terminal is analyzed through the central processing server, the corpus data to be labeled is classified according to the content and the task, the pre-labeling model corresponding to the category of the corpus data to be labeled is selected, the corpus data to be labeled is input into the pre-labeling model for calculation, the pre-labeling corpus data is obtained, risk points in the pre-labeling corpus data are checked according to the knowledge graph and the data dictionary, the risk points which are easy to error are manually checked and corrected through the label person terminal to obtain the labeling data, finally the labeling data are uploaded to the distributed file server, transaction rules are constructed in a block chain, and transaction is carried out according to the pre-constructed transaction rules.

Drawings

FIG. 1 is a block diagram of a corpus tagging system according to one embodiment;

FIG. 2 is a block diagram of a central processing server according to one embodiment;

FIG. 3 is a flow chart of a corpus labeling method according to an embodiment.

Detailed Description

The application will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning. The term "coupled" as used herein includes both direct and indirect coupling (coupling), unless otherwise indicated.

In an embodiment of the present application, in the present application,

embodiment one:

referring to fig. 1, fig. 1 is a block diagram of a corpus labeling system according to an embodiment, where the corpus labeling system includes: a user terminal 10 and a central processing server 20, the user terminal 10 being connected to the central processing server 20 through a network. In one implementation manner, the corpus labeling system provided by this embodiment further includes: a distributed file server 30 and an annotator terminal 40.

Referring to fig. 2, the central processing server 20 includes a request receiving module 201, a corpus uploading module 202, a corpus labeling module 203, an automatic checking module 204, a labeling data uploading/downloading module 205, and a payment module 206.

The request receiving module 201 is configured to receive a request from the user terminal 10, and send the request to the corpus uploading module 202. The user terminal 10 in this embodiment may be a mobile terminal such as a smart phone or a tablet PC, or may be a terminal such as a PC.

In an embodiment, the user selects the task content required by the user through the man-machine interaction interface on the user terminal 10 to send the request, and the task content in this embodiment may be defined by the functions that the corpus needs to be marked, and includes at least one of the following functions (task content and task type): splitting corpus, marking parts of speech on the material, marking dependency relation on the material, marking entity on the material, marking relation on the material, marking event on the material, reading and understanding the material, and marking questions and answers on the material.

The corpus uploading module 202 is configured to receive a corpus labeling request from a user terminal, analyze task content and task type of the corpus labeling request, retrieve a corpus uploading interface from a preset database according to the task content, and return the corpus uploading interface to the user terminal. In this embodiment, corpus uploading interfaces of different task contents are different, and corpus data to be annotated can be uploaded to the corpus uploading interface in a manner of uploading attachments, and also can be uploaded in a manner of combining uploading attachments and inputting text contents, wherein the attachments can be text, txt, voice, wav and the like.

The automatic labeling module 203 is configured to receive corpus data to be labeled from a corpus uploading interface, store the corpus data to be labeled in a preset database, analyze content in the corpus data to be labeled, classify the corpus data to be labeled according to the content to obtain a category of the corpus data to be labeled, select a pre-constructed pre-labeling model corresponding to the category according to the category of the corpus data to be labeled, and input the corpus data to be labeled into the corresponding pre-constructed pre-labeling model for calculation and inference to obtain the pre-labeled corpus data.

In an embodiment, the corpus data to be annotated uploaded by the user can come from various industries, and some industry characteristics, such as medical record documents, diagnostic reports and the like, can be found through the content in the uploaded corpus data to be annotated, and the corpus data to be annotated can be judged to belong to the medical industry; alarming and recording a document, wherein the corpus data to be annotated can be judged to belong to the public security field; the education system plans the document, and then can judge that the document belongs to the education system; the legal document can judge that the corpus data to be marked belongs to the legal field. Corpus data in different industry fields have different labeling characteristics, so that the embodiment constructs a plurality of classes of pre-labeling models in advance for different industry fields, each class can correspond to one industry field, in the embodiment, the pre-constructed pre-labeling model can be a deep learning model or other existing calculation models, and the construction of the pre-labeling model comprises the following steps: acquiring marked corpus data; inputting the labeled corpus data into a deep learning model, and adjusting each parameter in the deep learning model according to the output data and the input data of the deep learning model to obtain the optimal parameter, wherein the trained deep learning model is a pre-labeled model constructed in advance.

In this embodiment, the pre-labeling model is built based on a bidirectional long-short-range cyclic neural network and a conditional random field algorithm, and in this embodiment, the model is subjected to multiple iterative training by self-training data (artificial fine label material) to form a plurality of sequence labeling models for different tasks and different types, the models are built into micro-services, and labeling requests of different tasks and data types are responded.

The automatic checking module 204 is configured to retrieve a pre-constructed knowledge graph and a data dictionary, determine risk points in pre-labeled corpus data by using a differential transformation algorithm according to the pre-constructed knowledge graph and the data dictionary, correct and label the risk points in the pre-labeled corpus data, and send pre-labeled corpus data for labeling the risk points to a label terminal; and the annotator terminal is used for manually annotating the risk points in the pre-annotated corpus data to obtain the annotated data.

According to the embodiment, a ontology technology is adopted to manually construct a knowledge graph and a data dictionary, error-prone risk points in pre-labeled corpus data are checked through the knowledge graph and the data dictionary in a preset database by adopting a differential transformation method, the checked risk points are subjected to dyeing marking and pre-correction according to a preset rule, and the pre-labeled corpus data after the dyeing marking and pre-correction treatment are sent to a label terminal for manual labeling. The annotator terminal in this embodiment may be a mobile terminal such as a smart phone or a tablet computer, or may be a terminal such as a PC. And the annotator carries out manual annotation on the risk points in the pre-annotated corpus data after dyeing and marking processing through a man-machine interaction interface on the terminal. According to the method, the risk points can be detected and corrected through the knowledge graph, the data dictionary and the differential transformation algorithm, the risk points can be dyed and marked, so that a marker can quickly find the risk points when marking the risk points manually, and the corpus marking efficiency and accuracy are improved through the mode of combining automatic marking and manual marking.

In this embodiment, the knowledge graph is an abstract enantiomer of things corresponding to nature, and is a data organization and storage mode, which is represented as a database type with standards. The knowledge graph is completely different from the relational database, the data of the knowledge graph is formed by the relationship (edge) between points, and is also different from the relational database in terms of storage, inquiry and calculation methods, the data of the knowledge graph in the embodiment is also acquired from public and public graph projects and comprises common entities and relationships in the world, the knowledge graph construction principle in different industry fields is the same, and the knowledge graph constructed in the embodiment focuses on common knowledge and commercial graph. The data dictionary in this embodiment is the minimum unit of corpus composition, including words, parts of speech, binary sequences, ternary sequences, and the like, and is synonymous with a data dictionary in the conventional sense.

The annotation data uploading/downloading module 205 is configured to receive annotation data from the annotator terminal, upload the annotation data to the distributed file server 30, extract relevant features in the annotation data, upload the relevant features to the blockchain, and construct a transaction rule in the blockchain based on the relevant features, where the relevant features are used to identify the annotation data, measure the workload of the annotation data, and represent the annotation quality of the annotation data.

The relevant features in this embodiment include at least one of the following: labeling unique characterization vectors of the original text content, labeled cut scores, label numbers, entity numbers and event numbers, off coefficients, quality inspection score factors, sample data and the like. The features are extracted to effectively avoid repeated labeling of corpus, and the labeling workload is measured to construct transaction rules in the blockchain, and can also be used for representing the labeling quality. For the labeling corpus buyers, the buyers cannot see the labeling data completely before paying the fees, and the buyers can detect the labeling quality of the labeling data and measure whether the workload of the buyers is matched with the paying fees through the relevant features corresponding to the labeling data on the blockchain.

The payment module 206 is configured to receive a payment request from the user terminal, search for a pre-built transaction rule from the blockchain, determine a payment fee according to the pre-built transaction rule based on the annotation data, and return the payment fee to the user terminal, and if the payment fee is received, retrieve the annotation data from the preset database and send the annotation data to the user terminal. The distributed file server 30 in this embodiment may be a distributed file storage server (IPFS) and may also be a private distributed storage system.

The corpus labeling system in the embodiment of the application not only can realize corpus labeling, but also can provide a corpus package for training a corpus labeling model for a user, and corpus data in the corpus package is labeled, and in one implementation mode, the corpus labeling model can also be provided.

In an embodiment, the corpus labeling system further includes: the data conversion module is used for converting the corpus data to be annotated into a preset data format. In this embodiment, the uploaded corpus data to be annotated has a variety, and many encoding formats are inconsistent, so that for the purpose of post-processing more conveniently and quickly, the encoding formats are converted into unified preset formats, for example: UTF-8 coding format.

The embodiment may also receive a corpus purchase request from the user terminal 10, retrieve, from a preset database, a corpus of a category matching the corpus purchase request according to the corpus purchase request, and return the retrieved corpus to the user terminal 10.

Embodiment two:

referring to fig. 3, fig. 3 is a flowchart of a corpus labeling method, which is executed in the central processing server 20, and the method includes steps S101 to S105, which are specifically described below.

Step S101, receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal.

Step S102, receiving corpus data to be annotated from a corpus uploading interface, storing the corpus data to be annotated in a preset database, analyzing the content in the corpus data to be annotated, classifying the corpus data to be annotated according to the content to obtain the category of the corpus data to be annotated, selecting a pre-constructed pre-annotation model corresponding to the category according to the category of the corpus data to be annotated, and inputting the corpus data to be annotated into the corresponding pre-constructed pre-annotation model for calculation to obtain the pre-annotated corpus data.

Step S103, determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to a pre-constructed knowledge graph and a data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and sending the pre-labeled corpus data for marking the risk points to a marker terminal; and the annotator terminal is used for manually checking and correcting the risk points in the pre-annotated corpus data to obtain the annotated data.

Step S104, receiving the labeling data from the labeling personnel terminal, uploading the labeling data to the distributed file server, extracting relevant features in the labeling data, uploading the relevant features to the blockchain, and constructing a transaction rule in the blockchain based on the relevant features, wherein the relevant features are used for identifying the labeling data, measuring the workload of the labeling data and detecting the labeling quality of the labeling data.

Step S105, receiving payment/purchase request from the user terminal, searching pre-constructed transaction rules from the blockchain, determining payment cost according to the pre-constructed transaction rules based on the marking data, returning the payment cost to the user terminal, and if the payment cost is received, calling the marking data from the distributed file server and sending the marking data to the user terminal.

The steps in the corpus labeling method in this embodiment correspond to the modules in the first embodiment, and the specific implementation manner of the corpus labeling method in the first embodiment is described in detail in the first embodiment, which is not repeated here.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the application has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the application pertains, based on the idea of the application.

Claims

1. A corpus tagging system, comprising: the system comprises a user terminal and a central processing server, wherein the user terminal is connected to the central processing server through a network, and the central processing server comprises a request receiving module, a corpus uploading module, an automatic labeling module, an automatic checking module and a labeling data uploading/downloading module;

2. The corpus tagging system of claim 1, wherein the central processing server further comprises a payment module for receiving payment/purchase requests from the user terminal, finding pre-built transaction rules from the blockchain, determining payment fees based on the tagging data according to the pre-built transaction rules, and returning the payment fees to the user terminal, and if the payment fees are received, retrieving the tagging data from the distributed file server and transmitting to the user terminal.

3. The corpus tagging system of claim 1, wherein the task content of the corpus tagging request includes at least one of segmentation of the corpus, part-of-speech tagging of the corpus, dependency tagging of the corpus, entity tagging of the corpus, relationship tagging of the corpus, event tagging of the corpus, reading understanding tagging of the corpus, and question-answer tagging of the corpus.

4. The corpus labeling system of claim 1, wherein the pre-constructed pre-labeling model is a deep learning model.

5. The corpus labeling system according to any of claims 1-4, wherein the central processing server further comprises a data conversion module for converting the corpus data to be labeled into a first preset data format and converting the labeling data into a second preset data format.

6. The corpus labeling method is applied to a central processing server and is characterized by comprising the following steps of:

7. The corpus labeling method of claim 6, further comprising:

8. The corpus tagging method according to claim 6, wherein the task content of the corpus tagging request includes at least one of segmentation of the corpus, part-of-speech tagging of the corpus, dependency tagging of the corpus, entity tagging of the corpus, relation tagging of the corpus, event tagging of the corpus, reading understanding tagging of the corpus, and question-answer tagging of the corpus.

9. The corpus labeling method according to any of claims 6 to 8, further comprising, after receiving corpus data to be labeled from a corpus uploading interface before parsing the content in the corpus data to be labeled:

10. A computer readable storage medium comprising a program executable by a processor to implement the method of any one of claims 6-9.