CN111881294B - Corpus labeling system, corpus labeling method and storage medium - Google Patents

Corpus labeling system, corpus labeling method and storage medium Download PDF

Info

Publication number
CN111881294B
CN111881294B CN202010754175.7A CN202010754175A CN111881294B CN 111881294 B CN111881294 B CN 111881294B CN 202010754175 A CN202010754175 A CN 202010754175A CN 111881294 B CN111881294 B CN 111881294B
Authority
CN
China
Prior art keywords
corpus
data
labeling
labeled
tagging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010754175.7A
Other languages
Chinese (zh)
Other versions
CN111881294A (en
Inventor
杨永东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Benzhi Technology Shenzhen Co ltd
Original Assignee
Benzhi Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Benzhi Technology Shenzhen Co ltd filed Critical Benzhi Technology Shenzhen Co ltd
Priority to CN202010754175.7A priority Critical patent/CN111881294B/en
Publication of CN111881294A publication Critical patent/CN111881294A/en
Application granted granted Critical
Publication of CN111881294B publication Critical patent/CN111881294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a corpus labeling system, a method and a storage medium, which are characterized in that corpus data to be labeled are input into a corresponding pre-labeling model according to the category of the corpus data to be labeled, the pre-labeling corpus data is calculated, noise risk points such as labeling errors and the like of the pre-labeling data are checked by utilizing a knowledge graph and a data dictionary, the noise risk points are manually labeled, finally, the labeling data are uploaded to a distributed file server, transaction rules are constructed in a block chain, and the transaction is carried out according to the pre-constructed transaction rules.

Description

Corpus labeling system, corpus labeling method and storage medium
Technical Field
The application relates to artificial intelligence, in particular to a corpus labeling system, a corpus labeling method and a storage medium.
Background
Corpus includes documents of natural language text and voice information, such as speech of XX, txt, recording of XX, wav, with the development of artificial intelligence, the corpus has more and more applications, and can be recognized and processed by a computer or other intelligent devices for use, however, the corpus with marked information is required to be marked on the corpus to be formed, so that the artificial intelligent model can be trained to complete automatic extraction and automatic processing of information.
The labeling of the corpus refers to adding additional codes representing various language features to corresponding language components so as to facilitate the recognition and processing of the corpus by a computer.
The existing corpus construction technology generally adopts manual corpus labeling or manual examination after all machine labeling, so that a great deal of manpower and material resources are consumed, and the labeled corpus is all owned by the buyer and cannot be shared.
Disclosure of Invention
The application mainly solves the technical problem of how to label the language more efficiently.
According to a first aspect, in one embodiment, there is provided a corpus labeling system, including: the system comprises a user terminal and a central processing server, wherein the user terminal is connected to the central processing server through a network, and the central processing server comprises a request receiving module, a corpus uploading module, an automatic labeling module, an automatic checking module and a labeling data uploading/downloading module;
the request receiving module is used for receiving a request from a user terminal and sending the request to the corpus uploading module;
the corpus uploading module is used for receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal;
the automatic labeling module is used for receiving the corpus data to be labeled from the corpus uploading interface, storing the corpus data to be labeled in a preset database, analyzing the content in the corpus data to be labeled, classifying the corpus data to be labeled according to the content to obtain the category of the corpus data to be labeled, selecting a pre-constructed pre-labeling model corresponding to the category according to the category of the corpus data to be labeled, and inputting the corpus data to be labeled into the corresponding pre-constructed pre-labeling model for calculation to obtain the pre-labeled corpus data;
the automatic checking module is used for retrieving a pre-constructed knowledge graph and a data dictionary, determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to the pre-constructed knowledge graph and the data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and sending the pre-labeled corpus data of the marked risk points to a marker terminal; the annotator terminal is used for manually detecting and correcting risk points in the pre-annotated corpus data to obtain annotated data;
the annotation data uploading/downloading module is used for receiving the annotation data from the annotator terminal, uploading the annotation data to the distributed file server, extracting relevant features in the annotation data, uploading the relevant features to the blockchain, and constructing transaction rules in the blockchain based on the relevant features, wherein the relevant features are used for identifying the annotation data, measuring the workload of the annotation data and representing the annotation quality of the annotation data.
Further, the central processing server also comprises a payment module for receiving payment/purchase requests from the user terminal, searching for pre-constructed transaction rules from the blockchain, determining payment fees according to the pre-constructed transaction rules based on the annotation data, returning the payment fees to the user terminal, and if the payment fees are received, calling the annotation data from the distributed file server and sending the annotation data to the user terminal.
Further, the task content of the corpus labeling request comprises at least one of segmentation of the corpus, part-of-speech labeling of the corpus, dependency relationship labeling of the corpus, entity labeling of the corpus, relationship labeling of the corpus, event labeling of the corpus, reading and understanding labeling of the corpus and question and answer labeling of the corpus.
Further, the pre-constructed pre-labeling model is a deep learning model.
Further, the central processing server also comprises a data conversion module, which is used for converting the corpus data to be annotated into a first preset data format and converting the annotation data into a second preset data format.
According to a second aspect, in one embodiment, a corpus labeling method is provided, applied to a central processing server, and includes:
receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal;
receiving corpus data to be annotated from a corpus uploading interface, storing the corpus data to be annotated in a preset database, analyzing the content in the corpus data to be annotated, classifying the corpus data to be annotated according to the content to obtain the category of the corpus data to be annotated, selecting a pre-constructed pre-annotation model corresponding to the category according to the category of the corpus data to be annotated, and inputting the corpus data to be annotated into the corresponding pre-constructed pre-annotation model for calculation to obtain the pre-annotated corpus data;
determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to a pre-constructed knowledge graph and a data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and transmitting the pre-labeled corpus data marking the risk points to a marker terminal; the annotator terminal is used for manually checking and correcting risk points in the pre-annotated corpus data to obtain annotated data;
and receiving the labeling data from the labeling operator terminal, uploading the labeling data to a distributed file server, extracting relevant features in the labeling data, uploading the relevant features to a blockchain, and constructing a transaction rule in the blockchain based on the relevant features, wherein the relevant features are used for identifying the labeling data, measuring the workload of the labeling data and representing the labeling quality of the labeling data.
Further, the method further comprises the following steps:
receiving a payment/purchase request from the user terminal, searching a pre-constructed transaction rule from the blockchain, determining payment cost according to the pre-constructed transaction rule based on the marking data, returning the payment cost to the user terminal, and if the payment cost is received, calling the marking data from the distributed file server and sending the marking data to the user terminal.
Further, the task content of the corpus labeling request comprises at least one of segmentation of the corpus, part-of-speech labeling of the corpus, dependency relationship labeling of the corpus, entity labeling of the corpus, relationship labeling of the corpus, event labeling of the corpus, reading and understanding labeling of the corpus and question and answer labeling of the corpus.
Further, before analyzing the content in the corpus data to be annotated, after receiving the corpus data to be annotated from the corpus uploading interface, the method further comprises the following steps:
converting the corpus data to be annotated into a first preset data format, and converting the annotation data into a second preset data format.
According to a third aspect, an embodiment provides a computer readable storage medium including a program executable by a processor to implement the method described in the above embodiments.
According to the corpus labeling system, the corpus labeling method and the storage medium, the content in the corpus data to be labeled uploaded by the user terminal is analyzed through the central processing server, the corpus data to be labeled is classified according to the content and the task, the pre-labeling model corresponding to the category of the corpus data to be labeled is selected, the corpus data to be labeled is input into the pre-labeling model for calculation, the pre-labeling corpus data is obtained, risk points in the pre-labeling corpus data are checked according to the knowledge graph and the data dictionary, the risk points which are easy to error are manually checked and corrected through the label person terminal to obtain the labeling data, finally the labeling data are uploaded to the distributed file server, transaction rules are constructed in a block chain, and transaction is carried out according to the pre-constructed transaction rules.
Drawings
FIG. 1 is a block diagram of a corpus tagging system according to one embodiment;
FIG. 2 is a block diagram of a central processing server according to one embodiment;
FIG. 3 is a flow chart of a corpus labeling method according to an embodiment.
Detailed Description
The application will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.
Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.
The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning. The term "coupled" as used herein includes both direct and indirect coupling (coupling), unless otherwise indicated.
In an embodiment of the present application, in the present application,
embodiment one:
referring to fig. 1, fig. 1 is a block diagram of a corpus labeling system according to an embodiment, where the corpus labeling system includes: a user terminal 10 and a central processing server 20, the user terminal 10 being connected to the central processing server 20 through a network. In one implementation manner, the corpus labeling system provided by this embodiment further includes: a distributed file server 30 and an annotator terminal 40.
Referring to fig. 2, the central processing server 20 includes a request receiving module 201, a corpus uploading module 202, a corpus labeling module 203, an automatic checking module 204, a labeling data uploading/downloading module 205, and a payment module 206.
The request receiving module 201 is configured to receive a request from the user terminal 10, and send the request to the corpus uploading module 202. The user terminal 10 in this embodiment may be a mobile terminal such as a smart phone or a tablet PC, or may be a terminal such as a PC.
In an embodiment, the user selects the task content required by the user through the man-machine interaction interface on the user terminal 10 to send the request, and the task content in this embodiment may be defined by the functions that the corpus needs to be marked, and includes at least one of the following functions (task content and task type): splitting corpus, marking parts of speech on the material, marking dependency relation on the material, marking entity on the material, marking relation on the material, marking event on the material, reading and understanding the material, and marking questions and answers on the material.
The corpus uploading module 202 is configured to receive a corpus labeling request from a user terminal, analyze task content and task type of the corpus labeling request, retrieve a corpus uploading interface from a preset database according to the task content, and return the corpus uploading interface to the user terminal. In this embodiment, corpus uploading interfaces of different task contents are different, and corpus data to be annotated can be uploaded to the corpus uploading interface in a manner of uploading attachments, and also can be uploaded in a manner of combining uploading attachments and inputting text contents, wherein the attachments can be text, txt, voice, wav and the like.
The automatic labeling module 203 is configured to receive corpus data to be labeled from a corpus uploading interface, store the corpus data to be labeled in a preset database, analyze content in the corpus data to be labeled, classify the corpus data to be labeled according to the content to obtain a category of the corpus data to be labeled, select a pre-constructed pre-labeling model corresponding to the category according to the category of the corpus data to be labeled, and input the corpus data to be labeled into the corresponding pre-constructed pre-labeling model for calculation and inference to obtain the pre-labeled corpus data.
In an embodiment, the corpus data to be annotated uploaded by the user can come from various industries, and some industry characteristics, such as medical record documents, diagnostic reports and the like, can be found through the content in the uploaded corpus data to be annotated, and the corpus data to be annotated can be judged to belong to the medical industry; alarming and recording a document, wherein the corpus data to be annotated can be judged to belong to the public security field; the education system plans the document, and then can judge that the document belongs to the education system; the legal document can judge that the corpus data to be marked belongs to the legal field. Corpus data in different industry fields have different labeling characteristics, so that the embodiment constructs a plurality of classes of pre-labeling models in advance for different industry fields, each class can correspond to one industry field, in the embodiment, the pre-constructed pre-labeling model can be a deep learning model or other existing calculation models, and the construction of the pre-labeling model comprises the following steps: acquiring marked corpus data; inputting the labeled corpus data into a deep learning model, and adjusting each parameter in the deep learning model according to the output data and the input data of the deep learning model to obtain the optimal parameter, wherein the trained deep learning model is a pre-labeled model constructed in advance.
In this embodiment, the pre-labeling model is built based on a bidirectional long-short-range cyclic neural network and a conditional random field algorithm, and in this embodiment, the model is subjected to multiple iterative training by self-training data (artificial fine label material) to form a plurality of sequence labeling models for different tasks and different types, the models are built into micro-services, and labeling requests of different tasks and data types are responded.
The automatic checking module 204 is configured to retrieve a pre-constructed knowledge graph and a data dictionary, determine risk points in pre-labeled corpus data by using a differential transformation algorithm according to the pre-constructed knowledge graph and the data dictionary, correct and label the risk points in the pre-labeled corpus data, and send pre-labeled corpus data for labeling the risk points to a label terminal; and the annotator terminal is used for manually annotating the risk points in the pre-annotated corpus data to obtain the annotated data.
According to the embodiment, a ontology technology is adopted to manually construct a knowledge graph and a data dictionary, error-prone risk points in pre-labeled corpus data are checked through the knowledge graph and the data dictionary in a preset database by adopting a differential transformation method, the checked risk points are subjected to dyeing marking and pre-correction according to a preset rule, and the pre-labeled corpus data after the dyeing marking and pre-correction treatment are sent to a label terminal for manual labeling. The annotator terminal in this embodiment may be a mobile terminal such as a smart phone or a tablet computer, or may be a terminal such as a PC. And the annotator carries out manual annotation on the risk points in the pre-annotated corpus data after dyeing and marking processing through a man-machine interaction interface on the terminal. According to the method, the risk points can be detected and corrected through the knowledge graph, the data dictionary and the differential transformation algorithm, the risk points can be dyed and marked, so that a marker can quickly find the risk points when marking the risk points manually, and the corpus marking efficiency and accuracy are improved through the mode of combining automatic marking and manual marking.
In this embodiment, the knowledge graph is an abstract enantiomer of things corresponding to nature, and is a data organization and storage mode, which is represented as a database type with standards. The knowledge graph is completely different from the relational database, the data of the knowledge graph is formed by the relationship (edge) between points, and is also different from the relational database in terms of storage, inquiry and calculation methods, the data of the knowledge graph in the embodiment is also acquired from public and public graph projects and comprises common entities and relationships in the world, the knowledge graph construction principle in different industry fields is the same, and the knowledge graph constructed in the embodiment focuses on common knowledge and commercial graph. The data dictionary in this embodiment is the minimum unit of corpus composition, including words, parts of speech, binary sequences, ternary sequences, and the like, and is synonymous with a data dictionary in the conventional sense.
The annotation data uploading/downloading module 205 is configured to receive annotation data from the annotator terminal, upload the annotation data to the distributed file server 30, extract relevant features in the annotation data, upload the relevant features to the blockchain, and construct a transaction rule in the blockchain based on the relevant features, where the relevant features are used to identify the annotation data, measure the workload of the annotation data, and represent the annotation quality of the annotation data.
The relevant features in this embodiment include at least one of the following: labeling unique characterization vectors of the original text content, labeled cut scores, label numbers, entity numbers and event numbers, off coefficients, quality inspection score factors, sample data and the like. The features are extracted to effectively avoid repeated labeling of corpus, and the labeling workload is measured to construct transaction rules in the blockchain, and can also be used for representing the labeling quality. For the labeling corpus buyers, the buyers cannot see the labeling data completely before paying the fees, and the buyers can detect the labeling quality of the labeling data and measure whether the workload of the buyers is matched with the paying fees through the relevant features corresponding to the labeling data on the blockchain.
The payment module 206 is configured to receive a payment request from the user terminal, search for a pre-built transaction rule from the blockchain, determine a payment fee according to the pre-built transaction rule based on the annotation data, and return the payment fee to the user terminal, and if the payment fee is received, retrieve the annotation data from the preset database and send the annotation data to the user terminal. The distributed file server 30 in this embodiment may be a distributed file storage server (IPFS) and may also be a private distributed storage system.
The corpus labeling system in the embodiment of the application not only can realize corpus labeling, but also can provide a corpus package for training a corpus labeling model for a user, and corpus data in the corpus package is labeled, and in one implementation mode, the corpus labeling model can also be provided.
In an embodiment, the corpus labeling system further includes: the data conversion module is used for converting the corpus data to be annotated into a preset data format. In this embodiment, the uploaded corpus data to be annotated has a variety, and many encoding formats are inconsistent, so that for the purpose of post-processing more conveniently and quickly, the encoding formats are converted into unified preset formats, for example: UTF-8 coding format.
The embodiment may also receive a corpus purchase request from the user terminal 10, retrieve, from a preset database, a corpus of a category matching the corpus purchase request according to the corpus purchase request, and return the retrieved corpus to the user terminal 10.
Embodiment two:
referring to fig. 3, fig. 3 is a flowchart of a corpus labeling method, which is executed in the central processing server 20, and the method includes steps S101 to S105, which are specifically described below.
Step S101, receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal.
Step S102, receiving corpus data to be annotated from a corpus uploading interface, storing the corpus data to be annotated in a preset database, analyzing the content in the corpus data to be annotated, classifying the corpus data to be annotated according to the content to obtain the category of the corpus data to be annotated, selecting a pre-constructed pre-annotation model corresponding to the category according to the category of the corpus data to be annotated, and inputting the corpus data to be annotated into the corresponding pre-constructed pre-annotation model for calculation to obtain the pre-annotated corpus data.
Step S103, determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to a pre-constructed knowledge graph and a data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and sending the pre-labeled corpus data for marking the risk points to a marker terminal; and the annotator terminal is used for manually checking and correcting the risk points in the pre-annotated corpus data to obtain the annotated data.
Step S104, receiving the labeling data from the labeling personnel terminal, uploading the labeling data to the distributed file server, extracting relevant features in the labeling data, uploading the relevant features to the blockchain, and constructing a transaction rule in the blockchain based on the relevant features, wherein the relevant features are used for identifying the labeling data, measuring the workload of the labeling data and detecting the labeling quality of the labeling data.
Step S105, receiving payment/purchase request from the user terminal, searching pre-constructed transaction rules from the blockchain, determining payment cost according to the pre-constructed transaction rules based on the marking data, returning the payment cost to the user terminal, and if the payment cost is received, calling the marking data from the distributed file server and sending the marking data to the user terminal.
The steps in the corpus labeling method in this embodiment correspond to the modules in the first embodiment, and the specific implementation manner of the corpus labeling method in the first embodiment is described in detail in the first embodiment, which is not repeated here.
Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.
The foregoing description of the application has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the application pertains, based on the idea of the application.

Claims (10)

1. A corpus tagging system, comprising: the system comprises a user terminal and a central processing server, wherein the user terminal is connected to the central processing server through a network, and the central processing server comprises a request receiving module, a corpus uploading module, an automatic labeling module, an automatic checking module and a labeling data uploading/downloading module;
the request receiving module is used for receiving a request from a user terminal and sending the request to the corpus uploading module;
the corpus uploading module is used for receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal;
the automatic labeling module is used for receiving the corpus data to be labeled from the corpus uploading interface, storing the corpus data to be labeled in a preset database, analyzing the content in the corpus data to be labeled, classifying the corpus data to be labeled according to the content to obtain the category of the corpus data to be labeled, selecting a pre-constructed pre-labeling model corresponding to the category according to the category of the corpus data to be labeled, and inputting the corpus data to be labeled into the corresponding pre-constructed pre-labeling model for calculation to obtain the pre-labeled corpus data;
the automatic checking module is used for retrieving a pre-constructed knowledge graph and a data dictionary, determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to the pre-constructed knowledge graph and the data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and sending the pre-labeled corpus data of the marked risk points to a marker terminal; the annotator terminal is used for manually detecting and correcting risk points in the pre-annotated corpus data to obtain annotated data;
the annotation data uploading/downloading module is used for receiving the annotation data from the annotator terminal, uploading the annotation data to the distributed file server, extracting relevant features in the annotation data, uploading the relevant features to the blockchain, and constructing transaction rules in the blockchain based on the relevant features, wherein the relevant features are used for identifying the annotation data, measuring the workload of the annotation data and representing the annotation quality of the annotation data.
2. The corpus tagging system of claim 1, wherein the central processing server further comprises a payment module for receiving payment/purchase requests from the user terminal, finding pre-built transaction rules from the blockchain, determining payment fees based on the tagging data according to the pre-built transaction rules, and returning the payment fees to the user terminal, and if the payment fees are received, retrieving the tagging data from the distributed file server and transmitting to the user terminal.
3. The corpus tagging system of claim 1, wherein the task content of the corpus tagging request includes at least one of segmentation of the corpus, part-of-speech tagging of the corpus, dependency tagging of the corpus, entity tagging of the corpus, relationship tagging of the corpus, event tagging of the corpus, reading understanding tagging of the corpus, and question-answer tagging of the corpus.
4. The corpus labeling system of claim 1, wherein the pre-constructed pre-labeling model is a deep learning model.
5. The corpus labeling system according to any of claims 1-4, wherein the central processing server further comprises a data conversion module for converting the corpus data to be labeled into a first preset data format and converting the labeling data into a second preset data format.
6. The corpus labeling method is applied to a central processing server and is characterized by comprising the following steps of:
receiving a corpus labeling request from a user terminal, analyzing task content of the corpus labeling request, calling a corpus uploading interface from a preset database according to the task content, and returning the corpus uploading interface to the user terminal;
receiving corpus data to be annotated from a corpus uploading interface, storing the corpus data to be annotated in a preset database, analyzing the content in the corpus data to be annotated, classifying the corpus data to be annotated according to the content to obtain the category of the corpus data to be annotated, selecting a pre-constructed pre-annotation model corresponding to the category according to the category of the corpus data to be annotated, and inputting the corpus data to be annotated into the corresponding pre-constructed pre-annotation model for calculation to obtain the pre-annotated corpus data;
determining risk points in pre-labeled corpus data by utilizing a differential transformation algorithm according to a pre-constructed knowledge graph and a data dictionary, marking and correcting the risk points in the pre-labeled corpus data, and transmitting the pre-labeled corpus data marking the risk points to a marker terminal; the annotator terminal is used for manually checking and correcting risk points in the pre-annotated corpus data to obtain annotated data;
and receiving the labeling data from the labeling operator terminal, uploading the labeling data to a distributed file server, extracting relevant features in the labeling data, uploading the relevant features to a blockchain, and constructing a transaction rule in the blockchain based on the relevant features, wherein the relevant features are used for identifying the labeling data, measuring the workload of the labeling data and representing the labeling quality of the labeling data.
7. The corpus labeling method of claim 6, further comprising:
receiving a payment/purchase request from the user terminal, searching a pre-constructed transaction rule from the blockchain, determining payment cost according to the pre-constructed transaction rule based on the marking data, returning the payment cost to the user terminal, and if the payment cost is received, calling the marking data from the distributed file server and sending the marking data to the user terminal.
8. The corpus tagging method according to claim 6, wherein the task content of the corpus tagging request includes at least one of segmentation of the corpus, part-of-speech tagging of the corpus, dependency tagging of the corpus, entity tagging of the corpus, relation tagging of the corpus, event tagging of the corpus, reading understanding tagging of the corpus, and question-answer tagging of the corpus.
9. The corpus labeling method according to any of claims 6 to 8, further comprising, after receiving corpus data to be labeled from a corpus uploading interface before parsing the content in the corpus data to be labeled:
converting the corpus data to be annotated into a first preset data format, and converting the annotation data into a second preset data format.
10. A computer readable storage medium comprising a program executable by a processor to implement the method of any one of claims 6-9.
CN202010754175.7A 2020-07-30 2020-07-30 Corpus labeling system, corpus labeling method and storage medium Active CN111881294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010754175.7A CN111881294B (en) 2020-07-30 2020-07-30 Corpus labeling system, corpus labeling method and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010754175.7A CN111881294B (en) 2020-07-30 2020-07-30 Corpus labeling system, corpus labeling method and storage medium

Publications (2)

Publication Number Publication Date
CN111881294A CN111881294A (en) 2020-11-03
CN111881294B true CN111881294B (en) 2023-10-24

Family

ID=73205723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010754175.7A Active CN111881294B (en) 2020-07-30 2020-07-30 Corpus labeling system, corpus labeling method and storage medium

Country Status (1)

Country Link
CN (1) CN111881294B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4287063A4 (en) * 2021-04-06 2024-08-28 Huawei Cloud Computing Tech Co Ltd Corpus annotation method and apparatus, and related device
CN114464283A (en) * 2022-02-10 2022-05-10 上海市精神卫生中心(上海市心理咨询培训中心) Manual labeling processing method, device, processor and storage medium based on ICD-10 depression diagnosis and treatment standard interview text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018149326A1 (en) * 2017-02-16 2018-08-23 阿里巴巴集团控股有限公司 Natural language question answering method and apparatus, and server
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
WO2020001373A1 (en) * 2018-06-26 2020-01-02 杭州海康威视数字技术股份有限公司 Method and apparatus for ontology construction
CN111324742A (en) * 2020-02-10 2020-06-23 同方知网(北京)技术有限公司 Construction method of digital human knowledge map

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018149326A1 (en) * 2017-02-16 2018-08-23 阿里巴巴集团控股有限公司 Natural language question answering method and apparatus, and server
WO2020001373A1 (en) * 2018-06-26 2020-01-02 杭州海康威视数字技术股份有限公司 Method and apparatus for ontology construction
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN111324742A (en) * 2020-02-10 2020-06-23 同方知网(北京)技术有限公司 Construction method of digital human knowledge map

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于属性关系深度挖掘的试题知识点标注模型;何彬;李心宇;陈蓓蕾;夏盟;曾致中;;南京信息工程大学学报(自然科学版)(06);第107-114页 *

Also Published As

Publication number Publication date
CN111881294A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
US11537662B2 (en) System and method for analysis of structured and unstructured data
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
US20200133964A1 (en) System and method for analysis and determination of relationships from a variety of data sources
Zhaokai et al. Contract analytics in auditing
WO2020243846A1 (en) System and method for automated file reporting
KR20220133894A (en) Systems and methods for analysis and determination of relationships from various data sources
CN111881294B (en) Corpus labeling system, corpus labeling method and storage medium
CN115547466B (en) Medical institution registration and review system and method based on big data
CN111353004A (en) Data association analysis method and system for drug document
CN115687647A (en) Notarization document generation method and device, electronic equipment and storage medium
CN115438162A (en) Knowledge graph-based disease question-answering method, system, equipment and storage medium
US20230015090A1 (en) Systems and Methods for Dynamically Classifying Products and Assessing Applicability of Product Regulations
Lamba et al. Text Mining for Information Professionals
CN116244410A (en) Index data analysis method and system based on knowledge graph and natural language
CN118195533A (en) Project declaration and enterprise information interaction method and system based on artificial intelligence
Kang Automated duplicate bug reports detection-an experiment at axis communication ab
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN112579444A (en) Text cognition-based automatic analysis modeling method, system, device and medium
Venigalla et al. SOTagger-Towards Classifying Stack Overflow Posts through Contextual Tagging (S).
CN114254620A (en) Policy analysis method, device and storage medium
CN114741494A (en) Question answering method, device, equipment and medium
Park et al. The ripple effect of dataset reuse: Contextualising the data lifecycle for machine learning data sets and social impact
Butcher Contract Information Extraction Using Machine Learning
Kumar et al. Generalized named entity recognition framework
Taniguchi Duplicate bibliographic record detection with an OCR-converted source of information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant