CN116776881A

CN116776881A - Active learning-based domain entity identification system and identification method

Info

Publication number: CN116776881A
Application number: CN202310598745.1A
Authority: CN
Inventors: 宋荣伟; 刘蜜; 谢博; 蒲天应; 申国伟
Original assignee: Shanghai Workpower Telecom Technology Co ltd; Guizhou University
Current assignee: Shanghai Workpower Telecom Technology Co ltd; Guizhou University
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-09-19

Abstract

A domain entity recognition system and a recognition method based on active learning relate to the technical field of entity recognition and classification. The method solves the problems that entity identification in the existing specific field lacks a high-quality labeling data set, so that the labeling cost is very high and the identification accuracy is low. The recognition system comprises a data preprocessing module, a model training module, an active learning module, a domain entity recognition model and a domain entity crowdsourcing labeling platform device; the data preprocessing module is used for carrying out data processing on the original text data and sending the text data to the model training module and the active learning module; the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model; the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training. The method is suitable for entity identification and classification in Chinese information processing.

Description

Active learning-based domain entity identification system and identification method

Technical Field

The invention relates to the technical field of entity identification and classification, in particular to the technical field of entity identification and classification in Chinese information processing.

Background

With the development of information technology, we enter the age of big data and artificial intelligence. The traditional enterprises accumulate a great deal of data and experience knowledge in the informatization process, and support is provided for the business capability improvement of the enterprises. The knowledge graph is a technical tie connecting big data and artificial intelligence, and is a basic stone from perception intelligence to cognition intelligence. In order to realize the precipitation process from data asset to knowledge asset, shallow understanding and analysis of data are realized through data digitizing technology, data management, data mining and other related technologies, and further deep knowledge mining analysis and application are realized through knowledge discovery, knowledge fusion and the like.

Aiming at a large amount of unstructured text data formed by enterprises, the converged data can be deeply understood and arranged in a knowledge graph mode. In the construction process of the knowledge graph in the specific field, a field knowledge ontology model is constructed, and the extraction of field entities and entity relations is a key step of construction of the field knowledge graph. Therefore, the domain entity identification domain is a very important target in the knowledge graph construction task. For example, in the construction of a security knowledge graph in the field of network security, the goal is to extract security entities, such as attack organizations, enterprises, vulnerabilities, software, etc., from text data in the field of network security. For example, in the construction of product development knowledge maps in the manufacturing field, the goal is to extract entities related to product components, such as product names, design criteria, materials, etc., from text data involved in the product development design process.

Compared with the entity identification in the general field, the entity identification in the specific field has high labeling cost due to the lack of label data, so that the accuracy of identification is not high. The domain-specific entity identification task is often complex, involving a large number of multiple mixed-type entities and nested entities. Furthermore, deep learning relies on large-scale tag data, however, there is a lack of large-scale high-quality entity labeling data in certain fields. This makes a lot of manpower and material resources necessary when labeling the dataset, making labeling very costly.

In summary, entity identification in a specific field lacks a high-quality labeling dataset, so that the labeling cost is very high, and the identification accuracy is low.

Disclosure of Invention

The method solves the problems that entity identification in the existing specific field lacks a high-quality labeling data set, so that the labeling cost is very high and the identification accuracy is low.

In order to achieve the above object, the present invention provides the following solutions:

the invention provides a domain entity identification system based on active learning, which comprises a data preprocessing module, a model training module, an active learning module, a domain entity identification model and a domain entity crowdsourcing labeling platform device, wherein the model training module is used for acquiring a domain entity identification model;

the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;

the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;

and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.

Further, in a preferred embodiment, the data preprocessing module includes a data cleansing unit, a data format unit and a data segmentation unit;

the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;

the data format unit is used for processing an input format;

the data segmentation unit is used for selecting the data source of the strategy.

Further, in a preferred embodiment, the model training module includes a model identifying unit and a model evaluating unit;

the model identification unit is used for identifying the received data;

the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.

Further, in a preferred embodiment, the evaluation criteria include accuracy, recall, and F1 metric.

Further, in a preferred embodiment, the active learning module includes a policy selection unit, a data labeling unit, and an iterative training unit;

the strategy selection unit is used for selecting data with labeling value;

the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;

the iterative training unit is used for setting the number of iterative training times.

Further, in a preferred embodiment, the entity crowdsourcing labeling platform device includes a user layer, a display layer, a service layer, a data layer, a database, and a data analysis layer;

the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;

the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;

the data analysis layer is used for labeling the received data to obtain labeled data.

Further, in a preferred embodiment, the overall architecture of the entity crowdsourcing annotation platform device is implemented by adopting a B/S architecture;

the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.

The invention also provides a domain entity identification method based on active learning, which is realized by adopting the domain entity identification system based on active learning, and comprises the following steps:

s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;

s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;

s3, dividing the training set data to obtain training set data with labels and training set data without labels;

s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;

s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;

s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;

and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.

The present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs an active learning based domain entity identification method as described above.

The invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning.

The beneficial effects of the invention are as follows:

1. the invention provides a domain entity recognition system based on active learning, which uses a domain entity recognition framework built by a hole convolutional neural network based on a BERT pre-training model and a conditional random field technology, selects data with high labeling value by using an active learning method to label labeling experts, and adds the data into training data to train the domain entity recognition framework. Aiming at the problem of how to select high-value annotation data, a new selection strategy in active learning is realized, so that the method can fully label a data set to improve the entity identification performance of the framework.

2. The invention provides a field entity identification system based on active learning, which adopts an active learning mode to select data with high labeling value, and adds the data into training data to train a field entity identification framework, so that a large amount of manpower and material resources are required when a data set is labeled, and the labeling cost is reduced.

3. The invention provides a field entity identification system based on active learning, which adopts an active learning mode to select data with high annotation value, and realizes a new selection strategy so that the data can be fully provided with a label-free data set to improve the entity identification performance of a frame.

The method is suitable for entity identification and classification in Chinese information processing.

Drawings

FIG. 1 is a schematic diagram of a domain entity identification system based on active learning according to an embodiment;

FIG. 2 is a BERT model according to a third embodiment;

FIG. 3 is a model of a convolutional neural network of a hole in accordance with a third embodiment;

fig. 4 is a schematic diagram of a domain entity crowdsourcing labeling platform device according to a sixth embodiment;

fig. 5 is a schematic diagram of an implementation of a crowd-sourced labeling platform device for a domain entity according to a sixth embodiment.

FIG. 6 is a graph showing the overall experimental results of the eleventh embodiment;

FIG. 7 (a) is a graph showing the overall test results on the F1 index according to the eleventh embodiment;

fig. 7 (b) is a graph of test results of F1 index on each entity type according to the eleventh embodiment;

Detailed Description

Referring to fig. 1, the present embodiment provides a domain entity recognition system based on active learning, where the recognition system includes a data preprocessing module, a model training module, an active learning module, a domain entity recognition model, and a domain entity crowdsourcing labeling platform device;

In practical application, the data preprocessing module is used for acquiring field text data and cleaning the original text; processing the input format of the model; further processing the processed text into a model input format and cutting the processed text into a training set, a verification set and a test set; the training set is further divided into a training set with a label and a training set without a label. The model training module is used for constructing a domain entity recognition model based on the BERT cavity convolutional neural network, and mapping an input text into a real number vector rich in position, semantics and syntactic characteristics; and identifying the entity in the text, and evaluating the trained model. The active learning module is used for selecting data with high labeling value in a training set without labels, labeling by using a domain entity crowdsourcing labeling platform device, adding the selected data into the data set with labels, and performing iterative training on the model; and stopping iteration when the performance of the model reaches the preset condition. The entity crowdsourcing labeling platform device is a data labeling platform device for providing specific entity labeling tasks for the active learning module, and meets entity labeling under the condition of entity tag missing in a specific field.

The embodiment provides a domain entity recognition system based on active learning, which adopts an active learning mode to select data with high labeling value, and adds the data into training data to train a domain entity recognition framework, so that a large amount of manpower and material resources are required when a data set is labeled, and the labeling cost is reduced.

The embodiment provides a domain entity identification system based on active learning, which adopts an active learning mode to select data with high annotation value, so that a new selection strategy is realized, and a data set can be fully unlabeled to improve the entity identification performance of a frame.

In the second embodiment, the data preprocessing module in the active learning-based domain entity recognition system according to the first embodiment is illustrated, where the data preprocessing module includes a data cleaning unit, a data format unit and a data segmentation unit;

the data format unit is used for processing an input format;

In practical application, the data processing unit comprises a data cleaning unit, a data format unit and a data segmentation unit, wherein the data cleaning unit is used for removing special characters and pictures in a text; the data format unit is used for processing an input format of the BERT model; the data segmentation unit is used for training and evaluating the model, and selecting a strategy data source in active learning.

Referring to fig. 2 and 3, the present embodiment is described by illustrating a model training module in the active learning-based domain entity recognition system according to the first embodiment, where the model training module includes a model recognition unit and a model evaluation unit;

the model identification unit is used for identifying the received data;

In practical application, the model training module comprises a model training unit and a model evaluating unit, wherein,

the model training unit is used for utilizing a hole convolutional neural network model based on BERT, and the BERT model is shown in figure 2; a model of a hollow convolutional neural network is shown in fig. 3. Identifying an entity in the text; the model evaluation unit is used for evaluating the trained model. The model evaluation unit is also used for evaluating a model trained by using the BERT-based cavity convolutional neural network and generating a performance comparison analysis report with other baseline active learning methods.

In a fourth embodiment, the evaluation criteria in the active learning-based domain entity recognition system described in the third embodiment are exemplified, where the evaluation criteria include accuracy, recall, and F1 metric.

An active learning module in the active learning-based domain entity recognition system according to the first embodiment is illustrated, and the active learning module includes a policy selection unit, a data labeling unit and an iterative training unit;

the strategy selection unit is used for selecting data with labeling value;

An embodiment six is an example of the entity crowdsourcing labeling platform device in the active learning-based domain entity identification system according to the embodiment one, where the entity crowdsourcing labeling platform device includes a user layer, a display layer, a service layer, a data layer, a database, and a data analysis layer;

In practical application, the network security entity labeling system adopts a customized B/S architecture, as shown in FIG. 4. The data analysis layer is a completely independent entity annotation learning model, and the system annotation data can be acquired and delivered to a user only by calling an interface provided by the analysis layer model in the entity annotation system. In the view of users, the method is a black box, the users do not need to have a complex machine learning basis, manually configure tuning parameters, and only importing data is needed to optimize the labeling model of the users. The system developer can complete the development task without machine learning foundation.

The data layer is responsible for completing the data interaction, inquiry and other works between the service layer and the database, realizes the related functions of inserting, modifying, deleting and the like on the data, and also provides some advanced inquiry functions. Meanwhile, the data layer defines a special database interface, and when the database operation is needed in the business logic, the database operation can be easily completed by calling the interface only through dependency injection.

The business layer is responsible for receiving the data interacted by the front end and returning the data to the database through a certain process. The service layer also provides a plurality of service controller classes which can respond to various requests of the front end. Meanwhile, in order that the front end can operate the data transmitted from the rear end more conveniently, the service layer also defines a standard interface, so that response data of the rear end is restrained.

The display layer is used as the front end of the network security entity labeling system, receives and processes the data standardized by the back end again while rendering the user layer page, and then carries out logic processing on the user page delivered. Thus, the back-end does not need to pay excessive effort to process scattered data of the traditional front-end, and the work is finished in a powerful front-end engine. Meanwhile, the display layer processing logic can enable the user interaction page to be more flexible, and the display layer also makes great contribution on the requirements of humanized marking function completion and convenient interaction operation.

The entity crowdsourcing labeling platform device in the active learning-based domain entity identification system of the sixth embodiment is exemplified, and the overall architecture of the entity crowdsourcing labeling platform device is realized by adopting a B/S architecture;

In practical application, the embodiment adopts the Angular framework, the nz-zorro component library and the echart visual library to construct, so as to obtain a display layer, and is responsible for finishing processing back-end data and rendering pages.

The marking platform device has the advantages compared with the prior art: (1) The strategy selection unit in the data analysis layer in the device can select the data with the labeling value, and the existing traditional labeling platform can not screen the data according to the labeling value. (2) The model evaluation unit in the device is used for evaluating the trained domain entity recognition model, and compared with the simple model use of the traditional labeling platform, the labeling platform device can evaluate different models. (3) The in-device data preprocessing module comprises a data cleaning unit, a data format unit and a data segmentation unit, and the traditional marking platform can introduce data for use only by performing early data cleaning work.

The eighth embodiment provides an active learning-based domain entity recognition method, where the recognition method is implemented by using the active learning-based domain entity recognition system as set forth in any one of the first to seventh embodiments, and the recognition method is as follows:

In practical application, the embodiment is in a data preprocessing module; removing special characters and pictures in the text through a data cleaning unit; the data format unit performs word segmentation processing on the cleaned data, and uses one-hot coding as original coding of the data; encoding the position of each word in the sentence by a trigonometric function; initializing all segment codes of sentences to 1; finally, the three codes are used as inputs of the BERT model. The data segmentation unit uses a sklean machine learning framework to segment data into a training set, a verification set and a test set; wherein the training set is further divided into a labeled data set and an unlabeled data set for an active learning selection strategy. In the module training model: training a model by using the processed data as input data; optimizing the model through an Adam algorithm; in the active learning module: the probability of each label in each sentence in the data is calculated, and the lowest probability value is taken as the uncertainty of the sentence; decoding the sentences as confidence coefficient by a Vibiter algorithm of a calculation conditional random field; and extracting the entity in the tagged data set as an initial dictionary, then utilizing the dictionary to match the characters in each sentence, calculating the matching number of each sentence and normalizing. Through taking the uncertainty, the confidence and the matching number as selection indexes of a selection strategy and carrying out descending order sequencing, sentences with high annotation value can be selected. And performing entity labeling on the selected data by using a domain entity crowdsourcing labeling platform, and providing label data for further iterative optimization of active learning.

In a third embodiment, a method for identifying a domain entity based on active learning according to the first embodiment is provided.

In a tenth aspect, the present embodiment provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning as described in the eighth aspect.

An eleventh embodiment is a verification description of the domain entity identification system based on active learning according to any one of the first to seventh embodiments. Experimental results show that the accuracy and recall rate obtained by the method provided by the invention on the network security entity data set are better than those obtained by the baseline method.

In order to ensure the accuracy of the test, the effectiveness of the selection strategy of the invention is verified by adopting a BiLSTM deep learning model, and the test result is shown in figure 6, so that the safety entity identification performance of the BiLSTM-CRF model is poorer than the identification performance of BERT-RDCNN-CRF. However, it is seen from the table that whether the BERT model or the BiLSTM model is used, the proposed method is better than the baseline selection strategy method in terms of both the accuracy and recall.

In order to balance accuracy and recall, we also calculated F1 values for different methods, the overall test results on the F1 index graph test results are shown in fig. 7 (a); the test result graph of the F1 index on each entity type is shown in fig. 7 (b). It can be seen from the figure that the selection strategy approach we propose is better than the baseline approach, regardless of whether the BERT model or the BiLSTM model, which further verifies the effectiveness of the approach we propose. Wherein, for the BERT-based model, the F1 value of our proposed method was 88.54% compared to the LTC, 87.6%,85.2% and 87.1% of the MTP and RANDOM methods, respectively, increased by 0.94%,3.34% and 1.44%. For the BiLSTM model, we propose a method with an F1 value of 80.17% that is improved by 1.36%,3.17% and 2.93% compared to 78.81%,77% and 77.24 for LTC, MTP and RANDOM methods, respectively.

Therefore, the domain entity recognition system based on the active learning, which is disclosed by the embodiment, fuses the active learning and the deep learning, and solves the problem of complex structure of the Chinese domain entity. And a new selection strategy is provided, so that the identification effect of the Chinese entity in the specific field is good, and the labeling cost of the entity in the specific field is reduced.

The embodiment provides a field entity identification system based on active learning, which uses a field entity identification framework built by using a hole convolutional neural network based on a BERT pre-training model and a conditional random field technology, selects data with high labeling value by using an active learning method to label labeling experts, and adds the data into training data to train the field entity identification framework. Aiming at the problem of how to select high-value annotation data, a new selection strategy in active learning is realized, so that the method can fully label a data set to improve the entity identification performance of the framework.

The above description is only an example of the present invention and is not limited to the present invention, but various modifications and changes will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention. Are intended to be included within the scope of the claims of the present invention.

Claims

1. The domain entity identification system based on active learning is characterized by comprising a data preprocessing module, a model training module, an active learning module, a domain entity identification model and a domain entity crowdsourcing labeling platform device;

2. The active learning-based domain entity recognition system of claim 1, wherein the data preprocessing module comprises a data cleaning unit, a data format unit and a data segmentation unit;

the data format unit is used for processing an input format;

3. The active learning-based domain entity recognition system of claim 1, wherein the model training module comprises a model recognition unit and a model evaluation unit;

the model identification unit is used for identifying the received data;

4. A domain entity-identification system based on active learning as claimed in claim 3, wherein said evaluation criteria include accuracy, recall and F1 metric.

5. The active learning-based domain entity recognition system of claim 1, wherein the active learning module comprises a policy selection unit, a data annotation unit and an iterative training unit;

the strategy selection unit is used for selecting data with labeling value;

6. The active learning-based domain entity identification system of claim 1, wherein the entity crowdsourcing annotation platform device comprises a user layer, a presentation layer, a business layer, a data layer, a database, and a data analysis layer;

7. The active learning-based domain entity identification system of claim 6, wherein the overall architecture of the entity crowdsourcing annotation platform device is implemented using a B/S architecture;

8. An active learning-based domain entity identification method, characterized in that the identification method is implemented by adopting the active learning-based domain entity identification system according to any one of claims 1-7, and the identification method comprises the following steps:

9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs a domain entity identification method based on active learning as claimed in claim 8.

10. A computer device, characterized by: the apparatus comprises a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning as claimed in claim 8.