CN116776881A - Active learning-based domain entity identification system and identification method - Google Patents
Active learning-based domain entity identification system and identification method Download PDFInfo
- Publication number
- CN116776881A CN116776881A CN202310598745.1A CN202310598745A CN116776881A CN 116776881 A CN116776881 A CN 116776881A CN 202310598745 A CN202310598745 A CN 202310598745A CN 116776881 A CN116776881 A CN 116776881A
- Authority
- CN
- China
- Prior art keywords
- data
- domain entity
- model
- training
- active learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000012549 training Methods 0.000 claims abstract description 97
- 238000002372 labelling Methods 0.000 claims abstract description 68
- 238000007781 pre-processing Methods 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 19
- 238000013507 mapping Methods 0.000 claims abstract description 5
- 238000011156 evaluation Methods 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 15
- 238000004140 cleaning Methods 0.000 claims description 14
- 238000007405 data analysis Methods 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 8
- 238000012795 verification Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 230000000007 visual effect Effects 0.000 claims description 4
- 238000003860 storage Methods 0.000 claims description 3
- 230000010365 information processing Effects 0.000 abstract description 3
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012356 Product development Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19147—Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
A domain entity recognition system and a recognition method based on active learning relate to the technical field of entity recognition and classification. The method solves the problems that entity identification in the existing specific field lacks a high-quality labeling data set, so that the labeling cost is very high and the identification accuracy is low. The recognition system comprises a data preprocessing module, a model training module, an active learning module, a domain entity recognition model and a domain entity crowdsourcing labeling platform device; the data preprocessing module is used for carrying out data processing on the original text data and sending the text data to the model training module and the active learning module; the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model; the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training. The method is suitable for entity identification and classification in Chinese information processing.
Description
Technical Field
The invention relates to the technical field of entity identification and classification, in particular to the technical field of entity identification and classification in Chinese information processing.
Background
With the development of information technology, we enter the age of big data and artificial intelligence. The traditional enterprises accumulate a great deal of data and experience knowledge in the informatization process, and support is provided for the business capability improvement of the enterprises. The knowledge graph is a technical tie connecting big data and artificial intelligence, and is a basic stone from perception intelligence to cognition intelligence. In order to realize the precipitation process from data asset to knowledge asset, shallow understanding and analysis of data are realized through data digitizing technology, data management, data mining and other related technologies, and further deep knowledge mining analysis and application are realized through knowledge discovery, knowledge fusion and the like.
Aiming at a large amount of unstructured text data formed by enterprises, the converged data can be deeply understood and arranged in a knowledge graph mode. In the construction process of the knowledge graph in the specific field, a field knowledge ontology model is constructed, and the extraction of field entities and entity relations is a key step of construction of the field knowledge graph. Therefore, the domain entity identification domain is a very important target in the knowledge graph construction task. For example, in the construction of a security knowledge graph in the field of network security, the goal is to extract security entities, such as attack organizations, enterprises, vulnerabilities, software, etc., from text data in the field of network security. For example, in the construction of product development knowledge maps in the manufacturing field, the goal is to extract entities related to product components, such as product names, design criteria, materials, etc., from text data involved in the product development design process.
Compared with the entity identification in the general field, the entity identification in the specific field has high labeling cost due to the lack of label data, so that the accuracy of identification is not high. The domain-specific entity identification task is often complex, involving a large number of multiple mixed-type entities and nested entities. Furthermore, deep learning relies on large-scale tag data, however, there is a lack of large-scale high-quality entity labeling data in certain fields. This makes a lot of manpower and material resources necessary when labeling the dataset, making labeling very costly.
In summary, entity identification in a specific field lacks a high-quality labeling dataset, so that the labeling cost is very high, and the identification accuracy is low.
Disclosure of Invention
The method solves the problems that entity identification in the existing specific field lacks a high-quality labeling data set, so that the labeling cost is very high and the identification accuracy is low.
In order to achieve the above object, the present invention provides the following solutions:
the invention provides a domain entity identification system based on active learning, which comprises a data preprocessing module, a model training module, an active learning module, a domain entity identification model and a domain entity crowdsourcing labeling platform device, wherein the model training module is used for acquiring a domain entity identification model;
the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;
the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;
and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.
Further, in a preferred embodiment, the data preprocessing module includes a data cleansing unit, a data format unit and a data segmentation unit;
the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;
the data format unit is used for processing an input format;
the data segmentation unit is used for selecting the data source of the strategy.
Further, in a preferred embodiment, the model training module includes a model identifying unit and a model evaluating unit;
the model identification unit is used for identifying the received data;
the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.
Further, in a preferred embodiment, the evaluation criteria include accuracy, recall, and F1 metric.
Further, in a preferred embodiment, the active learning module includes a policy selection unit, a data labeling unit, and an iterative training unit;
the strategy selection unit is used for selecting data with labeling value;
the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;
the iterative training unit is used for setting the number of iterative training times.
Further, in a preferred embodiment, the entity crowdsourcing labeling platform device includes a user layer, a display layer, a service layer, a data layer, a database, and a data analysis layer;
the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;
the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;
the data analysis layer is used for labeling the received data to obtain labeled data.
Further, in a preferred embodiment, the overall architecture of the entity crowdsourcing annotation platform device is implemented by adopting a B/S architecture;
the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.
The invention also provides a domain entity identification method based on active learning, which is realized by adopting the domain entity identification system based on active learning, and comprises the following steps:
s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;
s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;
s3, dividing the training set data to obtain training set data with labels and training set data without labels;
s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;
s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;
s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;
and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.
The present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs an active learning based domain entity identification method as described above.
The invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning.
The beneficial effects of the invention are as follows:
1. the invention provides a domain entity recognition system based on active learning, which uses a domain entity recognition framework built by a hole convolutional neural network based on a BERT pre-training model and a conditional random field technology, selects data with high labeling value by using an active learning method to label labeling experts, and adds the data into training data to train the domain entity recognition framework. Aiming at the problem of how to select high-value annotation data, a new selection strategy in active learning is realized, so that the method can fully label a data set to improve the entity identification performance of the framework.
2. The invention provides a field entity identification system based on active learning, which adopts an active learning mode to select data with high labeling value, and adds the data into training data to train a field entity identification framework, so that a large amount of manpower and material resources are required when a data set is labeled, and the labeling cost is reduced.
3. The invention provides a field entity identification system based on active learning, which adopts an active learning mode to select data with high annotation value, and realizes a new selection strategy so that the data can be fully provided with a label-free data set to improve the entity identification performance of a frame.
The method is suitable for entity identification and classification in Chinese information processing.
Drawings
FIG. 1 is a schematic diagram of a domain entity identification system based on active learning according to an embodiment;
FIG. 2 is a BERT model according to a third embodiment;
FIG. 3 is a model of a convolutional neural network of a hole in accordance with a third embodiment;
fig. 4 is a schematic diagram of a domain entity crowdsourcing labeling platform device according to a sixth embodiment;
fig. 5 is a schematic diagram of an implementation of a crowd-sourced labeling platform device for a domain entity according to a sixth embodiment.
FIG. 6 is a graph showing the overall experimental results of the eleventh embodiment;
FIG. 7 (a) is a graph showing the overall test results on the F1 index according to the eleventh embodiment;
fig. 7 (b) is a graph of test results of F1 index on each entity type according to the eleventh embodiment;
Detailed Description
Referring to fig. 1, the present embodiment provides a domain entity recognition system based on active learning, where the recognition system includes a data preprocessing module, a model training module, an active learning module, a domain entity recognition model, and a domain entity crowdsourcing labeling platform device;
the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;
the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;
and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.
In practical application, the data preprocessing module is used for acquiring field text data and cleaning the original text; processing the input format of the model; further processing the processed text into a model input format and cutting the processed text into a training set, a verification set and a test set; the training set is further divided into a training set with a label and a training set without a label. The model training module is used for constructing a domain entity recognition model based on the BERT cavity convolutional neural network, and mapping an input text into a real number vector rich in position, semantics and syntactic characteristics; and identifying the entity in the text, and evaluating the trained model. The active learning module is used for selecting data with high labeling value in a training set without labels, labeling by using a domain entity crowdsourcing labeling platform device, adding the selected data into the data set with labels, and performing iterative training on the model; and stopping iteration when the performance of the model reaches the preset condition. The entity crowdsourcing labeling platform device is a data labeling platform device for providing specific entity labeling tasks for the active learning module, and meets entity labeling under the condition of entity tag missing in a specific field.
The embodiment provides a domain entity recognition system based on active learning, which adopts an active learning mode to select data with high labeling value, and adds the data into training data to train a domain entity recognition framework, so that a large amount of manpower and material resources are required when a data set is labeled, and the labeling cost is reduced.
The embodiment provides a domain entity identification system based on active learning, which adopts an active learning mode to select data with high annotation value, so that a new selection strategy is realized, and a data set can be fully unlabeled to improve the entity identification performance of a frame.
In the second embodiment, the data preprocessing module in the active learning-based domain entity recognition system according to the first embodiment is illustrated, where the data preprocessing module includes a data cleaning unit, a data format unit and a data segmentation unit;
the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;
the data format unit is used for processing an input format;
the data segmentation unit is used for selecting the data source of the strategy.
In practical application, the data processing unit comprises a data cleaning unit, a data format unit and a data segmentation unit, wherein the data cleaning unit is used for removing special characters and pictures in a text; the data format unit is used for processing an input format of the BERT model; the data segmentation unit is used for training and evaluating the model, and selecting a strategy data source in active learning.
Referring to fig. 2 and 3, the present embodiment is described by illustrating a model training module in the active learning-based domain entity recognition system according to the first embodiment, where the model training module includes a model recognition unit and a model evaluation unit;
the model identification unit is used for identifying the received data;
the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.
In practical application, the model training module comprises a model training unit and a model evaluating unit, wherein,
the model training unit is used for utilizing a hole convolutional neural network model based on BERT, and the BERT model is shown in figure 2; a model of a hollow convolutional neural network is shown in fig. 3. Identifying an entity in the text; the model evaluation unit is used for evaluating the trained model. The model evaluation unit is also used for evaluating a model trained by using the BERT-based cavity convolutional neural network and generating a performance comparison analysis report with other baseline active learning methods.
In a fourth embodiment, the evaluation criteria in the active learning-based domain entity recognition system described in the third embodiment are exemplified, where the evaluation criteria include accuracy, recall, and F1 metric.
An active learning module in the active learning-based domain entity recognition system according to the first embodiment is illustrated, and the active learning module includes a policy selection unit, a data labeling unit and an iterative training unit;
the strategy selection unit is used for selecting data with labeling value;
the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;
the iterative training unit is used for setting the number of iterative training times.
An embodiment six is an example of the entity crowdsourcing labeling platform device in the active learning-based domain entity identification system according to the embodiment one, where the entity crowdsourcing labeling platform device includes a user layer, a display layer, a service layer, a data layer, a database, and a data analysis layer;
the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;
the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;
the data analysis layer is used for labeling the received data to obtain labeled data.
In practical application, the network security entity labeling system adopts a customized B/S architecture, as shown in FIG. 4. The data analysis layer is a completely independent entity annotation learning model, and the system annotation data can be acquired and delivered to a user only by calling an interface provided by the analysis layer model in the entity annotation system. In the view of users, the method is a black box, the users do not need to have a complex machine learning basis, manually configure tuning parameters, and only importing data is needed to optimize the labeling model of the users. The system developer can complete the development task without machine learning foundation.
The data layer is responsible for completing the data interaction, inquiry and other works between the service layer and the database, realizes the related functions of inserting, modifying, deleting and the like on the data, and also provides some advanced inquiry functions. Meanwhile, the data layer defines a special database interface, and when the database operation is needed in the business logic, the database operation can be easily completed by calling the interface only through dependency injection.
The business layer is responsible for receiving the data interacted by the front end and returning the data to the database through a certain process. The service layer also provides a plurality of service controller classes which can respond to various requests of the front end. Meanwhile, in order that the front end can operate the data transmitted from the rear end more conveniently, the service layer also defines a standard interface, so that response data of the rear end is restrained.
The display layer is used as the front end of the network security entity labeling system, receives and processes the data standardized by the back end again while rendering the user layer page, and then carries out logic processing on the user page delivered. Thus, the back-end does not need to pay excessive effort to process scattered data of the traditional front-end, and the work is finished in a powerful front-end engine. Meanwhile, the display layer processing logic can enable the user interaction page to be more flexible, and the display layer also makes great contribution on the requirements of humanized marking function completion and convenient interaction operation.
The entity crowdsourcing labeling platform device in the active learning-based domain entity identification system of the sixth embodiment is exemplified, and the overall architecture of the entity crowdsourcing labeling platform device is realized by adopting a B/S architecture;
the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.
In practical application, the embodiment adopts the Angular framework, the nz-zorro component library and the echart visual library to construct, so as to obtain a display layer, and is responsible for finishing processing back-end data and rendering pages.
The marking platform device has the advantages compared with the prior art: (1) The strategy selection unit in the data analysis layer in the device can select the data with the labeling value, and the existing traditional labeling platform can not screen the data according to the labeling value. (2) The model evaluation unit in the device is used for evaluating the trained domain entity recognition model, and compared with the simple model use of the traditional labeling platform, the labeling platform device can evaluate different models. (3) The in-device data preprocessing module comprises a data cleaning unit, a data format unit and a data segmentation unit, and the traditional marking platform can introduce data for use only by performing early data cleaning work.
The eighth embodiment provides an active learning-based domain entity recognition method, where the recognition method is implemented by using the active learning-based domain entity recognition system as set forth in any one of the first to seventh embodiments, and the recognition method is as follows:
s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;
s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;
s3, dividing the training set data to obtain training set data with labels and training set data without labels;
s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;
s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;
s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;
and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.
In practical application, the embodiment is in a data preprocessing module; removing special characters and pictures in the text through a data cleaning unit; the data format unit performs word segmentation processing on the cleaned data, and uses one-hot coding as original coding of the data; encoding the position of each word in the sentence by a trigonometric function; initializing all segment codes of sentences to 1; finally, the three codes are used as inputs of the BERT model. The data segmentation unit uses a sklean machine learning framework to segment data into a training set, a verification set and a test set; wherein the training set is further divided into a labeled data set and an unlabeled data set for an active learning selection strategy. In the module training model: training a model by using the processed data as input data; optimizing the model through an Adam algorithm; in the active learning module: the probability of each label in each sentence in the data is calculated, and the lowest probability value is taken as the uncertainty of the sentence; decoding the sentences as confidence coefficient by a Vibiter algorithm of a calculation conditional random field; and extracting the entity in the tagged data set as an initial dictionary, then utilizing the dictionary to match the characters in each sentence, calculating the matching number of each sentence and normalizing. Through taking the uncertainty, the confidence and the matching number as selection indexes of a selection strategy and carrying out descending order sequencing, sentences with high annotation value can be selected. And performing entity labeling on the selected data by using a domain entity crowdsourcing labeling platform, and providing label data for further iterative optimization of active learning.
In a third embodiment, a method for identifying a domain entity based on active learning according to the first embodiment is provided.
In a tenth aspect, the present embodiment provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning as described in the eighth aspect.
An eleventh embodiment is a verification description of the domain entity identification system based on active learning according to any one of the first to seventh embodiments. Experimental results show that the accuracy and recall rate obtained by the method provided by the invention on the network security entity data set are better than those obtained by the baseline method.
In order to ensure the accuracy of the test, the effectiveness of the selection strategy of the invention is verified by adopting a BiLSTM deep learning model, and the test result is shown in figure 6, so that the safety entity identification performance of the BiLSTM-CRF model is poorer than the identification performance of BERT-RDCNN-CRF. However, it is seen from the table that whether the BERT model or the BiLSTM model is used, the proposed method is better than the baseline selection strategy method in terms of both the accuracy and recall.
In order to balance accuracy and recall, we also calculated F1 values for different methods, the overall test results on the F1 index graph test results are shown in fig. 7 (a); the test result graph of the F1 index on each entity type is shown in fig. 7 (b). It can be seen from the figure that the selection strategy approach we propose is better than the baseline approach, regardless of whether the BERT model or the BiLSTM model, which further verifies the effectiveness of the approach we propose. Wherein, for the BERT-based model, the F1 value of our proposed method was 88.54% compared to the LTC, 87.6%,85.2% and 87.1% of the MTP and RANDOM methods, respectively, increased by 0.94%,3.34% and 1.44%. For the BiLSTM model, we propose a method with an F1 value of 80.17% that is improved by 1.36%,3.17% and 2.93% compared to 78.81%,77% and 77.24 for LTC, MTP and RANDOM methods, respectively.
Therefore, the domain entity recognition system based on the active learning, which is disclosed by the embodiment, fuses the active learning and the deep learning, and solves the problem of complex structure of the Chinese domain entity. And a new selection strategy is provided, so that the identification effect of the Chinese entity in the specific field is good, and the labeling cost of the entity in the specific field is reduced.
The embodiment provides a field entity identification system based on active learning, which uses a field entity identification framework built by using a hole convolutional neural network based on a BERT pre-training model and a conditional random field technology, selects data with high labeling value by using an active learning method to label labeling experts, and adds the data into training data to train the field entity identification framework. Aiming at the problem of how to select high-value annotation data, a new selection strategy in active learning is realized, so that the method can fully label a data set to improve the entity identification performance of the framework.
The above description is only an example of the present invention and is not limited to the present invention, but various modifications and changes will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention. Are intended to be included within the scope of the claims of the present invention.
Claims (10)
1. The domain entity identification system based on active learning is characterized by comprising a data preprocessing module, a model training module, an active learning module, a domain entity identification model and a domain entity crowdsourcing labeling platform device;
the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;
the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;
and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.
2. The active learning-based domain entity recognition system of claim 1, wherein the data preprocessing module comprises a data cleaning unit, a data format unit and a data segmentation unit;
the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;
the data format unit is used for processing an input format;
the data segmentation unit is used for selecting the data source of the strategy.
3. The active learning-based domain entity recognition system of claim 1, wherein the model training module comprises a model recognition unit and a model evaluation unit;
the model identification unit is used for identifying the received data;
the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.
4. A domain entity-identification system based on active learning as claimed in claim 3, wherein said evaluation criteria include accuracy, recall and F1 metric.
5. The active learning-based domain entity recognition system of claim 1, wherein the active learning module comprises a policy selection unit, a data annotation unit and an iterative training unit;
the strategy selection unit is used for selecting data with labeling value;
the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;
the iterative training unit is used for setting the number of iterative training times.
6. The active learning-based domain entity identification system of claim 1, wherein the entity crowdsourcing annotation platform device comprises a user layer, a presentation layer, a business layer, a data layer, a database, and a data analysis layer;
the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;
the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;
the data analysis layer is used for labeling the received data to obtain labeled data.
7. The active learning-based domain entity identification system of claim 6, wherein the overall architecture of the entity crowdsourcing annotation platform device is implemented using a B/S architecture;
the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.
8. An active learning-based domain entity identification method, characterized in that the identification method is implemented by adopting the active learning-based domain entity identification system according to any one of claims 1-7, and the identification method comprises the following steps:
s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;
s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;
s3, dividing the training set data to obtain training set data with labels and training set data without labels;
s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;
s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;
s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;
and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs a domain entity identification method based on active learning as claimed in claim 8.
10. A computer device, characterized by: the apparatus comprises a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning as claimed in claim 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310598745.1A CN116776881A (en) | 2023-05-25 | 2023-05-25 | Active learning-based domain entity identification system and identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310598745.1A CN116776881A (en) | 2023-05-25 | 2023-05-25 | Active learning-based domain entity identification system and identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116776881A true CN116776881A (en) | 2023-09-19 |
Family
ID=87987020
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310598745.1A Pending CN116776881A (en) | 2023-05-25 | 2023-05-25 | Active learning-based domain entity identification system and identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116776881A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117436449A (en) * | 2023-11-01 | 2024-01-23 | 哈尔滨工业大学 | Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning |
-
2023
- 2023-05-25 CN CN202310598745.1A patent/CN116776881A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117436449A (en) * | 2023-11-01 | 2024-01-23 | 哈尔滨工业大学 | Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189942B (en) | Construction method and device of patent data knowledge graph | |
CN110727779A (en) | Question-answering method and system based on multi-model fusion | |
CN111027327A (en) | Machine reading understanding method, device, storage medium and device | |
CN110781276A (en) | Text extraction method, device, equipment and storage medium | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN110222194B (en) | Data chart generation method based on natural language processing and related device | |
CN113515600B (en) | Automatic calculation method for spatial analysis based on metadata | |
CN115470338B (en) | Multi-scenario intelligent question answering method and system based on multi-path recall | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN111782793A (en) | Intelligent customer service processing method, system and equipment | |
CN114547072A (en) | Method, system, equipment and storage medium for converting natural language query into SQL | |
CN113064995A (en) | Text multi-label classification method and system based on deep learning of images | |
CN110807086A (en) | Text data labeling method and device, storage medium and electronic equipment | |
CN111241209A (en) | Method and apparatus for generating information | |
CN111859969A (en) | Data analysis method and device, electronic equipment and storage medium | |
CN115099239B (en) | Resource identification method, device, equipment and storage medium | |
CN116776881A (en) | Active learning-based domain entity identification system and identification method | |
CN115803734A (en) | Natural language enrichment using action interpretation | |
CN115099233A (en) | Semantic analysis model construction method and device, electronic equipment and storage medium | |
CN113988071A (en) | Intelligent dialogue method and device based on financial knowledge graph and electronic equipment | |
CN113935314A (en) | Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network | |
CN117520503A (en) | Financial customer service dialogue generation method, device, equipment and medium based on LLM model | |
CN116433934A (en) | Multi-mode pre-training method for generating CT image representation and image report | |
CN112883183B (en) | Method for constructing multi-classification model, intelligent customer service method, and related device and system | |
CN115269998A (en) | Information recommendation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |