CN116776881A - Active learning-based domain entity identification system and identification method - Google Patents

Active learning-based domain entity identification system and identification method Download PDF

Info

Publication number
CN116776881A
CN116776881A CN202310598745.1A CN202310598745A CN116776881A CN 116776881 A CN116776881 A CN 116776881A CN 202310598745 A CN202310598745 A CN 202310598745A CN 116776881 A CN116776881 A CN 116776881A
Authority
CN
China
Prior art keywords
data
domain entity
model
training
active learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310598745.1A
Other languages
Chinese (zh)
Inventor
宋荣伟
刘蜜
谢博
蒲天应
申国伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Workpower Telecom Technology Co ltd
Guizhou University
Original Assignee
Shanghai Workpower Telecom Technology Co ltd
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Workpower Telecom Technology Co ltd, Guizhou University filed Critical Shanghai Workpower Telecom Technology Co ltd
Priority to CN202310598745.1A priority Critical patent/CN116776881A/en
Publication of CN116776881A publication Critical patent/CN116776881A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

A domain entity recognition system and a recognition method based on active learning relate to the technical field of entity recognition and classification. The method solves the problems that entity identification in the existing specific field lacks a high-quality labeling data set, so that the labeling cost is very high and the identification accuracy is low. The recognition system comprises a data preprocessing module, a model training module, an active learning module, a domain entity recognition model and a domain entity crowdsourcing labeling platform device; the data preprocessing module is used for carrying out data processing on the original text data and sending the text data to the model training module and the active learning module; the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model; the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training. The method is suitable for entity identification and classification in Chinese information processing.

Description

Active learning-based domain entity identification system and identification method
Technical Field
The invention relates to the technical field of entity identification and classification, in particular to the technical field of entity identification and classification in Chinese information processing.
Background
With the development of information technology, we enter the age of big data and artificial intelligence. The traditional enterprises accumulate a great deal of data and experience knowledge in the informatization process, and support is provided for the business capability improvement of the enterprises. The knowledge graph is a technical tie connecting big data and artificial intelligence, and is a basic stone from perception intelligence to cognition intelligence. In order to realize the precipitation process from data asset to knowledge asset, shallow understanding and analysis of data are realized through data digitizing technology, data management, data mining and other related technologies, and further deep knowledge mining analysis and application are realized through knowledge discovery, knowledge fusion and the like.
Aiming at a large amount of unstructured text data formed by enterprises, the converged data can be deeply understood and arranged in a knowledge graph mode. In the construction process of the knowledge graph in the specific field, a field knowledge ontology model is constructed, and the extraction of field entities and entity relations is a key step of construction of the field knowledge graph. Therefore, the domain entity identification domain is a very important target in the knowledge graph construction task. For example, in the construction of a security knowledge graph in the field of network security, the goal is to extract security entities, such as attack organizations, enterprises, vulnerabilities, software, etc., from text data in the field of network security. For example, in the construction of product development knowledge maps in the manufacturing field, the goal is to extract entities related to product components, such as product names, design criteria, materials, etc., from text data involved in the product development design process.
Compared with the entity identification in the general field, the entity identification in the specific field has high labeling cost due to the lack of label data, so that the accuracy of identification is not high. The domain-specific entity identification task is often complex, involving a large number of multiple mixed-type entities and nested entities. Furthermore, deep learning relies on large-scale tag data, however, there is a lack of large-scale high-quality entity labeling data in certain fields. This makes a lot of manpower and material resources necessary when labeling the dataset, making labeling very costly.
In summary, entity identification in a specific field lacks a high-quality labeling dataset, so that the labeling cost is very high, and the identification accuracy is low.
Disclosure of Invention
The method solves the problems that entity identification in the existing specific field lacks a high-quality labeling data set, so that the labeling cost is very high and the identification accuracy is low.
In order to achieve the above object, the present invention provides the following solutions:
the invention provides a domain entity identification system based on active learning, which comprises a data preprocessing module, a model training module, an active learning module, a domain entity identification model and a domain entity crowdsourcing labeling platform device, wherein the model training module is used for acquiring a domain entity identification model;
the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;
the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;
and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.
Further, in a preferred embodiment, the data preprocessing module includes a data cleansing unit, a data format unit and a data segmentation unit;
the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;
the data format unit is used for processing an input format;
the data segmentation unit is used for selecting the data source of the strategy.
Further, in a preferred embodiment, the model training module includes a model identifying unit and a model evaluating unit;
the model identification unit is used for identifying the received data;
the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.
Further, in a preferred embodiment, the evaluation criteria include accuracy, recall, and F1 metric.
Further, in a preferred embodiment, the active learning module includes a policy selection unit, a data labeling unit, and an iterative training unit;
the strategy selection unit is used for selecting data with labeling value;
the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;
the iterative training unit is used for setting the number of iterative training times.
Further, in a preferred embodiment, the entity crowdsourcing labeling platform device includes a user layer, a display layer, a service layer, a data layer, a database, and a data analysis layer;
the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;
the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;
the data analysis layer is used for labeling the received data to obtain labeled data.
Further, in a preferred embodiment, the overall architecture of the entity crowdsourcing annotation platform device is implemented by adopting a B/S architecture;
the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.
The invention also provides a domain entity identification method based on active learning, which is realized by adopting the domain entity identification system based on active learning, and comprises the following steps:
s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;
s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;
s3, dividing the training set data to obtain training set data with labels and training set data without labels;
s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;
s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;
s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;
and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.
The present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs an active learning based domain entity identification method as described above.
The invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning.
The beneficial effects of the invention are as follows:
1. the invention provides a domain entity recognition system based on active learning, which uses a domain entity recognition framework built by a hole convolutional neural network based on a BERT pre-training model and a conditional random field technology, selects data with high labeling value by using an active learning method to label labeling experts, and adds the data into training data to train the domain entity recognition framework. Aiming at the problem of how to select high-value annotation data, a new selection strategy in active learning is realized, so that the method can fully label a data set to improve the entity identification performance of the framework.
2. The invention provides a field entity identification system based on active learning, which adopts an active learning mode to select data with high labeling value, and adds the data into training data to train a field entity identification framework, so that a large amount of manpower and material resources are required when a data set is labeled, and the labeling cost is reduced.
3. The invention provides a field entity identification system based on active learning, which adopts an active learning mode to select data with high annotation value, and realizes a new selection strategy so that the data can be fully provided with a label-free data set to improve the entity identification performance of a frame.
The method is suitable for entity identification and classification in Chinese information processing.
Drawings
FIG. 1 is a schematic diagram of a domain entity identification system based on active learning according to an embodiment;
FIG. 2 is a BERT model according to a third embodiment;
FIG. 3 is a model of a convolutional neural network of a hole in accordance with a third embodiment;
fig. 4 is a schematic diagram of a domain entity crowdsourcing labeling platform device according to a sixth embodiment;
fig. 5 is a schematic diagram of an implementation of a crowd-sourced labeling platform device for a domain entity according to a sixth embodiment.
FIG. 6 is a graph showing the overall experimental results of the eleventh embodiment;
FIG. 7 (a) is a graph showing the overall test results on the F1 index according to the eleventh embodiment;
fig. 7 (b) is a graph of test results of F1 index on each entity type according to the eleventh embodiment;
Detailed Description
Referring to fig. 1, the present embodiment provides a domain entity recognition system based on active learning, where the recognition system includes a data preprocessing module, a model training module, an active learning module, a domain entity recognition model, and a domain entity crowdsourcing labeling platform device;
the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;
the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;
and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.
In practical application, the data preprocessing module is used for acquiring field text data and cleaning the original text; processing the input format of the model; further processing the processed text into a model input format and cutting the processed text into a training set, a verification set and a test set; the training set is further divided into a training set with a label and a training set without a label. The model training module is used for constructing a domain entity recognition model based on the BERT cavity convolutional neural network, and mapping an input text into a real number vector rich in position, semantics and syntactic characteristics; and identifying the entity in the text, and evaluating the trained model. The active learning module is used for selecting data with high labeling value in a training set without labels, labeling by using a domain entity crowdsourcing labeling platform device, adding the selected data into the data set with labels, and performing iterative training on the model; and stopping iteration when the performance of the model reaches the preset condition. The entity crowdsourcing labeling platform device is a data labeling platform device for providing specific entity labeling tasks for the active learning module, and meets entity labeling under the condition of entity tag missing in a specific field.
The embodiment provides a domain entity recognition system based on active learning, which adopts an active learning mode to select data with high labeling value, and adds the data into training data to train a domain entity recognition framework, so that a large amount of manpower and material resources are required when a data set is labeled, and the labeling cost is reduced.
The embodiment provides a domain entity identification system based on active learning, which adopts an active learning mode to select data with high annotation value, so that a new selection strategy is realized, and a data set can be fully unlabeled to improve the entity identification performance of a frame.
In the second embodiment, the data preprocessing module in the active learning-based domain entity recognition system according to the first embodiment is illustrated, where the data preprocessing module includes a data cleaning unit, a data format unit and a data segmentation unit;
the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;
the data format unit is used for processing an input format;
the data segmentation unit is used for selecting the data source of the strategy.
In practical application, the data processing unit comprises a data cleaning unit, a data format unit and a data segmentation unit, wherein the data cleaning unit is used for removing special characters and pictures in a text; the data format unit is used for processing an input format of the BERT model; the data segmentation unit is used for training and evaluating the model, and selecting a strategy data source in active learning.
Referring to fig. 2 and 3, the present embodiment is described by illustrating a model training module in the active learning-based domain entity recognition system according to the first embodiment, where the model training module includes a model recognition unit and a model evaluation unit;
the model identification unit is used for identifying the received data;
the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.
In practical application, the model training module comprises a model training unit and a model evaluating unit, wherein,
the model training unit is used for utilizing a hole convolutional neural network model based on BERT, and the BERT model is shown in figure 2; a model of a hollow convolutional neural network is shown in fig. 3. Identifying an entity in the text; the model evaluation unit is used for evaluating the trained model. The model evaluation unit is also used for evaluating a model trained by using the BERT-based cavity convolutional neural network and generating a performance comparison analysis report with other baseline active learning methods.
In a fourth embodiment, the evaluation criteria in the active learning-based domain entity recognition system described in the third embodiment are exemplified, where the evaluation criteria include accuracy, recall, and F1 metric.
An active learning module in the active learning-based domain entity recognition system according to the first embodiment is illustrated, and the active learning module includes a policy selection unit, a data labeling unit and an iterative training unit;
the strategy selection unit is used for selecting data with labeling value;
the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;
the iterative training unit is used for setting the number of iterative training times.
An embodiment six is an example of the entity crowdsourcing labeling platform device in the active learning-based domain entity identification system according to the embodiment one, where the entity crowdsourcing labeling platform device includes a user layer, a display layer, a service layer, a data layer, a database, and a data analysis layer;
the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;
the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;
the data analysis layer is used for labeling the received data to obtain labeled data.
In practical application, the network security entity labeling system adopts a customized B/S architecture, as shown in FIG. 4. The data analysis layer is a completely independent entity annotation learning model, and the system annotation data can be acquired and delivered to a user only by calling an interface provided by the analysis layer model in the entity annotation system. In the view of users, the method is a black box, the users do not need to have a complex machine learning basis, manually configure tuning parameters, and only importing data is needed to optimize the labeling model of the users. The system developer can complete the development task without machine learning foundation.
The data layer is responsible for completing the data interaction, inquiry and other works between the service layer and the database, realizes the related functions of inserting, modifying, deleting and the like on the data, and also provides some advanced inquiry functions. Meanwhile, the data layer defines a special database interface, and when the database operation is needed in the business logic, the database operation can be easily completed by calling the interface only through dependency injection.
The business layer is responsible for receiving the data interacted by the front end and returning the data to the database through a certain process. The service layer also provides a plurality of service controller classes which can respond to various requests of the front end. Meanwhile, in order that the front end can operate the data transmitted from the rear end more conveniently, the service layer also defines a standard interface, so that response data of the rear end is restrained.
The display layer is used as the front end of the network security entity labeling system, receives and processes the data standardized by the back end again while rendering the user layer page, and then carries out logic processing on the user page delivered. Thus, the back-end does not need to pay excessive effort to process scattered data of the traditional front-end, and the work is finished in a powerful front-end engine. Meanwhile, the display layer processing logic can enable the user interaction page to be more flexible, and the display layer also makes great contribution on the requirements of humanized marking function completion and convenient interaction operation.
The entity crowdsourcing labeling platform device in the active learning-based domain entity identification system of the sixth embodiment is exemplified, and the overall architecture of the entity crowdsourcing labeling platform device is realized by adopting a B/S architecture;
the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.
In practical application, the embodiment adopts the Angular framework, the nz-zorro component library and the echart visual library to construct, so as to obtain a display layer, and is responsible for finishing processing back-end data and rendering pages.
The marking platform device has the advantages compared with the prior art: (1) The strategy selection unit in the data analysis layer in the device can select the data with the labeling value, and the existing traditional labeling platform can not screen the data according to the labeling value. (2) The model evaluation unit in the device is used for evaluating the trained domain entity recognition model, and compared with the simple model use of the traditional labeling platform, the labeling platform device can evaluate different models. (3) The in-device data preprocessing module comprises a data cleaning unit, a data format unit and a data segmentation unit, and the traditional marking platform can introduce data for use only by performing early data cleaning work.
The eighth embodiment provides an active learning-based domain entity recognition method, where the recognition method is implemented by using the active learning-based domain entity recognition system as set forth in any one of the first to seventh embodiments, and the recognition method is as follows:
s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;
s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;
s3, dividing the training set data to obtain training set data with labels and training set data without labels;
s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;
s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;
s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;
and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.
In practical application, the embodiment is in a data preprocessing module; removing special characters and pictures in the text through a data cleaning unit; the data format unit performs word segmentation processing on the cleaned data, and uses one-hot coding as original coding of the data; encoding the position of each word in the sentence by a trigonometric function; initializing all segment codes of sentences to 1; finally, the three codes are used as inputs of the BERT model. The data segmentation unit uses a sklean machine learning framework to segment data into a training set, a verification set and a test set; wherein the training set is further divided into a labeled data set and an unlabeled data set for an active learning selection strategy. In the module training model: training a model by using the processed data as input data; optimizing the model through an Adam algorithm; in the active learning module: the probability of each label in each sentence in the data is calculated, and the lowest probability value is taken as the uncertainty of the sentence; decoding the sentences as confidence coefficient by a Vibiter algorithm of a calculation conditional random field; and extracting the entity in the tagged data set as an initial dictionary, then utilizing the dictionary to match the characters in each sentence, calculating the matching number of each sentence and normalizing. Through taking the uncertainty, the confidence and the matching number as selection indexes of a selection strategy and carrying out descending order sequencing, sentences with high annotation value can be selected. And performing entity labeling on the selected data by using a domain entity crowdsourcing labeling platform, and providing label data for further iterative optimization of active learning.
In a third embodiment, a method for identifying a domain entity based on active learning according to the first embodiment is provided.
In a tenth aspect, the present embodiment provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning as described in the eighth aspect.
An eleventh embodiment is a verification description of the domain entity identification system based on active learning according to any one of the first to seventh embodiments. Experimental results show that the accuracy and recall rate obtained by the method provided by the invention on the network security entity data set are better than those obtained by the baseline method.
In order to ensure the accuracy of the test, the effectiveness of the selection strategy of the invention is verified by adopting a BiLSTM deep learning model, and the test result is shown in figure 6, so that the safety entity identification performance of the BiLSTM-CRF model is poorer than the identification performance of BERT-RDCNN-CRF. However, it is seen from the table that whether the BERT model or the BiLSTM model is used, the proposed method is better than the baseline selection strategy method in terms of both the accuracy and recall.
In order to balance accuracy and recall, we also calculated F1 values for different methods, the overall test results on the F1 index graph test results are shown in fig. 7 (a); the test result graph of the F1 index on each entity type is shown in fig. 7 (b). It can be seen from the figure that the selection strategy approach we propose is better than the baseline approach, regardless of whether the BERT model or the BiLSTM model, which further verifies the effectiveness of the approach we propose. Wherein, for the BERT-based model, the F1 value of our proposed method was 88.54% compared to the LTC, 87.6%,85.2% and 87.1% of the MTP and RANDOM methods, respectively, increased by 0.94%,3.34% and 1.44%. For the BiLSTM model, we propose a method with an F1 value of 80.17% that is improved by 1.36%,3.17% and 2.93% compared to 78.81%,77% and 77.24 for LTC, MTP and RANDOM methods, respectively.
Therefore, the domain entity recognition system based on the active learning, which is disclosed by the embodiment, fuses the active learning and the deep learning, and solves the problem of complex structure of the Chinese domain entity. And a new selection strategy is provided, so that the identification effect of the Chinese entity in the specific field is good, and the labeling cost of the entity in the specific field is reduced.
The embodiment provides a field entity identification system based on active learning, which uses a field entity identification framework built by using a hole convolutional neural network based on a BERT pre-training model and a conditional random field technology, selects data with high labeling value by using an active learning method to label labeling experts, and adds the data into training data to train the field entity identification framework. Aiming at the problem of how to select high-value annotation data, a new selection strategy in active learning is realized, so that the method can fully label a data set to improve the entity identification performance of the framework.
The above description is only an example of the present invention and is not limited to the present invention, but various modifications and changes will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention. Are intended to be included within the scope of the claims of the present invention.

Claims (10)

1. The domain entity identification system based on active learning is characterized by comprising a data preprocessing module, a model training module, an active learning module, a domain entity identification model and a domain entity crowdsourcing labeling platform device;
the data preprocessing module is used for carrying out data processing on the original text data and sending the text data after the data preprocessing to the model training module and the active learning module;
the model training module is used for mapping and identifying the received data and training and evaluating the domain entity identification model;
and the active learning module is used for driving the domain entity crowdsourcing labeling platform device to carry out marking processing on the received data and sending the data to the domain entity identification model for training.
2. The active learning-based domain entity recognition system of claim 1, wherein the data preprocessing module comprises a data cleaning unit, a data format unit and a data segmentation unit;
the data cleaning unit is used for acquiring text data of the field and removing special characters and pictures in the text;
the data format unit is used for processing an input format;
the data segmentation unit is used for selecting the data source of the strategy.
3. The active learning-based domain entity recognition system of claim 1, wherein the model training module comprises a model recognition unit and a model evaluation unit;
the model identification unit is used for identifying the received data;
the model evaluation unit is used for evaluating the trained domain entity recognition model according to the evaluation standard.
4. A domain entity-identification system based on active learning as claimed in claim 3, wherein said evaluation criteria include accuracy, recall and F1 metric.
5. The active learning-based domain entity recognition system of claim 1, wherein the active learning module comprises a policy selection unit, a data annotation unit and an iterative training unit;
the strategy selection unit is used for selecting data with labeling value;
the data labeling unit is used for driving the domain entity crowdsourcing labeling platform device to label data;
the iterative training unit is used for setting the number of iterative training times.
6. The active learning-based domain entity identification system of claim 1, wherein the entity crowdsourcing annotation platform device comprises a user layer, a presentation layer, a business layer, a data layer, a database, and a data analysis layer;
the user layer is used for collecting text data to be marked and sending the text data to the service layer through the display layer;
the business layer is used for calling an algorithm in the database to process the received data and sending the processed data to the data analysis layer;
the data analysis layer is used for labeling the received data to obtain labeled data.
7. The active learning-based domain entity identification system of claim 6, wherein the overall architecture of the entity crowdsourcing annotation platform device is implemented using a B/S architecture;
the construction method of the display layer comprises the following steps: and constructing an Angular framework, an nz-zorro component library and an echart visual library to obtain the display layer.
8. An active learning-based domain entity identification method, characterized in that the identification method is implemented by adopting the active learning-based domain entity identification system according to any one of claims 1-7, and the identification method comprises the following steps:
s1, adopting the data preprocessing module to carry out data cleaning on original text data to obtain input data of a domain entity identification model;
s2, dividing the input format data of the domain entity identification model to obtain training set data, verification set data and test set data;
s3, dividing the training set data to obtain training set data with labels and training set data without labels;
s4, training the domain entity recognition model by the model training module according to the training set data with the labels to obtain a trained domain entity recognition model;
s5, analyzing the test set data by the trained domain entity recognition model to obtain analysis result data;
s6, the active learning module adopts the domain entity crowdsourcing labeling platform device to label the training set data without the labels and adds the labeled training set data into the labeled training data set data to obtain a new labeled training data set;
and S7, training the domain entity recognition model by adopting the analysis result data and the new training data set with the label until the performance of the domain entity recognition model reaches the preset condition, and stopping training to obtain the recognized domain entity.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs a domain entity identification method based on active learning as claimed in claim 8.
10. A computer device, characterized by: the apparatus comprises a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the domain entity identification method based on active learning as claimed in claim 8.
CN202310598745.1A 2023-05-25 2023-05-25 Active learning-based domain entity identification system and identification method Pending CN116776881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310598745.1A CN116776881A (en) 2023-05-25 2023-05-25 Active learning-based domain entity identification system and identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310598745.1A CN116776881A (en) 2023-05-25 2023-05-25 Active learning-based domain entity identification system and identification method

Publications (1)

Publication Number Publication Date
CN116776881A true CN116776881A (en) 2023-09-19

Family

ID=87987020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310598745.1A Pending CN116776881A (en) 2023-05-25 2023-05-25 Active learning-based domain entity identification system and identification method

Country Status (1)

Country Link
CN (1) CN116776881A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117436449A (en) * 2023-11-01 2024-01-23 哈尔滨工业大学 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117436449A (en) * 2023-11-01 2024-01-23 哈尔滨工业大学 Crowd-sourced named entity recognition model and system based on multi-source domain adaptation and reinforcement learning

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN111027327A (en) Machine reading understanding method, device, storage medium and device
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN110222194B (en) Data chart generation method based on natural language processing and related device
CN113515600B (en) Automatic calculation method for spatial analysis based on metadata
CN115470338B (en) Multi-scenario intelligent question answering method and system based on multi-path recall
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN114547072A (en) Method, system, equipment and storage medium for converting natural language query into SQL
CN113064995A (en) Text multi-label classification method and system based on deep learning of images
CN110807086A (en) Text data labeling method and device, storage medium and electronic equipment
CN111241209A (en) Method and apparatus for generating information
CN111859969A (en) Data analysis method and device, electronic equipment and storage medium
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN116776881A (en) Active learning-based domain entity identification system and identification method
CN115803734A (en) Natural language enrichment using action interpretation
CN115099233A (en) Semantic analysis model construction method and device, electronic equipment and storage medium
CN113988071A (en) Intelligent dialogue method and device based on financial knowledge graph and electronic equipment
CN113935314A (en) Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN116433934A (en) Multi-mode pre-training method for generating CT image representation and image report
CN112883183B (en) Method for constructing multi-classification model, intelligent customer service method, and related device and system
CN115269998A (en) Information recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination