CN114356990A - Base named entity recognition system and method based on transfer learning - Google Patents

Base named entity recognition system and method based on transfer learning Download PDF

Info

Publication number
CN114356990A
CN114356990A CN202111652819.2A CN202111652819A CN114356990A CN 114356990 A CN114356990 A CN 114356990A CN 202111652819 A CN202111652819 A CN 202111652819A CN 114356990 A CN114356990 A CN 114356990A
Authority
CN
China
Prior art keywords
base
named entity
training
entity recognition
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111652819.2A
Other languages
Chinese (zh)
Inventor
马良荔
覃基伟
李陶圆
何智勇
牛敬华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval University of Engineering PLA
Original Assignee
Naval University of Engineering PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval University of Engineering PLA filed Critical Naval University of Engineering PLA
Priority to CN202111652819.2A priority Critical patent/CN114356990A/en
Publication of CN114356990A publication Critical patent/CN114356990A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a base named entity identification method based on transfer learning, which comprises the following steps: 1. acquiring and preprocessing base data, and predefining entity types according to needs; 2. labeling the base data; 3. acquiring an open-source transfer learning model; 4. training the model of the invention; 5. named entity recognition is performed. According to the invention, the model is subjected to the pre-self-supervision training by using extra mass data through the transfer learning, the limitation that the traditional deep learning model needs mass labeled training data is solved, and meanwhile, the context information of the BiGRU model coding entity with strong generalization capability is used, so that more accurate base named entity prediction is realized under the condition of less human intervention, and technical support is provided for the automatic construction of the subsequent knowledge graph.

Description

Base named entity recognition system and method based on transfer learning
Technical Field
The invention relates to the technical field of natural language processing and transfer learning, in particular to a base named entity recognition system and method based on transfer learning.
Background
With the rise of media, unstructured data described in natural language on the internet is increasing, and many pieces of information capable of providing assistance for machine decision making are implied. Such information may be used for shopping recommendations, intelligent searches, aid decisions, and the like. In the age of the user leading to the generation of the content, most of the content is described by a natural language mode, how to automatically arrange and summarize information in the content and use the information for assisting machine decision, and a targeted research and exploration are needed.
The natural language processing is a technical means for analyzing and understanding natural language by adopting a computer, so that a machine can process unstructured data such as the natural language, the extraction of actual semantic knowledge in the unstructured data by the machine is facilitated, and the automatic and intelligent knowledge acquisition capability of the machine is improved.
Knowledge-graph is a powerful tool for organizing, managing and applying knowledge derived from unstructured data, and named entity recognition is one of the key steps in knowledge-graph construction. In the construction process of the domain-subdivision knowledge graph, the marked training data are deficient occasionally, and the same is true of the base knowledge graph.
The transfer learning is a technical means for solving the problem that the neural network does not perform well under the condition of lacking training data, and the neural network is deeply pre-trained by using mass indirect target data, so that the neural network has mass data as background knowledge before being trained by using target data, and the generalization capability of the model under the condition of scarce data is finally improved.
Typical named entity recognition methods are: a method for formulating production rules, a method based on machine classification algorithms, a method based on deep learning. Research shows that compared with the traditional named entity recognition method, the method based on deep learning has better accuracy and recall rate and is suitable for the named entity recognition with large-scale training data. However, the research of the deep learning-based method mostly focuses on named entity recognition in the general field, and there are many challenges in the base description recognition. On one hand, base description data are scarce relative to the general field, and the requirement of massive data required by deep learning and training of a neural network is difficult to meet; on the other hand, the base named entity identification needs to subdivide entity types, the entity types are more, and complexity of named entity identification is improved. Therefore, the base named entity recognition method needs to be improved and improved in the aspects of reducing required target training data, improving the fine-grained learning capacity of the model on the description text, improving the generalization capacity of the model and the like.
Disclosure of Invention
The invention aims to provide a system and a method for recognizing a base named entity based on transfer learning, and aims to solve the problems that the disclosed base description text data is less and the traditional deep learning model training is inconvenient. An ALBERT model pre-trained by using mass extra data is added as a basic model, a mode of transfer learning is used for generating fine-grained word vectors of a description text, a BiGRU model is used as a coding layer, forward information and backward information of the base description text are learned, a CRF model is used for constraining output, and a named entity recognition result meeting requirements is obtained.
In order to achieve the purpose, the base named entity recognition method based on transfer learning is characterized by comprising the following steps: it comprises the following steps:
step 1: acquiring a natural language description corpus of a base from the Internet, and preprocessing the natural language description corpus to remove picture description information and HTML label information and uniformly describe units of attribute values;
step 2: randomly dividing the preprocessed natural language description corpus into a test set, a verification set and a training set, and performing entity labeling on the natural language description corpus of the test set, the natural language description corpus of the verification set and the natural language description corpus of the training set by using a BIOES format labeling mode to form the test set, the verification set and the training set in a BIOES labeling format;
and step 3: acquiring an open-source transfer learning ALBERT model, updating the open-source transfer learning ALBERT model by using a finetune mode through natural language description linguistic data of a base, and obtaining an updated transfer learning ALBERT layer;
and 4, step 4: constructing a base named entity recognition model by utilizing an updated transfer learning ALBERT layer, a BiGRU coding layer and a CRF constraint layer, training the base named entity recognition model by using a test set, a verification set and a training set in a BIOES labeling format as a training data set of the base named entity recognition model, and obtaining the trained base named entity recognition model by using accuracy, recall rate and F1 values as training evaluation indexes through continuous iterative fitting in the training process;
and 5: and recognizing the sentences uploaded by the user by using the trained base named entity recognition model to obtain BIOES format labels corresponding to the uploaded sentences.
The invention has the beneficial effects that:
the invention fully utilizes the capability of migration learning to improve the performance of the model under the condition of less labeled data quantity, and breaks through the problem of lower performance of the traditional neural network model under the condition of less labeled data. In the characteristic learning stage, the requirement for fine-grained depiction of word vectors in the early stage of the model is met by a method of self-supervision and pre-training the model in advance by using mass data, then context characteristics of BiGRU model coding sentences are adopted, information of the sentences is learned from front to back and from back, and finally constraints are added through a CRF layer, so that the integral model provided by the invention can quickly and accurately recognize base named entities.
Drawings
FIG. 1 is a flow chart of training data construction in the present invention;
FIG. 2 is a schematic diagram of training data labeling according to the present invention;
FIG. 3 is a diagram of a base named entity recognition model architecture according to the present invention;
FIG. 4 is a diagram of a migration learning (ALBERT) model architecture in accordance with the present invention;
FIG. 5 is a schematic diagram of the SOP self-supervised pre-training process of the ALBERT model in the present invention;
FIG. 6 is a schematic diagram of the overall structure of a GRU model according to the present invention;
FIG. 7 is the overall structure of the BiGRU model of the present invention;
FIG. 8 is a graph comparing the results of model experiments in the present invention;
FIG. 9 is a diagram illustrating an example of base named entity recognition in the present invention;
fig. 10 is a schematic structural diagram of the present invention.
The system comprises a 1-base corpus collection and preprocessing module, a 2-machine learning data set construction module, a 3-transfer learning ALBERT layer updating module, a 4-base named entity recognition model training module and a 5-sentence recognition module.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
the base named entity identification method based on the transfer learning shown in fig. 1-9 is characterized in that: it comprises the following steps:
step 1: acquiring a natural Language description corpus of a base from the Internet, and preprocessing the natural Language description corpus so as to remove picture description information and HTML (Hypertext Markup Language) tag information and uniformly describe units of attribute values;
step 2: randomly dividing the preprocessed natural language description corpus into a test set, a verification set and a training set, and performing entity labeling on the natural language description corpus of the test set, the natural language description corpus of the verification set and the natural language description corpus of the training set by using a BIOES format labeling mode and by using Doccano as a labeling tool (shown in figure 2) to form the test set, the verification set and the training set of the BIOES labeling format, wherein the test set, the verification set and the training set are formed by 2: 2: 6, random division;
and step 3: acquiring an open-source transfer learning ALBERT model, updating the open-source transfer learning ALBERT model by using a finetune mode through natural language description linguistic data of a base, and acquiring an updated transfer learning ALBERT layer for acquiring word vectors with fine-grained depiction;
and 4, step 4: constructing a base named entity recognition model (programming language is Python and a deep learning frame is PyTorch) by utilizing an updated migration learning ALBERT layer (A Lite Bidirectional Encoder representation from converters) based on lightweight Bidirectional coding representation of a converter, a BiGRU (Bidirectional Gated Loop Unit) coding layer and a CRF (Conditional Random Field) constraint layer, training the base named entity recognition model by using a test set, a verification set and a training set of a BIOES labeling format as a training data set of the base named entity recognition model, and obtaining the trained base named entity recognition model by using accuracy, recall and F1 values as training evaluation indexes and continuously performing iterative fitting in the training process;
and 5: and recognizing the sentences uploaded by the user by using the trained base named entity recognition model to obtain BIOES format labels corresponding to the uploaded sentences.
In the step 1 of the technical scheme, a crawler Selenium based on Python is used for crawling base newsletters on public websites such as 'Chinese military Net', 'Global military Net', 'Phoenix military' and the like according to a base name list to obtain a newsletter unprocessed aggregate; manually screening the content of the unprocessed aggregation of the news drafts, selecting the name of a base, the position of the base, the name of a region in which the base is responsible, the name of weapon equipment of the base, the name of a stationed army of the base, the name of a building facility in the base, the number of the building facility in the base, the position of each building facility in the base and the base evaluation from the unprocessed aggregation of the news drafts, and rejecting the news drafts irrelevant to the description of the base to be researched;
crawling base description information in Wikipedia and/or Baidu encyclopedia according to a base name list by using a Python-based crawler Selenium, so as to obtain a base name, a base position, a name of a base responsible area, a name of base weaponry, a name of a base stationing army, names of building facilities in the base, numbers of the building facilities in the base, positions of the building facilities in the base and base evaluation;
and forming a natural language description corpus of the base through the news manuscript and the base physical facilities, the base position, the base personnel condition and the base weapons and equipment information acquired from the Wikipedia and/or the Baidu encyclopedia.
In step 1 of the above technical solution, the natural language description corpus is preprocessed by data cleaning, data integration and labeled data format conversion;
the data cleaning method comprises the following steps: deleting repeated texts or large empty texts in the original data set, deleting information description contents originally belonging to pictures, and removing HTML label information which is not described by natural language, such as paragraph marks "< p >", hyperlink marks and the like.
The data integration method comprises the following steps: firstly, marking entities with same name and different name as Alias synonyms, and carrying out special treatment; second, unity units, such as attribute values for military facilities; and thirdly, uniformly labeling the data obtained from the two data sources, summarizing and sorting the labeled entity types suitable for the two data sources, and reducing the possibility of large-area sparse labeling.
The method for converting the format of the marking data comprises the following steps: and the data corpus is subjected to entity marking by adopting a Doccano open source marking tool, the data derived by the tool is in a Json format, and the marking data in the Json format is converted into a BIOES (Beigin, Inside, Other, End, Single) marking format by using a Python programming language. And after pretreatment, obtaining the natural language description corpus after pretreatment.
In step 2 of the above technical solution, base description category labeling is performed on base description information of natural language description corpus in test set, validation set and training set of BIOES labeling format according to preset base description classification describing base characteristics, and character sequencing labeling is performed on the base description information.
In step 3 of the above technical solution, the training set in the biees label format is used as a data sample for fitting the base named entity recognition model, the validation set in the biees label format is used for evaluating the current training state of the base named entity recognition model, and the test set in the biees label format evaluates the generalization performance of the trained base named entity recognition model.
In step 5 of the above technical scheme, the updated migration learning ALBERT layer in the trained base named entity recognition model maps each character in the sentence uploaded by the user into a low-dimensional dense distributed character vector, and a large amount of Chinese pre-training corpus is used for migration learning to make the distributed character vector more exquisite;
a BiGRU coding layer of the trained base named entity recognition model uses a BiGRU network to code distributed character vectors output by a migration learning ALBERT layer to form multi-dimensional character vectors;
and the CRF decoding layer of the trained base named entity recognition model decodes the multidimensional character vector output by the BiGRU coding layer, carries out constraint according to the implicit sequence relation of the BIOES labeling format, calculates to obtain a label labeling sequence meeting the requirement, and obtains the BIOES format label corresponding to the uploading statement.
In step 4 of the above technical solution, the calculation formulas of the accuracy P and the recall R, F1 values are as follows:
Figure BDA0003447539860000071
Figure BDA0003447539860000072
Figure BDA0003447539860000073
wherein TP represents the number of correctly predicted entities, FN represents the number of entities whose positive examples are predicted as negative examples, FP represents the number of entities whose negative examples are predicted as positive examples, outredicttrue represents the number of samples predicted as positive, outAllTrue represents the number of all positive samples in a sample, and F1 is the harmonic mean of P and R values for balancing the ratio of the two indices.
Named entity recognition has a fundamental role in the field of natural language processing, in order to recognize predefined entity classes containing semantic information in natural language description text. However, unlike the conventional two-or multi-classification problem in the similar image domain, named entity recognition in the Natural Language Processing (NLP) domain requires not only the length of the sequence but also the sequence order. For named entity recognition problems, the NLP field typically processes the problem by converting it into a sequence-tagged classification problem, i.e., carefully classifying each element in a linear sequence according to context. In this way, the problem is converted into a classification problem, and the classification evaluation index in machine learning can be used for evaluation.
In step 5 of the above technical scheme, the BIOES format corresponding to the upload statement is marked as a Json format, a Json file is analyzed by using a loads function of a Json module in a Python programming language, and an entity character string and an entity type predicted by the trained base named entity recognition model are displayed at the front end of the Web.
In the above technical solution, the BiGRU coding layer is formed by combining GRU models in two directions, namely forward and backward directions, of an entity (base description name), one GRU unit includes a reset gate and an update gate, and the detailed calculation formula is as follows:
z(t)=σ(W(z)x(t)+U(z)h(t-1)) (4)
r(t)=σ(W(r)x(t)+U(r)h(t-1)) (5)
Figure BDA0003447539860000081
Figure BDA0003447539860000082
wherein σ represents an activation function Sigmoid for compressing the function value to the interval (0,1), z represents an update gate, r represents a reset gate, t represents unit time, x represents a hadamard product, i.e. multiplication of corresponding elements in the matrix, and tanh represents a tan activation function for compressing the function value to the interval (-1, 1);
z(t)indicating the output of the update gate z at time t, W(z)Representing time t, updating the weight matrix of the input vector x of gate z, U(z)Representing time t, the weight matrix of the hidden layer vector h of gate z is updated, h(t-1)Representing the hidden layer vector at time t-1, r(t)Represents the output of the reset gate r at time t, W(r)Weight matrix, U, representing the input vector x of the reset gate r at time t(r)Indicating the time t, the weight matrix of the hidden layer vector h of the gate r is reset,
Figure BDA0003447539860000083
representing candidate hidden states at time t, h(t)Indicating a hidden state at time t.
Distributed character vector x(t)The forward input is the formula 4, and the output of the forward GRU of the entity at the time t is obtained
Figure BDA0003447539860000084
I.e., h in equation 7(t)Corresponding the forward GRU of (1), and then the distributed character vector x(t)Reversely inputting into formula 4, obtaining the output of GRU at t moment after entity
Figure BDA0003447539860000085
I.e., h in equation 7(t)Backward GRU corresponds to the result;
then the outputs of the forward hidden state layer and the reverse hidden state layer at the same position are spliced to obtain a BiGRU output h at the time ttNamely:
Figure BDA0003447539860000086
in the above technical solution, the decoding method of the CRF decoding layer comprises:
outputting h from the BiGRU at the time ttSubstituting the formula into the following formula, and calculating according to the formula (9) to obtain a score matrix of each input value at each position; then, equation (10) uses the maximum likelihood estimation to obtain a likelihood function, and in equation (11) uses the Softmax function to calculate the likelihood of the normalized sequence y; finally, extracting a label sequence with the highest score through a Viterbi (Viterbi) algorithm, namely labeling a BIOES format corresponding to the uploading statement;
Figure BDA0003447539860000091
Figure BDA0003447539860000092
Figure BDA0003447539860000093
Figure BDA0003447539860000094
wherein the content of the first and second substances,
Figure BDA0003447539860000095
a transfer score matrix representing the BIOES format label, wherein the transfer score represents the probability of the character combination of the current position character and the adjacent position character, the transfer score matrix is initialized randomly at first, the final score is obtained according to the continuous back propagation learning of the neural network,
Figure BDA0003447539860000096
representing a matrix of emission fractions, the emission fraction being the predicted output h obtained by the BiGRU layert(ii) a S (h, y) represents the adjacent character order score, N represents the training sentenceI denotes the letters of the iteration; p (y | h) represents the probability of the actual character tag sequence to the predicted character tag sequence, P (y | h)i|hi) Denotes a number yiY, h substituted into equation 11iH, h substituted into equation 11iA character representing a position i in a vector h, h representing an output of the BiGRU, λ representing a coefficient of a parameter in the likelihood function, and L representing the likelihood function; theta represents the parameter distribution to be estimated, namely the parameter value of the CRF model, and Y (h) represents all possible BIOES format labeling sequence sets; y is*Representing the most probable biees-formatted tag sequence, S (h, y ') represents the substitution of y ' for y in equation 9, y ' represents a subset of y (h), y (h) represents the set of all possible biees tag sequences, e represents a natural constant in mathematics,
Figure BDA0003447539860000097
the method represents the CRF model parameter value with the maximum probability in all possible BIOES labeling sequence sets.
A base named entity recognition system based on transfer learning is shown in figure 10 and comprises a base corpus collection and preprocessing module 1, a machine learning data set construction module 2, a transfer learning ALBERT layer updating module 3, a base named entity recognition model training module 4 and a sentence recognition module 5;
the base corpus collection and preprocessing module 1 is used for obtaining natural language description corpuses of a base from the Internet and preprocessing the natural language description corpuses, so that picture description information and HTML label information are removed, and units of attribute values are described in a unified mode;
the machine learning data set construction module 2 is used for randomly dividing the preprocessed natural language description corpus into a test set, a verification set and a training set, and performing entity labeling on the natural language description corpus of the test set, the natural language description corpus of the verification set and the natural language description corpus of the training set by using a BIOES format labeling mode to form the test set, the verification set and the training set in a BIOES labeling format;
the migration learning ALBERT layer updating module 3 is used for acquiring an open-source migration learning ALBERT model, updating the open-source migration learning ALBERT model by using a finetune mode through the natural language description linguistic data of the base, and obtaining an updated migration learning ALBERT layer;
the base named entity recognition model training module 4 is used for constructing a base named entity recognition model by utilizing the updated transfer learning ALBERT layer, the BiGRU coding layer and the CRF constraint layer, training the base named entity recognition model by using a test set, a verification set and a training set in a BIOES labeling format as a training data set of the base named entity recognition model, and obtaining the trained base named entity recognition model by continuous iterative fitting by using accuracy, recall rate and F1 values as training evaluation indexes in the training process;
the sentence recognition module 5 is configured to recognize a sentence uploaded by a user by using the trained base named entity recognition model, and obtain a biees format label corresponding to the uploaded sentence.
In fig. 2, the preprocessed data is entity labeled using Doccano as a labeling tool according to a predefined named entity type describing the military base. After labeling, a Json file of the description is generated, and then the Json file is converted into the most common BIOES method (Begin-inside-other-end-single) at present.
FIG. 2 illustrates a labeling process for base named entity training data. Combining the characteristics of the natural language description text of the base, removing the entity types with rare quantity in some linguistic data, and mainly identifying eleven types of military base description entities which characterize the base, including base name (BaseName), base position (BaseLace), Time (Time), responsible area (Responblile area), base Weaponry (Weipponry), camping team (Garrison Troops), base facilities (BaseFacilities), facility number (BaseFacilisties Id), facility attribute (BaseFacilisparpro), facility position (BaseFacilit space) and base evaluation (BasealUeve).
And (3) according to the named entity types of the predefined eleven types of description bases, carrying out entity annotation on the preprocessed data by using Doccano (an open source visualization named entity annotation tool). The Json file of the description is generated after labeling, and then converted into the most common BIOES method (Begin-inside-other-end-single) at present, wherein B represents the first character of the entity describing the azit, I represents the internal character of the entity describing the azit, O represents other characters, E represents the last character of the entity describing the azit, and S represents the entity formed by single characters. After manual fine-grained entity labeling, each independent character in the base corpus is marked with a corresponding classification label to obtain a training data set.
Fig. 3 is a structural diagram of the base named entity recognition model provided by the present invention, where the model mainly includes a migration learning (ALBERT) layer, a BiGRU coding layer, and a CRF constraint layer, and finally outputs a recognition result.
FIG. 4 illustrates the process of generating word vectors using the ALBERT model. Firstly, a large amount of Chinese pre-training corpus is used for self-supervised learning to obtain a trained ALBERT model, and sentences are input into the model to obtain word vectors which describe the word characteristics in detail.
FIG. 4 illustrates the process of generating word vectors using the ALBERT model. The ALBERT model is built on the basis of the BERT model proposed by Google. The BERT model, as one of the pre-training models, brings a milestone change to NLP. Compared with a BERT model, the ALBERT adopts a matrix decomposition method to reduce the parameter quantity, the method introduces a low-dimensional vector space E, the direct mapping of the original vocabulary vector V and the hidden layer vector H is changed into indirect mapping, and o represents the parameter magnitude.
Figure BDA0003447539860000111
The ALBERT innovations propose an SOP (sequence Order Prediction) pre-training strategy to replace the NSP (Next sequence Prediction) strategy. According to the original NSP strategy, the next sentence of the front sentence and the next sentence in a corpus is replaced according to a certain proportion and used as training data of self-supervision learning, however, researchers of XLNET and RoBERTA find that the performance of a pre-training model is restrained by the NSP training strategy, so that the SOP strategy is used as the pre-training strategy of ALBERT, two continuous fragments in a document are used as positive samples, and the exchange sequence of the two fragments is used as negative samples.
Fig. 5 shows a process of an ALBERT model SOP pre-training strategy, where an NSP pre-training strategy may cause a final model to become a document for determining whether two preceding and following sentences come from the same topic, and an SOP strategy avoids this problem, so that the model also has an ability to determine an order for two sentences with the same topic. The original sequence of two sentences in the pre-training corpus is positive example, and the reverse sequence is negative example.
A block diagram of the GRU model is shown in fig. 6, where σ represents the activation function Sigmoid for compressing the function value to the interval (0,1), z represents the update gate, r represents the reset gate, the subscript t represents the unit time, tanh represents the tan activation function for compressing the function value to the interval (-1,1),
Figure BDA0003447539860000121
representing candidate hidden states at time t, h(t)Indicating a hidden state at time t.
Fig. 7 shows a structural diagram of the BiGRU model, which is composed of two parts: one is a GRU model coded by forward and backward; the other is a GRU model coded backward-forward. And after the two GRU models are combined, the distributed character vectors output by the transfer learning ALBERT layer are encoded to form multi-dimensional character vectors.
FIG. 8 shows the comparison of experimental results using 7 models (HMM (Hidden Markov Model), CRF (Conditional Random Field), BilSTM (Bi-directional Long Short Term Memory), BilSTM-CRF (Long Short Term Memory-Conditional Random Field), BiGRU-CRF (Bi-directional Gated Short Term Memory Unit), ALBERT-BilSTM-CRF (A Lite directional Encode responses for the transducers-Bi-directional Short Term Memory-Conditional Random Field), and the results of experiments using 7 models (HMM (Hidden Markov Model), CRF (Conditional Random Field), Bi-directional Long Short Term Memory-Conditional Random Field, two-directional Long Short Term Memory-Conditional Random Field), and the results of experiments using the models to compare the results of experiments with the results of experiments using 7 models (HMM-Hidden Markov Model, Bi-Conditional Random Field, named entity name recognition Model, the better performance of named entity recognition is realized under the conditions of less human intervention and scarce training data. The model of the invention has better effect in the base named entity recognition task, and provides technical support for the automatic construction of the subsequent knowledge graph.
Fig. 9 shows that a prototype system for function display is constructed based on the base named entity recognition model in the present invention, a text to be recognized is input into the system interaction frame, a pre-trained base named entity recognition model is called, and a prediction result of named entity recognition can be obtained after model processing.
Fig. 9 shows that a named entity recognition module is realized by using the model of the present invention, and a Web interactive interface developed by using a Python language in combination with a Django framework is implemented, and specific operations are divided into two stages, wherein the input of a text statement and a Request are submitted to a background for processing, and a prediction result is visually displayed, and an obtained entity prediction result can be used as the input of a subsequent relationship extraction module.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims (10)

1. A base named entity identification method based on transfer learning is characterized in that: it comprises the following steps:
step 1: acquiring a natural language description corpus of a base from the Internet, and preprocessing the natural language description corpus to remove picture description information and HTML label information and uniformly describe units of attribute values;
step 2: randomly dividing the preprocessed natural language description corpus into a test set, a verification set and a training set, and performing entity labeling on the natural language description corpus of the test set, the natural language description corpus of the verification set and the natural language description corpus of the training set by using a BIOES format labeling mode to form the test set, the verification set and the training set in a BIOES labeling format;
and step 3: acquiring an open-source transfer learning ALBERT model, updating the open-source transfer learning ALBERT model by using a finetune mode through natural language description linguistic data of a base, and obtaining an updated transfer learning ALBERT layer;
and 4, step 4: constructing a base named entity recognition model by utilizing an updated transfer learning ALBERT layer, a BiGRU coding layer and a CRF constraint layer, training the base named entity recognition model by using a test set, a verification set and a training set in a BIOES labeling format as a training data set of the base named entity recognition model, and obtaining the trained base named entity recognition model by using accuracy, recall rate and F1 values as training evaluation indexes through continuous iterative fitting in the training process;
and 5: and recognizing the sentences uploaded by the user by using the trained base named entity recognition model to obtain BIOES format labels corresponding to the uploaded sentences.
2. The system of claim 1, wherein: in the step 1, crawling a newsfeed by using a crawler Selenium based on Python according to a base name list to obtain an unprocessed newsfeed collection; screening the content of the unprocessed news manuscript aggregate, and selecting a base name, a base position, a name of a base region in charge, a name of base weapons, a name of a base stationing army, a name of a building facility in the base, a number of the building facility in the base, a position of each building facility in the base and base evaluation from the unprocessed news manuscript aggregate;
crawling base description information in Wikipedia and/or Baidu encyclopedia according to a base name list by using a Python-based crawler Selenium, so as to obtain a base name, a base position, a name of a base responsible area, a name of base weaponry, a name of a base stationing army, names of building facilities in the base, numbers of the building facilities in the base, positions of the building facilities in the base and base evaluation;
and forming a natural language description corpus of the base through the news manuscript and the base physical facilities, the base position, the base personnel condition and the base weapons and equipment information acquired from the Wikipedia and/or the Baidu encyclopedia.
3. The system of claim 1, wherein: in the step 2, base description category labeling is carried out on the base description information of the test set, the verification set and the natural language description corpus in the BIOES labeling format according to preset base description classification describing base characteristics, and character sequencing labeling is carried out on the base description information.
4. The system of claim 1, wherein: in the step 3, the training set in the BIOES label format is used as a data sample for fitting the base named entity recognition model, the verification set in the BIOES label format is used for evaluating the current training state of the base named entity recognition model, and the test set in the BIOES label format evaluates the generalization performance of the trained base named entity recognition model.
5. The system of claim 1, wherein: in the step 5, the updated migration learning ALBERT layer in the trained base named entity recognition model maps each character in the sentence uploaded by the user into a distributed character vector;
a BiGRU coding layer of the trained base named entity recognition model uses a BiGRU network to code distributed character vectors output by a migration learning ALBERT layer to form multi-dimensional character vectors;
and the CRF decoding layer of the trained base named entity recognition model decodes the multidimensional character vector output by the BiGRU coding layer, carries out constraint according to the implicit sequence relation of the BIOES labeling format, calculates to obtain a label labeling sequence meeting the requirement, and obtains the BIOES format label corresponding to the uploading statement.
6. The system of claim 1, wherein: in step 4, the calculation formulas of the accuracy P and the recall R, F1 are as follows:
Figure FDA0003447539850000031
Figure FDA0003447539850000032
Figure FDA0003447539850000033
wherein TP represents the number of correctly predicted entities, FN represents the number of entities whose positive examples are predicted as negative examples, FP represents the number of entities whose negative examples are predicted as positive examples, outredicttrue represents the number of samples predicted as positive, outAllTrue represents the number of all positive samples in a sample, and F1 is the harmonic mean of P and R values for balancing the ratio of the two indices.
7. The system of claim 1, wherein: in the step 5, the BIOES format corresponding to the uploading statement is marked as a Json format, a Json file is analyzed by using a loads function of a Json module in a Python programming language, and the entity character string and the entity type predicted by the trained base named entity recognition model are displayed at the front end of the Web.
8. The system of claim 5, wherein: the BiGRU coding layer is formed by combining GRU models in the forward direction and the backward direction of an entity, one GRU unit comprises a reset gate and an update gate, and the detailed calculation formula is as follows:
z(t)=σ(W(z)x(t)+U(z)h(t-1)) (4)
r(t)=σ(W(r)x(t)+U(r)h(t-1)) (5)
Figure FDA0003447539850000034
Figure FDA0003447539850000035
wherein σ represents an activation function Sigmoid for compressing the function value to the interval (0,1), z represents an update gate, r represents a reset gate, t represents unit time, x represents a hadamard product, i.e. multiplication of corresponding elements in the matrix, and tanh represents a tan activation function for compressing the function value to the interval (-1, 1);
z(t)indicating the output of the update gate z at time t, W(z)Representing time t, updating the weight matrix of the input vector x of gate z, U(z)Representing time t, the weight matrix of the hidden layer vector h of gate z is updated, h(t-1)Representing the hidden layer vector at time t-1, r(t)Represents the output of the reset gate r at time t, W(r)Weight matrix, U, representing the input vector x of the reset gate r at time t(r)Indicating the time t, the weight matrix of the hidden layer vector h of the gate r is reset,
Figure FDA0003447539850000041
representing candidate hidden states at time t, h(t)Representing a hidden state at time t;
distributed character vector x(t)The forward input is the formula 4, and the output of the forward GRU of the entity at the time t is obtained
Figure FDA0003447539850000042
Then the distributed character vector x(t)Reversely inputting into formula 4, obtaining the output of GRU at t moment after entity
Figure FDA0003447539850000043
Then splicing the outputs of the forward and reverse hidden state layers at the same positionThen, obtaining the BiGRU output h at the time ttNamely:
Figure FDA0003447539850000044
9. the system of claim 5, wherein: the decoding method of the CRF decoding layer comprises the following steps:
outputting h from the BiGRU at the time ttSubstituting the formula into the following formula, and calculating according to the formula (9) to obtain a score matrix of each input value at each position; then, equation (10) uses the maximum likelihood estimation to obtain a likelihood function, and in equation (11) uses the Softmax function to calculate the likelihood of the normalized sequence y; finally, extracting a label sequence with the highest score through a Viterbi algorithm, namely labeling a BIOES format corresponding to the uploaded sentence;
Figure FDA0003447539850000045
Figure FDA0003447539850000046
Figure FDA0003447539850000047
Figure FDA0003447539850000051
wherein the content of the first and second substances,
Figure FDA0003447539850000052
a transition score matrix representing the biees format label, the transition score representing the probability that the character combination of the current position character and its adjacent position,the score matrix is transferred to be initialized at random at the beginning, the final score is obtained according to the continuous back propagation learning of the neural network,
Figure FDA0003447539850000053
representing a matrix of emission fractions, the emission fraction being the predicted output h obtained by the BiGRU layert(ii) a S (h, y) represents the adjacent character sequence score, N represents the total number of characters of the training sentence, and i represents the iterative letter; p (y | h) represents the probability of the actual character tag sequence to the predicted character tag sequence, P (y | h)i|hi) Denotes a number yiY, h substituted into equation 11iH, h substituted into equation 11iA character representing a position i in a vector h, h representing an output of the BiGRU, λ representing a coefficient of a parameter in the likelihood function, and L representing the likelihood function; theta represents the parameter distribution to be estimated, namely the parameter value of the CRF model, and Y (h) represents all possible BIOES format labeling sequence sets; y is*Representing the most probable biees-formatted tag sequence, S (h, y ') represents the substitution of y ' for y in equation 9, y ' represents a subset of y (h), y (h) represents the set of all possible biees tag sequences, e represents a natural constant in mathematics,
Figure FDA0003447539850000054
the method represents the CRF model parameter value with the maximum probability in all possible BIOES labeling sequence sets.
10. A base named entity recognition system based on transfer learning, characterized by: the system comprises a base corpus collection and preprocessing module (1), a machine learning data set construction module (2), a migration learning ALBERT layer updating module (3), a base named entity recognition model training module (4) and a sentence recognition module (5);
the base corpus collection and preprocessing module (1) is used for obtaining natural language description corpuses of a base from the Internet and preprocessing the natural language description corpuses so as to remove picture description information and HTML label information and uniformly describe units of attribute values;
the machine learning data set construction module (2) is used for randomly dividing the preprocessed natural language description corpus into a test set, a verification set and a training set, and performing entity labeling on the natural language description corpus of the test set, the natural language description corpus of the verification set and the natural language description corpus of the training set by using a BIOES format labeling mode to form the test set, the verification set and the training set in a BIOES labeling format;
the migration learning ALBERT layer updating module (3) is used for acquiring an open-source migration learning ALBERT model, updating the open-source migration learning ALBERT model by using a finetune mode through the natural language description linguistic data of the base, and obtaining an updated migration learning ALBERT layer;
the base named entity recognition model training module (4) is used for constructing a base named entity recognition model by utilizing the updated migration learning ALBERT layer, the BiGRU coding layer and the CRF constraint layer, training the base named entity recognition model by using a test set, a verification set and a training set of a BIOES labeling format as a training data set of the base named entity recognition model, and obtaining the trained base named entity recognition model by continuous iterative fitting by using accuracy, recall rate and F1 values as training evaluation indexes in the training process;
and the sentence recognition module (5) is used for recognizing the sentences uploaded by the user by using the trained base named entity recognition model to obtain the BIOES format labels corresponding to the uploaded sentences.
CN202111652819.2A 2021-12-30 2021-12-30 Base named entity recognition system and method based on transfer learning Pending CN114356990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111652819.2A CN114356990A (en) 2021-12-30 2021-12-30 Base named entity recognition system and method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111652819.2A CN114356990A (en) 2021-12-30 2021-12-30 Base named entity recognition system and method based on transfer learning

Publications (1)

Publication Number Publication Date
CN114356990A true CN114356990A (en) 2022-04-15

Family

ID=81103499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111652819.2A Pending CN114356990A (en) 2021-12-30 2021-12-30 Base named entity recognition system and method based on transfer learning

Country Status (1)

Country Link
CN (1) CN114356990A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467500A (en) * 2023-06-15 2023-07-21 阿里巴巴(中国)有限公司 Data relation identification, automatic question-answer and query sentence generation method
CN116910646A (en) * 2023-07-04 2023-10-20 南京航空航天大学 Method for classifying internal link objectives of knowledge units in SO website

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467500A (en) * 2023-06-15 2023-07-21 阿里巴巴(中国)有限公司 Data relation identification, automatic question-answer and query sentence generation method
CN116467500B (en) * 2023-06-15 2023-11-03 阿里巴巴(中国)有限公司 Data relation identification, automatic question-answer and query sentence generation method
CN116910646A (en) * 2023-07-04 2023-10-20 南京航空航天大学 Method for classifying internal link objectives of knowledge units in SO website
CN116910646B (en) * 2023-07-04 2024-02-09 南京航空航天大学 Method for classifying internal link objectives of knowledge units in SO website

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN108363753B (en) Comment text emotion classification model training and emotion classification method, device and equipment
CN107168945B (en) Bidirectional cyclic neural network fine-grained opinion mining method integrating multiple features
CN110210037B (en) Syndrome-oriented medical field category detection method
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN108319666B (en) Power supply service assessment method based on multi-modal public opinion analysis
WO2020224097A1 (en) Intelligent semantic document recommendation method and device, and computer-readable storage medium
CN110287323B (en) Target-oriented emotion classification method
CN107798624B (en) Technical label recommendation method in software question-and-answer community
CN111144448A (en) Video barrage emotion analysis method based on multi-scale attention convolutional coding network
CN111291188B (en) Intelligent information extraction method and system
CN112667818B (en) GCN and multi-granularity attention fused user comment sentiment analysis method and system
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN110750635B (en) French recommendation method based on joint deep learning model
CN111897913A (en) Semantic tree enhancement based cross-modal retrieval method for searching video from complex text
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN112232087A (en) Transformer-based specific aspect emotion analysis method of multi-granularity attention model
CN109918477A (en) A kind of distributed search resources bank selection method based on variation self-encoding encoder
CN111368082A (en) Emotion analysis method for domain adaptive word embedding based on hierarchical network
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN111831783B (en) Method for extracting chapter-level relation
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN117151220A (en) Industry knowledge base system and method based on entity link and relation extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination