CN116304020A - Industrial text entity extraction method based on semantic source analysis and span characteristics - Google Patents
Industrial text entity extraction method based on semantic source analysis and span characteristics Download PDFInfo
- Publication number
- CN116304020A CN116304020A CN202310045143.3A CN202310045143A CN116304020A CN 116304020 A CN116304020 A CN 116304020A CN 202310045143 A CN202310045143 A CN 202310045143A CN 116304020 A CN116304020 A CN 116304020A
- Authority
- CN
- China
- Prior art keywords
- entity
- span
- word
- text
- industrial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Abstract
The invention discloses an industrial text entity extraction method based on semantic source analysis and span characteristics, and belongs to the technical field of natural language processing. The method comprises the following steps: acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text; performing word segmentation operation on the original text, and simultaneously performing word vector training; acquiring entity class definitions based on semantic source analysis, and carrying out entity labeling and data set division on the original text; designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model; and carrying out entity recognition on the unlabeled industrial text by using the trained entity extraction model. The method can rapidly define the entity category, lighten the dependence of entity definition on expert knowledge and manual analysis on texts, and simultaneously has better effect of merging the multi-feature entity extraction model in identifying common entity boundaries such as industrial parts and the like and higher entity identification accuracy.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to an industrial text entity extraction method based on semantic source analysis and span characteristics.
Background
Along with the maturity and development of computer software and hardware and information processing technology in the field of artificial intelligence, china is greatly pushing the application of the artificial intelligence technology in the field of industry for further pushing industrial intelligence. The knowledge graph is used as an emerging technology, can integrate complex and massive data together, and is related to each other through the mined relation, so that the knowledge graph has strong data description capability and rich semantic relation function.
The entity is an important language unit for carrying information in the text and is also a core element for forming a knowledge graph, the high-quality completion of the entity extraction task is the basis of the follow-up attribute relation extraction, event extraction, knowledge graph formation and other works, and the main task is to identify and classify the entity with specific meaning in the text, for example, the general field generally takes a name of a person, a name of a place and a name of an organization as an extraction target. The research object herein is a target entity in the industrial field, and refers to a named entity with high value in unstructured industrial text in an actual industrial scene, such as: component units, processing tools, fault conditions, etc., which often imply a great deal of industrial knowledge and are of high utility value. At present, the data in the industrial field is disordered, the text structure difference in different subdivision directions is large, and under the condition of lacking expert knowledge to conduct guidance, valuable entities in corpus are combed, and a great amount of labor cost is required to be consumed for defining the entities, so that the design of a general extraction mode of the text entities in the field is significant.
After the entity type is defined, the domain entity extraction model can accurately and efficiently identify the industrial domain naming entity, support can be provided in a downstream industrial domain question-answering system, industrial domain information retrieval and industrial reasoning model construction task, the construction efficiency of an industrial knowledge base is improved, and the promotion of industrial automation and intellectualization of the industrial domain is further improved.
Disclosure of Invention
Aiming at the problems of long time consumption and low efficiency of the entity extraction task of the manual modeling of the text in the industrial field, the invention provides an entity class definition method based on the sense origin, thereby assisting non-professional persons in semi-automatically defining the entity class of the entity extraction task in the industrial field, improving the knowledge extraction modeling efficiency.
In order to achieve the above purpose, the invention provides an industrial text entity extraction method based on semantic source analysis and span characteristics, which mainly comprises the following steps:
(1) Acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text;
(2) Performing word segmentation operation on the original text to obtain a word segmentation result, and performing word vector training on the word segmentation result to obtain word vectors of an industrial corpus;
(3) Acquiring entity class definitions based on semantic source analysis, and carrying out entity labeling and data set division on the original text;
(4) Designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model;
the entity extraction model obtains the characteristic representation of each span through multi-characteristic construction, characteristic fusion and characteristic coding, and then completes entity identification through a semantic primitive center word identifier, a boundary regression and an entity classifier;
(5) And carrying out entity recognition on the unlabeled industrial text by using the trained entity extraction model.
Further, the preprocessing in the step 1 includes duplication removal, word drying, word stopping, number sequence number removal and English removal.
Further, the step 2 specifically includes:
performing word segmentation on the original text by using a word segmentation tool to obtain a word segmentation result;
and inputting the Word segmentation result as a pre-training model, wherein the pre-training model uses Word2Vec in a Python Gensim topic model package, and adopts a Skip-gram model to complete pre-training of the Word segmentation result, so as to obtain the Word vector of the industrial corpus.
Further, the obtaining the entity class definition based on the semanteme analysis includes:
(3.1) calculating TF-IDF characteristic values aiming at the word segmentation results, and selecting a certain proportion of the word segmentation results with higher TF-IDF characteristic values as candidate words of the original text;
the TF-IDF characteristic value is calculated as follows:
TF-IDF=TF*IDF
wherein: TF represents the frequency of occurrence of the word after sentence segmentation in a sentence, IDF is the logarithm of the number of sentences divided by the number of sentences containing the word, taken as the base of 10;
(3.2) performing cluster analysis on the candidate words by using a K-means algorithm;
(3.2.1) initializing k different cluster centers, cluster center c being:
c i ←w j ,1≤i≤k,1≤j≤n
wherein: j is a random positive integer in [1, n ], n is the number of word vectors of the candidate word, n=card (W), W is a set of word vectors of the candidate word, and i is a label of the cluster center;
W={w j |w j ={v j1 ,v j2 ,…,v jd },1≤j≤n}
(3.2.2) dividing the candidate word sample into k clusters according to the shortest distance principle for the centers of the k clusters to obtain a cluster c i Wherein distance dist is calculated using euclidean distance;
(3.2.3) recalculating cluster centers for the divided k cluster centers to obtain calculated k cluster centers; judging whether the calculated k cluster centers are the same as the current k cluster centers or not; if the k cluster centers are different, the calculated k cluster centers are re-used as the current k cluster centers, and the steps (3.2.1) - (3.2.3) are repeated until the calculated k cluster centers are the same as the current k cluster centers, and the divided k clusters are used as k task clusters;
(3.3) carrying out semantic source analysis on the candidate word clustering result, and defining the entity category of each cluster according to the semantic source analysis result;
obtaining the meaning source of each candidate word by using an OpenHowNet meaning source analysis tool; according to the clustering analysis result, counting the original frequency distribution of the meaning of the candidate words in each class, obtaining the sequence of the candidate words, selecting the name of the class with the forefront sequence as a reference entity class, and manually defining the entity type for the class of the entity.
Further, the entity label comprises the entity category label and the entity span label, the entity span label requires to label a starting position index and a terminating position index of the entity in the sentence, and the starting position index and the terminating position index form an entity span representation of the entity in the sentence;
and dividing the data, namely dividing the marked original text into a training set, a verification set and a test set.
Further, the span-based entity extraction model includes:
constructing multiple characteristics;
extracting features of the original text by using a BERT pre-training model to obtain BERT feature vectors corresponding to each character;
splitting the radicals corresponding to each character of the original text, and coding all the radicals through One-Hot coding to obtain radical feature vectors;
constructing sequence features on word segmentation results of the original text, labeling words at the initial positions of words and other positions of words with different labels, adding part-of-speech features into the labels, and initializing the labels of each word by using an nn.ebedding method in Pytorch to obtain word segmentation feature vectors of each character;
feature fusion
Extracting two features of the radical feature vector and the word segmentation feature vector by using different two-way long-short-term memory networks, and performing splicing operation on the extracted two feature vectors and the BERT feature vector to obtain each character feature fusion vector;
feature encoding
Coding by using the feature fusion vector of the span start character and the span end character of each word and the word length feature vector to obtain the feature representation of the span of each word;
original meaning central word span identifier
Performing sense original analysis on the word segmentation result of the original text to obtain a sense original center word real tag of each word segmentation, performing similarity calculation on the sense original analysis result and the entity category, and obtaining a standard span of the sense original center word when the word segmentation meeting the similarity threshold is a real sense original center word; the span characteristic representation of the word segmentation of the original text obtains the score of the sense original center word of the span through the MLP and the sigmoid function, and if the score exceeds a threshold value, the sense original center word is a candidate sense original center word;
boundary regression device
Extracting character characteristic information before and after the candidate sense original center word by using a convolution network, respectively calculating forward offset and backward offset by using two tanh functions, and finally obtaining complete entity span;
entity classifier
And obtaining the span characteristic representation according to the span of the complete entity, training a classifier by using a feedforward neural network FFNN as a neural network model, predicting to obtain the category score of the entity, and completing entity identification.
Further, the calculating of the sense origin center word score of the span specifically comprises the following steps:
wherein: h is a e (s j ) For span characterization of the jth span, the MLP consists of a linear classification layer and a GELU activation function.
Further, the boundary regressor specifically includes:
extracting character characteristic information before and after the candidate sense original center words by using a convolution network, wherein the characterization is shown as a formula:
the j-th span is the concatenation of vectors of k characters before and after the center, wherein W is a convolution kernel, b is offset, convolution is used for extracting character features before and after the center word, and maximum pooling operation is adopted;
calculating forward offset and backward offset
Wherein:for the forward bias amount +.>For backward bias, w 1 、b 1 And W is 2 、b 2 Parameters calculated for front and back bias respectively;
to correct the offset, the calculated offset is rounded, and the whole-form rounding is calculated as follows:
wherein:a start position index indicating the jth span, < ->Index indicating termination position of jth span, "[]"means rounding calculation;
the loss function uses the smoothl 1 loss, calculated as follows:
further, the entity classifier specifically includes:
training a classifier by using a feedforward neural network FFNN as a neural network model, and selecting a RuLU function as a nonlinear activation function of each layer;
the softmax was used as a scoring function as shown in the formula:
P e (e|s i )=softmax(W e f activite (f classify (h e (s i ))))
wherein: p (P) e (e|s i ) Representing the probability that the ith span is an entity, f activeate The activation function is represented as a function of the activation,f classify representing a neural network FFNN, h e (s i ) Span characterization for the ith span;
the loss function employs a cross entropy loss function as shown in the formula:
wherein: s represents an enumerated span set, and in the training process, a model is trained by minimizing negative log likelihood probability;
finding out entity class label y with highest score in prediction e (s i ) As a result of the prediction, as shown in the formula:
y e (s i )=argmax e∈ε P e (e|s i )
wherein: epsilon is the entity tag set that marks the span of the entity.
Further, the model training and testing process comprises:
(4.1) inputting training set data into the entity extraction model, obtaining the characteristic representation of the span of the segmentation through multi-characteristic construction, characteristic fusion and characteristic coding, and inputting the characteristic representation into a sense origin center
Training a word recognizer, a boundary regression and an entity classifier, and respectively optimizing a loss function of the word recognizer, the boundary regression and the entity classifier to obtain a trained entity extraction model;
and (4.2) sending the data of the testing machine into a trained model to obtain an entity identification result, comparing the result with an entity label, and calculating the number of correct and incorrect detections in a test sample to obtain a comprehensive evaluation index F1 value, wherein the calculation formula is as follows:
wherein: p is the recognition accuracy rate, R is the recall rate.
The invention has the beneficial effects that:
according to the invention, candidate words with larger importance of complex industrial texts can be rapidly identified in the task of defining the text entities in the industrial field, meanwhile, entity categories can be rapidly defined, dependence of entity definition on expert knowledge and manual analysis on the texts are reduced, meanwhile, the effect of the entity extraction model fused with multiple features on identifying common entity boundaries such as industrial parts is better, and the entity identification accuracy is higher.
Drawings
FIG. 1 is a flow chart of an industrial text entity extraction method based on semantic source analysis and span features according to an embodiment of the present invention.
FIG. 2 is a flow chart of an entity class definition method based on semantic source analysis according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a span-based entity extraction model training process according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a feature stitching fusion mode according to an embodiment of the present invention.
FIG. 5 is a diagram of a span-based entity extraction model according to an embodiment of the present invention.
Detailed Description
In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A preferred embodiment is given below with reference to the accompanying drawings:
as shown in fig. 1, the span-based industrial field text entity extraction method of the present embodiment includes the following steps:
s101, acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text;
firstly, acquiring an industrial text data set, in order to improve the quality of original industrial text data, carrying out data preprocessing on the original text, wherein a preprocessing link comprises removing repeated sentences in the corpus, deleting stop words in the text based on a Hadamard stop word library, removing step sequence number digital labels possibly existing in the corpus, and removing English if English in the corpus does not exist in an entity. And obtaining industrial text data with higher quality through data processing.
S102, performing word segmentation operation on the original text to obtain word segmentation results, and performing word vector training on the word segmentation results to obtain word vectors of an industrial corpus
Dividing the original text into sentences, and then performing word segmentation on the sentences through a word segmentation tool to obtain word segmentation results. And inputting Word vector training models into Word segmentation results, wherein Word2Vec in a Python Gensim topic model package is used for Word pre-training by using a Skip-gram model, the Word segmentation results are input into the Word segmentation results, distributed vector representations of each Word are output, word vectors of an industrial corpus are obtained, and vector dimensions can be set by themselves as super parameters and default is 200 dimensions.
S103, acquiring entity class definition based on semantic source analysis, and carrying out entity labeling and data set division on the original text;
A. as shown in fig. 2, the acquisition of entity class definitions includes the steps of:
(1) Calculating TF-IDF characteristic values aiming at the word segmentation results, and selecting word segmentation results with higher TF-IDF characteristic values and a certain proportion as candidate words in the original text;
TF-IDF is a statistical method used to evaluate the importance of words to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency of its occurrence in the corpus, the calculation formula is as follows:
tfidf i,j =tf i,j ×idf i
tf is the Term Frequency (Term Frequency), which indicates the Frequency of occurrence of the Term in the document d, and the calculation formula is:
wherein: the numerator n represents the number of occurrences of the word in the document, and the denominator represents the sum of the number of occurrences of all words in the document.
idf is the reverse document frequency (Inverse Document Frequency), which is a measure of the general importance of a word, and the idf of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, taking the quotient obtained as the logarithm of the base 10, and the calculation formula is:
wherein: the I D I represents the total number of the files in the corpus, the total number of sentences can be also represented, the denominator represents the number of the files containing the words, and the numerical addition of one prevents the situation that the denominator is zero.
The main ideas of IDF are: if the fewer documents containing the term t, i.e., the smaller n, the larger IDF, the better class distinction capability the term t has. If the number of documents containing the term t in a certain class of documents C is m and the total number of documents containing t in other classes is p, it is obvious that the number n=m+p of all documents containing t is also large when m is large, the value of IDF obtained according to the IDF formula will be small, which indicates that the term t is not strong in classification ability.
High term frequencies within a particular document, and low document frequencies of that term throughout the document collection, may yield a high weighted TF-IDF. Thus, TF-IDF tends to filter out common words, preserving important words. The importance ranking of the words is obtained by calculating and utilizing the importance of the TF-IDF to measure the words.
After calculating the characteristic value, selecting a word segmentation result with a certain proportion of the TF-IDF characteristic value ranked at the front as a candidate word in the original text, selecting the front 30% by default, and selecting different proportions according to the actual word number.
(2) Performing cluster analysis on the candidate words by using a K-means algorithm;
the method comprises the following specific steps:
(2.1) initializing k different cluster centers, cluster center c being:
c i ←w j ,1≤i≤k,1≤j≤n
wherein: j is a random positive integer in [1, n ], n is the number of word vectors of the candidate word, n=card (W), W is a set of word vectors of the candidate word, i is the number of cluster centers, i.e. the number of cluster centers, wherein k is a set super parameter;
W={w j |w j ={v j1 ,v j2 ,...,v jd },1≤j≤n}
(2.2) dividing the candidate word sample into k clusters according to the shortest distance principle for the centers of the k clusters to obtain a cluster c i Wherein distance dist is calculated using euclidean distance;
(2.3) recalculating the cluster centers for the k partitioned cluster centers to obtain calculated k cluster centers; judging whether the calculated k cluster centers are the same as the current k cluster centers or not; and if the k cluster centers are different, the calculated k cluster centers are taken as current k cluster centers again, and the steps (2.1) - (2.3) are repeated until the calculated k cluster centers are the same as the current k cluster centers, and the divided k clusters are taken as k task clusters.
(3) Performing semantic source analysis on the candidate word clustering result, and defining entity category of each cluster according to the semantic source analysis result;
and (3) the clustering result C obtained in the step (2) comprises k groups of sets, wherein k is the number of manually defined clustering entity categories, each group of sets comprises a plurality of words, and the words are candidate words.
And respectively carrying out sense original analysis on k groups of word sets, carrying out sense original word frequency statistics on all words in one group of word sets by using a sense original analysis tool to obtain word frequency statistical ordering of sense original, manually analyzing the first few types of sense original sets with higher word frequency, wherein the sense original sets are fused with conceptual features of words in the clusters, can abstract and represent entity types in the clusters, and manually select a certain type of sense original as the type of the entity, namely the entity type, or synthesize a plurality of sense original definition to cover abstract entity types of word frequency higher sense original words.
B. Entity annotation
Manually labeling the text according to the self-defined k-class entity, wherein the labeling needs to identify the position index of the entity in the sentence and the category of the entity in the file, the position index comprises the initial position index and the final position index of the entity in the sentence, and the category of the entity is the defined k entity categories.
C. Data partitioning
Dividing the marked data into a training set, a verification set and a test set, and approximately meeting 8:1: 1.
S104, designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model;
as shown in fig. 3 and 5, the entity extraction model obtains the feature representation of each span through multi-feature construction, feature fusion and feature coding, and then completes entity identification through a semantic primitive center word identifier, a boundary regression and an entity classifier;
original text sequence t= { c 1 ,c 2 ,...,c m -wherein: c i Representing the ith character in the sequence and m represents the length of the sentence.
As shown in fig. 4, the feature stitching fusion mode adopts multi-feature fusion to fuse three feature vectors of part-of-speech features, component features and word vectors, the fused feature vectors are input into a two-way long-short-term memory network for encoding, and the start position vector, the end position vector and the entity length feature vector of the span are selected from the encoded sentence character feature vectors for stitching.
Splitting raw text data into individual word forms using a BERT pre-training modelExtracting features of the original text to obtain BERT feature vectors corresponding to each character
Splitting the radicals corresponding to each character of the original text, and coding all the radicals by One-Hot coding to obtain the radical feature vector
Constructing sequence features for word segmentation results of an original text, labeling different labels for word starting positions of the word segmentation results and words at other positions of the words, adding part-of-speech features into the labels, labeling the starting positions as 'B' + part-of-speech ', labeling the rest positions as' I '+ part-of-speech', and finally initializing the labels of each word by using an nn.emmbedding method in Pytorch to obtain word segmentation feature vectors of each character
Feature fusion, extracting three features of radical feature vectors and word segmentation feature vectors by using different two-way long-short-term memory networks, and performing splicing operation on the results, wherein the character feature vector is spliced, if the character feature embedding dimension is b, the radical feature vector embedding dimension is p, the word segmentation feature vector embedding dimension is f, the final embedding vector dimension is b+p+f, and finally the character feature vector is splicedAs shown in the formula.
And (3) span representation, namely encoding each word segmentation span, and defining the span representation of the j-th span in the label set in the sentence as follows:
wherein: s represents a span in the sentence, j represents a jth span in the entity tag set in the sentence, E START(j) Character feature vector in sentence representing start character of jth spani is the starting position index of the entity in the sentence,/->Representing the entity length feature vector as shown in the formula:
the nn. Ebedding method in Pytorch is used here to encode the entity length feature vector.
Based on the span recognizer of the original meaning center word, each word of the original text is firstly taken as a span, the special word characterization consisting of letters, numbers, characters and the like in sentences is considered, the complex combined word is taken as a span by utilizing regular matching, and the complex combined word, word results of the word segmentation and the word results of the rule matching are combined into an initial span word set. For the j-th span s in the span set j And carrying out sense origin analysis on the span word to obtain a sense origin analysis result of the span word. And carrying out similarity calculation on the meaning original analysis result and the defined entity category, wherein the word segmentation meeting the similarity threshold is a real meaning original center word, so as to obtain the standard span of the meaning original center word, and the similarity is calculated by adopting cosine similarity by default.
The span characteristic characterization of the word segmentation of the original text obtains the sense original center word score of the span through MLP and sigmoid functions, and the specific calculation is as follows:
wherein: the MLP consists of a linear classification layer and a GELU activation function, and if the score exceeds a threshold value, the sense original center word is a candidate sense original center word.
The boundary regressor is added based on the expanded word bias before and after the training of the center word, and a new long entity is formed by combining a plurality of word segmentation words in the industrial field entity and the entity nesting phenomenon. The initial characterization of the boundary regressor adds characterization features of the front and rear sides of the center word, and the characterization is shown in the following formula:
wherein the method comprises the steps ofThe j-th span is the concatenation of vectors of k characters before and after the center, wherein W is a convolution kernel, b is offset, convolution is used for extracting character features before and after the center word, and maximum pooling operation is adopted by default.
Calculating the bias before and after the center word, and respectively calculating the bias before and after the center word by adopting two activation functions, whereinPre-calculation bias +.>And calculating back bias, wherein w and b are parameters calculated by the back bias and the forth bias respectively, so that new entity candidate words are obtained, and the calculation formula is as follows:
to correct the offset, the calculated offset is rounded,a start position index indicating the jth span, < ->Index indicating termination position of jth span, "[]"means rounding calculation, the rounding calculation is as follows:
training a classifier, namely enumerating all spans in sentences, calculating to obtain the representation of each span, training the classifier by using a feedforward neural network FFNN as a neural network model, selecting a RuLU function as a nonlinear activation function of each layer, and improving the training speed of the model, wherein the FFNN layer number and the hidden layer neuron number are set as super-parameters, and finally, the task is to predict the possibility that the spans are entities, and because the entity class classification task is a multi-classification task, the softmax is used as a scoring function, as shown in the formula:
P e (e|s i )=softmax(W e f activite (f classify (h e (s i ))))
wherein: p (P) e (e|s i ) Representing the probability that the ith span is an entity, f activate Representing an activation function, f classify Representing a neural network FFNN.
The loss function employs a cross entropy loss function as shown in the formula:
wherein: s represents an enumerated set of spans, during which the model is trained by minimizing negative log likelihood probabilities.
Finding out entity class label y with highest score in prediction e (s i ) As a result of the prediction, as shown in the formula, epsilon is the entity tag set that marks the span of the entity.
y e (s i )=argmax e∈ε P e (e|s i )
The final loss function consists of three optimization targets, namely candidate center word loss, boundary regression loss and entity classifier loss, wherein the true value of the candidate center word is based on the semanteme similarity, the candidate center word loss and the entity classifier loss both adopt cross entropy loss, and the boundary regression loss adopts smoth L1 loss, and is calculated as follows:
the final loss function is calculated as follows:
Loss=Loss core +Loss Reg +Loss e
training the model by using the training set, and minimizing candidate center word loss, boundary regression loss and entity classifier loss.
And (3) using the verification set data set, adjusting the model to be optimal according to the comprehensive evaluation index F1 value, and storing the optimal model.
Model test, namely sending test set data to a trained model to obtain a prediction label, comparing the prediction label with an actual label, and calculating the number of correct/incorrect detection of a test sample to obtain detection accuracy, recall rate and comprehensive evaluation index F1 value;
wherein: p is the recognition accuracy rate, R is the recall rate.
S105, performing entity recognition on the unlabeled industrial text by using the trained entity extraction model.
In the description of the present invention, it is provided merely for convenience in simply describing the invention and to simplify the description, rather than to indicate or imply that the algorithms or processes referred to must be in a specific form, in a specific construction and operation, and design, and thus should not be construed as limiting the invention.
Claims (10)
1. An industrial text entity extraction method based on semantic source analysis and span characteristics is characterized by comprising the following steps:
(1) Acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text;
(2) Performing word segmentation operation on the original text to obtain a word segmentation result, and performing word vector training on the word segmentation result to obtain word vectors of an industrial corpus;
(3) Acquiring entity class definitions based on semantic source analysis, and carrying out entity labeling and data set division on the original text;
(4) Designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model;
the entity extraction model obtains the characteristic representation of each span through multi-characteristic construction, characteristic fusion and characteristic coding, and then completes entity identification through a semantic primitive center word identifier, a boundary regression and an entity classifier;
(5) And carrying out entity recognition on the unlabeled industrial text by using the trained entity extraction model.
2. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 1, wherein: the pretreatment in the step 1 comprises duplication removal, word drying, word stopping, and removal of digital serial numbers and English.
3. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 1, wherein said step 2 specifically comprises:
performing word segmentation on the original text by using a word segmentation tool to obtain a word segmentation result;
and inputting the Word segmentation result as a pre-training model, wherein the pre-training model uses Word2Vec in a Python Gensim topic model package, and adopts a Skip-gram model to complete pre-training of the Word segmentation result, so as to obtain the Word vector of the industrial corpus.
4. The method for extracting industrial text entities based on semblance analysis and span features as claimed in claim 1, wherein the obtaining entity class definitions based on semblance analysis comprises:
(3.1) calculating TF-IDF characteristic values aiming at the word segmentation results, and selecting a certain proportion of the word segmentation results with higher TF-IDF characteristic values as candidate words of the original text;
the TF-IDF characteristic value is calculated as follows:
TF-IDF=TF*IDF
wherein: TF represents the frequency of occurrence of the word after sentence segmentation in a sentence, IDF is the logarithm of the number of sentences divided by the number of sentences containing the word, taken as the base of 10;
(3.2) performing cluster analysis on the candidate words by using a K-means algorithm;
(3.2.1) initializing k different cluster centers, cluster center c being:
ci←wj,1≤i≤k,1≤j≤n
wherein: j is a random positive integer in [1, n ], n is the number of word vectors of the candidate word, n=card (W), W is a set of word vectors of the candidate word, and i is a label of the cluster center;
W={w j |w j ={v j1 ,v j2 ,...,v jd },1≤j≤n}
(3.2.2) for the current kThe cluster center divides the candidate word sample into k clusters according to the shortest distance principle to obtain a cluster c i Wherein distance dist is calculated using euclidean distance;
(3.2.3) recalculating cluster centers for the divided k cluster centers to obtain calculated k cluster centers; judging whether the calculated k cluster centers are the same as the current k cluster centers or not; if the k cluster centers are different, the calculated k cluster centers are re-used as the current k cluster centers, and the steps (3.2.1) - (3.2.3) are repeated until the calculated k cluster centers are the same as the current k cluster centers, and the divided k clusters are used as k task clusters;
(3.3) carrying out semantic source analysis on the candidate word clustering result, and defining the entity category of each cluster according to the semantic source analysis result;
obtaining the meaning source of each candidate word by using an OpenHowNet meaning source analysis tool; according to the clustering analysis result, counting the original frequency distribution of the meaning of the candidate words in each class, obtaining the sequence of the candidate words, selecting the name of the class with the forefront sequence as a reference entity class, and manually defining the entity type for the class of the entity.
5. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 1, wherein: the entity label comprises the entity category label and an entity span label, wherein the entity span label requires to label a starting position index and a terminating position index of an entity in a sentence, and the starting position index and the terminating position index form an entity span representation of the entity in the sentence;
and dividing the data, namely dividing the marked original text into a training set, a verification set and a test set.
6. The method for industrial text entity extraction based on semanteme analysis and span features of claim 5, wherein the span-based entity extraction model comprises:
constructing multiple characteristics;
extracting features of the original text by using a BERT pre-training model to obtain BERT feature vectors corresponding to each character;
splitting the radicals corresponding to each character of the original text, and coding all the radicals through One-Hot coding to obtain radical feature vectors;
constructing sequence features on word segmentation results of the original text, labeling words at the initial positions of words and other positions of words with different labels, adding part-of-speech features into the labels, and initializing the labels of each word by using an nn.ebedding method in Pytorch to obtain word segmentation feature vectors of each character;
feature fusion
Extracting two features of the radical feature vector and the word segmentation feature vector by using different two-way long-short-term memory networks, and performing splicing operation on the extracted two feature vectors and the BERT feature vector to obtain each character feature fusion vector;
feature encoding
Coding by using the feature fusion vector of the span start character and the span end character of each word and the word length feature vector to obtain the feature representation of the span of each word;
original meaning central word span identifier
Performing sense original analysis on the word segmentation result of the original text to obtain a sense original center word real tag of each word segmentation, performing similarity calculation on the sense original analysis result and the entity category, and obtaining a standard span of the sense original center word when the word segmentation meeting the similarity threshold is a real sense original center word; the span characteristic representation of the word segmentation of the original text obtains the score of the sense original center word of the span through the MLP and the sigmoid function, and if the score exceeds a threshold value, the sense original center word is a candidate sense original center word;
boundary regression device
Extracting character characteristic information before and after the candidate sense original center word by using a convolution network, respectively calculating forward offset and backward offset by using two tanh functions, and finally obtaining complete entity span;
entity classifier
And obtaining the span characteristic representation according to the span of the complete entity, training a classifier by using a feedforward neural network FFNN as a neural network model, predicting to obtain the category score of the entity, and completing entity identification.
7. The method for extracting industrial text entities based on the senso analysis and span features according to claim 6, wherein the senso center word score calculation of the span is specifically:
wherein: h is a e (s j ) For span characterization of the jth span, the MLP consists of a linear classification layer and a GELU activation function.
8. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 6, wherein the boundary regressor is specifically:
extracting character characteristic information before and after the candidate sense original center words by using a convolution network, wherein the characterization is shown as a formula:
is the j-th span as the middleSplicing vectors of k characters before and after the heart, wherein W is a convolution kernel, b is offset, convolution is used for extracting character features before and after a central word, and maximum pooling operation is adopted;
calculating forward offset and backward offset
Wherein:for the forward bias amount +.>For backward bias, w 1 、b 1 And W is 2 、b 2 Parameters calculated for front and back bias respectively;
to correct the offset, the calculated offset is rounded, and the whole-form rounding is calculated as follows:
wherein:a start position index indicating the jth span, < ->Represents the jthEnd position index of span, "[]"means rounding calculation;
the loss function uses the smoothl 1 loss, calculated as follows:
9. the method for extracting industrial text entities based on semanteme analysis and span features according to claim 6, wherein the entity classifier is specifically:
training a classifier by using a feedforward neural network FFNN as a neural network model, and selecting a RuLU function as a nonlinear activation function of each layer;
the softmax was used as a scoring function as shown in the formula:
P e (e|s i )=softmax(W e f activite (f classify (h e (s i ))))
wherein: p (P) e (e|s i ) Representing the probability that the ith span is an entity, f activate Representing an activation function, f classify Representing a neural network FFNN, h e (s i ) Span characterization for the ith span;
the loss function employs a cross entropy loss function as shown in the formula:
wherein: s represents an enumerated span set, and in the training process, a model is trained by minimizing negative log likelihood probability;
finding out entity class label y with highest score in prediction e (s i ) As a result of the prediction, as shown in the formula:
y e (s i )=argmax e∈ε P e (e|s i )
wherein: epsilon is the entity tag set that marks the span of the entity.
10. The method for extracting industrial text entities based on semanteme analysis and span features of claim 6, wherein the model training and testing process comprises:
inputting training set data into the entity extraction model, obtaining the characteristic characterization of the span of the word segmentation through multi-characteristic construction, characteristic fusion and characteristic coding, inputting the training set data into a semantic primitive center word recognizer, a boundary regressor and an entity classifier for training, and respectively optimizing the loss functions of the training set data to obtain the trained entity extraction model;
and (4.2) sending the data of the testing machine into a trained model to obtain an entity identification result, comparing the result with an entity label, and calculating the number of correct and incorrect detections in a test sample to obtain a comprehensive evaluation index F1 value, wherein the calculation formula is as follows:
wherein: p is the recognition accuracy rate, R is the recall rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310045143.3A CN116304020A (en) | 2023-01-30 | 2023-01-30 | Industrial text entity extraction method based on semantic source analysis and span characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310045143.3A CN116304020A (en) | 2023-01-30 | 2023-01-30 | Industrial text entity extraction method based on semantic source analysis and span characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116304020A true CN116304020A (en) | 2023-06-23 |
Family
ID=86836760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310045143.3A Pending CN116304020A (en) | 2023-01-30 | 2023-01-30 | Industrial text entity extraction method based on semantic source analysis and span characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116304020A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117131198A (en) * | 2023-10-27 | 2023-11-28 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
CN117236335A (en) * | 2023-11-13 | 2023-12-15 | 江西师范大学 | Two-stage named entity recognition method based on prompt learning |
-
2023
- 2023-01-30 CN CN202310045143.3A patent/CN116304020A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117131198A (en) * | 2023-10-27 | 2023-11-28 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
CN117131198B (en) * | 2023-10-27 | 2024-01-16 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
CN117236335A (en) * | 2023-11-13 | 2023-12-15 | 江西师范大学 | Two-stage named entity recognition method based on prompt learning |
CN117236335B (en) * | 2023-11-13 | 2024-01-30 | 江西师范大学 | Two-stage named entity recognition method based on prompt learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN106991085B (en) | Entity abbreviation generation method and device | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN116304020A (en) | Industrial text entity extraction method based on semantic source analysis and span characteristics | |
CN110415071B (en) | Automobile competitive product comparison method based on viewpoint mining analysis | |
CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
CN112464656A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
CN112597283A (en) | Notification text information entity attribute extraction method, computer equipment and storage medium | |
CN111985228A (en) | Text keyword extraction method and device, computer equipment and storage medium | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN114611491A (en) | Intelligent government affair public opinion analysis research method based on text mining technology | |
CN113343690A (en) | Text readability automatic evaluation method and device | |
CN114328939B (en) | Natural language processing model construction method based on big data | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN116842194A (en) | Electric power semantic knowledge graph system and method | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN111859955A (en) | Public opinion data analysis model based on deep learning | |
CN117009521A (en) | Knowledge-graph-based intelligent process retrieval and matching method for engine | |
CN113392191B (en) | Text matching method and device based on multi-dimensional semantic joint learning | |
CN115934936A (en) | Intelligent traffic text analysis method based on natural language processing | |
CN115269833A (en) | Event information extraction method and system based on deep semantics and multitask learning | |
CN114610882A (en) | Abnormal equipment code detection method and system based on electric power short text classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |