CN116304020A - Industrial text entity extraction method based on semantic source analysis and span characteristics - Google Patents

Industrial text entity extraction method based on semantic source analysis and span characteristics Download PDF

Info

Publication number
CN116304020A
CN116304020A CN202310045143.3A CN202310045143A CN116304020A CN 116304020 A CN116304020 A CN 116304020A CN 202310045143 A CN202310045143 A CN 202310045143A CN 116304020 A CN116304020 A CN 116304020A
Authority
CN
China
Prior art keywords
entity
span
word
text
industrial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310045143.3A
Other languages
Chinese (zh)
Inventor
胡建鹏
黄子麒
高永彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University of Engineering Science
Original Assignee
Shanghai University of Engineering Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University of Engineering Science filed Critical Shanghai University of Engineering Science
Priority to CN202310045143.3A priority Critical patent/CN116304020A/en
Publication of CN116304020A publication Critical patent/CN116304020A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses an industrial text entity extraction method based on semantic source analysis and span characteristics, and belongs to the technical field of natural language processing. The method comprises the following steps: acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text; performing word segmentation operation on the original text, and simultaneously performing word vector training; acquiring entity class definitions based on semantic source analysis, and carrying out entity labeling and data set division on the original text; designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model; and carrying out entity recognition on the unlabeled industrial text by using the trained entity extraction model. The method can rapidly define the entity category, lighten the dependence of entity definition on expert knowledge and manual analysis on texts, and simultaneously has better effect of merging the multi-feature entity extraction model in identifying common entity boundaries such as industrial parts and the like and higher entity identification accuracy.

Description

Industrial text entity extraction method based on semantic source analysis and span characteristics
Technical Field
The invention relates to the technical field of natural language processing, in particular to an industrial text entity extraction method based on semantic source analysis and span characteristics.
Background
Along with the maturity and development of computer software and hardware and information processing technology in the field of artificial intelligence, china is greatly pushing the application of the artificial intelligence technology in the field of industry for further pushing industrial intelligence. The knowledge graph is used as an emerging technology, can integrate complex and massive data together, and is related to each other through the mined relation, so that the knowledge graph has strong data description capability and rich semantic relation function.
The entity is an important language unit for carrying information in the text and is also a core element for forming a knowledge graph, the high-quality completion of the entity extraction task is the basis of the follow-up attribute relation extraction, event extraction, knowledge graph formation and other works, and the main task is to identify and classify the entity with specific meaning in the text, for example, the general field generally takes a name of a person, a name of a place and a name of an organization as an extraction target. The research object herein is a target entity in the industrial field, and refers to a named entity with high value in unstructured industrial text in an actual industrial scene, such as: component units, processing tools, fault conditions, etc., which often imply a great deal of industrial knowledge and are of high utility value. At present, the data in the industrial field is disordered, the text structure difference in different subdivision directions is large, and under the condition of lacking expert knowledge to conduct guidance, valuable entities in corpus are combed, and a great amount of labor cost is required to be consumed for defining the entities, so that the design of a general extraction mode of the text entities in the field is significant.
After the entity type is defined, the domain entity extraction model can accurately and efficiently identify the industrial domain naming entity, support can be provided in a downstream industrial domain question-answering system, industrial domain information retrieval and industrial reasoning model construction task, the construction efficiency of an industrial knowledge base is improved, and the promotion of industrial automation and intellectualization of the industrial domain is further improved.
Disclosure of Invention
Aiming at the problems of long time consumption and low efficiency of the entity extraction task of the manual modeling of the text in the industrial field, the invention provides an entity class definition method based on the sense origin, thereby assisting non-professional persons in semi-automatically defining the entity class of the entity extraction task in the industrial field, improving the knowledge extraction modeling efficiency.
In order to achieve the above purpose, the invention provides an industrial text entity extraction method based on semantic source analysis and span characteristics, which mainly comprises the following steps:
(1) Acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text;
(2) Performing word segmentation operation on the original text to obtain a word segmentation result, and performing word vector training on the word segmentation result to obtain word vectors of an industrial corpus;
(3) Acquiring entity class definitions based on semantic source analysis, and carrying out entity labeling and data set division on the original text;
(4) Designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model;
the entity extraction model obtains the characteristic representation of each span through multi-characteristic construction, characteristic fusion and characteristic coding, and then completes entity identification through a semantic primitive center word identifier, a boundary regression and an entity classifier;
(5) And carrying out entity recognition on the unlabeled industrial text by using the trained entity extraction model.
Further, the preprocessing in the step 1 includes duplication removal, word drying, word stopping, number sequence number removal and English removal.
Further, the step 2 specifically includes:
performing word segmentation on the original text by using a word segmentation tool to obtain a word segmentation result;
and inputting the Word segmentation result as a pre-training model, wherein the pre-training model uses Word2Vec in a Python Gensim topic model package, and adopts a Skip-gram model to complete pre-training of the Word segmentation result, so as to obtain the Word vector of the industrial corpus.
Further, the obtaining the entity class definition based on the semanteme analysis includes:
(3.1) calculating TF-IDF characteristic values aiming at the word segmentation results, and selecting a certain proportion of the word segmentation results with higher TF-IDF characteristic values as candidate words of the original text;
the TF-IDF characteristic value is calculated as follows:
TF-IDF=TF*IDF
wherein: TF represents the frequency of occurrence of the word after sentence segmentation in a sentence, IDF is the logarithm of the number of sentences divided by the number of sentences containing the word, taken as the base of 10;
(3.2) performing cluster analysis on the candidate words by using a K-means algorithm;
(3.2.1) initializing k different cluster centers, cluster center c being:
c i ←w j ,1≤i≤k,1≤j≤n
wherein: j is a random positive integer in [1, n ], n is the number of word vectors of the candidate word, n=card (W), W is a set of word vectors of the candidate word, and i is a label of the cluster center;
W={w j |w j ={v j1 ,v j2 ,…,v jd },1≤j≤n}
(3.2.2) dividing the candidate word sample into k clusters according to the shortest distance principle for the centers of the k clusters to obtain a cluster c i Wherein distance dist is calculated using euclidean distance;
Figure BDA0004055051290000041
Figure BDA0004055051290000042
(3.2.3) recalculating cluster centers for the divided k cluster centers to obtain calculated k cluster centers; judging whether the calculated k cluster centers are the same as the current k cluster centers or not; if the k cluster centers are different, the calculated k cluster centers are re-used as the current k cluster centers, and the steps (3.2.1) - (3.2.3) are repeated until the calculated k cluster centers are the same as the current k cluster centers, and the divided k clusters are used as k task clusters;
(3.3) carrying out semantic source analysis on the candidate word clustering result, and defining the entity category of each cluster according to the semantic source analysis result;
obtaining the meaning source of each candidate word by using an OpenHowNet meaning source analysis tool; according to the clustering analysis result, counting the original frequency distribution of the meaning of the candidate words in each class, obtaining the sequence of the candidate words, selecting the name of the class with the forefront sequence as a reference entity class, and manually defining the entity type for the class of the entity.
Further, the entity label comprises the entity category label and the entity span label, the entity span label requires to label a starting position index and a terminating position index of the entity in the sentence, and the starting position index and the terminating position index form an entity span representation of the entity in the sentence;
and dividing the data, namely dividing the marked original text into a training set, a verification set and a test set.
Further, the span-based entity extraction model includes:
constructing multiple characteristics;
extracting features of the original text by using a BERT pre-training model to obtain BERT feature vectors corresponding to each character;
splitting the radicals corresponding to each character of the original text, and coding all the radicals through One-Hot coding to obtain radical feature vectors;
constructing sequence features on word segmentation results of the original text, labeling words at the initial positions of words and other positions of words with different labels, adding part-of-speech features into the labels, and initializing the labels of each word by using an nn.ebedding method in Pytorch to obtain word segmentation feature vectors of each character;
feature fusion
Extracting two features of the radical feature vector and the word segmentation feature vector by using different two-way long-short-term memory networks, and performing splicing operation on the extracted two feature vectors and the BERT feature vector to obtain each character feature fusion vector;
feature encoding
Coding by using the feature fusion vector of the span start character and the span end character of each word and the word length feature vector to obtain the feature representation of the span of each word;
original meaning central word span identifier
Performing sense original analysis on the word segmentation result of the original text to obtain a sense original center word real tag of each word segmentation, performing similarity calculation on the sense original analysis result and the entity category, and obtaining a standard span of the sense original center word when the word segmentation meeting the similarity threshold is a real sense original center word; the span characteristic representation of the word segmentation of the original text obtains the score of the sense original center word of the span through the MLP and the sigmoid function, and if the score exceeds a threshold value, the sense original center word is a candidate sense original center word;
boundary regression device
Extracting character characteristic information before and after the candidate sense original center word by using a convolution network, respectively calculating forward offset and backward offset by using two tanh functions, and finally obtaining complete entity span;
entity classifier
And obtaining the span characteristic representation according to the span of the complete entity, training a classifier by using a feedforward neural network FFNN as a neural network model, predicting to obtain the category score of the entity, and completing entity identification.
Further, the calculating of the sense origin center word score of the span specifically comprises the following steps:
Figure BDA0004055051290000061
wherein: h is a e (s j ) For span characterization of the jth span, the MLP consists of a linear classification layer and a GELU activation function.
Further, the boundary regressor specifically includes:
extracting character characteristic information before and after the candidate sense original center words by using a convolution network, wherein the characterization is shown as a formula:
Figure BDA0004055051290000071
Figure BDA0004055051290000072
the j-th span is the concatenation of vectors of k characters before and after the center, wherein W is a convolution kernel, b is offset, convolution is used for extracting character features before and after the center word, and maximum pooling operation is adopted;
calculating forward offset and backward offset
Figure BDA0004055051290000073
Figure BDA0004055051290000074
Wherein:
Figure BDA0004055051290000075
for the forward bias amount +.>
Figure BDA0004055051290000076
For backward bias, w 1 、b 1 And W is 2 、b 2 Parameters calculated for front and back bias respectively;
to correct the offset, the calculated offset is rounded, and the whole-form rounding is calculated as follows:
Figure BDA0004055051290000077
Figure BDA0004055051290000078
wherein:
Figure BDA0004055051290000079
a start position index indicating the jth span, < ->
Figure BDA00040550512900000710
Index indicating termination position of jth span, "[]"means rounding calculation;
the loss function uses the smoothl 1 loss, calculated as follows:
Figure BDA0004055051290000081
further, the entity classifier specifically includes:
training a classifier by using a feedforward neural network FFNN as a neural network model, and selecting a RuLU function as a nonlinear activation function of each layer;
the softmax was used as a scoring function as shown in the formula:
P e (e|s i )=softmax(W e f activite (f classify (h e (s i ))))
wherein: p (P) e (e|s i ) Representing the probability that the ith span is an entity, f activeate The activation function is represented as a function of the activation,f classify representing a neural network FFNN, h e (s i ) Span characterization for the ith span;
the loss function employs a cross entropy loss function as shown in the formula:
Figure BDA0004055051290000082
wherein: s represents an enumerated span set, and in the training process, a model is trained by minimizing negative log likelihood probability;
finding out entity class label y with highest score in prediction e (s i ) As a result of the prediction, as shown in the formula:
y e (s i )=argmax e∈ε P e (e|s i )
wherein: epsilon is the entity tag set that marks the span of the entity.
Further, the model training and testing process comprises:
(4.1) inputting training set data into the entity extraction model, obtaining the characteristic representation of the span of the segmentation through multi-characteristic construction, characteristic fusion and characteristic coding, and inputting the characteristic representation into a sense origin center
Training a word recognizer, a boundary regression and an entity classifier, and respectively optimizing a loss function of the word recognizer, the boundary regression and the entity classifier to obtain a trained entity extraction model;
and (4.2) sending the data of the testing machine into a trained model to obtain an entity identification result, comparing the result with an entity label, and calculating the number of correct and incorrect detections in a test sample to obtain a comprehensive evaluation index F1 value, wherein the calculation formula is as follows:
Figure BDA0004055051290000091
wherein: p is the recognition accuracy rate, R is the recall rate.
The invention has the beneficial effects that:
according to the invention, candidate words with larger importance of complex industrial texts can be rapidly identified in the task of defining the text entities in the industrial field, meanwhile, entity categories can be rapidly defined, dependence of entity definition on expert knowledge and manual analysis on the texts are reduced, meanwhile, the effect of the entity extraction model fused with multiple features on identifying common entity boundaries such as industrial parts is better, and the entity identification accuracy is higher.
Drawings
FIG. 1 is a flow chart of an industrial text entity extraction method based on semantic source analysis and span features according to an embodiment of the present invention.
FIG. 2 is a flow chart of an entity class definition method based on semantic source analysis according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a span-based entity extraction model training process according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a feature stitching fusion mode according to an embodiment of the present invention.
FIG. 5 is a diagram of a span-based entity extraction model according to an embodiment of the present invention.
Detailed Description
In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A preferred embodiment is given below with reference to the accompanying drawings:
as shown in fig. 1, the span-based industrial field text entity extraction method of the present embodiment includes the following steps:
s101, acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text;
firstly, acquiring an industrial text data set, in order to improve the quality of original industrial text data, carrying out data preprocessing on the original text, wherein a preprocessing link comprises removing repeated sentences in the corpus, deleting stop words in the text based on a Hadamard stop word library, removing step sequence number digital labels possibly existing in the corpus, and removing English if English in the corpus does not exist in an entity. And obtaining industrial text data with higher quality through data processing.
S102, performing word segmentation operation on the original text to obtain word segmentation results, and performing word vector training on the word segmentation results to obtain word vectors of an industrial corpus
Dividing the original text into sentences, and then performing word segmentation on the sentences through a word segmentation tool to obtain word segmentation results. And inputting Word vector training models into Word segmentation results, wherein Word2Vec in a Python Gensim topic model package is used for Word pre-training by using a Skip-gram model, the Word segmentation results are input into the Word segmentation results, distributed vector representations of each Word are output, word vectors of an industrial corpus are obtained, and vector dimensions can be set by themselves as super parameters and default is 200 dimensions.
S103, acquiring entity class definition based on semantic source analysis, and carrying out entity labeling and data set division on the original text;
A. as shown in fig. 2, the acquisition of entity class definitions includes the steps of:
(1) Calculating TF-IDF characteristic values aiming at the word segmentation results, and selecting word segmentation results with higher TF-IDF characteristic values and a certain proportion as candidate words in the original text;
TF-IDF is a statistical method used to evaluate the importance of words to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency of its occurrence in the corpus, the calculation formula is as follows:
tfidf i,j =tf i,j ×idf i
tf is the Term Frequency (Term Frequency), which indicates the Frequency of occurrence of the Term in the document d, and the calculation formula is:
Figure BDA0004055051290000111
wherein: the numerator n represents the number of occurrences of the word in the document, and the denominator represents the sum of the number of occurrences of all words in the document.
idf is the reverse document frequency (Inverse Document Frequency), which is a measure of the general importance of a word, and the idf of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, taking the quotient obtained as the logarithm of the base 10, and the calculation formula is:
Figure BDA0004055051290000121
wherein: the I D I represents the total number of the files in the corpus, the total number of sentences can be also represented, the denominator represents the number of the files containing the words, and the numerical addition of one prevents the situation that the denominator is zero.
The main ideas of IDF are: if the fewer documents containing the term t, i.e., the smaller n, the larger IDF, the better class distinction capability the term t has. If the number of documents containing the term t in a certain class of documents C is m and the total number of documents containing t in other classes is p, it is obvious that the number n=m+p of all documents containing t is also large when m is large, the value of IDF obtained according to the IDF formula will be small, which indicates that the term t is not strong in classification ability.
High term frequencies within a particular document, and low document frequencies of that term throughout the document collection, may yield a high weighted TF-IDF. Thus, TF-IDF tends to filter out common words, preserving important words. The importance ranking of the words is obtained by calculating and utilizing the importance of the TF-IDF to measure the words.
After calculating the characteristic value, selecting a word segmentation result with a certain proportion of the TF-IDF characteristic value ranked at the front as a candidate word in the original text, selecting the front 30% by default, and selecting different proportions according to the actual word number.
(2) Performing cluster analysis on the candidate words by using a K-means algorithm;
the method comprises the following specific steps:
(2.1) initializing k different cluster centers, cluster center c being:
c i ←w j ,1≤i≤k,1≤j≤n
wherein: j is a random positive integer in [1, n ], n is the number of word vectors of the candidate word, n=card (W), W is a set of word vectors of the candidate word, i is the number of cluster centers, i.e. the number of cluster centers, wherein k is a set super parameter;
W={w j |w j ={v j1 ,v j2 ,...,v jd },1≤j≤n}
(2.2) dividing the candidate word sample into k clusters according to the shortest distance principle for the centers of the k clusters to obtain a cluster c i Wherein distance dist is calculated using euclidean distance;
Figure BDA0004055051290000131
Figure BDA0004055051290000132
(2.3) recalculating the cluster centers for the k partitioned cluster centers to obtain calculated k cluster centers; judging whether the calculated k cluster centers are the same as the current k cluster centers or not; and if the k cluster centers are different, the calculated k cluster centers are taken as current k cluster centers again, and the steps (2.1) - (2.3) are repeated until the calculated k cluster centers are the same as the current k cluster centers, and the divided k clusters are taken as k task clusters.
(3) Performing semantic source analysis on the candidate word clustering result, and defining entity category of each cluster according to the semantic source analysis result;
and (3) the clustering result C obtained in the step (2) comprises k groups of sets, wherein k is the number of manually defined clustering entity categories, each group of sets comprises a plurality of words, and the words are candidate words.
And respectively carrying out sense original analysis on k groups of word sets, carrying out sense original word frequency statistics on all words in one group of word sets by using a sense original analysis tool to obtain word frequency statistical ordering of sense original, manually analyzing the first few types of sense original sets with higher word frequency, wherein the sense original sets are fused with conceptual features of words in the clusters, can abstract and represent entity types in the clusters, and manually select a certain type of sense original as the type of the entity, namely the entity type, or synthesize a plurality of sense original definition to cover abstract entity types of word frequency higher sense original words.
B. Entity annotation
Manually labeling the text according to the self-defined k-class entity, wherein the labeling needs to identify the position index of the entity in the sentence and the category of the entity in the file, the position index comprises the initial position index and the final position index of the entity in the sentence, and the category of the entity is the defined k entity categories.
C. Data partitioning
Dividing the marked data into a training set, a verification set and a test set, and approximately meeting 8:1: 1.
S104, designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model;
as shown in fig. 3 and 5, the entity extraction model obtains the feature representation of each span through multi-feature construction, feature fusion and feature coding, and then completes entity identification through a semantic primitive center word identifier, a boundary regression and an entity classifier;
original text sequence t= { c 1 ,c 2 ,...,c m -wherein: c i Representing the ith character in the sequence and m represents the length of the sentence.
As shown in fig. 4, the feature stitching fusion mode adopts multi-feature fusion to fuse three feature vectors of part-of-speech features, component features and word vectors, the fused feature vectors are input into a two-way long-short-term memory network for encoding, and the start position vector, the end position vector and the entity length feature vector of the span are selected from the encoded sentence character feature vectors for stitching.
Splitting raw text data into individual word forms using a BERT pre-training modelExtracting features of the original text to obtain BERT feature vectors corresponding to each character
Figure BDA0004055051290000151
Splitting the radicals corresponding to each character of the original text, and coding all the radicals by One-Hot coding to obtain the radical feature vector
Figure BDA0004055051290000152
Constructing sequence features for word segmentation results of an original text, labeling different labels for word starting positions of the word segmentation results and words at other positions of the words, adding part-of-speech features into the labels, labeling the starting positions as 'B' + part-of-speech ', labeling the rest positions as' I '+ part-of-speech', and finally initializing the labels of each word by using an nn.emmbedding method in Pytorch to obtain word segmentation feature vectors of each character
Figure BDA0004055051290000153
Feature fusion, extracting three features of radical feature vectors and word segmentation feature vectors by using different two-way long-short-term memory networks, and performing splicing operation on the results, wherein the character feature vector is spliced, if the character feature embedding dimension is b, the radical feature vector embedding dimension is p, the word segmentation feature vector embedding dimension is f, the final embedding vector dimension is b+p+f, and finally the character feature vector is spliced
Figure BDA0004055051290000154
As shown in the formula.
Figure BDA0004055051290000155
And (3) span representation, namely encoding each word segmentation span, and defining the span representation of the j-th span in the label set in the sentence as follows:
Figure BDA0004055051290000156
wherein: s represents a span in the sentence, j represents a jth span in the entity tag set in the sentence, E START(j) Character feature vector in sentence representing start character of jth span
Figure BDA0004055051290000161
i is the starting position index of the entity in the sentence,/->
Figure BDA0004055051290000162
Representing the entity length feature vector as shown in the formula:
Figure BDA0004055051290000163
the nn. Ebedding method in Pytorch is used here to encode the entity length feature vector.
Figure BDA0004055051290000164
Based on the span recognizer of the original meaning center word, each word of the original text is firstly taken as a span, the special word characterization consisting of letters, numbers, characters and the like in sentences is considered, the complex combined word is taken as a span by utilizing regular matching, and the complex combined word, word results of the word segmentation and the word results of the rule matching are combined into an initial span word set. For the j-th span s in the span set j And carrying out sense origin analysis on the span word to obtain a sense origin analysis result of the span word. And carrying out similarity calculation on the meaning original analysis result and the defined entity category, wherein the word segmentation meeting the similarity threshold is a real meaning original center word, so as to obtain the standard span of the meaning original center word, and the similarity is calculated by adopting cosine similarity by default.
The span characteristic characterization of the word segmentation of the original text obtains the sense original center word score of the span through MLP and sigmoid functions, and the specific calculation is as follows:
Figure BDA0004055051290000165
wherein: the MLP consists of a linear classification layer and a GELU activation function, and if the score exceeds a threshold value, the sense original center word is a candidate sense original center word.
The boundary regressor is added based on the expanded word bias before and after the training of the center word, and a new long entity is formed by combining a plurality of word segmentation words in the industrial field entity and the entity nesting phenomenon. The initial characterization of the boundary regressor adds characterization features of the front and rear sides of the center word, and the characterization is shown in the following formula:
Figure BDA0004055051290000171
wherein the method comprises the steps of
Figure BDA0004055051290000172
The j-th span is the concatenation of vectors of k characters before and after the center, wherein W is a convolution kernel, b is offset, convolution is used for extracting character features before and after the center word, and maximum pooling operation is adopted by default.
Calculating the bias before and after the center word, and respectively calculating the bias before and after the center word by adopting two activation functions, wherein
Figure BDA0004055051290000173
Pre-calculation bias +.>
Figure BDA0004055051290000174
And calculating back bias, wherein w and b are parameters calculated by the back bias and the forth bias respectively, so that new entity candidate words are obtained, and the calculation formula is as follows:
Figure BDA0004055051290000175
Figure BDA0004055051290000176
to correct the offset, the calculated offset is rounded,
Figure BDA0004055051290000177
a start position index indicating the jth span, < ->
Figure BDA0004055051290000178
Index indicating termination position of jth span, "[]"means rounding calculation, the rounding calculation is as follows:
Figure BDA0004055051290000179
Figure BDA00040550512900001710
training a classifier, namely enumerating all spans in sentences, calculating to obtain the representation of each span, training the classifier by using a feedforward neural network FFNN as a neural network model, selecting a RuLU function as a nonlinear activation function of each layer, and improving the training speed of the model, wherein the FFNN layer number and the hidden layer neuron number are set as super-parameters, and finally, the task is to predict the possibility that the spans are entities, and because the entity class classification task is a multi-classification task, the softmax is used as a scoring function, as shown in the formula:
P e (e|s i )=softmax(W e f activite (f classify (h e (s i ))))
wherein: p (P) e (e|s i ) Representing the probability that the ith span is an entity, f activate Representing an activation function, f classify Representing a neural network FFNN.
The loss function employs a cross entropy loss function as shown in the formula:
Figure BDA0004055051290000181
wherein: s represents an enumerated set of spans, during which the model is trained by minimizing negative log likelihood probabilities.
Finding out entity class label y with highest score in prediction e (s i ) As a result of the prediction, as shown in the formula, epsilon is the entity tag set that marks the span of the entity.
y e (s i )=argmax e∈ε P e (e|s i )
The final loss function consists of three optimization targets, namely candidate center word loss, boundary regression loss and entity classifier loss, wherein the true value of the candidate center word is based on the semanteme similarity, the candidate center word loss and the entity classifier loss both adopt cross entropy loss, and the boundary regression loss adopts smoth L1 loss, and is calculated as follows:
Figure BDA0004055051290000182
the final loss function is calculated as follows:
Loss=Loss core +Loss Reg +Loss e
training the model by using the training set, and minimizing candidate center word loss, boundary regression loss and entity classifier loss.
And (3) using the verification set data set, adjusting the model to be optimal according to the comprehensive evaluation index F1 value, and storing the optimal model.
Model test, namely sending test set data to a trained model to obtain a prediction label, comparing the prediction label with an actual label, and calculating the number of correct/incorrect detection of a test sample to obtain detection accuracy, recall rate and comprehensive evaluation index F1 value;
Figure BDA0004055051290000191
wherein: p is the recognition accuracy rate, R is the recall rate.
S105, performing entity recognition on the unlabeled industrial text by using the trained entity extraction model.
In the description of the present invention, it is provided merely for convenience in simply describing the invention and to simplify the description, rather than to indicate or imply that the algorithms or processes referred to must be in a specific form, in a specific construction and operation, and design, and thus should not be construed as limiting the invention.

Claims (10)

1. An industrial text entity extraction method based on semantic source analysis and span characteristics is characterized by comprising the following steps:
(1) Acquiring an industrial text data set, and preprocessing the industrial text data set to obtain an original text;
(2) Performing word segmentation operation on the original text to obtain a word segmentation result, and performing word vector training on the word segmentation result to obtain word vectors of an industrial corpus;
(3) Acquiring entity class definitions based on semantic source analysis, and carrying out entity labeling and data set division on the original text;
(4) Designing a span-based entity extraction model, and performing model training and testing by using the marked original text to obtain a trained entity extraction model;
the entity extraction model obtains the characteristic representation of each span through multi-characteristic construction, characteristic fusion and characteristic coding, and then completes entity identification through a semantic primitive center word identifier, a boundary regression and an entity classifier;
(5) And carrying out entity recognition on the unlabeled industrial text by using the trained entity extraction model.
2. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 1, wherein: the pretreatment in the step 1 comprises duplication removal, word drying, word stopping, and removal of digital serial numbers and English.
3. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 1, wherein said step 2 specifically comprises:
performing word segmentation on the original text by using a word segmentation tool to obtain a word segmentation result;
and inputting the Word segmentation result as a pre-training model, wherein the pre-training model uses Word2Vec in a Python Gensim topic model package, and adopts a Skip-gram model to complete pre-training of the Word segmentation result, so as to obtain the Word vector of the industrial corpus.
4. The method for extracting industrial text entities based on semblance analysis and span features as claimed in claim 1, wherein the obtaining entity class definitions based on semblance analysis comprises:
(3.1) calculating TF-IDF characteristic values aiming at the word segmentation results, and selecting a certain proportion of the word segmentation results with higher TF-IDF characteristic values as candidate words of the original text;
the TF-IDF characteristic value is calculated as follows:
TF-IDF=TF*IDF
wherein: TF represents the frequency of occurrence of the word after sentence segmentation in a sentence, IDF is the logarithm of the number of sentences divided by the number of sentences containing the word, taken as the base of 10;
(3.2) performing cluster analysis on the candidate words by using a K-means algorithm;
(3.2.1) initializing k different cluster centers, cluster center c being:
ci←wj,1≤i≤k,1≤j≤n
wherein: j is a random positive integer in [1, n ], n is the number of word vectors of the candidate word, n=card (W), W is a set of word vectors of the candidate word, and i is a label of the cluster center;
W={w j |w j ={v j1 ,v j2 ,...,v jd },1≤j≤n}
(3.2.2) for the current kThe cluster center divides the candidate word sample into k clusters according to the shortest distance principle to obtain a cluster c i Wherein distance dist is calculated using euclidean distance;
Figure FDA0004055051280000031
Figure FDA0004055051280000032
(3.2.3) recalculating cluster centers for the divided k cluster centers to obtain calculated k cluster centers; judging whether the calculated k cluster centers are the same as the current k cluster centers or not; if the k cluster centers are different, the calculated k cluster centers are re-used as the current k cluster centers, and the steps (3.2.1) - (3.2.3) are repeated until the calculated k cluster centers are the same as the current k cluster centers, and the divided k clusters are used as k task clusters;
(3.3) carrying out semantic source analysis on the candidate word clustering result, and defining the entity category of each cluster according to the semantic source analysis result;
obtaining the meaning source of each candidate word by using an OpenHowNet meaning source analysis tool; according to the clustering analysis result, counting the original frequency distribution of the meaning of the candidate words in each class, obtaining the sequence of the candidate words, selecting the name of the class with the forefront sequence as a reference entity class, and manually defining the entity type for the class of the entity.
5. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 1, wherein: the entity label comprises the entity category label and an entity span label, wherein the entity span label requires to label a starting position index and a terminating position index of an entity in a sentence, and the starting position index and the terminating position index form an entity span representation of the entity in the sentence;
and dividing the data, namely dividing the marked original text into a training set, a verification set and a test set.
6. The method for industrial text entity extraction based on semanteme analysis and span features of claim 5, wherein the span-based entity extraction model comprises:
constructing multiple characteristics;
extracting features of the original text by using a BERT pre-training model to obtain BERT feature vectors corresponding to each character;
splitting the radicals corresponding to each character of the original text, and coding all the radicals through One-Hot coding to obtain radical feature vectors;
constructing sequence features on word segmentation results of the original text, labeling words at the initial positions of words and other positions of words with different labels, adding part-of-speech features into the labels, and initializing the labels of each word by using an nn.ebedding method in Pytorch to obtain word segmentation feature vectors of each character;
feature fusion
Extracting two features of the radical feature vector and the word segmentation feature vector by using different two-way long-short-term memory networks, and performing splicing operation on the extracted two feature vectors and the BERT feature vector to obtain each character feature fusion vector;
feature encoding
Coding by using the feature fusion vector of the span start character and the span end character of each word and the word length feature vector to obtain the feature representation of the span of each word;
original meaning central word span identifier
Performing sense original analysis on the word segmentation result of the original text to obtain a sense original center word real tag of each word segmentation, performing similarity calculation on the sense original analysis result and the entity category, and obtaining a standard span of the sense original center word when the word segmentation meeting the similarity threshold is a real sense original center word; the span characteristic representation of the word segmentation of the original text obtains the score of the sense original center word of the span through the MLP and the sigmoid function, and if the score exceeds a threshold value, the sense original center word is a candidate sense original center word;
boundary regression device
Extracting character characteristic information before and after the candidate sense original center word by using a convolution network, respectively calculating forward offset and backward offset by using two tanh functions, and finally obtaining complete entity span;
entity classifier
And obtaining the span characteristic representation according to the span of the complete entity, training a classifier by using a feedforward neural network FFNN as a neural network model, predicting to obtain the category score of the entity, and completing entity identification.
7. The method for extracting industrial text entities based on the senso analysis and span features according to claim 6, wherein the senso center word score calculation of the span is specifically:
Figure FDA0004055051280000051
wherein: h is a e (s j ) For span characterization of the jth span, the MLP consists of a linear classification layer and a GELU activation function.
8. The method for extracting industrial text entities based on semanteme analysis and span features according to claim 6, wherein the boundary regressor is specifically:
extracting character characteristic information before and after the candidate sense original center words by using a convolution network, wherein the characterization is shown as a formula:
Figure FDA0004055051280000061
Figure FDA0004055051280000062
is the j-th span as the middleSplicing vectors of k characters before and after the heart, wherein W is a convolution kernel, b is offset, convolution is used for extracting character features before and after a central word, and maximum pooling operation is adopted;
calculating forward offset and backward offset
Figure FDA0004055051280000063
Figure FDA0004055051280000064
Wherein:
Figure FDA0004055051280000065
for the forward bias amount +.>
Figure FDA0004055051280000066
For backward bias, w 1 、b 1 And W is 2 、b 2 Parameters calculated for front and back bias respectively;
to correct the offset, the calculated offset is rounded, and the whole-form rounding is calculated as follows:
Figure FDA0004055051280000067
Figure FDA0004055051280000068
wherein:
Figure FDA0004055051280000069
a start position index indicating the jth span, < ->
Figure FDA00040550512800000610
Represents the jthEnd position index of span, "[]"means rounding calculation;
the loss function uses the smoothl 1 loss, calculated as follows:
Figure FDA00040550512800000611
9. the method for extracting industrial text entities based on semanteme analysis and span features according to claim 6, wherein the entity classifier is specifically:
training a classifier by using a feedforward neural network FFNN as a neural network model, and selecting a RuLU function as a nonlinear activation function of each layer;
the softmax was used as a scoring function as shown in the formula:
P e (e|s i )=softmax(W e f activite (f classify (h e (s i ))))
wherein: p (P) e (e|s i ) Representing the probability that the ith span is an entity, f activate Representing an activation function, f classify Representing a neural network FFNN, h e (s i ) Span characterization for the ith span;
the loss function employs a cross entropy loss function as shown in the formula:
Figure FDA0004055051280000071
wherein: s represents an enumerated span set, and in the training process, a model is trained by minimizing negative log likelihood probability;
finding out entity class label y with highest score in prediction e (s i ) As a result of the prediction, as shown in the formula:
y e (s i )=argmax e∈ε P e (e|s i )
wherein: epsilon is the entity tag set that marks the span of the entity.
10. The method for extracting industrial text entities based on semanteme analysis and span features of claim 6, wherein the model training and testing process comprises:
inputting training set data into the entity extraction model, obtaining the characteristic characterization of the span of the word segmentation through multi-characteristic construction, characteristic fusion and characteristic coding, inputting the training set data into a semantic primitive center word recognizer, a boundary regressor and an entity classifier for training, and respectively optimizing the loss functions of the training set data to obtain the trained entity extraction model;
and (4.2) sending the data of the testing machine into a trained model to obtain an entity identification result, comparing the result with an entity label, and calculating the number of correct and incorrect detections in a test sample to obtain a comprehensive evaluation index F1 value, wherein the calculation formula is as follows:
Figure FDA0004055051280000081
wherein: p is the recognition accuracy rate, R is the recall rate.
CN202310045143.3A 2023-01-30 2023-01-30 Industrial text entity extraction method based on semantic source analysis and span characteristics Pending CN116304020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310045143.3A CN116304020A (en) 2023-01-30 2023-01-30 Industrial text entity extraction method based on semantic source analysis and span characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310045143.3A CN116304020A (en) 2023-01-30 2023-01-30 Industrial text entity extraction method based on semantic source analysis and span characteristics

Publications (1)

Publication Number Publication Date
CN116304020A true CN116304020A (en) 2023-06-23

Family

ID=86836760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310045143.3A Pending CN116304020A (en) 2023-01-30 2023-01-30 Industrial text entity extraction method based on semantic source analysis and span characteristics

Country Status (1)

Country Link
CN (1) CN116304020A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131198A (en) * 2023-10-27 2023-11-28 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library
CN117236335A (en) * 2023-11-13 2023-12-15 江西师范大学 Two-stage named entity recognition method based on prompt learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131198A (en) * 2023-10-27 2023-11-28 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library
CN117131198B (en) * 2023-10-27 2024-01-16 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library
CN117236335A (en) * 2023-11-13 2023-12-15 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN117236335B (en) * 2023-11-13 2024-01-30 江西师范大学 Two-stage named entity recognition method based on prompt learning

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN109165294B (en) Short text classification method based on Bayesian classification
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN106991085B (en) Entity abbreviation generation method and device
CN111291188B (en) Intelligent information extraction method and system
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN110415071B (en) Automobile competitive product comparison method based on viewpoint mining analysis
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111158641B (en) Automatic recognition method for transaction function points based on semantic analysis and text mining
CN112464656A (en) Keyword extraction method and device, electronic equipment and storage medium
CN112597283A (en) Notification text information entity attribute extraction method, computer equipment and storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
CN113343690A (en) Text readability automatic evaluation method and device
CN114328939B (en) Natural language processing model construction method based on big data
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN116842194A (en) Electric power semantic knowledge graph system and method
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN111859955A (en) Public opinion data analysis model based on deep learning
CN117009521A (en) Knowledge-graph-based intelligent process retrieval and matching method for engine
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
CN115269833A (en) Event information extraction method and system based on deep semantics and multitask learning
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination