CN116644182A

CN116644182A - Unsupervised text multi-label marking method based on entity word influence area evaluation standard

Info

Publication number: CN116644182A
Application number: CN202310675101.8A
Authority: CN
Inventors: 王锐; 檀潮; 刘峰
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-08-25

Abstract

The method for marking the unsupervised text multi-label based on the entity word influence region evaluation standard comprises the following steps: 1) In the preparation stage, a text set to be marked is prepared as a corpus, a limited word set with inductive meaning on the overall content of the corpus is obtained through word segmentation, word embedding and word vector clustering, and each word set is used for calculating a word as a label of the word set. 2) And in the modeling stage, corpus is segmented into sentence sets, IOB format labeling is carried out based on the label word sets, NER models are trained, and optimal model parameters are stored. 3) And in the marking stage, identifying the entity words and the positions in the text to be marked by using the NER model, describing the representing degree of the entity words to the labels based on the influence area of the entity words, describing the characteristic degree of the entity words to the text based on the TF-IDF, textRank, LDA model, calculating the score of each label component of the text, marking the labels with generalized meanings, and applying the labels to understanding, classifying and searching the text.

Description

Unsupervised text multi-label marking method based on entity word influence area evaluation standard

Technical Field

The invention relates to a text multi-label mark, and relates to clustering, named entity recognition and keyword extraction technologies in the process.

Background

Text multi-label marking refers to a technology for marking a plurality of content labels after component analysis is carried out on a given document, and is used for assisting in understanding, classifying, searching and other requirements. In the text multi-label marking problem, the semantic label extracting method only extracts keywords of the text, but the labels of different texts are difficult to find commonalities, the total number of the labels has no upper limit and the meaning division is disordered, and the total analysis and the extraction of content components with representative meanings can not be carried out on the content of a text set; the multi-label classification method performs multi-label marking of documents based on a given sample set, and labels are limited and have inductive significance, but must be completed under the supervision of the sample set, and customizing a multi-label text sample set of a certain scale is a difficult task. The existing text multi-label marking method is difficult to achieve the effect that the labels describe and generalize the overall content of the text set while multi-label marking is carried out on the text under the condition of no supervision.

According to the method, labels with inductive significance for corpus content are obtained through word vector clustering, each label type entity word contained in a given text is obtained through named entity recognition, the representativeness of the entity word to the text is evaluated through a keyword extraction technology, so that the proportion of each label content contained in the text is analyzed, the representativeness of the isolated entity word in the text to the label content is difficult to judge, a reward mechanism on content component calculation is provided for the same label entity word appearing adjacently, and influence weakening of adjacent different types of entity words is correspondingly dealt with. Therefore, in order to carry out statistics with more accurate representative meaning on entity word contents of different labels, the invention introduces the co-occurrence relation and distribution characteristics of concept description words of an entity word influence area, and provides an unsupervised text multi-label marking method based on an entity word influence area evaluation standard.

Disclosure of Invention

The invention provides an unsupervised text multi-label marking method based on an entity word influence area evaluation standard. The method comprises the steps of carrying out word segmentation, word embedding and word vector clustering on a training corpus to obtain a plurality of word sets with division significance on the overall content of the corpus, calculating a word for each word set by combining word frequency and average word vectors to serve as labels for representing the word sets, training a NER model based on the word sets, identifying each labeled entity word and position contained in a text by using named entity, describing the characteristic degree of the entity word on the text based on a keyword extraction technology, evaluating the representation degree of label description entity words on the labels based on an entity word influence area, and evaluating the component proportion of given text containing given label content, so that multi-label marking is completed under an unsupervised condition.

In the invention, the keyword extraction technology used is divided into TF-IDF, textRank and LDA, each keyword extraction technology represents a feature model formed by word segmentation of a language database, and the feature model describes the feature degree that words in a text can represent the text through a word set with weight. The label component score calculation results under two flows can be obtained by exchanging the sequence of NER sequence labeling and matching feature model word sets. Meanwhile, the concept of the influence area of the entity word is introduced, the words which are close to the entity word in distance are considered to contain weaker similar semantics, partial contents around the entity word are regarded as the influence area of the word on two granularities of a character level and a sentence level, and the statistical word number is weighted on the influence area, so that the component score of the label corresponding to the entity word is calculated. The weight of the influence area score around the entity words is reduced along with the distance in a power series manner, the influence area scores between adjacent heterogeneous words are weakened, and the co-occurrence score addition is obtained by the influence area scores between adjacent homogeneous words, so that more complete component score calculation is carried out on the entity word contents of different labels. And obtaining final normalized component scores of all labels in the text by weighting and combining the normalized component scores under all the methods, wherein the labels with the scores not smaller than a given threshold value are allocated to the document, and the scores are used as percentage values of label components.

An unsupervised text multi-label marking method based on entity word influence region evaluation standard, comprising the following steps:

1) Preparation stage

a) Collecting a certain number of texts to be marked as corpus C= { d _i }, where d _i Representing a document;

b) Word segmentation and stop Word filtering are carried out on the whole corpus C by using Jieba to obtain a Word segmentation file P, word2Vec is used for loading a Word vector file trained by hundred-degree encyclopedia corpus, and words which do not appear in the Word segmentation file P are filtered to obtain a Word set W and a Word vector set { e } _w Performing mapping from words in the corpus C to a vector space;

c) For the Word vector set { e } in Word2Vec _w Clustering to obtain word sets with a plurality of word sense cohesion, and calculating a word from each group of word sets based on word frequency and average word sense to serve as a tag name representing the word set;

d) Ending the preparation stage;

2) Training phase

a) According to 8: dividing corpus C into training set and verification set according to proportion of 2, dividing each document into sentence set according to punctuation, labeling IOB format by using labeled named entity word set (allowing long label to cover short label), and using it as NER training data set;

b) Constructing a NER model (BiLSTM-ATT-CRF is used in the invention), training on a data set and storing optimal model parameters;

c) Ending the training stage;

3) Marking stage

a) Acquiring a document t to be marked _i ；

b) Identification of document t using NER model _i The method comprises the steps that the influence area range and the weight of words are calculated according to word lengths, co-occurrence relations and distribution characteristics of entity words of different label types, evaluation of label components is carried out on the basis of the influence area of the entity words and a TF-IDF, textRank, LDA model, and therefore normalized component scores of the labels under various calculation strategies are calculated;

c) Distributing experience weights to the scores obtained by various calculation strategies to obtain normalized component scores of the final labels, wherein the labels with scores not smaller than a given threshold value are distributed to the document, and the component scores are used as component percentages of the labels;

d) Repeating 3-a) to 3-c) until label marking is completed on all the documents to be marked;

e) Ending the marking stage;

wherein, the word vector clustering and naming process in the step 1-c) is as follows:

in order to avoid the situation that a uniform clustering result cannot be obtained by adjusting a t threshold value (a clustering cluster exists, the number of the cluster words is the largest and is different from the number of other cluster words by the same order), recursive Hierachical clustering is adopted, namely, on the basis of t threshold value selection according to a binary recursion mode, clustering granularity is controlled by setting an upper limit (2000 is adopted by the invention, the recommended value is 5% -20% of the total number of words participating in clustering) and a lower limit (50 is adopted by the invention) of the number of words allowed in a single cluster, and the binary is ended when one of the following conditions is met:

1) There is one cluster with a word number greater than the upper limit;

2) Clusters with the word numbers meeting the upper and lower limit requirements are larger than or equal to the number N (allowing the clusters to change along with the recursion depth, and the invention uniformly takes 2);

3) The length of the upper and lower limit intervals of the t threshold value of two halves is smaller than a fixed value delta (0.001 is taken by the invention);

after the bisection is finished, carrying out recursion processing on each cluster obtained by clustering under the current t threshold, removing clusters with the number of words lower than the set number lower limit from a clustering result, and carrying out the clustering of the same process on the clusters with the number of words exceeding the upper limit in a recursion manner until a final clustering result is obtained;

in the process of extracting the tag signature from the word sets, a word vector set { e } is calculated for each group of word sets _w Weighted cluster center vector e, weight is the occurrence word frequency P of word w in word segmentation file P _w Calculate word vector e _w Cosine distance d from cluster center vector e _w The term representation calculation formula is obtained as follows:

represent_score(w)＝(1-d _w )p _w

selecting the word with the largest representation in each group of word sets as the tag name of the representative word set;

wherein step 3-b) is a process of calculating the fraction of tag components:

1) In order to distinguish the contribution degree of the entity words to the labels, the part of text contents around the entity words are considered to represent similar semantics, the influence area is regarded as the influence area of the entity words, the co-occurrence relation and the distribution characteristics of the entity words are reflected, the weighted sum of the numbers of words in the influence area (comprising the entity words and the sentence level and not counting punctuation) is calculated based on the two granularity of the character level and the sentence level, and the score of the influence area is used as the score of the label components, and the calculation method is as follows:

when the influence area scores are counted character by character, the co-occurrence is described in a emphasis mode, and the influence area scores obtained when the same tag class words are adjacent are required to be higher than those obtained when the tag class words are isolated from each other:

a) Basic rules: if the word length is l, q is used as _c As the public ratio of the descending number of the weight of the affected area, 1-1/l, each word weight in the entity word is lambda (describing the representation of the entity word to the tag, the candidate values include TF-IDF value, textRank value, the occurrence frequency of words P in the word segmentation file P _w The invention takes λ=1), the character with a distance d from the boundary of the entity word (considered as calculated on the text with punctuation removed, the first character distance d=1 outside the word) is given a weight λq _c ^d /2；

b) Word isolation hypothesis: assuming that no other entity words exist in the context of the entity word, the affected character extends towards two sides of the word infinitely;

c) Similar words are adjacent: in the interval clamped by adjacent entity words of the same tag class, in order to carry out point addition on the same type of co-occurrence, the total obtained influence area points of the tag class in the interval are the influence points of two words which extend towards infinity under the condition of no mutual influence, and the reward points which are in inverse relation with the distance are added, and the reward points need to be calculated for the two words respectively, and the formula is as follows:

where l is word length, d is inter-word interval length (excluding punctuation), d ₀ Taking 0 when the two words are in the same sentence with punctuation as a separation, otherwise taking 1;

d) Heterogeneous word adjacency: in the interval clamped by adjacent different tag entity words, if the influence area of a word extends to a punctuation point before extending to another word, the area until the punctuation point is all calculated as the influence range of the influence area, if the influence area has an unassigned part in the interval, the influence ranges of the two words are assigned according to the ratio of the word lengths (the permission is not an integer), and finally the influence area score is calculated according to the length of each influence range and substituted into a weight formula;

when the semantic influence area scores are counted by sentences, the distribution range of each tag content is described in a focused mode, when the semantic influence area scores are calculated, similar words are regarded as being larger than the semantic influence areas when adjacent words are not close to each other to obtain higher scores when the similar words are isolated from each other, and when the similar words are close to each other, the similar words are regarded as semantic reinforcement to be combined into a whole to calculate the semantic influence area scores:

a) Basic rules: extending from word to both sides, each punctuation mark or other word is broken into a sentence unit (allowing empty sentence units to be generated), the influence weight is reduced by power series according to the sentence unit, and the weight public ratio is q _s (1/2) is taken by the invention, so that each word in the entity word has a weight of lambda (the representing degree of the entity word to the label is described by the invention, lambda=1), and each word in the ith sentence (the sentence adjacent to the word is 1 st) from the boundary of the entity word is given with the weight of lambda wq _s ⁱ Where w is a weight based on word length (w=1-1/l, l is word length);

b) Word isolation hypothesis: assuming that no other entity words exist in the context of the entity Word, the affected sentence units extend to two sides of the Word, and only statistics is carried out on sentences when i is less than or equal to N (the proportion of the union of entity Word sets to the Word set W in Word2Vec is positively correlated with N, and N=2 is adopted in the invention);

c) Similar words are adjacent: if the entity words of the same label class are adjacent, when the number of sentence units between two words is greater than M (the invention takes 2), the influence areas of the two words in the opposite directions are all intervals clamped by the two words, the influence area scores of the two words on the sentence units of the interval are respectively calculated, and if the number is less than or equal to M, the two words and the area between the two words are combined to be regarded as one long word under the label classThe influence area score of the two words is not calculated, the influence area score of the long word is calculated instead, and the weight lambda in the long word takes a larger value lambda of the weight lambda in the word of the two words _max Recursively applying all rules (including merging rules) to the merged long word;

d) Heterogeneous word adjacency: dividing sentence units into influence areas of two words according to number equally on intervals clamped by adjacent different tag entity words (comprising combined long words), and distributing words of the sentence according to word length proportion if sentence units with the same order from the two words exist;

under the influence area strategy, two calculation modes of a common character level and a sentence level are adopted; in each mode, the sum of the scores of the influence areas of all entity words belonging to a certain label under the NER label is taken as the component score of the label in the text; 2) In order to distinguish the importance degree of the entity word to the text, the label content score of the contribution of the entity word w is described based on the feature model, and the calculation formula is as follows:

label_score(w)＝feature(w)×len_weight(w)

wherein len_weight (w) is a weight based on word length, which is taken as the length of the word by the invention; tenature (w) is feature weight for describing importance degree of words in a document to the document, and a word set with feature weight can be obtained by analyzing a text under three feature models, and is respectively based on three key word extraction technologies of TF-IDF, textRank, LDA;

a)TF-IDF：

the feature (w) takes the TF-IDF value of the word, wherein the word frequency TF is calculated by word segmentation of the document, and the inverse word frequency IDF has two calculation modes in a corpus C to be labeled and an additional corpus (the invention adopts a Jieba corpus);

b)TextRank：

feature (w) takes the TextRank weight of the word, and word segmentation and calculation of the TextRank of each word in the document range;

c)LDA：

the feature (w) is calculated by using the existing LDA-w method, a doc-topic distribution matrix theta and a topic-word distribution matrix phi (Gibbs sampling is adopted in the invention) are obtained on the basis of an LDA model aiming at a corpus C, the topic-word matrix is counted and normalized into a word-topic probability matrix, so that words are mapped into topic vectors taking a topic number K as a dimension, a K-dimensional garbage topic vector with each dimension being 1/K is defined, the cosine distance d of the word topic vector and the garbage topic vector and the generation probability P (w) of the words are calculated, and a feature (w) formula is obtained:

feature(w)＝(1-d)P(w)

wherein the method comprises the steps ofz _i Is the subject of the word w, P (z _i ) Theme z is fetched for words in a document _i Probability of P (w|z) _i ) Is subject z _i Generating conditional probability of word w, P (z _i ) And P (w|z) _i ) Respectively taking the matrix theta and phi;

the labeled entity words of NER labels are not necessarily overlapped with word segmentation results of three feature models, and are divided into two calculation sequences:

a) Under the condition that the feature model word set is matched after NER, performing NER sequence labeling on each document in a punctuation segmentation sentence by sentence, calculating component scores for corresponding labels through three feature models according to the obtained label types and the entity word positions, and if no word corresponds in the feature models, performing word segmentation on the identified entity words (the invention uses the precise mode of the Jieba word segmentation), performing component score calculation on words with smaller granularity in the same process again;

b) Under the condition that NER is carried out according to a Word set in a feature model, component score calculation of corresponding labels is carried out on a plurality of entity words identified in each Word, and under the condition that any entity cannot be marked by NER, if the feature Word can be mapped into a Word vector (Word 2Vec loaded in a preparation stage is adopted in the invention), cosine distance calculation is carried out on the Word vector and an average Word vector of each label Word set, each distance is normalized to probability that a Word belongs to each label, and the probability is taken as a weight to enable the Word to contribute scores to each label;

under the feature model strategy, as the IDF in the TF-IDF has two sources, four types of feature weights are adopted, and under the two sequences of NER labeling and feature word set matching, eight feature model component scores are calculated; in each mode, the sum of tag content scores contributed by all entity words belonging to a tag in the text is taken as the component score of the tag in the text;

under the influence area strategy, two types of situations including a character level and a sentence level exist, under the feature model strategy, eight types of situations exist, ten label component score calculation modes are adopted, and under each mode, the score of each label is normalized to be in the form of a percentage, so that normalized component scores are obtained;

wherein step 3-c) is said to assign labels:

the component scores obtained under each calculation strategy are combined according to the empirical weights, the empirical weights are adjusted according to the label marking effect, and the weighting method adopted by the invention is as follows:

taking the fractional averages of the character level and the sentence level in the influence area strategy, giving a weight of 0.5, taking the fractional averages of four types of situations of NER and matching feature models, giving a weight of 0.3, taking the fractional averages of three types of situations of NER on the feature model word set (removing the TF-IDF strategy of calculating the IDF based on the corpus C to be marked), giving a weight of 0.2, and obtaining the component score of each final label;

the invention takes the reciprocal of the number of labels as the score threshold for distributing labels, and takes the reciprocal multiplied by 1.1 as the score threshold if the document with lower component tendency is processed with no label or special label.

Drawings

Fig. 1 is a flow chart overall.

Fig. 2 is a preparation phase flow chart.

Figure 3 is a training phase flow diagram.

Fig. 4 is a labeling phase flow chart.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, an unsupervised text multi-label marking method based on an evaluation standard of an entity word influence area firstly obtains a corpus, and generates a label word set by analyzing the corpus. After the corpus is marked based on the word set, the NER model is trained by the corpus. After model training is completed, a document to be marked is obtained, multi-strategy component score calculation is carried out through an influence area and a feature model based on the entity words marked by NER, and finally multi-label marking of the document is completed.

As shown in fig. 2, the preparation stage of the invention performs word segmentation and nonsensical stop word filtering on a corpus, maps words into word vectors through word embedding, clusters a plurality of word sets according to the word vectors, and automatically generates a label name to complete the preparation stage.

Step 2-0, starting a preparation stage;

step 2-1, word segmentation and stop word filtering are carried out on the corpus;

step 2-2, word embedding is carried out on the word segmentation result;

step 2-3, word vector clustering and word set naming;

step 2-4 ends the preparation phase.

Fig. 3 is a process description of the training phase.

Step 3-0, starting a training step;

step 3-1, dividing the text into sentence sets according to punctuation;

step 3-2, labeling the IOB format based on a plurality of word sets with tag names;

step 3-3, performing NER model training by using the marked corpus;

step 3-4, saving optimal NER model parameters;

and 3-5, ending the training phase.

Fig. 4 is a flow description of the marking phase.

Step 4-0, starting a marking step;

step 4-1, obtaining a document to be marked;

step 4-2, step 4-3 and step 4-4 are based on the entity words marked by NER, and label component score calculation is carried out by combining the entity word influence area with the TF-IDF, textRank, LDA three feature models;

step 4-5, combining experience weights to obtain the final score of each label component;

step 4-6, assigning labels with scores not less than a given threshold value to the documents;

step 4-7, checking whether the document needs to be marked, if so, jumping to step 4-1, otherwise, going to step 4-8;

step 4-8 ends the marking phase.

Claims

1. An unsupervised text multi-label marking method based on entity word influence area evaluation standard is characterized by comprising the following steps:

1) A preparation phase comprising:

d) Ending the preparation stage;

2) A training phase comprising:

c) Ending the training stage;

3) A marking phase comprising:

a) Acquiring a document t to be marked _i ；

b) Identification of document t using NER model _i The entity words and positions of the labels are shared according to the word lengths of the entity words of different label classesCalculating the influence area range and weight of the words according to the current relation and the distribution characteristics, and evaluating the components of each label based on the influence area of the entity words and the TF-IDF, textRank, LDA model, so as to calculate the normalized component score of each label under various calculation strategies;

e) The marking phase is ended.

2. The method for non-supervised text multi-label marking based on entity word impact region assessment criteria of claim 1, wherein the step 3-b) of assessing each label component based on the TF-IDF, textRank, LDA model, thereby calculating normalized component scores of each label under various calculation strategies, comprises calculating a weighted sum of the number of impact region words based on two granularities of character level and sentence level, taking the impact region score as the label component score, and the calculation method comprises the following steps:

(1) When the character-by-character statistics affects the regional score, the co-occurrence is described by emphasis:

(2) When the influence area scores are counted statement by statement, the distribution range of each tag content is described in a focusing mode:

c) Similar words are adjacent: if the entity words of the same label class are adjacent, when the number of sentence units between two words is greater than M (2 is taken by the invention), the influence areas of the two words in opposite directions are all intervals clamped by the two words, the influence area scores of the two words on the sentence units between the intervals are respectively calculated, if the number of the influence area scores is less than or equal to M, the combination of the two words and the area between the two words is regarded as one long word under the label class, the influence area scores of the two words are not respectively calculated, the influence area scores of the long word are calculated, and the weight lambda in the long word takes the larger value lambda of the weight lambda in the word of the two words _max Recursively applying all rules (including merging rules) to the merged long word;

d) Heterogeneous word adjacency: on the interval clamped by adjacent different label entity words (including combined long words), dividing sentence units into influence areas of two words according to number equally, and if sentence units with the same order from the two words exist, distributing the words of the sentence according to word length proportion.

3. The method for labeling unsupervised text based on evaluation criteria of influence areas of entity words according to claim 1, wherein in the step 3-b) the TF-IDF, textRank, LDA model is used for evaluating each label component and the step 3-c) the experience weight is assigned to the scores obtained by each calculation strategy, in order to distinguish the importance of entity words to text, the label content score of the contribution of the feature model to the entity words w is described based on the feature model, and the calculation formula is as follows:

label_score(w)＝feature(w)×len_weight(w)

wherein len_weight (w) is a weight based on word length, which is taken as the length of the word by the invention; feature (w) is feature weight for describing importance degree of words in a document to the document, and a word set with feature weight can be obtained by analyzing a text under three feature models, and is respectively based on three key word extraction technologies of TF-IDF, textRank, LDA;

under the feature model strategy, as the IDF in the TF-IDF has two sources of a corpus C and an additional corpus (the invention adopts a Jieba corpus), four types of feature weights are adopted, and under the two sequences of NER labeling and feature word set matching, eight feature model component scores are calculated; in each mode, the sum of tag content scores contributed by all entity words belonging to a tag in the text is taken as the component score of the tag in the text;

under the influence area strategy, two types of situations including a character level and a sentence level exist, under the feature model strategy, eight types of situations exist, ten label component score calculation modes are adopted, and in each mode, the score of each label is normalized to be in the form of a percentage, so that normalized component scores are obtained.