CN110705247A

CN110705247A - Based on x2-C text similarity calculation method

Info

Publication number: CN110705247A
Application number: CN201910811440.8A
Authority: CN
Inventors: 赵卫东; 李化泽; 王铭; 刘昊
Original assignee: Shandong University of Science and Technology
Current assignee: Qingdao Guancheng Software Co ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2020-01-17
Anticipated expiration: 2039-08-30
Also published as: WO2021035921A1; CN110705247B

Abstract

The invention discloses a Chinese character' chi²A text similarity calculation method of-C, in particular to the field of text information processing. The method comprises the steps of classifying a test data set by using a Convolutional Neural Network (CNN), calculating the initial weight of each feature word in a detection sample according to TF-IDF, and then using χ²Calculating a domain association factor by using the algorithm C, calculating an initial weight by using the word position factor alpha in combination with the domain association factor to obtain a feature word weight, establishing a word bank by using all feature words of the detection sample, and expressing the detection sample into an initial text vector by combining the word bank and the feature word weight. Calculating the similarity between words in a word bank by using a word2vec tool and forming a word meaning similarity matrix, calculating an initial text vector by using the matrix to obtain a text vector, and finally calculating the text vector by using a cosine similarity algorithm to obtain the similarity between texts, so that the association degree of the characteristic words and the field thereof, the semantic relation between the characteristic words and the bit of the characteristic words are increasedAnd information is set, so that the accuracy of text similarity calculation is improved.

Description

Based on x2-C text similarity calculation method

Technical Field

The invention relates to the field of text information processing, in particular to a Chinese character chi-based Chinese character recognition method²-a text similarity calculation method of C.

Background

The text similarity is the calculation of semantic similarity between texts, and is applied to many fields in the information explosion era. For example: the traditional text similarity calculation is based on a space vector model (VSM), the model adopts TF-IDF to calculate the weight of feature words in a text, the text is converted into a text vector of a multidimensional space, and the similarity of the text vector is calculated to measure the similarity between the texts. But the TF-IDF algorithm only considers the relation between the term characteristics and the document and does not consider the problem of relevance with the category, so that the accuracy of text similarity calculation is low. The LDA topic model is a three-layer Bayesian probability model and comprises three-layer structures of words, topics and documents. The basic idea of adopting LDA to calculate text similarity is to perform theme modeling on a text, traverse and extract words in the text in word distribution corresponding to a theme to obtain the theme distribution of the text, and calculate the text similarity through the distribution. Because the short text has few representative words, the LDA does not necessarily achieve the expected effect on the topic mining of the short text, and is more suitable for long texts. The most popular text similarity calculation method at present uses a convolutional neural network, and the main idea is to form a two-dimensional matrix by word vectors of all words in a sentence in sequence, access a CNN model on the basis, extract and filter local semantics in the sentence by using the structural properties of the CNN, and finally abstract the feature vector representation of the sentence. But the CNN model has the defects of complex structure, more parameters, long running time and the like.

Disclosure of Invention

The invention aims to solve the defects and provides a method for combining X by adopting TF-IDF²the-C algorithm calculates the weight of the characteristic words, increases the word position information, and increases the chi-based word meaning information to the text vector by using word2vec²-C text similarity calculation method

The invention specifically adopts the following technical scheme:

based on x²-C text similarity calculation method, comprising the steps of:

step 1: preprocessing the test data and the content of the corpus;

step 2: classifying the test data set using a Convolutional Neural Network (CNN);

and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;

and 4, step 4: using χ²Calculating a domain association factor Ca by using a C algorithm;

and 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;

step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;

and 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;

and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;

and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.

Preferably, the step 4 comprises:

χ²the-C algorithm is to²The statistics are combined into a domain association algorithm and a domain association factor Ca is calculated, as a result of the χ calculated by some feature words² _(d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;

wherein d represents a characteristic word, c represents the domain described by the characteristic word, d_iRepresenting the ith characteristic word in the text, and the count represents the total number of the characteristic words in the text;

ca is calculated by the formulae (2), (3), (4):

wherein q is_dIndicating the proportion of the number of documents in the positive class containing the feature word d, e_dRepresenting the number of documents in the main category containing the feature word d, E_dIndicates the total number of documents of the positive class, p_dRepresenting the proportion of the number of documents in the negative class containing the feature word d, n_dNumber of documents representing negative class containing d, N_dTotal number of documents, w, representing negative classes_dAnd d represents the degree of association with the domain.

Preferably, in the step 5, in order to increase the association degree of the feature words and the domains, the formula (5) is used in combination with the domain association factor C_a((d,c)Calculating the initial weight of the feature words according to the word position information alpha to obtain the weight of the feature words;

w_dt'＝C_a((d,c)×w_dt×α (5)

wherein, w_dt' represents a characteristic word weight, α represents position information of a word, and α is obtained by equation (6):

preferably, in step 6, firstly, feature words of the texts β and γ are merged, and a lexicon of the texts β and γ is established, so that each word has a corresponding label;

and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:

v_k＝<w'_1k，w'_2k，w'_ηk……w'_Nk>(7)

wherein v is_kThe characteristic word sets representing the initial text vector, beta and gamma are respectively S_dβ＝{d_β1,d_β2,d_β3,d_βi,…,d_βnAnd S_dγ＝{d_γ1,d_γ2,d_γ3,d_γj,…,d_γmThe feature word weight sets of the } beta and gamma are respectively S_wβ＝{w'_β1,w'_β2,w'_β3,w'_βj....w'_βnAnd s_wγ＝{w'_γ1,w'_γ2,w'_γ3,w'_γj....w'_γm}，k∈{β，γ}，w'_kη∈S_wβ∪s_wγEta represents a feature word d_ηCorresponding reference numbers in lexicons, d_η∈S_dβ∪S_dγN is the total number of the feature words in the word stock;

and for w in text beta_ksηObey equation (8);

for w in the text gamma_ksηObedience formula (9)

d_ηIn referring to a thesaurus labeled η, and d_βi∈S_dβ，d_γi∈S_dγ。

Preferably, in the step 7,

the similarity between the characteristic word with the label i and the characteristic word with the label j in the word stock is sim_ijThen, the similarity matrix between the feature words is represented by formula (10):

preferably, in step 9, word sense information is added to the text vector, and the β and γ text vectors are updated by using the similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):

v_β＝<w'_β1，w'_β1，w'_βη……w'_βN)>(11)

v_γ＝<w'_γ1，w'_γ1，w'_γη……w'_γN)>(12)

performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;

v'_k＝v_k×P (13)

where k ∈ { β, γ }, a new text vector v 'is computed'_β＝<w”_β1，w”_β1，w”_βη……w”_βN)>And v'_γ＝<w”_γ1，w”_γ1，w”_γη……w”_γN)>The text vector includes word frequency, word sense, word position information and domain association degree.

The invention has the following beneficial effects:

the invention uses TF-IDF to combine chi²the-C algorithm calculates the weight of the feature words, increases word position information, utilizes word2vec to increase word meaning information to the text vector, considers the influence of the field on the distribution condition of the feature words and the semantic relation among the feature words, and the experimental result shows that the method is based on chi²The calculation method accuracy of the similarity of the C text is 82.64%, and the F1 value is 84.09%, which is superior to other methods.

Drawings

FIG. 1 is a graph based on χ²-C text similarity calculation flow chart.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:

based on x²-C text similarity calculation method, comprising the steps of:

step 1: preprocessing the test data and the content of the corpus;

TF-IDF algorithm adoption

It is shown that,

wherein, W_dtRepresenting the weight value of the feature word d in the document t, TF_dtRepresenting the word frequency number, m, of the feature word d in the document t_dThe expression is the number of times of occurrence of the characteristic word d in the document t, S represents the total number of the characteristic words of the document t, IDF_dtThe representation being an inverse text frequency index, n, of the feature word d_dThe number of texts containing the feature word d in the corpus is shown, N represents the total number of the texts in the corpus, and nf is a normalization factor.

And 4, step 4: using χ²Calculating a domain association factor Ca (class association) by using a C algorithm;

χ²the-C algorithm is to²The statistics are combined into a domain correlation algorithm and a domain correlation factor Ca is calculated, and the domain relation and the one-dimensional degree of freedom chi are assumed²Similar distribution, x²The value of the statistic is inversely proportional to the independence between the feature word d and the domain c, equation (14) is the chi-squared test calculation of the feature word d for the domain c,

wherein A represents the number of documents belonging to C and including D, B represents the number of documents not belonging to C and including D, C represents the number of documents belonging to C and not including D, D represents the number of documents not belonging to C and not including D, and N represents the total number of documents;

χ calculated due to some characteristic words² _(d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;

wherein d represents a feature word, c represents, i represents

Ca is calculated by the formulae (2), (3), (4):

in order to increase the degree of association of the feature words with the domain, equation (5) is used in combination with the domain association factor C_a((d,c)Calculating the initial weight of the feature words according to the word position information alpha to obtain the weight of the feature words;

w_dt'＝Ca((d,c)×w_dt×α (5)

wherein, w_dt' represents a characteristic word weight, and α represents position information of a word ((first-order value)>Second order value>Third value)), α is obtained using equation (6):

generating a text vector by using the updated feature word weight, firstly merging the feature words of the texts beta and gamma, establishing a word bank of the texts beta and gamma, enabling each word to have a corresponding label, and aiming at better representing the texts into a vector form by establishing the word bank;

v_k＝<w'_1k，w'_2k，w'_ηk……w'_Nk>(7)

and for w in text beta_ksηObey equation (8);

for w in the text gamma_ksηObedience formula (9)

and calculating word sense similarity among the characteristic words by using a word2vec tool, and listing a similarity matrix according to the word sense similarity among the characteristic words.

Word vectors containing semantic information can be generated by using word2vec, and the similarity of the word vectors corresponding to the words is calculated by a cosine similarity algorithm to obtain the similarity between the two words.

sim (A, B) is the degree of similarity between two words, A and B are word vectors of the feature words produced by word2vec, and p represents the dimensionality of the word vectors.

Adding word sense information in the text vector, and updating beta and gamma text vectors by using a similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):

v_β＝<w'_β1，w'_β1，w'_βη……w'_βN)>(11)

v_γ＝<w'_γ1，w'_γ1，w'_γη……w'_γN)>(12)

v'_k＝v_k×P (13)

Equation (15) is a cosine similarity calculation equation.

Wherein sim (v'_β,v'_γ) Representing the degree of similarity, v ', between the texts β and γ'_βAnd v'_γIs the final text vector, w, of the text corresponding to β and γ "_βiV 'to'_βValue of the ith dimension of the medium, similarly w "_γiV 'to'_γThe value of the ith dimension, P' is the dimension of the text vector.

Selecting a subset of about 8000 texts of a THUCNews news text classification data set as an embodiment, and classifying the texts into science and technology, sports, politics and entertainment types through a Convolutional Neural Network (CNN)

The method comprises the following steps of I, preprocessing the content of a test data set and a corpus, and mainly comprises the following steps:

(1) deleting symbols in the text content: comma, question mark, exclamation mark, etc.

(2) Stop words: the method mainly refers to deleting stop words in text content, and common stop words comprise the following steps: words without definite meanings such as subject, preposition, adverb, mood-assisting word, conjunctive word, etc.

(3) Word segmentation: and segmenting the text content without the stop words according to the words.

II, selecting scientific and technological detection samples: text β and text γ, wherein the content of text β:

"natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. "

Content of text γ:

"natural language processing is a branch of data science, and its main coverage is: the text data is systematically analyzed, understood and information extracted in an intelligent and efficient mode. "

After preprocessing the text β:

natural language/processing/computer/science/domain/artificial intelligence/domain/importance/direction/research/enable/implementation/human/computer/inter/use/natural language/go/active/communication/theory/method

After preprocessing of the text γ:

natural language/processing/data/science/branching/main/overlay/content/intelligence/efficiency/manner/pair/text/data/systematization/analysis/understanding/information extraction/process

Calculating initial weight of the characteristic words in the test text by using TF-IDF algorithm, wherein TF refers to the frequency of the characteristic words in the specified document, and TF has the calculation formula of (16)

For the text β, the TF values of the remaining feature words are 0.0526, except that the TF values corresponding to "natural language" and "domain" are 0.10526, respectively.

For the text γ, the TF values for the remaining tokens are 0.0556, except that "data" corresponds to a TF value of 0.1111.

The IDF inverse document frequency mainly refers to a measure of the general importance of a term, that is, the smaller the number of documents containing a certain feature word, the larger the IDF value, and the IDF is calculated as (17).

The IDF values of the text β are as in table 1:

TABLE 1

The IDF value of the text γ is as in table 2:

TABLE 2

The weights corresponding to the feature words in the text β are shown in table 3:

TABLE 3

The initial weight corresponding to each feature word in the text γ is shown in table 4:

TABLE 4

As shown in table 3, "natural language", "processing", and the like, the initial weight corresponding to the feature word that is important in the natural language processing field is not high, and mainly because the IDF calculation has weak ability to distinguish the field, the degree of association between the feature word and the field is not reflected.

And IV, calculating the association degree of the feature words and the field by using chi-square statistic to obtain a field association factor. According to the formula (1) and the formula (14), the χ corresponding to each feature word can be calculated²' (d, c) value.

The chi-squared statistics for the feature words in text β are shown in table 5.

TABLE 5

The chi-square statistic corresponding to the feature words in the text gamma is shown in table 6.

TABLE 6

As can be seen from tables 5 and 6, the more important feature words in the natural language processing field have higher chi-square statistic, which indicates that the chi-square statistic can well represent the degree of association between the feature words and the field.

Using χ²And calculating the domain association degree of the feature words by the algorithm C to obtain a domain association factor Ca. The relevance degree of the characteristic words and the domain can be better represented by combining chi-square statistics. Lays a good foundation for calculating the text similarity, the domain association factor Ca of the text beta is shown in Table 7,

TABLE 7

Table 8 shows the domain association factor Ca for each feature word in γ:

TABLE 8

The initial weight, the domain relevance factor Ca, has been calculated for each feature word in this example. Ca and the initial weight are combined according to equation (5), and the feature word weight is calculated. Table 9 and table 10 show the weights of the feature words in the text β and the text γ, respectively.

TABLE 9

Watch 10

The corresponding text vectors are generated by using the calculated word weights, a word bank is established by using all words in the test texts beta and gamma, and the table 11 is the word bank.

TABLE 11

Generating a text vector corresponding to the text according to the word stock and the formulas (8) and (9):

v_βs＝<0.502677433，0.108777994，0，0.06002695，0，0.178649449，0，0.000008，0，0.000283959，……，0.077038474

v_γs＝<0.146430991，0.063410661，0.020417404，0.015952149，0.045591454，0，0.140432493，0，0.013076936，……，0>

and VI, calculating word sense similarity between the feature words by using a word2vec tool to obtain a similarity moment P. word2vec is mainly used for training large-scale corpora into high-quality word vectors, and the word vectors are not only low-dimensional and dense, but also contain semantic information. And calculating the similarity between the two words by using the cosine similarity, and establishing a similarity matrix by using the result and the word bank.

The subscript n represents the reference number of the corresponding word in the lexicon, for example: sim₅₆The similarity between the feature words with the labels d5 and d6, i.e. the similarity between the meaning of "high efficiency" and "artificial intelligence" is P, as follows:

VII, updating the text vector v by using the formula (13) for the text similarity matrix P which is calculated_βsAnd v_γsI.e. by

v_βz＝v_βs×P

v_γz＝v_γs×P

v_βzAnd v_γzIs the final text vector, and the text vector at this time contains word frequency, word sense, word position information and domain association degree. And then calculating the similarity of the two texts according to a cosine similarity algorithm, wherein the similarity of the beta and the gamma of the texts is as follows: 0.89325.

the method is compared with other text similarity algorithms, 12124 preprocessed texts are used as corpora, a convolutional neural network CNN is used for classifying text sets, and science and technology, education, sports and military are selected as test text sets. In order to better show the experimental result, text data in each class is manually screened, and for low similarity, similar texts are constructed by manual modification.

According to the method, the accuracy and the recall rate are used as evaluation standards, the accuracy P, the recall rate R and the F1 value of each class are respectively calculated by combining data, and finally the average accuracy, the average recall rate and the average F1 value are obtained. The calculation formulas are as formula (18), formula (19) and formula (20):

TABLE 12

In order to verify the effect of calculating text similarity by different algorithms, the method based on chi is adopted²And C, carrying out experimental comparison on four text similarity algorithms based on an LDA model, an unknown network and a convolutional neural network CNN, wherein the evaluation indexes are three indexes of accuracy, recall rate and F1 value in the formula (18), the formula (19) and the formula (20), and the experimental results are shown in Table 12.

As can be seen from the experimental results of Table 12, the base% is²The accuracy, recall and F1 values of the-C text similarity algorithm are slightly better than the other three algorithms because LDA works poorly for topic mining of short texts and does not take into account the structure and word position information of the text. The knowledgebase of the netknowledge contains limited vocabulary and does not contain many professional terms or uncommon words, so that the similarity calculation result has low accuracy, although the CNN-based methodThe accuracy of the calculation result of the algorithm is high, but the parameters are more, the model is complex, and the running time is long. Based on x²the-C algorithm is simple in structure, the association degree of the characteristic words and the domain of the characteristic words is increased, the word2vec tool is used for enabling the text vectors to contain word meaning information, and the accuracy of text similarity calculation is improved.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. Based on x²-C text similarity calculation method, characterized by comprising the steps of:

step 1: preprocessing the test data and the content of the corpus;

2. The χ -based of claim 1²-C, wherein said step 4 comprises:

ca is calculated by the formulae (2), (3), (4):

3. The χ -based of claim 1²-C, wherein in step 5, in order to increase the degree of association between the feature words and the domain, formula (5) is used in combination with the domain association factor C_a((d,c)Hehe wordCalculating the initial weight of the feature words by using the position information alpha to obtain the weight of the feature words;

w_dt'＝C_a((d,c)×w_dt×α (5)

wherein, w_dt' represents a feature word weight, α represents position information of a word, and α is obtained using equation (6):

4. the χ -based of claim 1²The method for calculating text similarity of-C, wherein in step 6, the feature words of the texts β and γ are first combined to establish a lexicon of the texts β and γ, so that each word has a corresponding label;

v_k＝<w'_1k，w'_2k，w'_ηk……w'_Nk>(7)

and for w in text beta_ksηObey equation (8);

for w in the text gamma_ksηObedience formula (9)

5. The χ -based of claim 1²C, characterized in that, in said step 7,

6. the χ -based of claim 1²A text similarity calculation method according to-C, wherein in step 9, word sense information is added to the text vector, and the β and γ text vectors are updated using the similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):

v_β＝<w'_β1，w'_β1，w'_βη……w'_βN)>(11)

v_γ＝<w'_γ1，w'_γ1，w'_γη……w'_γN)>(12)

v'_k＝v_k×P (13)