CN110705247A - Based on x2-C text similarity calculation method - Google Patents

Based on x2-C text similarity calculation method Download PDF

Info

Publication number
CN110705247A
CN110705247A CN201910811440.8A CN201910811440A CN110705247A CN 110705247 A CN110705247 A CN 110705247A CN 201910811440 A CN201910811440 A CN 201910811440A CN 110705247 A CN110705247 A CN 110705247A
Authority
CN
China
Prior art keywords
word
text
similarity
feature
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910811440.8A
Other languages
Chinese (zh)
Other versions
CN110705247B (en
Inventor
赵卫东
李化泽
王铭
刘昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Guancheng Software Co ltd
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN201910811440.8A priority Critical patent/CN110705247B/en
Priority to PCT/CN2019/112951 priority patent/WO2021035921A1/en
Publication of CN110705247A publication Critical patent/CN110705247A/en
Application granted granted Critical
Publication of CN110705247B publication Critical patent/CN110705247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese character' chi2A text similarity calculation method of-C, in particular to the field of text information processing. The method comprises the steps of classifying a test data set by using a Convolutional Neural Network (CNN), calculating the initial weight of each feature word in a detection sample according to TF-IDF, and then using χ2Calculating a domain association factor by using the algorithm C, calculating an initial weight by using the word position factor alpha in combination with the domain association factor to obtain a feature word weight, establishing a word bank by using all feature words of the detection sample, and expressing the detection sample into an initial text vector by combining the word bank and the feature word weight. Calculating the similarity between words in a word bank by using a word2vec tool and forming a word meaning similarity matrix, calculating an initial text vector by using the matrix to obtain a text vector, and finally calculating the text vector by using a cosine similarity algorithm to obtain the similarity between texts, so that the association degree of the characteristic words and the field thereof, the semantic relation between the characteristic words and the bit of the characteristic words are increasedAnd information is set, so that the accuracy of text similarity calculation is improved.

Description

Based on x2-C text similarity calculation method
Technical Field
The invention relates to the field of text information processing, in particular to a Chinese character chi-based Chinese character recognition method2-a text similarity calculation method of C.
Background
The text similarity is the calculation of semantic similarity between texts, and is applied to many fields in the information explosion era. For example: the traditional text similarity calculation is based on a space vector model (VSM), the model adopts TF-IDF to calculate the weight of feature words in a text, the text is converted into a text vector of a multidimensional space, and the similarity of the text vector is calculated to measure the similarity between the texts. But the TF-IDF algorithm only considers the relation between the term characteristics and the document and does not consider the problem of relevance with the category, so that the accuracy of text similarity calculation is low. The LDA topic model is a three-layer Bayesian probability model and comprises three-layer structures of words, topics and documents. The basic idea of adopting LDA to calculate text similarity is to perform theme modeling on a text, traverse and extract words in the text in word distribution corresponding to a theme to obtain the theme distribution of the text, and calculate the text similarity through the distribution. Because the short text has few representative words, the LDA does not necessarily achieve the expected effect on the topic mining of the short text, and is more suitable for long texts. The most popular text similarity calculation method at present uses a convolutional neural network, and the main idea is to form a two-dimensional matrix by word vectors of all words in a sentence in sequence, access a CNN model on the basis, extract and filter local semantics in the sentence by using the structural properties of the CNN, and finally abstract the feature vector representation of the sentence. But the CNN model has the defects of complex structure, more parameters, long running time and the like.
Disclosure of Invention
The invention aims to solve the defects and provides a method for combining X by adopting TF-IDF2the-C algorithm calculates the weight of the characteristic words, increases the word position information, and increases the chi-based word meaning information to the text vector by using word2vec2-C text similarity calculation method
The invention specifically adopts the following technical scheme:
based on x2-C text similarity calculation method, comprising the steps of:
step 1: preprocessing the test data and the content of the corpus;
step 2: classifying the test data set using a Convolutional Neural Network (CNN);
and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;
and 4, step 4: using χ2Calculating a domain association factor Ca by using a C algorithm;
and 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;
step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;
and 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;
and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;
and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.
Preferably, the step 4 comprises:
χ2the-C algorithm is to2The statistics are combined into a domain association algorithm and a domain association factor Ca is calculated, as a result of the χ calculated by some feature words2 (d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;
wherein d represents a characteristic word, c represents the domain described by the characteristic word, diRepresenting the ith characteristic word in the text, and the count represents the total number of the characteristic words in the text;
ca is calculated by the formulae (2), (3), (4):
Figure BDA0002185165630000022
Figure BDA0002185165630000023
Figure BDA0002185165630000024
wherein q isdIndicating the proportion of the number of documents in the positive class containing the feature word d, edRepresenting the number of documents in the main category containing the feature word d, EdIndicates the total number of documents of the positive class, pdRepresenting the proportion of the number of documents in the negative class containing the feature word d, ndNumber of documents representing negative class containing d, NdTotal number of documents, w, representing negative classesdAnd d represents the degree of association with the domain.
Preferably, in the step 5, in order to increase the association degree of the feature words and the domains, the formula (5) is used in combination with the domain association factor Ca((d,c)Calculating the initial weight of the feature words according to the word position information alpha to obtain the weight of the feature words;
wdt'=Ca((d,c)×wdt×α (5)
wherein, wdt' represents a characteristic word weight, α represents position information of a word, and α is obtained by equation (6):
Figure BDA0002185165630000031
preferably, in step 6, firstly, feature words of the texts β and γ are merged, and a lexicon of the texts β and γ is established, so that each word has a corresponding label;
and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:
vk=<w'1k,w'2k,w'ηk……w'Nk>(7)
wherein v iskThe characteristic word sets representing the initial text vector, beta and gamma are respectively S={dβ1,dβ2,dβ3,dβi,…,dβnAnd S={dγ1,dγ2,dγ3,dγj,…,dγmThe feature word weight sets of the } beta and gamma are respectively S={w'β1,w'β2,w'β3,w'βj....w'βnAnd s={w'γ1,w'γ2,w'γ3,w'γj....w'γm},k∈{β,γ},w'∈S∪sEta represents a feature word dηCorresponding reference numbers in lexicons, dη∈S∪SN is the total number of the feature words in the word stock;
and for w in text betaksηObey equation (8);
Figure BDA0002185165630000032
for w in the text gammaksηObedience formula (9)
Figure BDA0002185165630000033
dηIn referring to a thesaurus labeled η, and dβi∈S,dγi∈S
Preferably, in the step 7,
the similarity between the characteristic word with the label i and the characteristic word with the label j in the word stock is simijThen, the similarity matrix between the feature words is represented by formula (10):
Figure BDA0002185165630000034
preferably, in step 9, word sense information is added to the text vector, and the β and γ text vectors are updated by using the similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):
vβ=<w'β1,w'β1,w'βη……w'βN)>(11)
vγ=<w'γ1,w'γ1,w'γη……w'γN)>(12)
performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;
v'k=vk×P (13)
where k ∈ { β, γ }, a new text vector v 'is computed'β=<w”β1,w”β1,w”βη……w”βN)>And v'γ=<w”γ1,w”γ1,w”γη……w”γN)>The text vector includes word frequency, word sense, word position information and domain association degree.
The invention has the following beneficial effects:
the invention uses TF-IDF to combine chi2the-C algorithm calculates the weight of the feature words, increases word position information, utilizes word2vec to increase word meaning information to the text vector, considers the influence of the field on the distribution condition of the feature words and the semantic relation among the feature words, and the experimental result shows that the method is based on chi2The calculation method accuracy of the similarity of the C text is 82.64%, and the F1 value is 84.09%, which is superior to other methods.
Drawings
FIG. 1 is a graph based on χ2-C text similarity calculation flow chart.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:
based on x2-C text similarity calculation method, comprising the steps of:
step 1: preprocessing the test data and the content of the corpus;
step 2: classifying the test data set using a Convolutional Neural Network (CNN);
and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;
TF-IDF algorithm adoption
Figure BDA0002185165630000041
It is shown that,
Figure BDA0002185165630000042
wherein, WdtRepresenting the weight value of the feature word d in the document t, TFdtRepresenting the word frequency number, m, of the feature word d in the document tdThe expression is the number of times of occurrence of the characteristic word d in the document t, S represents the total number of the characteristic words of the document t, IDFdtThe representation being an inverse text frequency index, n, of the feature word ddThe number of texts containing the feature word d in the corpus is shown, N represents the total number of the texts in the corpus, and nf is a normalization factor.
And 4, step 4: using χ2Calculating a domain association factor Ca (class association) by using a C algorithm;
χ2the-C algorithm is to2The statistics are combined into a domain correlation algorithm and a domain correlation factor Ca is calculated, and the domain relation and the one-dimensional degree of freedom chi are assumed2Similar distribution, x2The value of the statistic is inversely proportional to the independence between the feature word d and the domain c, equation (14) is the chi-squared test calculation of the feature word d for the domain c,
Figure BDA0002185165630000051
wherein A represents the number of documents belonging to C and including D, B represents the number of documents not belonging to C and including D, C represents the number of documents belonging to C and not including D, D represents the number of documents not belonging to C and not including D, and N represents the total number of documents;
χ calculated due to some characteristic words2 (d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;
Figure BDA0002185165630000052
wherein d represents a feature word, c represents, i represents
Ca is calculated by the formulae (2), (3), (4):
Figure BDA0002185165630000053
Figure BDA0002185165630000054
wherein q isdIndicating the proportion of the number of documents in the positive class containing the feature word d, edRepresenting the number of documents in the main category containing the feature word d, EdIndicates the total number of documents of the positive class, pdRepresenting the proportion of the number of documents in the negative class containing the feature word d, ndNumber of documents representing negative class containing d, NdTotal number of documents, w, representing negative classesdAnd d represents the degree of association with the domain.
And 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;
in order to increase the degree of association of the feature words with the domain, equation (5) is used in combination with the domain association factor Ca((d,c)Calculating the initial weight of the feature words according to the word position information alpha to obtain the weight of the feature words;
wdt'=Ca((d,c)×wdt×α (5)
wherein, wdt' represents a characteristic word weight, and α represents position information of a word ((first-order value)>Second order value>Third value)), α is obtained using equation (6):
Figure BDA0002185165630000061
step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;
generating a text vector by using the updated feature word weight, firstly merging the feature words of the texts beta and gamma, establishing a word bank of the texts beta and gamma, enabling each word to have a corresponding label, and aiming at better representing the texts into a vector form by establishing the word bank;
and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:
vk=<w'1k,w'2k,w'ηk……w'Nk>(7)
wherein v iskThe characteristic word sets representing the initial text vector, beta and gamma are respectively S={dβ1,dβ2,dβ3,dβi,…,dβnAnd S={dγ1,dγ2,dγ3,dγj,…,dγmThe feature word weight sets of the } beta and gamma are respectively S={w'β1,w'β2,w'β3,w'βj....w'βnAnd s={w'γ1,w'γ2,w'γ3,w'γj....w'γm},k∈{β,γ},w'∈S∪sEta represents a feature word dηCorresponding reference numbers in lexicons, dη∈S∪SN is the total number of the feature words in the word stock;
and for w in text betaksηObey equation (8);
Figure BDA0002185165630000062
for w in the text gammaksηObedience formula (9)
Figure BDA0002185165630000063
dηIn referring to a thesaurus labeled η, and dβi∈S,dγi∈S
And 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;
and calculating word sense similarity among the characteristic words by using a word2vec tool, and listing a similarity matrix according to the word sense similarity among the characteristic words.
Word vectors containing semantic information can be generated by using word2vec, and the similarity of the word vectors corresponding to the words is calculated by a cosine similarity algorithm to obtain the similarity between the two words.
sim (A, B) is the degree of similarity between two words, A and B are word vectors of the feature words produced by word2vec, and p represents the dimensionality of the word vectors.
The similarity between the characteristic word with the label i and the characteristic word with the label j in the word stock is simijThen, the similarity matrix between the feature words is represented by formula (10):
Figure BDA0002185165630000071
and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;
and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.
Adding word sense information in the text vector, and updating beta and gamma text vectors by using a similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):
vβ=<w'β1,w'β1,w'βη……w'βN)>(11)
vγ=<w'γ1,w'γ1,w'γη……w'γN)>(12)
performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;
v'k=vk×P (13)
where k ∈ { β, γ }, a new text vector v 'is computed'β=<w”β1,w”β1,w”βη……w”βN)>And v'γ=<w”γ1,w”γ1,w”γη……w”γN)>The text vector includes word frequency, word sense, word position information and domain association degree.
Equation (15) is a cosine similarity calculation equation.
Wherein sim (v'β,v'γ) Representing the degree of similarity, v ', between the texts β and γ'βAnd v'γIs the final text vector, w, of the text corresponding to β and γ "βiV 'to'βValue of the ith dimension of the medium, similarly w "γiV 'to'γThe value of the ith dimension, P' is the dimension of the text vector.
Selecting a subset of about 8000 texts of a THUCNews news text classification data set as an embodiment, and classifying the texts into science and technology, sports, politics and entertainment types through a Convolutional Neural Network (CNN)
The method comprises the following steps of I, preprocessing the content of a test data set and a corpus, and mainly comprises the following steps:
(1) deleting symbols in the text content: comma, question mark, exclamation mark, etc.
(2) Stop words: the method mainly refers to deleting stop words in text content, and common stop words comprise the following steps: words without definite meanings such as subject, preposition, adverb, mood-assisting word, conjunctive word, etc.
(3) Word segmentation: and segmenting the text content without the stop words according to the words.
II, selecting scientific and technological detection samples: text β and text γ, wherein the content of text β:
"natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. "
Content of text γ:
"natural language processing is a branch of data science, and its main coverage is: the text data is systematically analyzed, understood and information extracted in an intelligent and efficient mode. "
After preprocessing the text β:
natural language/processing/computer/science/domain/artificial intelligence/domain/importance/direction/research/enable/implementation/human/computer/inter/use/natural language/go/active/communication/theory/method
After preprocessing of the text γ:
natural language/processing/data/science/branching/main/overlay/content/intelligence/efficiency/manner/pair/text/data/systematization/analysis/understanding/information extraction/process
Calculating initial weight of the characteristic words in the test text by using TF-IDF algorithm, wherein TF refers to the frequency of the characteristic words in the specified document, and TF has the calculation formula of (16)
Figure BDA0002185165630000081
For the text β, the TF values of the remaining feature words are 0.0526, except that the TF values corresponding to "natural language" and "domain" are 0.10526, respectively.
For the text γ, the TF values for the remaining tokens are 0.0556, except that "data" corresponds to a TF value of 0.1111.
The IDF inverse document frequency mainly refers to a measure of the general importance of a term, that is, the smaller the number of documents containing a certain feature word, the larger the IDF value, and the IDF is calculated as (17).
Figure BDA0002185165630000091
The IDF values of the text β are as in table 1:
TABLE 1
The IDF value of the text γ is as in table 2:
TABLE 2
Figure BDA0002185165630000093
The weights corresponding to the feature words in the text β are shown in table 3:
TABLE 3
Figure BDA0002185165630000101
The initial weight corresponding to each feature word in the text γ is shown in table 4:
TABLE 4
Figure BDA0002185165630000102
As shown in table 3, "natural language", "processing", and the like, the initial weight corresponding to the feature word that is important in the natural language processing field is not high, and mainly because the IDF calculation has weak ability to distinguish the field, the degree of association between the feature word and the field is not reflected.
And IV, calculating the association degree of the feature words and the field by using chi-square statistic to obtain a field association factor. According to the formula (1) and the formula (14), the χ corresponding to each feature word can be calculated2' (d, c) value.
The chi-squared statistics for the feature words in text β are shown in table 5.
TABLE 5
Figure BDA0002185165630000111
The chi-square statistic corresponding to the feature words in the text gamma is shown in table 6.
TABLE 6
As can be seen from tables 5 and 6, the more important feature words in the natural language processing field have higher chi-square statistic, which indicates that the chi-square statistic can well represent the degree of association between the feature words and the field.
Using χ2And calculating the domain association degree of the feature words by the algorithm C to obtain a domain association factor Ca. The relevance degree of the characteristic words and the domain can be better represented by combining chi-square statistics. Lays a good foundation for calculating the text similarity, the domain association factor Ca of the text beta is shown in Table 7,
TABLE 7
Figure BDA0002185165630000121
Table 8 shows the domain association factor Ca for each feature word in γ:
TABLE 8
Figure BDA0002185165630000122
The initial weight, the domain relevance factor Ca, has been calculated for each feature word in this example. Ca and the initial weight are combined according to equation (5), and the feature word weight is calculated. Table 9 and table 10 show the weights of the feature words in the text β and the text γ, respectively.
TABLE 9
Figure BDA0002185165630000131
Watch 10
Figure BDA0002185165630000132
The corresponding text vectors are generated by using the calculated word weights, a word bank is established by using all words in the test texts beta and gamma, and the table 11 is the word bank.
TABLE 11
Figure BDA0002185165630000141
Generating a text vector corresponding to the text according to the word stock and the formulas (8) and (9):
vβs=<0.502677433,0.108777994,0,0.06002695,0,0.178649449,0,0.000008,0,0.000283959,……,0.077038474
vγs=<0.146430991,0.063410661,0.020417404,0.015952149,0.045591454,0,0.140432493,0,0.013076936,……,0>
and VI, calculating word sense similarity between the feature words by using a word2vec tool to obtain a similarity moment P. word2vec is mainly used for training large-scale corpora into high-quality word vectors, and the word vectors are not only low-dimensional and dense, but also contain semantic information. And calculating the similarity between the two words by using the cosine similarity, and establishing a similarity matrix by using the result and the word bank.
Figure BDA0002185165630000142
The subscript n represents the reference number of the corresponding word in the lexicon, for example: sim56The similarity between the feature words with the labels d5 and d6, i.e. the similarity between the meaning of "high efficiency" and "artificial intelligence" is P, as follows:
Figure BDA0002185165630000143
VII, updating the text vector v by using the formula (13) for the text similarity matrix P which is calculatedβsAnd vγsI.e. by
vβz=vβs×P
vγz=vγs×P
vβzAnd vγzIs the final text vector, and the text vector at this time contains word frequency, word sense, word position information and domain association degree. And then calculating the similarity of the two texts according to a cosine similarity algorithm, wherein the similarity of the beta and the gamma of the texts is as follows: 0.89325.
the method is compared with other text similarity algorithms, 12124 preprocessed texts are used as corpora, a convolutional neural network CNN is used for classifying text sets, and science and technology, education, sports and military are selected as test text sets. In order to better show the experimental result, text data in each class is manually screened, and for low similarity, similar texts are constructed by manual modification.
According to the method, the accuracy and the recall rate are used as evaluation standards, the accuracy P, the recall rate R and the F1 value of each class are respectively calculated by combining data, and finally the average accuracy, the average recall rate and the average F1 value are obtained. The calculation formulas are as formula (18), formula (19) and formula (20):
Figure BDA0002185165630000151
Figure BDA0002185165630000152
Figure BDA0002185165630000153
TABLE 12
In order to verify the effect of calculating text similarity by different algorithms, the method based on chi is adopted2And C, carrying out experimental comparison on four text similarity algorithms based on an LDA model, an unknown network and a convolutional neural network CNN, wherein the evaluation indexes are three indexes of accuracy, recall rate and F1 value in the formula (18), the formula (19) and the formula (20), and the experimental results are shown in Table 12.
As can be seen from the experimental results of Table 12, the base% is2The accuracy, recall and F1 values of the-C text similarity algorithm are slightly better than the other three algorithms because LDA works poorly for topic mining of short texts and does not take into account the structure and word position information of the text. The knowledgebase of the netknowledge contains limited vocabulary and does not contain many professional terms or uncommon words, so that the similarity calculation result has low accuracy, although the CNN-based methodThe accuracy of the calculation result of the algorithm is high, but the parameters are more, the model is complex, and the running time is long. Based on x2the-C algorithm is simple in structure, the association degree of the characteristic words and the domain of the characteristic words is increased, the word2vec tool is used for enabling the text vectors to contain word meaning information, and the accuracy of text similarity calculation is improved.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (6)

1. Based on x2-C text similarity calculation method, characterized by comprising the steps of:
step 1: preprocessing the test data and the content of the corpus;
step 2: classifying the test data set using a Convolutional Neural Network (CNN);
and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;
and 4, step 4: using χ2Calculating a domain association factor Ca by using a C algorithm;
and 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;
step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;
and 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;
and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;
and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.
2. The χ -based of claim 12-C, wherein said step 4 comprises:
χ2the-C algorithm is to2The statistics are combined into a domain association algorithm and a domain association factor Ca is calculated, as a result of the χ calculated by some feature words2 (d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;
Figure FDA0002185165620000011
wherein d represents a characteristic word, c represents the domain described by the characteristic word, diRepresenting the ith characteristic word in the text, and the count represents the total number of the characteristic words in the text;
ca is calculated by the formulae (2), (3), (4):
Figure FDA0002185165620000012
Figure FDA0002185165620000014
wherein q isdIndicating the proportion of the number of documents in the positive class containing the feature word d, edRepresenting the number of documents in the main category containing the feature word d, EdIndicates the total number of documents of the positive class, pdRepresenting the proportion of the number of documents in the negative class containing the feature word d, ndNumber of documents representing negative class containing d, NdTotal number of documents, w, representing negative classesdAnd d represents the degree of association with the domain.
3. The χ -based of claim 12-C, wherein in step 5, in order to increase the degree of association between the feature words and the domain, formula (5) is used in combination with the domain association factor Ca((d,c)Hehe wordCalculating the initial weight of the feature words by using the position information alpha to obtain the weight of the feature words;
wdt'=Ca((d,c)×wdt×α (5)
wherein, wdt' represents a feature word weight, α represents position information of a word, and α is obtained using equation (6):
Figure FDA0002185165620000021
4. the χ -based of claim 12The method for calculating text similarity of-C, wherein in step 6, the feature words of the texts β and γ are first combined to establish a lexicon of the texts β and γ, so that each word has a corresponding label;
and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:
vk=<w'1k,w'2k,w'ηk……w'Nk>(7)
wherein v iskThe characteristic word sets representing the initial text vector, beta and gamma are respectively S={dβ1,dβ2,dβ3,dβi,…,dβnAnd S={dγ1,dγ2,dγ3,dγj,…,dγmThe feature word weight sets of the } beta and gamma are respectively S={w'β1,w'β2,w'β3,w'βj....w'βnAnd s={w'γ1,w'γ2,w'γ3,w'γj....w'γm},k∈{β,γ},w'∈S∪sEta represents a feature word dηCorresponding reference numbers in lexicons, dη∈S∪SN is the total number of the feature words in the word stock;
and for w in text betaksηObey equation (8);
for w in the text gammaksηObedience formula (9)
Figure FDA0002185165620000023
dηIn referring to a thesaurus labeled η, and dβi∈S,dγi∈S
5. The χ -based of claim 12C, characterized in that, in said step 7,
the similarity between the characteristic word with the label i and the characteristic word with the label j in the word stock is simijThen, the similarity matrix between the feature words is represented by formula (10):
Figure FDA0002185165620000031
6. the χ -based of claim 12A text similarity calculation method according to-C, wherein in step 9, word sense information is added to the text vector, and the β and γ text vectors are updated using the similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):
vβ=<w'β1,w'β1,w'βη……w'βN)>(11)
vγ=<w'γ1,w'γ1,w'γη……w'γN)>(12)
performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;
v'k=vk×P (13)
where k ∈ { β, γ }, a new text vector v 'is computed'β=<w”β1,w”β1,w”βη……w”βN)>And v'γ=<w”γ1,w”γ1,w”γη……w”γN)>The text vector includes word frequency, word sense, word position information and domain association degree.
CN201910811440.8A 2019-08-30 2019-08-30 Based on x2-C text similarity calculation method Active CN110705247B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910811440.8A CN110705247B (en) 2019-08-30 2019-08-30 Based on x2-C text similarity calculation method
PCT/CN2019/112951 WO2021035921A1 (en) 2019-08-30 2019-10-24 TEXT SIMILARITY CALCULATION METHOD EMPLOYING χ2-C

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910811440.8A CN110705247B (en) 2019-08-30 2019-08-30 Based on x2-C text similarity calculation method

Publications (2)

Publication Number Publication Date
CN110705247A true CN110705247A (en) 2020-01-17
CN110705247B CN110705247B (en) 2020-08-04

Family

ID=69193643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910811440.8A Active CN110705247B (en) 2019-08-30 2019-08-30 Based on x2-C text similarity calculation method

Country Status (2)

Country Link
CN (1) CN110705247B (en)
WO (1) WO2021035921A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112002415A (en) * 2020-08-23 2020-11-27 吾征智能技术(北京)有限公司 Intelligent cognitive disease system based on human excrement
CN113190672A (en) * 2021-05-12 2021-07-30 上海热血网络科技有限公司 Advertisement judgment model, advertisement filtering method and system
CN113722427A (en) * 2021-07-30 2021-11-30 安徽掌学科技有限公司 Thesis duplicate checking method based on feature vector space
CN114676701A (en) * 2020-12-24 2022-06-28 腾讯科技(深圳)有限公司 Text vector processing method, device, medium and electronic equipment
CN116957362A (en) * 2023-09-18 2023-10-27 国网江西省电力有限公司经济技术研究院 Multi-target planning method and system for regional comprehensive energy system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492420B (en) * 2022-04-02 2022-07-29 北京中科闻歌科技股份有限公司 Text classification method, device and equipment and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN106202518A (en) * 2016-07-22 2016-12-07 桂林电子科技大学 Based on CHI and the short text classification method of sub-category association rule algorithm
CN106502990A (en) * 2016-10-27 2017-03-15 广东工业大学 A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026B (en) * 2007-07-02 2011-01-26 蒙圣光 Text similarity, acceptation similarity calculating method and system and application system
CN104090865B (en) * 2014-07-08 2017-11-03 安一恒通(北京)科技有限公司 Text similarity calculation method and device
CN105786789B (en) * 2014-12-16 2019-07-23 阿里巴巴集团控股有限公司 A kind of calculation method and device of text similarity

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN106202518A (en) * 2016-07-22 2016-12-07 桂林电子科技大学 Based on CHI and the short text classification method of sub-category association rule algorithm
CN106502990A (en) * 2016-10-27 2017-03-15 广东工业大学 A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUJIA ZHAI,ET AL.: "A Chi-Square Statistics Based Feature Selection Method in Text Classification", 《2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS)》 *
何燕等: "一种新的维吾尔文文本分类特征选择方法", 《河南科技大学学报(自然科学版)》 *
石俊涛: "中文文本分类中卡方特征提取和对TF-IDF权重改进", 《中国优秀硕士学位论文全文数据库》 *
邱苓芸等: "PageRank算法改进研究", 《软件导刊》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112002415A (en) * 2020-08-23 2020-11-27 吾征智能技术(北京)有限公司 Intelligent cognitive disease system based on human excrement
CN112002415B (en) * 2020-08-23 2024-03-01 吾征智能技术(北京)有限公司 Intelligent cognitive disease system based on human excrement
CN114676701A (en) * 2020-12-24 2022-06-28 腾讯科技(深圳)有限公司 Text vector processing method, device, medium and electronic equipment
CN113190672A (en) * 2021-05-12 2021-07-30 上海热血网络科技有限公司 Advertisement judgment model, advertisement filtering method and system
CN113722427A (en) * 2021-07-30 2021-11-30 安徽掌学科技有限公司 Thesis duplicate checking method based on feature vector space
CN116957362A (en) * 2023-09-18 2023-10-27 国网江西省电力有限公司经济技术研究院 Multi-target planning method and system for regional comprehensive energy system

Also Published As

Publication number Publication date
WO2021035921A1 (en) 2021-03-04
CN110705247B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
Naili et al. Comparative study of word embedding methods in topic segmentation
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
US11379668B2 (en) Topic models with sentiment priors based on distributed representations
CN110705247B (en) Based on x2-C text similarity calculation method
CN105183833B (en) Microblog text recommendation method and device based on user model
Froud et al. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN102662987B (en) A kind of sorting technique of the network text semanteme based on Baidupedia
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Reddy et al. Profile specific document weighted approach using a new term weighting measure for author profiling
Parlar et al. Analysis of data pre-processing methods for sentiment analysis of reviews
Bashir et al. Automatic Hausa LanguageText Summarization Based on Feature Extraction using Naïve Bayes Model
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Ayadi et al. A Survey of Arabic Text Representation and Classification Methods.
Villegas et al. Vector-based word representations for sentiment analysis: a comparative study
Al-Sarem et al. The effect of training set size in authorship attribution: application on short Arabic texts
CN110580286A (en) Text feature selection method based on inter-class information entropy
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
Yang et al. Exploring word similarity to improve chinese personal name disambiguation
Ling Coronavirus public sentiment analysis with BERT deep learning
Zmandar et al. Multilingual Financial Word Embeddings for Arabic, English and French
AbuElAtta et al. Arabic Regional Dialect Identification (ARDI) using Pair of Continuous Bag-of-Words and Data Augmentation.
Lin et al. Domain Independent Key Term Extraction from Spoken Content Based on Context and Term Location Information in the Utterances

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231106

Address after: 266000 room 2102, 21 / F, block B, No.1 Keyuan Weiyi Road, Laoshan District, Qingdao City, Shandong Province

Patentee after: Qingdao Guancheng Software Co.,Ltd.

Address before: 579 qianwangang Road, Huangdao District, Qingdao City, Shandong Province

Patentee before: SHANDONG University OF SCIENCE AND TECHNOLOGY

TR01 Transfer of patent right