CN110705247A - Based on x2-C text similarity calculation method - Google Patents
Based on x2-C text similarity calculation method Download PDFInfo
- Publication number
- CN110705247A CN110705247A CN201910811440.8A CN201910811440A CN110705247A CN 110705247 A CN110705247 A CN 110705247A CN 201910811440 A CN201910811440 A CN 201910811440A CN 110705247 A CN110705247 A CN 110705247A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- similarity
- feature
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 63
- 239000011159 matrix material Substances 0.000 claims abstract description 21
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 19
- 238000000034 method Methods 0.000 claims abstract description 14
- 238000012360 testing method Methods 0.000 claims abstract description 14
- 238000001514 detection method Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- PUAQLLVFLMYYJJ-UHFFFAOYSA-N 2-aminopropiophenone Chemical compound CC(N)C(=O)C1=CC=CC=C1 PUAQLLVFLMYYJJ-UHFFFAOYSA-N 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000546 chi-square test Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003631 expected effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Chinese character' chi2A text similarity calculation method of-C, in particular to the field of text information processing. The method comprises the steps of classifying a test data set by using a Convolutional Neural Network (CNN), calculating the initial weight of each feature word in a detection sample according to TF-IDF, and then using χ2Calculating a domain association factor by using the algorithm C, calculating an initial weight by using the word position factor alpha in combination with the domain association factor to obtain a feature word weight, establishing a word bank by using all feature words of the detection sample, and expressing the detection sample into an initial text vector by combining the word bank and the feature word weight. Calculating the similarity between words in a word bank by using a word2vec tool and forming a word meaning similarity matrix, calculating an initial text vector by using the matrix to obtain a text vector, and finally calculating the text vector by using a cosine similarity algorithm to obtain the similarity between texts, so that the association degree of the characteristic words and the field thereof, the semantic relation between the characteristic words and the bit of the characteristic words are increasedAnd information is set, so that the accuracy of text similarity calculation is improved.
Description
Technical Field
The invention relates to the field of text information processing, in particular to a Chinese character chi-based Chinese character recognition method2-a text similarity calculation method of C.
Background
The text similarity is the calculation of semantic similarity between texts, and is applied to many fields in the information explosion era. For example: the traditional text similarity calculation is based on a space vector model (VSM), the model adopts TF-IDF to calculate the weight of feature words in a text, the text is converted into a text vector of a multidimensional space, and the similarity of the text vector is calculated to measure the similarity between the texts. But the TF-IDF algorithm only considers the relation between the term characteristics and the document and does not consider the problem of relevance with the category, so that the accuracy of text similarity calculation is low. The LDA topic model is a three-layer Bayesian probability model and comprises three-layer structures of words, topics and documents. The basic idea of adopting LDA to calculate text similarity is to perform theme modeling on a text, traverse and extract words in the text in word distribution corresponding to a theme to obtain the theme distribution of the text, and calculate the text similarity through the distribution. Because the short text has few representative words, the LDA does not necessarily achieve the expected effect on the topic mining of the short text, and is more suitable for long texts. The most popular text similarity calculation method at present uses a convolutional neural network, and the main idea is to form a two-dimensional matrix by word vectors of all words in a sentence in sequence, access a CNN model on the basis, extract and filter local semantics in the sentence by using the structural properties of the CNN, and finally abstract the feature vector representation of the sentence. But the CNN model has the defects of complex structure, more parameters, long running time and the like.
Disclosure of Invention
The invention aims to solve the defects and provides a method for combining X by adopting TF-IDF2the-C algorithm calculates the weight of the characteristic words, increases the word position information, and increases the chi-based word meaning information to the text vector by using word2vec2-C text similarity calculation method
The invention specifically adopts the following technical scheme:
based on x2-C text similarity calculation method, comprising the steps of:
step 1: preprocessing the test data and the content of the corpus;
step 2: classifying the test data set using a Convolutional Neural Network (CNN);
and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;
and 4, step 4: using χ2Calculating a domain association factor Ca by using a C algorithm;
and 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;
step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;
and 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;
and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;
and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.
Preferably, the step 4 comprises:
χ2the-C algorithm is to2The statistics are combined into a domain association algorithm and a domain association factor Ca is calculated, as a result of the χ calculated by some feature words2 (d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;
wherein d represents a characteristic word, c represents the domain described by the characteristic word, diRepresenting the ith characteristic word in the text, and the count represents the total number of the characteristic words in the text;
ca is calculated by the formulae (2), (3), (4):
wherein q isdIndicating the proportion of the number of documents in the positive class containing the feature word d, edRepresenting the number of documents in the main category containing the feature word d, EdIndicates the total number of documents of the positive class, pdRepresenting the proportion of the number of documents in the negative class containing the feature word d, ndNumber of documents representing negative class containing d, NdTotal number of documents, w, representing negative classesdAnd d represents the degree of association with the domain.
Preferably, in the step 5, in order to increase the association degree of the feature words and the domains, the formula (5) is used in combination with the domain association factor Ca((d,c)Calculating the initial weight of the feature words according to the word position information alpha to obtain the weight of the feature words;
wdt'=Ca((d,c)×wdt×α (5)
wherein, wdt' represents a characteristic word weight, α represents position information of a word, and α is obtained by equation (6):
preferably, in step 6, firstly, feature words of the texts β and γ are merged, and a lexicon of the texts β and γ is established, so that each word has a corresponding label;
and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:
vk=<w'1k,w'2k,w'ηk……w'Nk>(7)
wherein v iskThe characteristic word sets representing the initial text vector, beta and gamma are respectively Sdβ={dβ1,dβ2,dβ3,dβi,…,dβnAnd Sdγ={dγ1,dγ2,dγ3,dγj,…,dγmThe feature word weight sets of the } beta and gamma are respectively Swβ={w'β1,w'β2,w'β3,w'βj....w'βnAnd swγ={w'γ1,w'γ2,w'γ3,w'γj....w'γm},k∈{β,γ},w'kη∈Swβ∪swγEta represents a feature word dηCorresponding reference numbers in lexicons, dη∈Sdβ∪SdγN is the total number of the feature words in the word stock;
and for w in text betaksηObey equation (8);
for w in the text gammaksηObedience formula (9)
dηIn referring to a thesaurus labeled η, and dβi∈Sdβ,dγi∈Sdγ。
Preferably, in the step 7,
the similarity between the characteristic word with the label i and the characteristic word with the label j in the word stock is simijThen, the similarity matrix between the feature words is represented by formula (10):
preferably, in step 9, word sense information is added to the text vector, and the β and γ text vectors are updated by using the similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):
vβ=<w'β1,w'β1,w'βη……w'βN)>(11)
vγ=<w'γ1,w'γ1,w'γη……w'γN)>(12)
performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;
v'k=vk×P (13)
where k ∈ { β, γ }, a new text vector v 'is computed'β=<w”β1,w”β1,w”βη……w”βN)>And v'γ=<w”γ1,w”γ1,w”γη……w”γN)>The text vector includes word frequency, word sense, word position information and domain association degree.
The invention has the following beneficial effects:
the invention uses TF-IDF to combine chi2the-C algorithm calculates the weight of the feature words, increases word position information, utilizes word2vec to increase word meaning information to the text vector, considers the influence of the field on the distribution condition of the feature words and the semantic relation among the feature words, and the experimental result shows that the method is based on chi2The calculation method accuracy of the similarity of the C text is 82.64%, and the F1 value is 84.09%, which is superior to other methods.
Drawings
FIG. 1 is a graph based on χ2-C text similarity calculation flow chart.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:
based on x2-C text similarity calculation method, comprising the steps of:
step 1: preprocessing the test data and the content of the corpus;
step 2: classifying the test data set using a Convolutional Neural Network (CNN);
and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;
wherein, WdtRepresenting the weight value of the feature word d in the document t, TFdtRepresenting the word frequency number, m, of the feature word d in the document tdThe expression is the number of times of occurrence of the characteristic word d in the document t, S represents the total number of the characteristic words of the document t, IDFdtThe representation being an inverse text frequency index, n, of the feature word ddThe number of texts containing the feature word d in the corpus is shown, N represents the total number of the texts in the corpus, and nf is a normalization factor.
And 4, step 4: using χ2Calculating a domain association factor Ca (class association) by using a C algorithm;
χ2the-C algorithm is to2The statistics are combined into a domain correlation algorithm and a domain correlation factor Ca is calculated, and the domain relation and the one-dimensional degree of freedom chi are assumed2Similar distribution, x2The value of the statistic is inversely proportional to the independence between the feature word d and the domain c, equation (14) is the chi-squared test calculation of the feature word d for the domain c,
wherein A represents the number of documents belonging to C and including D, B represents the number of documents not belonging to C and including D, C represents the number of documents belonging to C and not including D, D represents the number of documents not belonging to C and not including D, and N represents the total number of documents;
χ calculated due to some characteristic words2 (d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;
wherein d represents a feature word, c represents, i represents
Ca is calculated by the formulae (2), (3), (4):
wherein q isdIndicating the proportion of the number of documents in the positive class containing the feature word d, edRepresenting the number of documents in the main category containing the feature word d, EdIndicates the total number of documents of the positive class, pdRepresenting the proportion of the number of documents in the negative class containing the feature word d, ndNumber of documents representing negative class containing d, NdTotal number of documents, w, representing negative classesdAnd d represents the degree of association with the domain.
And 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;
in order to increase the degree of association of the feature words with the domain, equation (5) is used in combination with the domain association factor Ca((d,c)Calculating the initial weight of the feature words according to the word position information alpha to obtain the weight of the feature words;
wdt'=Ca((d,c)×wdt×α (5)
wherein, wdt' represents a characteristic word weight, and α represents position information of a word ((first-order value)>Second order value>Third value)), α is obtained using equation (6):
step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;
generating a text vector by using the updated feature word weight, firstly merging the feature words of the texts beta and gamma, establishing a word bank of the texts beta and gamma, enabling each word to have a corresponding label, and aiming at better representing the texts into a vector form by establishing the word bank;
and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:
vk=<w'1k,w'2k,w'ηk……w'Nk>(7)
wherein v iskThe characteristic word sets representing the initial text vector, beta and gamma are respectively Sdβ={dβ1,dβ2,dβ3,dβi,…,dβnAnd Sdγ={dγ1,dγ2,dγ3,dγj,…,dγmThe feature word weight sets of the } beta and gamma are respectively Swβ={w'β1,w'β2,w'β3,w'βj....w'βnAnd swγ={w'γ1,w'γ2,w'γ3,w'γj....w'γm},k∈{β,γ},w'kη∈Swβ∪swγEta represents a feature word dηCorresponding reference numbers in lexicons, dη∈Sdβ∪SdγN is the total number of the feature words in the word stock;
and for w in text betaksηObey equation (8);
for w in the text gammaksηObedience formula (9)
dηIn referring to a thesaurus labeled η, and dβi∈Sdβ,dγi∈Sdγ。
And 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;
and calculating word sense similarity among the characteristic words by using a word2vec tool, and listing a similarity matrix according to the word sense similarity among the characteristic words.
Word vectors containing semantic information can be generated by using word2vec, and the similarity of the word vectors corresponding to the words is calculated by a cosine similarity algorithm to obtain the similarity between the two words.
sim (A, B) is the degree of similarity between two words, A and B are word vectors of the feature words produced by word2vec, and p represents the dimensionality of the word vectors.
The similarity between the characteristic word with the label i and the characteristic word with the label j in the word stock is simijThen, the similarity matrix between the feature words is represented by formula (10):
and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;
and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.
Adding word sense information in the text vector, and updating beta and gamma text vectors by using a similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):
vβ=<w'β1,w'β1,w'βη……w'βN)>(11)
vγ=<w'γ1,w'γ1,w'γη……w'γN)>(12)
performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;
v'k=vk×P (13)
where k ∈ { β, γ }, a new text vector v 'is computed'β=<w”β1,w”β1,w”βη……w”βN)>And v'γ=<w”γ1,w”γ1,w”γη……w”γN)>The text vector includes word frequency, word sense, word position information and domain association degree.
Equation (15) is a cosine similarity calculation equation.
Wherein sim (v'β,v'γ) Representing the degree of similarity, v ', between the texts β and γ'βAnd v'γIs the final text vector, w, of the text corresponding to β and γ "βiV 'to'βValue of the ith dimension of the medium, similarly w "γiV 'to'γThe value of the ith dimension, P' is the dimension of the text vector.
Selecting a subset of about 8000 texts of a THUCNews news text classification data set as an embodiment, and classifying the texts into science and technology, sports, politics and entertainment types through a Convolutional Neural Network (CNN)
The method comprises the following steps of I, preprocessing the content of a test data set and a corpus, and mainly comprises the following steps:
(1) deleting symbols in the text content: comma, question mark, exclamation mark, etc.
(2) Stop words: the method mainly refers to deleting stop words in text content, and common stop words comprise the following steps: words without definite meanings such as subject, preposition, adverb, mood-assisting word, conjunctive word, etc.
(3) Word segmentation: and segmenting the text content without the stop words according to the words.
II, selecting scientific and technological detection samples: text β and text γ, wherein the content of text β:
"natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. "
Content of text γ:
"natural language processing is a branch of data science, and its main coverage is: the text data is systematically analyzed, understood and information extracted in an intelligent and efficient mode. "
After preprocessing the text β:
natural language/processing/computer/science/domain/artificial intelligence/domain/importance/direction/research/enable/implementation/human/computer/inter/use/natural language/go/active/communication/theory/method
After preprocessing of the text γ:
natural language/processing/data/science/branching/main/overlay/content/intelligence/efficiency/manner/pair/text/data/systematization/analysis/understanding/information extraction/process
Calculating initial weight of the characteristic words in the test text by using TF-IDF algorithm, wherein TF refers to the frequency of the characteristic words in the specified document, and TF has the calculation formula of (16)
For the text β, the TF values of the remaining feature words are 0.0526, except that the TF values corresponding to "natural language" and "domain" are 0.10526, respectively.
For the text γ, the TF values for the remaining tokens are 0.0556, except that "data" corresponds to a TF value of 0.1111.
The IDF inverse document frequency mainly refers to a measure of the general importance of a term, that is, the smaller the number of documents containing a certain feature word, the larger the IDF value, and the IDF is calculated as (17).
The IDF values of the text β are as in table 1:
TABLE 1
The IDF value of the text γ is as in table 2:
TABLE 2
The weights corresponding to the feature words in the text β are shown in table 3:
TABLE 3
The initial weight corresponding to each feature word in the text γ is shown in table 4:
TABLE 4
As shown in table 3, "natural language", "processing", and the like, the initial weight corresponding to the feature word that is important in the natural language processing field is not high, and mainly because the IDF calculation has weak ability to distinguish the field, the degree of association between the feature word and the field is not reflected.
And IV, calculating the association degree of the feature words and the field by using chi-square statistic to obtain a field association factor. According to the formula (1) and the formula (14), the χ corresponding to each feature word can be calculated2' (d, c) value.
The chi-squared statistics for the feature words in text β are shown in table 5.
TABLE 5
The chi-square statistic corresponding to the feature words in the text gamma is shown in table 6.
TABLE 6
As can be seen from tables 5 and 6, the more important feature words in the natural language processing field have higher chi-square statistic, which indicates that the chi-square statistic can well represent the degree of association between the feature words and the field.
Using χ2And calculating the domain association degree of the feature words by the algorithm C to obtain a domain association factor Ca. The relevance degree of the characteristic words and the domain can be better represented by combining chi-square statistics. Lays a good foundation for calculating the text similarity, the domain association factor Ca of the text beta is shown in Table 7,
TABLE 7
Table 8 shows the domain association factor Ca for each feature word in γ:
TABLE 8
The initial weight, the domain relevance factor Ca, has been calculated for each feature word in this example. Ca and the initial weight are combined according to equation (5), and the feature word weight is calculated. Table 9 and table 10 show the weights of the feature words in the text β and the text γ, respectively.
TABLE 9
Watch 10
The corresponding text vectors are generated by using the calculated word weights, a word bank is established by using all words in the test texts beta and gamma, and the table 11 is the word bank.
TABLE 11
Generating a text vector corresponding to the text according to the word stock and the formulas (8) and (9):
vβs=<0.502677433,0.108777994,0,0.06002695,0,0.178649449,0,0.000008,0,0.000283959,……,0.077038474
vγs=<0.146430991,0.063410661,0.020417404,0.015952149,0.045591454,0,0.140432493,0,0.013076936,……,0>
and VI, calculating word sense similarity between the feature words by using a word2vec tool to obtain a similarity moment P. word2vec is mainly used for training large-scale corpora into high-quality word vectors, and the word vectors are not only low-dimensional and dense, but also contain semantic information. And calculating the similarity between the two words by using the cosine similarity, and establishing a similarity matrix by using the result and the word bank.
The subscript n represents the reference number of the corresponding word in the lexicon, for example: sim56The similarity between the feature words with the labels d5 and d6, i.e. the similarity between the meaning of "high efficiency" and "artificial intelligence" is P, as follows:
VII, updating the text vector v by using the formula (13) for the text similarity matrix P which is calculatedβsAnd vγsI.e. by
vβz=vβs×P
vγz=vγs×P
vβzAnd vγzIs the final text vector, and the text vector at this time contains word frequency, word sense, word position information and domain association degree. And then calculating the similarity of the two texts according to a cosine similarity algorithm, wherein the similarity of the beta and the gamma of the texts is as follows: 0.89325.
the method is compared with other text similarity algorithms, 12124 preprocessed texts are used as corpora, a convolutional neural network CNN is used for classifying text sets, and science and technology, education, sports and military are selected as test text sets. In order to better show the experimental result, text data in each class is manually screened, and for low similarity, similar texts are constructed by manual modification.
According to the method, the accuracy and the recall rate are used as evaluation standards, the accuracy P, the recall rate R and the F1 value of each class are respectively calculated by combining data, and finally the average accuracy, the average recall rate and the average F1 value are obtained. The calculation formulas are as formula (18), formula (19) and formula (20):
TABLE 12
In order to verify the effect of calculating text similarity by different algorithms, the method based on chi is adopted2And C, carrying out experimental comparison on four text similarity algorithms based on an LDA model, an unknown network and a convolutional neural network CNN, wherein the evaluation indexes are three indexes of accuracy, recall rate and F1 value in the formula (18), the formula (19) and the formula (20), and the experimental results are shown in Table 12.
As can be seen from the experimental results of Table 12, the base% is2The accuracy, recall and F1 values of the-C text similarity algorithm are slightly better than the other three algorithms because LDA works poorly for topic mining of short texts and does not take into account the structure and word position information of the text. The knowledgebase of the netknowledge contains limited vocabulary and does not contain many professional terms or uncommon words, so that the similarity calculation result has low accuracy, although the CNN-based methodThe accuracy of the calculation result of the algorithm is high, but the parameters are more, the model is complex, and the running time is long. Based on x2the-C algorithm is simple in structure, the association degree of the characteristic words and the domain of the characteristic words is increased, the word2vec tool is used for enabling the text vectors to contain word meaning information, and the accuracy of text similarity calculation is improved.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.
Claims (6)
1. Based on x2-C text similarity calculation method, characterized by comprising the steps of:
step 1: preprocessing the test data and the content of the corpus;
step 2: classifying the test data set using a Convolutional Neural Network (CNN);
and step 3: calculating the initial weight of the feature words in the detection sample by using a TF-IDF algorithm;
and 4, step 4: using χ2Calculating a domain association factor Ca by using a C algorithm;
and 5: calculating initial weight by using the word position factor alpha in combination with the domain association factor Ca to obtain the weight of the feature word;
step 6: establishing a word bank and generating an initial text vector according to the weight of the characteristic words obtained in the step 5;
and 7: calculating word sense similarity among the feature words by using a word2vec tool to obtain a similarity matrix P;
and 8: calculating an initial text vector according to the similarity matrix obtained in the step 7 to obtain a text vector;
and step 9: and (4) calculating the text vector generated in the step (8) by using a cosine similarity algorithm to obtain the text similarity.
2. The χ -based of claim 12-C, wherein said step 4 comprises:
χ2the-C algorithm is to2The statistics are combined into a domain association algorithm and a domain association factor Ca is calculated, as a result of the χ calculated by some feature words2 (d,c)The value is large, so that the subsequent calculation is greatly influenced, and the calculation result is inaccurate, so that the formula (1) is adopted for processing the calculation result;
wherein d represents a characteristic word, c represents the domain described by the characteristic word, diRepresenting the ith characteristic word in the text, and the count represents the total number of the characteristic words in the text;
ca is calculated by the formulae (2), (3), (4):
wherein q isdIndicating the proportion of the number of documents in the positive class containing the feature word d, edRepresenting the number of documents in the main category containing the feature word d, EdIndicates the total number of documents of the positive class, pdRepresenting the proportion of the number of documents in the negative class containing the feature word d, ndNumber of documents representing negative class containing d, NdTotal number of documents, w, representing negative classesdAnd d represents the degree of association with the domain.
3. The χ -based of claim 12-C, wherein in step 5, in order to increase the degree of association between the feature words and the domain, formula (5) is used in combination with the domain association factor Ca((d,c)Hehe wordCalculating the initial weight of the feature words by using the position information alpha to obtain the weight of the feature words;
wdt'=Ca((d,c)×wdt×α (5)
wherein, wdt' represents a feature word weight, α represents position information of a word, and α is obtained using equation (6):
4. the χ -based of claim 12The method for calculating text similarity of-C, wherein in step 6, the feature words of the texts β and γ are first combined to establish a lexicon of the texts β and γ, so that each word has a corresponding label;
and then expressing the two texts to be tested, namely beta and gamma, into a form of a formula (7) by utilizing a word stock:
vk=<w'1k,w'2k,w'ηk……w'Nk>(7)
wherein v iskThe characteristic word sets representing the initial text vector, beta and gamma are respectively Sdβ={dβ1,dβ2,dβ3,dβi,…,dβnAnd Sdγ={dγ1,dγ2,dγ3,dγj,…,dγmThe feature word weight sets of the } beta and gamma are respectively Swβ={w'β1,w'β2,w'β3,w'βj....w'βnAnd swγ={w'γ1,w'γ2,w'γ3,w'γj....w'γm},k∈{β,γ},w'kη∈Swβ∪swγEta represents a feature word dηCorresponding reference numbers in lexicons, dη∈Sdβ∪SdγN is the total number of the feature words in the word stock;
and for w in text betaksηObey equation (8);
for w in the text gammaksηObedience formula (9)
dηIn referring to a thesaurus labeled η, and dβi∈Sdβ,dγi∈Sdγ。
6. the χ -based of claim 12A text similarity calculation method according to-C, wherein in step 9, word sense information is added to the text vector, and the β and γ text vectors are updated using the similarity matrix; the text vectors of the test texts β and γ are respectively expressed by equations (11) and (12):
vβ=<w'β1,w'β1,w'βη……w'βN)>(11)
vγ=<w'γ1,w'γ1,w'γη……w'γN)>(12)
performing inner product on the initial text vector and the feature word similarity matrix to obtain a text vector;
v'k=vk×P (13)
where k ∈ { β, γ }, a new text vector v 'is computed'β=<w”β1,w”β1,w”βη……w”βN)>And v'γ=<w”γ1,w”γ1,w”γη……w”γN)>The text vector includes word frequency, word sense, word position information and domain association degree.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910811440.8A CN110705247B (en) | 2019-08-30 | 2019-08-30 | Based on x2-C text similarity calculation method |
PCT/CN2019/112951 WO2021035921A1 (en) | 2019-08-30 | 2019-10-24 | TEXT SIMILARITY CALCULATION METHOD EMPLOYING χ2-C |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910811440.8A CN110705247B (en) | 2019-08-30 | 2019-08-30 | Based on x2-C text similarity calculation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110705247A true CN110705247A (en) | 2020-01-17 |
CN110705247B CN110705247B (en) | 2020-08-04 |
Family
ID=69193643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910811440.8A Active CN110705247B (en) | 2019-08-30 | 2019-08-30 | Based on x2-C text similarity calculation method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110705247B (en) |
WO (1) | WO2021035921A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112002415A (en) * | 2020-08-23 | 2020-11-27 | 吾征智能技术(北京)有限公司 | Intelligent cognitive disease system based on human excrement |
CN113190672A (en) * | 2021-05-12 | 2021-07-30 | 上海热血网络科技有限公司 | Advertisement judgment model, advertisement filtering method and system |
CN113722427A (en) * | 2021-07-30 | 2021-11-30 | 安徽掌学科技有限公司 | Thesis duplicate checking method based on feature vector space |
CN114676701A (en) * | 2020-12-24 | 2022-06-28 | 腾讯科技(深圳)有限公司 | Text vector processing method, device, medium and electronic equipment |
CN116957362A (en) * | 2023-09-18 | 2023-10-27 | 国网江西省电力有限公司经济技术研究院 | Multi-target planning method and system for regional comprehensive energy system |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114492420B (en) * | 2022-04-02 | 2022-07-29 | 北京中科闻歌科技股份有限公司 | Text classification method, device and equipment and computer readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104102626A (en) * | 2014-07-07 | 2014-10-15 | 厦门推特信息科技有限公司 | Method for computing semantic similarities among short texts |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN106202518A (en) * | 2016-07-22 | 2016-12-07 | 桂林电子科技大学 | Based on CHI and the short text classification method of sub-category association rule algorithm |
CN106502990A (en) * | 2016-10-27 | 2017-03-15 | 广东工业大学 | A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing |
CN108536677A (en) * | 2018-04-09 | 2018-09-14 | 北京信息科技大学 | A kind of patent text similarity calculating method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026B (en) * | 2007-07-02 | 2011-01-26 | 蒙圣光 | Text similarity, acceptation similarity calculating method and system and application system |
CN104090865B (en) * | 2014-07-08 | 2017-11-03 | 安一恒通(北京)科技有限公司 | Text similarity calculation method and device |
CN105786789B (en) * | 2014-12-16 | 2019-07-23 | 阿里巴巴集团控股有限公司 | A kind of calculation method and device of text similarity |
-
2019
- 2019-08-30 CN CN201910811440.8A patent/CN110705247B/en active Active
- 2019-10-24 WO PCT/CN2019/112951 patent/WO2021035921A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104102626A (en) * | 2014-07-07 | 2014-10-15 | 厦门推特信息科技有限公司 | Method for computing semantic similarities among short texts |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN106202518A (en) * | 2016-07-22 | 2016-12-07 | 桂林电子科技大学 | Based on CHI and the short text classification method of sub-category association rule algorithm |
CN106502990A (en) * | 2016-10-27 | 2017-03-15 | 广东工业大学 | A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing |
CN108536677A (en) * | 2018-04-09 | 2018-09-14 | 北京信息科技大学 | A kind of patent text similarity calculating method |
Non-Patent Citations (4)
Title |
---|
YUJIA ZHAI,ET AL.: "A Chi-Square Statistics Based Feature Selection Method in Text Classification", 《2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS)》 * |
何燕等: "一种新的维吾尔文文本分类特征选择方法", 《河南科技大学学报(自然科学版)》 * |
石俊涛: "中文文本分类中卡方特征提取和对TF-IDF权重改进", 《中国优秀硕士学位论文全文数据库》 * |
邱苓芸等: "PageRank算法改进研究", 《软件导刊》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112002415A (en) * | 2020-08-23 | 2020-11-27 | 吾征智能技术(北京)有限公司 | Intelligent cognitive disease system based on human excrement |
CN112002415B (en) * | 2020-08-23 | 2024-03-01 | 吾征智能技术(北京)有限公司 | Intelligent cognitive disease system based on human excrement |
CN114676701A (en) * | 2020-12-24 | 2022-06-28 | 腾讯科技(深圳)有限公司 | Text vector processing method, device, medium and electronic equipment |
CN113190672A (en) * | 2021-05-12 | 2021-07-30 | 上海热血网络科技有限公司 | Advertisement judgment model, advertisement filtering method and system |
CN113722427A (en) * | 2021-07-30 | 2021-11-30 | 安徽掌学科技有限公司 | Thesis duplicate checking method based on feature vector space |
CN116957362A (en) * | 2023-09-18 | 2023-10-27 | 国网江西省电力有限公司经济技术研究院 | Multi-target planning method and system for regional comprehensive energy system |
Also Published As
Publication number | Publication date |
---|---|
WO2021035921A1 (en) | 2021-03-04 |
CN110705247B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Naili et al. | Comparative study of word embedding methods in topic segmentation | |
CN111177365B (en) | Unsupervised automatic abstract extraction method based on graph model | |
US11379668B2 (en) | Topic models with sentiment priors based on distributed representations | |
CN110705247B (en) | Based on x2-C text similarity calculation method | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
Froud et al. | Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering | |
CN107895000B (en) | Cross-domain semantic information retrieval method based on convolutional neural network | |
Suleiman et al. | Comparative study of word embeddings models and their usage in Arabic language applications | |
CN109885675B (en) | Text subtopic discovery method based on improved LDA | |
CN102662987B (en) | A kind of sorting technique of the network text semanteme based on Baidupedia | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
Reddy et al. | Profile specific document weighted approach using a new term weighting measure for author profiling | |
Parlar et al. | Analysis of data pre-processing methods for sentiment analysis of reviews | |
Bashir et al. | Automatic Hausa LanguageText Summarization Based on Feature Extraction using Naïve Bayes Model | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
Ayadi et al. | A Survey of Arabic Text Representation and Classification Methods. | |
Villegas et al. | Vector-based word representations for sentiment analysis: a comparative study | |
Al-Sarem et al. | The effect of training set size in authorship attribution: application on short Arabic texts | |
CN110580286A (en) | Text feature selection method based on inter-class information entropy | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
Yang et al. | Exploring word similarity to improve chinese personal name disambiguation | |
Ling | Coronavirus public sentiment analysis with BERT deep learning | |
Zmandar et al. | Multilingual Financial Word Embeddings for Arabic, English and French | |
AbuElAtta et al. | Arabic Regional Dialect Identification (ARDI) using Pair of Continuous Bag-of-Words and Data Augmentation. | |
Lin et al. | Domain Independent Key Term Extraction from Spoken Content Based on Context and Term Location Information in the Utterances |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231106 Address after: 266000 room 2102, 21 / F, block B, No.1 Keyuan Weiyi Road, Laoshan District, Qingdao City, Shandong Province Patentee after: Qingdao Guancheng Software Co.,Ltd. Address before: 579 qianwangang Road, Huangdao District, Qingdao City, Shandong Province Patentee before: SHANDONG University OF SCIENCE AND TECHNOLOGY |
|
TR01 | Transfer of patent right |