CN108647322A - The method that word-based net identifies a large amount of Web text messages similarities - Google Patents
The method that word-based net identifies a large amount of Web text messages similarities Download PDFInfo
- Publication number
- CN108647322A CN108647322A CN201810445807.4A CN201810445807A CN108647322A CN 108647322 A CN108647322 A CN 108647322A CN 201810445807 A CN201810445807 A CN 201810445807A CN 108647322 A CN108647322 A CN 108647322A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- words
- documents
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 121
- 239000013598 vector Substances 0.000 claims description 41
- 230000008569 process Effects 0.000 claims description 19
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000001914 filtration Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000005259 measurement Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 abstract description 2
- 239000012141 concentrate Substances 0.000 abstract 1
- 238000003825 pressing Methods 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 12
- 238000012549 training Methods 0.000 description 12
- 238000002474 experimental method Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 6
- 238000000691 measurement method Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000007418 data mining Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 235000013339 cereals Nutrition 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 240000008042 Zea mays Species 0.000 description 2
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 2
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 235000005822 corn Nutrition 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 244000168667 Pholiota nameko Species 0.000 description 1
- 235000014528 Pholiota nameko Nutrition 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of methods that word-based net identifies a large amount of Web text messages similarities, include the following steps:(1) word net is built;(2) new Web page text message similarity identification, includes the following steps:Text message is extracted from new Web page and constitutes new document, and Feature Words f is extracted from new document1、f2、…、fm;Solve the set of the similar word of each Feature Words f;Solve the similar document set of each Feature Words f;It determines the similar document of new document and calculates the similarity value that similar document concentrates document;The document that similar document is concentrated is filtered, final similarity collection of document is obtained;(3) method for pressing step (1) carries out word net update to new Web page.It using this method, can be used for finding that information is plagiarized or information is imitated, distorted, can be used for excavating implicit existing correlativity between different field, the webpage of repetition can be eliminated, reduce the burden of search engine, optimization storage and index structure.
Description
Technical Field
The invention relates to a method for identifying the similarity of internet text information, in particular to a method for identifying the similarity of a large number of Web text information based on a word network.
Background
The change of internet technology provides a platform for information publishing, communication and communication for offline and online users while transmitting information and knowledge, introduces common users to participate in rapid growth of a large amount of online information, and facilitates the internet to become one of important components of an information resource library.
In an effort to address the rapid growth of internet information, many research projects have targeted how to efficiently organize this vast amount of information, enable end users to quickly and accurately obtain desired information, and reduce the cost of organizing the information. Web information in the internet is reasonably formatted and displayed to an end user through an HTML tag in a text form, so many Web document processing systems implement a generalized processing mode for Web documents using common texts based on a text data processing technology. Web document processing technology includes many processes, in general form: the method comprises the steps of webpage crawling, HTML mark removal, redundant blank line elimination, interference word removal, word stem extraction, text data mining, information display and the like. If special processing is carried out on related webpages with link relations, the link relations among the webpages also need to be analyzed, and the core of the whole processing process is text data mining. Text data mining and conventional data mining techniques have many common places in methods, including analyzing the potential inherent structure of data, clustering similar data, and when applied to general text data, clustering methods try to identify the group to which text documents belong, and then form different clusters according to the degree of similarity between documents, wherein documents within a cluster have high similarity and documents between clusters have low similarity.
Therefore, aiming at the organization of the internet for orderly and normatively organizing the information, the method for improving the transparence and the orderliness of the internet public information is one of the first solutions for preventing the terminal user from being submerged by massive information in the process of providing the information acquisition service. Especially, in the era of mobile internet, it is one of the necessary works to improve the experience of internet users to quickly acquire and display valuable information from a large amount of information oceans to the end users, and simultaneously to ensure the property security and privacy of individuals or organizations involved in the generation of information, and to eliminate useless information, repeated information and sensitive information in the internet.
A common means adopted in the process of comparing similar texts by the conventional method is a text literal content comparison-based method, and representative methods include a hash method based on text content, a document vector space model method, an edit distance method, and the like, for example: (1) comparing the query key words provided by the terminal user with subject terms provided by each document in a pre-established document index library, and if the similarity between the query key words and the document subject terms is within a certain preset threshold value, considering the document containing the subject terms as a return result required by the user; (2) or a more efficient and simple method for comparing the similarity between a large number of Web text documents, namely the simhash method, in the method, Google practices remove a large number of repeated Web pages in a search engine when crawling the Web pages, experience proves that the method has good similar text recognition capability and is suitable for the requirement of Google for rapidly processing a large number of Web pages, and the butterfly effect cannot be generated. Considering that a text document is composed of a series of words, a K-gram method is adopted to select K successive words to form a subsequence, then the subsequence is converted into a hash value to form a shift, finally, a document is composed of a plurality of shift and represents all feature sets of the document, and the feature sets are used as unique identifications of the document different from other documents, so that the shift method based on text content realizes identification of similar documents by comparing the shift hash values of different documents. (3) Corresponding to the hash method, another method preprocesses the document to only contain a plurality of characteristic words, and the characteristic words are characterized in that the occurrence frequency of certain text documents is high, and the occurrence frequency of the characteristic words in other text documents is low, so that the characteristic words have good distinguishing capability for different texts. The document vector space model method is to extract all the characteristic words in the document, then calculate the TF-IDF values of all the characteristic words, convert the document into the text characteristic vector formed by the TF-IDF values of a series of characteristic words, and calculate the similarity of two documents by comparing the difference of the text characteristic vectors of the two documents; (4) the idea of the edit distance method is to change one text string into another text string through editing operations such as insertion, deletion, replacement and the like, the way of calculating the similarity between the two text strings is to count the total times of the editing operations, compared with the hashing method and the vector space model method, the method can more directly compare the similarity between different text strings based on the content of the text strings, the calculated result is more accurate, but the method is not suitable for long text strings, and the memory and the CPU time required by the calculation process are multiplied along with the increase of the length of the text strings.
The method for determining the similarity of the text according to the literal content of the text can return a basic result required by a query request or compare the similarity of the text from the literal content, but has the following defects: (1) the ambiguity of the end user query method is not considered, namely the target result which should be returned by the query request is not clear, so that the content of the input query request is not targeted, and the returned query result is probably not the result expected by the user; (2) although the forms of the content characters of the two documents are greatly different, the contained information or meanings are greatly different, and the two documents are described from different angles by using different words and even are synonyms of the same problem.
The similar identification of the Web page includes, in addition to the above methods, a method based on a Web page link relationship, an anchor text (anchor text), an anchor window (anchor window), and the like. For example, in order to identify related Web pages, the existing link relationship between Web pages suggests that two Web pages are considered to be related if they contain the same pointed-to links (incoming links) or if they contain the same pointed-to links (outgoing links) to other Web pages. For another example, between different objects having reference relationships, if two objects are referenced by similar objects, the two objects are considered to be similar, and the method for identifying object similarity is applied to the link relationship between web pages and the reference relationship between scientific and technical papers. Havelivala et al indicate that a method for identifying related web pages based on web page link relationships does not have a good effect in the case of a small number of links, and propose to combine an anchor text and anchor window method to compensate for the small number of web page links, which is easily affected by the number of links between web pages, the type or quality of web pages.
Disclosure of Invention
The invention aims to solve the problems and provide a method for identifying the similarity of a large amount of Web text information based on a word network.
The invention realizes the purpose through the following technical scheme:
a method for recognizing similarity of a large number of Web text messages based on a word network comprises the following steps:
(1) constructing a word network, comprising the following steps:
1.1, extracting text information from a Web page to form a document set D consisting of a plurality of documents D, extracting characteristic words from one document D in the document set D, and calculating any two f in all the characteristic wordsi、fjNormalized mutual information value norm _ I between twoijAnd norm _ IjiFrom the calculated norm _ IijAnd norm _ IjiValue-based respective construction of feature words fi、fjThe mutual information relation between the words is < fi,fj> and < fj,fi>, norm_IijWord pair < f as mutual information relationshipi,fjWeight of > norm _ IjiWord pair < f as mutual information relationshipj,fiA weight of > but norm _ Iij=norm_IjiThe mutual information relation word pair is less than fi,fj> and word pair < fj,fiAdding a word-entering net;
1.2, executing the operation in the step 1.1 on all the documents D in the document set D until all the documents D in the document set D are processed; in the process, when a new document d 'is introduced, extracting the feature words f'i、 f'jMeter for measuringCalculating any two f 'in all feature words of the Chinese character'i、f'jTwo equal normalized mutual information values norm _ I 'between two'ijAnd norm _ I'jiAnd establishing a mutual information relationship word pair < f 'between the two'i,f'j> and < f'j,f'iIf the mutual information relationship word pair is < f'i,f'j> and < f'j,f'iIs present in the word network, then norm _ I'ijUpdating the weight of the mutual information relationship in the word network, if the word pair of the mutual information relationship is < f'i,f'j> and < f'j,f'iIf the word network does not exist in the word network, adding the word network into the word network to finally form the whole word network, and storing the word network in a database system;
(2) the identification of the similarity of the text information of the new Web page comprises the following steps:
2.1, extracting text information from a new Web page to form a new document, and extracting a characteristic word f from the new document: segmenting new documents, calculating the weight measurement TF-IDF value of each word, and selecting a characteristic word f according to the TF-IDF value1、f2、…、fm;
2.2, solving a set of similar words of each feature word f: aiming at each characteristic word f, searching words having direct mutual information relation with the characteristic word f in a word network in a database system, and simultaneously recording mutual information values of the words to form a similar word set corresponding to each characteristic word, namely f1→{t11:I11,t12:I12,...}, f2→{t21:I21,t22:I22,...},…,fm→{tm1:Im1,tm2:Im2,., wherein the same feature word fmCorresponding set of similar words tm1,tm2,.. } all words are different, there may be common similar words between feature words f, i.e. for any two feature words flAnd fkThe intersection operation of the corresponding similar word sets between (1 is less than or equal to l, k is less than or equal to m) satisfiesWhereinRepresenting an empty set;
2.3, solving a similar document set of each characteristic word f: aiming at the similar word sets { t) corresponding to all the characteristic words f1:I1,t2:I2,...,tn:InAnd solving document sets corresponding to all words in the similar word sets respectively to form document sets corresponding to the similar word sets, and calculating mutual information values accumulated by the documents in the document sets. I.e. for the set of similar words t1:I1,t2:I2,...,tn:InEvery word t iniSolving includes tiAll documents of { Ii:(di1,di2,..) }, wherein IiFor corresponding mutual information values, di1,di2,.. each contain tiA different document of (2); after all t completes the above process, the union set of all the documents corresponding to t is obtained, i.e. { I }1:(d11,d12,...)}∪{I2:(d21,d22,...)}∪...∪{In:(dn1,dn2,..) }, to obtain a new set { d }1:Id1,d2:Id2,. for item d in the seti:IdiAll d are different documents, IdiIncluding d for unioniCorresponding mutual information value I and corresponding t at diOf the tf-idf value in (d), when d is equal to1:Id1,d2:Id2,., namely a document set with a certain mutual information relationship with the characteristic word f, i.e. f → { d }1:Id1,d2:Id2,.. }; suppose f1→{d11:I11,d12:I12,...}, f2→{d21:I21,d22:I22,...},…,fm→{dm1:Im1,dm2:Im1,., wherein di1,di2,...,dijFor different documents in the document library, the document set may contain the same document between every two documents, namely for any two characteristic words flAnd fk(1 is less than or equal to l, k is less than or equal to m) the intersection operation of the document sets related to the mutual information satisfies
2.4, determining similar documents of the new document: and (4) applying intersection operation to the document set which is obtained in the step 2.3 and has mutual information relationship with the characteristic word f, namely obtaining a similar document set omega ═ { d ═ d11:I11,d12:I12,...}∧{d21:I21,d22:I22,...}∧...∧{dm1:Im1,dm2:Im1,.., assuming that the calculation of Ω isWhereinFor a document that exists in all collections, IiAs documentsThe corresponding similarity value is the sum of mutual information values corresponding to the corresponding documents in all the sets when the intersection is solved; then the character string includes the feature word f1、f2、…、fmThe document similar to the document is
2.5, filtering the documents in the similar document set to obtain a final similarity document set: for similarity document setEach document in (1)According to the corresponding similarity value IiComparing the value with a threshold value delta, if the value is smaller than delta, filtering and discarding, otherwise, keeping, and obtaining a filtered similar document setThe set is a final similarity document set;
(3) and (4) performing word Web updating on the new Web page according to the method in the step (1) to prepare for identifying the text information similarity of the Web page updated next time.
Description of the drawings: the initial weight value of the feature word in the document is measured by a weight measurement, namely a TF-IDF measurement method, commonly used in the field of traditional information retrieval; the correlation between the characteristic words is quantified by mutual information, the mutual information indicates that the appearance condition of two different characteristic words in the text is two random events, the information quantity provided by one event for eliminating the uncertainty of the other event is known, and the size of the mutual information between the two characteristic words is defined as the measure of the correlation degree or similarity between the two characteristic words.
Preferably, in step 1.1 and step 2.1, extracting the feature word f includes the following steps:
A. firstly, extracting text information;
B. filtering symbols and segmenting words;
C. a word segmentation list;
D. converting each word into a lower case;
E. restoring words by using a baud stem algorithm;
F. and filtering the numbers and the stop words to obtain the feature words f.
In the step 1.1, any two f in all the feature words are calculatedi、fjNormalized mutual information value norm _ I between twoijComprises the following steps:
① constructing two feature words fiAnd fjThe weights in all documents D within the document set D measure the TF-IDF vector: according to two characteristic words f in a specific document set DiAnd fjAnd respectively constructing TF-IDF vectors with the same dimension of the two characteristic words on the TF-IDF value of each document d, if the characteristic words are in the document diIf it occurs, the value at the ith position in the TF-IDF vector is the word in the corresponding document diTF-IDF value of (1); if the word feature word is in the document diIf the value is not present, the value at the ith position in the TF-IDF vector is represented by 0;
② calculating two feature words fiAnd fjTF-IDF vector distance of: calculating cosine values of the two TF-IDF vectors as TF-IDF vector distances for measuring the two words, wherein the calculation mode is shown as formula (I), and the vector distances quantitatively illustrate the similarity of the two TF-IDF vectors and reflect the f of the two characteristic wordsiAnd fjDegree of similarity of information expressed within the document set D:
wherein,the expression fiTF-IDF vectors within the document set D;the expression fjTF-IDF vectors within the document set D;
③ calculating two feature words fiAnd fjNormalized mutual information value norm _ I ofij: using two characteristic words fiAnd fjThe TF-IDF vector distance pair of the two feature words f calculated according to the formula (II)iAnd fjThe mutual information value is normalized, and finally two characteristic words f are obtained according to a formula (III)iAnd fjNormalized mutual information value n oform_Iij:
Wherein X, Y respectively represent the word fiAnd fjTwo random events of occurrence, "0" means the word fiOr fjNot present in a particular document of the document collection, "1" means the word fiOr fjOccurring in a particular document of a collection of documents, p (x, y) represents the word fiAnd fjJoint probability of simultaneous occurrence in certain specific documents of a document collection, p (x) and p (y) representing the word fiAnd fjEdge probabilities that occur in certain particular documents of the document collection;
in step 2.1, taking calculating the TF-IDF value of the weight metric of the word w as an example, calculating the TF-IDF value of the weight metric of each word includes the following steps:
a. calculating the frequency TF of the word w in the document d according to the following formula, namely the ratio of the number of times the word w appears in the document d to the total number of words in the document d:
TF(w,d)=count(w,d)/size(d)
wherein, TF (w, d) represents the frequency of the word w appearing in the document d, count (w, d) represents the number of times the word w appears in the document d, and size (d) represents the total number of words contained in the document d;
b. calculating the inverse text frequency IDF of the word w in the whole document set D according to the following formula, namely calculating the ratio of the total number of documents in the document set to the number of documents containing the word w and then taking the logarithm:
wherein IDF (w, D; D) represents the inverse text frequency of the word w in the document set D, sum (D) represents the total number of documents in the document set D, and count (w, D; D) represents the number of documents containing the word w in the document set D;
c. the TF-IDF value of word w in document d, i.e., the product of the TF value and the IDF value of word w, is calculated as follows:
TF-IDF(w,d)=TF×IDF。
in the step 2.5, the value range of the threshold value delta is 0.5-0.7.
The database system in step 1.2 is a distributed database HBase.
The invention has the beneficial effects that:
the invention provides a word network model constructed by mutual information relation between words from a word correlation statistical method, then compares the similarity of two different text documents based on a certain number of words with mutual information relation respectively positioned in the two different text documents, namely establishes a 'document-mutual information word-document' relation model, can be used as a supplement of a method for comparing the similarity of the text documents based on a content face form, realizes text similarity comparison with a certain real meaning by applying the 'mutual information word' relation model, provides more optional results for query requests with fuzziness, simultaneously solves the problem that a synonymy description mode based on the same information in the traditional method can not be identified, eliminates content plagiarism, simulation or expands the extraction range of effective information, and returning more relevant results for the terminal inquiry user.
The method can be used for discovering information plagiarism or information imitation and tampering, and simultaneously can also be used for discovering the implicit relevant relation among different fields; repeated Web pages can be eliminated through research of repeated Web text information, the load of a search engine is reduced, a storage and index structure is optimized, and the retrieval efficiency of a search engine system and the quality of retrieval results are improved.
Drawings
FIG. 1 is a graph of time taken to construct a word net as a function of text content size in an embodiment of the present invention;
FIG. 2 is a graph of accuracy, recall, and F1 metrics as a function of a similarity threshold δ in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram showing the comparison between the experiment effects of the Naive Bayes method in Mahout and the method of the present invention in the embodiment of the present invention;
FIG. 4 is a graph of the inter-cluster density, intra-cluster density, and F1 metrics as a function of the similarity threshold δ in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram showing the comparison between the k-means method in Mahout and the experimental effect of the method of the present invention in the embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
the invention discloses a method for identifying similarity of a large number of Web text messages based on a word network, which comprises the following steps:
(1) constructing a word network, comprising the following steps:
1.1, extracting text information from a Web page to form a document set D consisting of a plurality of documents D, extracting characteristic words from one document D in the document set D, and calculating any two f in all the characteristic wordsi、fjNormalized mutual information value norm _ I between twoijAnd norm _ IjiFrom the calculated norm _ IijAnd norm _ IjiValue-based respective construction of feature words fi、fjThe mutual information relation between the words is < fi,fj> and < fj,fi>, norm_IijWord pair < f as mutual information relationshipi,fjWeight of > norm _ IjiWord pair < f as mutual information relationshipj,fiA weight of > but norm _ Iij=norm_IjiThe mutual information relation word pair is less than fi,fj> and word pair < fj,fiAdding a word-entering net;
calculating any two f in all the characteristic wordsi、fjNormalized mutual information value norm _ I between twoijThe method comprises the following steps:
① constructing two feature words fiAnd fjThe weights in all documents D within the document set D measure the TF-IDF vector: according to two characteristic words f in a specific document set DiAnd fjAnd respectively constructing TF-IDF vectors with the same dimension of the two characteristic words on the TF-IDF value of each document d, if the characteristic words are in the document diIf it occurs, the value at the ith position in the TF-IDF vector is the word in the corresponding document diTF-IDF value of (1); if the word feature word is in the document diIf the value is not present, the value at the ith position in the TF-IDF vector is represented by 0;
② calculating two feature words fiAnd fjTF-IDF vector distance of: calculating cosine values of the two TF-IDF vectors as TF-IDF vector distances for measuring the two words, wherein the calculation mode is shown as formula (I), and the vector distances quantitatively illustrate the similarity of the two TF-IDF vectors and reflect the f of the two characteristic wordsiAnd fjDegree of similarity of information expressed within the document set D:
wherein,the expression fiTF-IDF vectors within the document set D;the expression fjTF-IDF vectors within the document set D;
③ calculating two feature words fiAnd fjNormalized mutual information value norm _ I ofij: using two characteristic words fiAnd fjThe TF-IDF vector distance pair of the two feature words f calculated according to the formula (II)iAnd fjThe mutual information value is normalized, and finally two characteristic words f are obtained according to a formula (III)iAnd fjNormalized mutual information value norm _ I ofij:
Wherein X, Y respectively represent the word fiAnd fjTwo random events of occurrence, "0" means the word fiOr fjNot present in a particular document of the document collection, "1" means the word fiOr fjOccurring in a particular document of a collection of documents, p (x, y) represents the word fiAnd fjJoint probability of simultaneous occurrence in certain specific documents of a document collection, p (x) and p (y) representing the word fiAnd fjEdge probabilities that occur in certain particular documents of the document collection;
1.2, executing the operation in the step 1.1 on all the documents D in the document set D until all the documents D in the document set D are processed; in the process, when a new document d 'is introduced, extracting the feature words f'i、 f'jCalculating any two f 'in all feature words of the current word'i、f'jTwo equal normalized mutual information values norm _ I 'between two'ijAnd norm _ I'jiAnd establishing a mutual information relationship word pair < f 'between the two'i,f'j> and < f'j,f'iIf the mutual information relationship word pair is < f'i,f'j> and < f'j,f'iIs present in the word network, then norm _ I'ijUpdating the weight of the mutual information relationship in the word network, if the word pair of the mutual information relationship is < f'i,f'j> and < f'j,f'iIf the word network does not exist in the word network, adding the word network into the word network to finally form the whole word network, and storing the word network in a database system, wherein the database system is preferably a distributed database HBase;
the extracting of the feature word f comprises the following steps:
A. firstly, extracting text information;
B. filtering symbols and segmenting words;
C. a word segmentation list;
D. converting each word into a lower case;
E. restoring words by using a baud stem algorithm;
F. filtering the numbers and the stop words to obtain a feature word f;
(2) the identification of the similarity of the text information of the new Web page comprises the following steps:
2.1, extracting text information from a new Web page to form a new document, extracting characteristic words f from the new document, segmenting the new document, calculating the weight measurement TF-IDF value of each word, and selecting the characteristic words f according to the TF-IDF value1、f2、…、fm;
Taking the calculation of the TF-IDF value for the word w as an example, the above-mentioned calculation of the TF-IDF value for the weight metric of each word includes the following steps:
a. calculating the frequency TF of the word w in the document d according to the following formula, namely the ratio of the number of times the word w appears in the document d to the total number of words in the document d:
TF(w,d)=count(w,d)/size(d)
wherein, TF (w, d) represents the frequency of the word w appearing in the document d, count (w, d) represents the number of times the word w appears in the document d, and size (d) represents the total number of words contained in the document d;
b. calculating the inverse text frequency IDF of the word w in the whole document set D according to the following formula, namely calculating the ratio of the total number of documents in the document set to the number of documents containing the word w and then taking the logarithm:
wherein IDF (w, D; D) represents the inverse text frequency of the word w in the document set D, sum (D) represents the total number of documents in the document set D, and count (w, D; D) represents the number of documents containing the word w in the document set D;
c. the TF-IDF value of word w in document d, i.e., the product of the TF value and the IDF value of word w, is calculated as follows:
TF-IDF(w,d)=TF×IDF;
2.2, solving a set of similar words of each feature word f: aiming at each characteristic word f, searching words having direct mutual information relation with the characteristic word f in a word network in a database system, and simultaneously recording mutual information values of the words to form a similar word set corresponding to each characteristic word, namely f1→{t11:I11,t12:I12,...}, f2→{t21:I21,t22:I22,...},…,fm→{tm1:Im1,tm2:Im2,., wherein the same feature word fmCorresponding set of similar words tm1,tm2,.. } all words are different, there may be common similar words between feature words f, i.e. for any two feature words flAnd fkThe intersection operation of the corresponding similar word sets between (1 is less than or equal to l, k is less than or equal to m) satisfiesWhereinRepresenting an empty set;
2.3, solving a similar document set of each characteristic word f: aiming at the similar word sets { t) corresponding to all the characteristic words f1:I1,t2:I2,...,tn:InAnd solving document sets corresponding to all words in the similar word sets respectively to form document sets corresponding to the similar word sets, and calculating mutual information values accumulated by the documents in the document sets. I.e. for the set of similar words t1:I1,t2:I2,...,tn:InEvery word t iniSolving includes tiAll documents of { Ii:(di1,di2,..) }, wherein IiFor corresponding mutual information values, di1,di2,.. each contain tiA different document of (2); after all t completes the above process, the union set of all the documents corresponding to t is obtained, i.e. { I }1:(d11,d12,...)}∪{I2:(d21,d22,...)}∪...∪{In:(dn1,dn2,..) }, to obtain a new set { d }1:Id1,d2:Id2,. for item d in the seti:IdiAll d are different documents, IdiIncluding d for unioniCorresponding mutual information value I and corresponding t at diOf the tf-idf value in (d), when d is equal to1:Id1,d2:Id2,., namely a document set with a certain mutual information relationship with the characteristic word f, i.e. f → { d }1:Id1,d2:Id2,.. }; suppose f1→{d11:I11,d12:I12,...}, f2→{d21:I21,d22:I22,...},…,fm→{dm1:Im1,dm2:Im1,., wherein di1,di2,...,dijFor different documents in the document library, the document set may contain the same document between every two documents, namely for any two characteristic words flAnd fk(1 is less than or equal to l, k is less than or equal to m) the intersection operation of the document sets related to the mutual information satisfies
2.4, determining similar documents of the new document: and (4) applying intersection operation to the document set which is obtained in the step 2.3 and has mutual information relationship with the characteristic word f, namely obtaining a similar document set omega ═ { d ═ d11:I11,d12:I12,...}∧{d21:I21,d22:I22,...}∧...∧{dm1:Im1,dm2:Im1,.., assuming that the calculation of Ω isWhereinFor a document that exists in all collections, IiAs documentsThe corresponding similarity value is the sum of mutual information values corresponding to the corresponding documents in all the sets when the intersection is solved; then the character string includes the feature word f1、f2、…、fmThe document similar to the document is
2.5, filtering the documents in the similar document set to obtain a final similarity document set: for similarity document setEach document in (1)According to the corresponding similarity value IiComparing the value with a threshold value delta, if the value is smaller than delta, filtering and discarding, otherwise, keeping, and obtaining a filtered similar document setThe set is a final similarity document set, and the value range of the threshold value delta is 0.5-0.7;
(3) and (4) performing word Web updating on the new Web page according to the method in the step (1) to prepare for identifying the text information similarity of the Web page updated next time.
The effectiveness of the method was verified experimentally as follows:
experiments are respectively carried out on the data sets 20-NewsGroups and Reuters-21578, and are compared with experiments carried out on the data sets 20-NewsGroups by a Naive Bayes text classification method provided by Mahout, and are compared with experiments carried out on the data sets Reuters-21578 by a K-means text clustering method provided by Mahout. The experimental operation process is divided into two stages, wherein the first stage is to construct a word network with topic classification according to all existing documents, and is equivalent to a word network model generation stage; the second stage is to search similar documents of a certain document according to the word network model generated in the first stage.
Experimental setup:
the experimental environment is a Hadoop cluster with 19 machine nodes, the total configuration capacity is 6.42TB, one node is a NameNode node, one node is a SecondardyNameNode node, and the rest are DataNode nodes; a distributed database HBase cluster used for experiments is provided with 13 machine nodes, wherein one machine node is an HMaster node, and the other machine nodes are HRegionServer nodes. The cluster uses Hadoop with a version number of 2.2.0 and HBase with a version number of 0.98.6. The whole Hadoop cluster environment structure and the machine node performance are shown in table 1, and the HBase cluster environment structure and the machine node performance are shown in table 2.
TABLE 1 Hadoop Cluster architecture and machine node Performance
TABLE 2 HBase Cluster Structure and machine node Performance
Currently, the most used Web data sets in text clustering or classification techniques are 20-news groups.
The data set 20-News groups consists of 20 predefined categories, 997 files are contained under a category directory soc. Different categories in the data set contain different degrees of similar information, for example, the information contained between comp.sys.ibm.pc.hardware and comp.sys.mac.hardware is very close, and the information contained between misc.forsale and soc.region.christian is very different. The 20-News groups dataset is commonly used as a corpus of data for text classification algorithms.
The data set Reuters-21578 is manually collected from the news line of the street nameko and organized, and is distributed in 22 data files, from the reut2-000.sgm to the reut2-020.sgm, each data file contains 1000 documents, and the data file reut2-021.sgm contains 578 documents, so the data set is called Reuters 21578. Each data file beginning in the format <! A DOTYPE Lewis System "Lewis.dtd" > declares a document type, immediately starting with < REUTERS >, and ending with < REUTERS > defines a scope to which the document content belongs. The Reuters-21578 data sets are files in the SGML format, preprocessing operation is needed before text contents in the files are used, SGML format characters are removed, the text contents are extracted, and then subsequent analysis operation can be carried out. The Reuters-21578 data set is commonly used as a corpus of data for text clustering algorithms.
And (3) experimental evaluation:
often, the performance of a computer system or program is evaluated from the resource conditions, such as space or time, it consumes. However, in addition to considering the time or space resource consumption of the overall system, text mining systems also compare the ability of the system to mine relevant documents. Recall (Recall) and Precision (Precision) are two measurement methods for measuring the ability of a text processing system to discover relevant documents, and the two measurement methods have advantages and disadvantages respectively and are complementary to each other. Recall is defined as the proportion of relevant documents in the system that are retrieved, with emphasis on considering the recall ratio of the system, and is calculated as shown in equation (4-1); the accuracy rate is defined as the proportion of the documents retrieved that are related to each other, and is calculated by considering the precision rate of the system as shown in equation (4-2).
Currently, a common measurement method is to take the recall rate and the accuracy rate into consideration, i.e., F1 measurement method:
the method of the invention is adopted to calculate the similarity among all documents, the documents with the similarity within a certain threshold value range are classified together to form a new document set theta, and the proportion of the documents in the formed new document set theta, which are correctly classified into the same category pre-established with the original data set, is used as the standard of the performance correctness of the metric calculation method. For example, the recall rate of a classification directory of a data set is the ratio of the number of documents with the similarity within a certain threshold value delta range, which belong to the classification directory, to the total number of documents in the classification directory of an original document set; the accuracy is the ratio of the number of documents with the similarity within a certain threshold value delta range belonging to the classification catalogue to the total number of retrieved documents with the similarity within the threshold value delta range.
Experimental results and analysis of processing 20-Newsgroups:
the data in 20-Newsgroups is mostly the text content of the webpage retrieval result, and many empty lines or spaces exist inside the text content, and text attributes are used for explaining the text content. Therefore, the original data set needs to be preprocessed, such as removing empty lines, punctuation marks, single letters, numbers, useless words, etc., performing operations of segmenting words, extracting word stems, calculating weights of words in the documents, etc.
The experimental operation process is divided into two stages, wherein the first stage is to construct a word network from all preprocessed documents, and the word network is equivalent to a word network model generation stage; the second stage is to search similar documents of a certain document according to the word network model generated in the first stage.
In the first stage of constructing the word network, the relationship between the time used by the system and the length of the text content is shown in fig. 1.
The number of word net edges in a word net formed by the data set 20-News groups is 77053480, namely the number of word pairs with mutual information relationship is 7700 and more than ten thousand.
Before identifying similar documents by using a word network model, documents under various classification catalogues in a 20-News groups data set are divided into a training set and a test set according to the proportion of 3:2, the training set is used for training a document similarity threshold delta, and the test set finally verifies the accuracy of the model in the range of the similarity greater than the threshold delta.
And in the training stage, the delta value is 0.1-1, the recall rate and the accuracy rate of the files in the 20 different classified catalogues in the training set in the classification are respectively solved, and finally, the average value of the recall rate and the accuracy rate of all the classifications is obtained to be used as the integral recall rate and the accuracy rate value of the data set. The recall average and precision average, and the corresponding F1 metric as a function of delta, are shown in fig. 2.
As can be seen from FIG. 2, the F1 metric value integrating the precision rate and the recall rate takes the maximum value between δ being equal to or greater than 0.5 and equal to or less than 0.6, and when δ is equal to or greater than 0.55, the precision rate, the recall rate and the F1 intersect at one point, and the F1 metric value can take the maximum value. Therefore, in the process of classifying the data set 20-NewsGroups by identifying similar documents by using the method of the invention, the threshold value delta of the similarity of the documents is taken as 0.55, namely when the similarity between the documents is more than 0.55, the text information contained in the documents is considered to be similar.
The test data of the data set used herein were then experimentally compared using the Naive Bayes text classification method provided by Mahout and the method of the present invention with a similarity threshold δ of 0.55, with the accuracy rate average, recall rate average and F1-Measure metric shown in fig. 3.
As can be seen from FIG. 3, the method of the present invention performs similar classification on texts, the accuracy rate of the method is higher than that of the Naive Bayes text classification method in Mahout, the recall rate is slightly lower than that of the Naive Bayes text classification method in Mahout, but the comprehensive evaluation index F1 values of the accuracy rate and the recall rate are higher than that of the method in Mahout, which indicates that the method of the present invention is more suitable for text similar classification.
In terms of time efficiency, the time used by the Naive Bayes text classification method in Mahout is 95642 seconds, the time used by the word correlation method is 128397 seconds, and all experimental operations are performed in a Hadoop distributed cluster environment. The word correlation method requires access to the distributed database every time the related information is acquired from the word network, and therefore, a large amount of time is consumed.
Experimental results and analysis of treatment Reuters-21578:
the text contents in the Reuters-25178 data set are all stored in the data files in an SGML format, the text contents are evenly distributed in 22 data files from rout 2-000.sgm to rout 2-021.sgm according to the generation time sequence, except that the data files rout 2-021.sgm contain 578 different document contents, each of the rest data files contains 1000 different document contents on average, therefore, the information of different types is not evenly distributed among different data files, and the document contents in all the data files must be extracted into each separate file before the relevant operation is carried out on the Reuters-25178 data set.
At present, the best partitioning method is considered to be that a Reuters-21578 data set is divided into 10 theme types, but the partitioning method has a problem that certain documents contain more cross information and the theme classification to which the documents belong is difficult to determine, and for example, the information cross degree between the two theme classifications of corn and wheat and the grain theme classification is difficult to define. Ana adopts a simpler and more intuitive method, discards a document containing more than one topic information, classifies the document containing three topic information of corn, while the document containing the white and the grain into a grain topic classification, finally a Reuters-21578 data set is divided into 8 topic types, the document distribution condition under each topic type is shown in table 3, in the experiment, namely, a training set and a test set obtained by adopting the division method, a k-means clustering algorithm in Mahout is used for obtaining document clusters under each topic type on the training set, and Inter-Cluster Density (Intra-Cluster Density) and Intra-Cluster Density (Intra-Cluster Density) of each type are calculated, and a word correlation method provided by the experiment is used for training a threshold value delta of text similarity in each Cluster on the training data set. Finally, experimental comparisons were performed on the test data sets using the k-means clustering algorithm in Mahout and the method described in the present invention.
TABLE 3 distribution of documents for Reuters-21578 dataset partitioned into 8 topic types
The experimental operation process aiming at the data set Reuters-21578 is still divided into two stages, wherein the first stage is that a word network is constructed by all preprocessed documents, and is equivalent to a word network model generation stage; the second stage is to search similar documents of a certain document according to the word network model generated in the first stage. The process of constructing word nets in the first stage is consistent with section 4.3, and the number of word net edges in the constructed word net is 27526742, that is, the number of word pairs with mutual information relationship is 2700 more than ten thousand.
After text clustering is carried out on the training data set based on a k-means method, the minimum similarity value among the documents in each cluster is 0.527 on average, the inter-cluster density is 0.5969, and the intra-cluster density is 0.7038. Experiments are carried out on a training set by adopting the method disclosed by the invention, and the change conditions of the inter-cluster density, the intra-cluster density and the F1 metric along with the similarity threshold value delta are shown in FIG. 4, wherein the values are the average values among all cluster categories.
When the method is used for processing the Reuters-21578 data set, the maximum value can be obtained by the F1 metric value within the range that the similarity threshold value is more than or equal to 0.5 and less than or equal to 0.7. Meanwhile, in order to ensure that the F1 metric value of the algorithm can obtain a maximum value, the inter-cluster density is smaller, and the intra-cluster density is larger, the delta value is taken as 0.7 in the experiment, namely when the similarity between the documents is larger than 0.7, the documents are considered to belong to the same cluster, and if the similarities between the documents and a plurality of clusters are larger than the delta value, the documents are classified into the cluster with the maximum similarity.
For test data, the k-means text clustering method provided by Mahout and the method provided by the invention are used for experimental comparison with the similarity threshold value delta of 0.7, and the inter-cluster average density, the intra-cluster average density and the F1-Measure measurement between the k-means text clustering method and the method are shown in FIG. 5.
As can be seen from FIG. 5, the inter-cluster density of the method is lower than that of the k-means algorithm in text clustering, and the intra-cluster density is slightly higher than that of the k-means algorithm, which means that the clustering clusters generated by the word correlation method are compact. But from its F1 metric analysis, the superiority of the k-means method in text clustering applications is higher than the word relevance method. Subsequently, the experiment further uses the Naive Bayes algorithm and the C4.5 algorithm to process the documents under the corresponding subjects according to the data set partition method shown in table 3, and the distribution of F1 metric values of the Naive Bayes algorithm and the C4.5 algorithm is obtained as shown in table 4.
TABLE 4 Naive Bayes and C4.5 algorithms handle F1 metric results (%) -for the data set shown in Table 3
It can be seen from table 4 that some documents in the Reuters-21578 data set have a large degree of information skew, such as documents under the subject trade, whose F1 measures are very different using different methods.
In terms of time efficiency, the time used by using a k-means text clustering method in the Mahout is 2342 seconds, the time used by using a word correlation method is 3971 seconds, and all experimental operations are performed in a Hadoop distributed cluster environment. The word correlation method requires accessing the distributed database every time the related information is acquired from the word net, and therefore, a large amount of time is consumed.
The invention provides a word correlation fuzzy recognition algorithm based on the context relationship of the text document to recognize the similar information of the document aiming at the characteristics of information propagation in the Internet, and solves the limitation of recognizing the similar information according to the literal content of the text segment in the traditional method.
A large amount of information with free forms and irregular contents exists in the Internet, so that the difficulty of obtaining effective information is greatly increased. However, in the conventional method, either the designed algorithm is very complex for improving the problem solving precision, or the precision of the problem solving result is neglected for improving the problem solving efficiency, so that the balance among simplicity, high efficiency and precision is difficult to realize. The invention provides a similar document fuzzy recognition method based on word correlation based on an excellent open source distributed processing platform Hadoop, and the method is used for recognizing documents with similar information in a broad sense, namely information of synonymy conversion types commonly existing in the documents, from a statistical language processing model by constructing a word network in a certain information subject field, so that the recognition range of the similar information is expanded.
In future research work, more extensive data corpora can be adopted to carry out more in-depth research on the word correlation model provided by the text, and parameters in the model are subjected to multi-layer sub-optimization, so that the mutual information relationship between words and the attenuation process of the mutual information relationship size established by intermediate words are researched. In addition, because the model needs to perform sufficient correlation training between words representing each type of information topic in the process of constructing the word network in the early stage, the word network construction in the early stage needs to consume a large amount of time, which is also a research direction in the future.
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.
Claims (6)
1. A method for recognizing similarity of a large amount of Web text information based on word network is characterized in that: the method comprises the following steps:
(1) constructing a word network, comprising the following steps:
1.1, extracting text information from a Web page to form a document set D consisting of a plurality of documents D, extracting characteristic words from one document D in the document set D, and calculating any two f in all the characteristic wordsi、fjNormalized mutual information value norm _ I between twoijAnd norm _ IjiFrom the calculated norm _ IijAnd norm _ IjiValue-based respective construction of feature words fi、fjThe mutual information relation between the words is < fi,fj> and < fj,fi>,norm_IijWord pair < f as mutual information relationshipi,fjWeight of > norm _ IjiWord pair < f as mutual information relationshipj,fiA weight of > but norm _ Iij=norm_IjiThe mutual information relation word pair is less than fi,fj> and word pair < fj,fiAdding a word-entering net;
1.2, executing the operation in the step 1.1 on all the documents D in the document set D until all the documents D in the document set D are processed; in the process, when a new document d' is introduced, the characteristic words f are extractedi'、f'jCalculating any two f of all the feature wordsi'、f'jTwo equal normalized mutual information values norm _ I 'between two'ijAnd norm _ I'jiAnd establishing a mutual information relationship word pair < f between the twoi',f'j> and < f'j,fi' >, if the mutual information relationship word pair < fi',f'j> and < f'j,fi'> exists in the word network, then with norm _ I'ijUpdating the weight of the mutual information relation in the word network by the value, if the word pair of the mutual information relation is less than fi',f'j> and < f'j,fiIf the word network does not exist, adding the word network into the word network to finally form the whole word network, and storing the word network in a database system;
(2) the identification of the similarity of the text information of the new Web page comprises the following steps:
2.1, extracting text information from a new Web page to form a new document, and extracting a characteristic word f from the new document: segmenting new documents, calculating the weight measurement TF-IDF value of each word, and selecting a characteristic word f according to the TF-IDF value1、f2、…、fm;
2.2, solving a set of similar words of each feature word f: aiming at each characteristic word f, searching words with direct mutual information relation in a word network in a database system and searching the wordsSimultaneously recording the mutual information value of each word to form a similar word set corresponding to each characteristic word, namely f1→{t11:I11,t12:I12,...},f2→{t21:I21,t22:I22,...},…,fm→{tm1:Im1,tm2:Im2,., wherein the same feature word fmCorresponding set of similar words tm1,tm2,.. } all words are different, there may be common similar words between feature words f, i.e. for any two feature words flAnd fkThe intersection operation of the corresponding similar word sets between (1 is less than or equal to l, k is less than or equal to m) satisfiesWhereinRepresenting an empty set;
2.3, solving a similar document set of each characteristic word f: aiming at the similar word set { t corresponding to all the characteristic words f1:I1,t2:I2,...,tn:InAnd solving document sets corresponding to all words in the similar word sets respectively to form document sets corresponding to the similar word sets, and calculating mutual information values accumulated by the documents in the document sets. I.e. for the set of similar words t1:I1,t2:I2,...,tn:InEvery word t iniSolving includes tiAll documents of { Ii:(di1,di2,..) }, wherein IiFor corresponding mutual information values, di1,di2,.. each contain tiA different document of (2); after all t completes the above process, the union set of all the documents corresponding to t is obtained, i.e. { I }1:(d11,d12,...)}∪{I2:(d21,d22,...)}∪...∪{In:(dn1,dn2,..) }, to obtain a new set { d }1:Id1,d2:Id2,. for item d in the seti:IdiAll d are different documents, IdiIncluding d for unioniCorresponding mutual information value I and corresponding t at diOf the tf-idf value in (d), when d is equal to1:Id1,d2:Id2,., namely a document set with a certain mutual information relationship with the characteristic word f, i.e. f → { d }1:Id1,d2:Id2,.. }; suppose f1→{d11:I11,d12:I12,...},f2→{d21:I21,d22:I22,...},…,fm→{dm1:Im1,dm2:Im1,., wherein di1,di2,...,dijFor different documents in the document library, the document set may contain the same document between every two documents, i.e. for any two feature words flAnd fk(1 is less than or equal to l, k is less than or equal to m) satisfies the intersection operation of the document sets related to the mutual information
2.4, determining similar documents of the new document: and (4) applying intersection operation to the document set which is obtained in the step 2.3 and has mutual information relationship with the characteristic word f, namely obtaining a similar document set omega ═ { d ═ d11:I11,d12:I12,...}∧{d21:I21,d22:I22,...}∧...∧{dm1:Im1,dm2:Im1,.., assuming that the calculation of Ω isWhereinFor a document that exists in all collections, IiAs documentsThe corresponding similarity value is the sum of mutual information values corresponding to the corresponding documents in all the sets when the intersection is solved; then the character string includes the feature word f1、f2、…、fmThe document similar to the document is
2.5, filtering the documents in the similar document set to obtain a final similarity document set: for similarity document setEach document in (1)According to the corresponding similarity value IiComparing the value with a threshold value delta, if the value is smaller than delta, filtering and discarding, otherwise, keeping, and obtaining a filtered similar document setThe set is a final similarity document set;
(3) and (4) performing word Web updating on the new Web page according to the method in the step (1) to prepare for identifying the text information similarity of the Web page updated next time.
2. The method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in step 1.1 and step 2.1, extracting the feature word f includes the following steps:
A. firstly, extracting text information;
B. filtering symbols and segmenting words;
C. a word segmentation list;
D. converting each word into a lower case;
E. restoring words by using a baud stem algorithm;
F. and filtering the numbers and the stop words to obtain the feature words f.
3. The method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in the step 1.1, any two f in all the feature words are calculatedi、fjNormalized mutual information value norm _ I between twoijThe method comprises the following steps:
① constructing two feature words fiAnd fjThe weights in all documents D within the document set D measure the TF-IDF vector: according to two characteristic words f in a specific document set DiAnd fjThe TF-IDF value of each document d respectively forms a TF-IDF vector with the same dimension of the two characteristic words if the characteristic words are in the document diIf it occurs, the value at the ith position in the TF-IDF vector is the word in the corresponding document diTF-IDF value of (1); if the word feature word is in the document diIf the value is not present, the value at the ith position in the TF-IDF vector is represented by 0;
② calculating two feature words fiAnd fjTF-IDF vector distance of: calculating cosine values of the two TF-IDF vectors as TF-IDF vector distances for measuring the two words, wherein the calculation mode is shown as formula (I), and the vector distances quantitatively illustrate the similarity of the two TF-IDF vectors and reflect the f of the two characteristic wordsiAnd fjDegree of similarity of information expressed within the document set D:
wherein,the expression fiTF-IDF vectors within the document set D;the expression fjTF-IDF vectors within the document set D;
③ calculating two feature words fiAnd fjNormalized mutual information value norm _ I ofij: using two characteristic words fiAnd fjThe TF-IDF vector distance pair of the two feature words f calculated according to the formula (II)iAnd fjThe mutual information value is normalized, and finally two characteristic words f are obtained according to a formula (III)iAnd fjNormalized mutual information value norm _ I ofij:
Wherein X, Y respectively represent the word fiAnd fjTwo random events of occurrence, "0" means the word fiOr fjNot present in a particular document of the document collection, "1" means the word fiOr fjOccurring in a particular document of a collection of documents, p (x, y) represents the word fiAnd fjJoint probabilities of simultaneous occurrence in certain specific documents of a document collection, p (x) and p (y) representing the word fiAnd fjA marginal probability of occurring in certain specific documents of the set of documents;
4. the method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in step 2.1, taking calculating the TF-IDF value of the weight metric of the word w as an example, calculating the TF-IDF value of the weight metric of each word includes the following steps:
a. calculating the frequency TF of the word w in the document d according to the following formula, namely the ratio of the number of times the word w appears in the document d to the total number of words in the document d:
TF(w,d)=count(w,d)/size(d)
wherein, TF (w, d) represents the frequency of the word w appearing in the document d, count (w, d) represents the number of times the word w appears in the document d, and size (d) represents the total number of words contained in the document d;
b. calculating the inverse text frequency IDF of the word w in the whole document set D according to the following formula, namely calculating the ratio of the total number of documents in the document set to the number of documents containing the word w and then taking the logarithm:
wherein IDF (w, D; D) represents the inverse text frequency of the word w in the document set D, sum (D) represents the total number of documents in the document set D, and count (w, D; D) represents the number of documents containing the word w in the document set D;
c. the TF-IDF value of word w in document d, i.e., the product of the TF value and the IDF value of word w, is calculated as follows:
TF-IDF(w,d)=TF×IDF。
5. the method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in the step 2.5, the value range of the threshold value delta is 0.5-0.7.
6. The method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: the database system in step 1.2 is a distributed database HBase.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810445807.4A CN108647322B (en) | 2018-05-11 | 2018-05-11 | Method for identifying similarity of mass Web text information based on word network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810445807.4A CN108647322B (en) | 2018-05-11 | 2018-05-11 | Method for identifying similarity of mass Web text information based on word network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108647322A true CN108647322A (en) | 2018-10-12 |
CN108647322B CN108647322B (en) | 2021-12-17 |
Family
ID=63754348
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810445807.4A Expired - Fee Related CN108647322B (en) | 2018-05-11 | 2018-05-11 | Method for identifying similarity of mass Web text information based on word network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647322B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134760A (en) * | 2019-05-17 | 2019-08-16 | 北京思维造物信息科技股份有限公司 | A kind of searching method, device, equipment and medium |
CN110175234A (en) * | 2019-04-08 | 2019-08-27 | 北京百度网讯科技有限公司 | Unknown word identification method, apparatus, computer equipment and storage medium |
CN110276390A (en) * | 2019-06-14 | 2019-09-24 | 六盘水市食品药品检验检测所 | A kind of third party's food inspection synthesis of mechanism information processing system and method |
CN110852090A (en) * | 2019-11-07 | 2020-02-28 | 中科天玑数据科技股份有限公司 | Public opinion crawling mechanism characteristic vocabulary extension system and method |
CN111539028A (en) * | 2020-04-23 | 2020-08-14 | 周婷 | File storage method and device, storage medium and electronic equipment |
CN111881256A (en) * | 2020-07-17 | 2020-11-03 | 中国人民解放军战略支援部队信息工程大学 | Text entity relation extraction method and device and computer readable storage medium equipment |
US20220058690A1 (en) * | 2017-03-29 | 2022-02-24 | Ebay Inc. | Generating keywords by associative context with input words |
CN114090421A (en) * | 2021-09-17 | 2022-02-25 | 秒针信息技术有限公司 | Test set generation method and device, storage medium and electronic equipment |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080275870A1 (en) * | 2005-12-12 | 2008-11-06 | Shanahan James G | Method and apparatus for constructing a compact similarity structure and for using the same in analyzing document relevance |
CN101582080A (en) * | 2009-06-22 | 2009-11-18 | 浙江大学 | Web image clustering method based on image and text relevant mining |
US7689531B1 (en) * | 2005-09-28 | 2010-03-30 | Trend Micro Incorporated | Automatic charset detection using support vector machines with charset grouping |
CN102033867A (en) * | 2010-12-14 | 2011-04-27 | 西北工业大学 | Semantic-similarity measuring method for XML (Extensible Markup Language) document classification |
CN102169495A (en) * | 2011-04-11 | 2011-08-31 | 趣拿开曼群岛有限公司 | Industry dictionary generating method and device |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
US20130060785A1 (en) * | 2005-03-30 | 2013-03-07 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating customization |
CN104063502A (en) * | 2014-07-08 | 2014-09-24 | 中南大学 | WSDL semi-structured document similarity analyzing and classifying method based on semantic model |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN105183813A (en) * | 2015-08-26 | 2015-12-23 | 山东省计算中心(国家超级计算济南中心) | Mutual information based parallel feature selection method for document classification |
CN105701167A (en) * | 2015-12-31 | 2016-06-22 | 北京工业大学 | Topic relevance judgement method based on coal mine safety event |
CN106547739A (en) * | 2016-11-03 | 2017-03-29 | 同济大学 | A kind of text semantic similarity analysis method |
-
2018
- 2018-05-11 CN CN201810445807.4A patent/CN108647322B/en not_active Expired - Fee Related
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130060785A1 (en) * | 2005-03-30 | 2013-03-07 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating customization |
US7689531B1 (en) * | 2005-09-28 | 2010-03-30 | Trend Micro Incorporated | Automatic charset detection using support vector machines with charset grouping |
US20080275870A1 (en) * | 2005-12-12 | 2008-11-06 | Shanahan James G | Method and apparatus for constructing a compact similarity structure and for using the same in analyzing document relevance |
CN101582080A (en) * | 2009-06-22 | 2009-11-18 | 浙江大学 | Web image clustering method based on image and text relevant mining |
CN102033867A (en) * | 2010-12-14 | 2011-04-27 | 西北工业大学 | Semantic-similarity measuring method for XML (Extensible Markup Language) document classification |
CN102169495A (en) * | 2011-04-11 | 2011-08-31 | 趣拿开曼群岛有限公司 | Industry dictionary generating method and device |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
CN104063502A (en) * | 2014-07-08 | 2014-09-24 | 中南大学 | WSDL semi-structured document similarity analyzing and classifying method based on semantic model |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN105183813A (en) * | 2015-08-26 | 2015-12-23 | 山东省计算中心(国家超级计算济南中心) | Mutual information based parallel feature selection method for document classification |
CN105701167A (en) * | 2015-12-31 | 2016-06-22 | 北京工业大学 | Topic relevance judgement method based on coal mine safety event |
CN106547739A (en) * | 2016-11-03 | 2017-03-29 | 同济大学 | A kind of text semantic similarity analysis method |
Non-Patent Citations (3)
Title |
---|
QINGLIN GUO: "The similarity computing of documents based on VSM", 《2008 32ND ANNUAL IEEE INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE》 * |
公冶小燕 等: "基于改进的TF-IDF算法及共现词的主题词抽取算法", 《南京大学学报(自然科学)》 * |
程芃森: "基于特征词群的新闻类重复网页和近似网页识别算法", 《成都信息工程学院学报》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220058690A1 (en) * | 2017-03-29 | 2022-02-24 | Ebay Inc. | Generating keywords by associative context with input words |
US11769173B2 (en) * | 2017-03-29 | 2023-09-26 | Ebay Inc. | Generating keywords by associative context with input words |
CN110175234A (en) * | 2019-04-08 | 2019-08-27 | 北京百度网讯科技有限公司 | Unknown word identification method, apparatus, computer equipment and storage medium |
CN110175234B (en) * | 2019-04-08 | 2022-02-25 | 北京百度网讯科技有限公司 | Unknown word recognition method and device, computer equipment and storage medium |
CN110134760A (en) * | 2019-05-17 | 2019-08-16 | 北京思维造物信息科技股份有限公司 | A kind of searching method, device, equipment and medium |
CN110276390B (en) * | 2019-06-14 | 2022-09-16 | 六盘水市食品药品检验检测所 | Comprehensive information processing system and method for third-party food detection mechanism |
CN110276390A (en) * | 2019-06-14 | 2019-09-24 | 六盘水市食品药品检验检测所 | A kind of third party's food inspection synthesis of mechanism information processing system and method |
CN110852090A (en) * | 2019-11-07 | 2020-02-28 | 中科天玑数据科技股份有限公司 | Public opinion crawling mechanism characteristic vocabulary extension system and method |
CN110852090B (en) * | 2019-11-07 | 2024-03-19 | 中科天玑数据科技股份有限公司 | Mechanism characteristic vocabulary expansion system and method for public opinion crawling |
CN111539028A (en) * | 2020-04-23 | 2020-08-14 | 周婷 | File storage method and device, storage medium and electronic equipment |
CN111539028B (en) * | 2020-04-23 | 2023-05-12 | 国网浙江省电力有限公司物资分公司 | File storage method and device, storage medium and electronic equipment |
CN111881256A (en) * | 2020-07-17 | 2020-11-03 | 中国人民解放军战略支援部队信息工程大学 | Text entity relation extraction method and device and computer readable storage medium equipment |
CN111881256B (en) * | 2020-07-17 | 2022-11-08 | 中国人民解放军战略支援部队信息工程大学 | Text entity relation extraction method and device and computer readable storage medium equipment |
CN114090421A (en) * | 2021-09-17 | 2022-02-25 | 秒针信息技术有限公司 | Test set generation method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN108647322B (en) | 2021-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
Rousseau et al. | Main core retention on graph-of-words for single-document keyword extraction | |
CN103514183B (en) | Information search method and system based on interactive document clustering | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
Xie et al. | Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb | |
CN110888991B (en) | Sectional type semantic annotation method under weak annotation environment | |
CN113962293B (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
CN107180045A (en) | A kind of internet text contains the abstracting method of geographical entity relation | |
CN108132927A (en) | A kind of fusion graph structure and the associated keyword extracting method of node | |
CN114706972B (en) | Automatic generation method of unsupervised scientific and technological information abstract based on multi-sentence compression | |
CN108090178B (en) | Text data analysis method, text data analysis device, server and storage medium | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN112836029A (en) | Graph-based document retrieval method, system and related components thereof | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
KR20180129001A (en) | Method and System for Entity summarization based on multilingual projected entity space | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
Chow et al. | A new document representation using term frequency and vectorized graph connectionists with application to document retrieval | |
Pang et al. | A text similarity measurement based on semantic fingerprint of characteristic phrases | |
Chen et al. | Research on clustering analysis of Internet public opinion | |
Triwijoyo et al. | Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms | |
CN113705217B (en) | Literature recommendation method and device for knowledge learning in electric power field | |
CN113157857B (en) | Hot topic detection method, device and equipment for news | |
CN114943285A (en) | Intelligent auditing system for internet news content data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20211217 |